A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1...
-
Upload
alyson-carpenter -
Category
Documents
-
view
224 -
download
2
Transcript of A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1...
![Page 1: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/1.jpg)
A Data Transformation Service in Cloud Infrastructures
Κατρής Δημήτριος
1 Φεβρουαρίου 2010
Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
![Page 2: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/2.jpg)
Outline
• Introduction– Data transformation– gDTS
• Transformation Model• Core Functionality• System Model
– Architecture– Designating the number of workers
• Evaluation
![Page 3: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/3.jpg)
Introduction: DT Usefulness• Digital Libraries
– Digital data preservation• old data to new specifications
– Content Security• watermarking
• Adaptive content delivery– bandwidth limitations + special characteristics or internet devices
• require transformations of the source data to different quality and/or formats• Content visualization
– Data representation can be different from its visualization• presenting content requires a sort of transformation
• Text extraction• Others
– data migration, database wrappers, ontology mappings, etc• We focus on per document transformations (transcoding)
![Page 4: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/4.jpg)
Introduction: gDTS features
• Generic transformation framework – based on pluggable components (transformation programs)
• reveal the transformation capabilities of the framework• we are able to furnish domain and application specific data
transformations.
• Automatic transformation discovery– content type of a source object + target content type
• appropriate transformation is automatically selected• a chain of transformations may be performed
• Operates in several environments• WSRF compliant service • stand alone executable
• Workload distribution– harnesses computational resources from a cloud infrastructure
![Page 5: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/5.jpg)
Transformation Model
• Content Types identification– MIME type specification
• media type + subtype + set of parameters “attribute=value” e.g.
– text/html; charset=“iso-8859-7”– image/jpeg; width=“1024”, height=“768”
• provides compliance with mainstream applications
![Page 6: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/6.jpg)
Transformation Model
• Programs– the software used to perform the conversion
• Transformation Programs– references one program– describes its transformation capabilities
• contains one or more transformation capabilities (Transformation Units)
![Page 7: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/7.jpg)
Transformation Model• Transformation Program Example<Name>ImageMagickWrapper</Name>
<Program>
<Software>
<Package>
<ID>dts_programs_bundle</ID>
<Location>http://repo.di.uoa.gr/programs/dts_programs_bundle.tar.gz</Location>
</Package>
<Package>
<ID>package_apache_poi</ID>
<Location>http://repo.di.uoa.gr/programs/imagemagick.tar.gz</Location>
</Package>
</Software>
<Class>org.gcube.datatransformation.datatransformationlibrary.programs.applications.ImageMagickWrapper</Class>
</Program>
<TransformationUnits>
.
.
.
![Page 8: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/8.jpg)
Transformation Model• Transformation Unit
– Describes• one program capability• the way the program is to be used in order to perform a transformation
– sets program parameters
– Contains• one or more source content types• single target content type• proper program parameters
– Can be composite• references other transformation units • performs consecutive transformations over a source object
– Can have multiple sources• in order to combine documents• handling multipart documents
– cleaner approach
– Other features• wildcards in the content types of transformation units
– image/jpeg image/jpeg; width=”*”, height=”*”– */* application/zip
• wildcards in the program parameters– the ‘-’ enforces the presence of a program parameter value
![Page 9: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/9.jpg)
Transformation Model
• Transformation Unit Example– Source content type (image/tiff)– Target content type (image/tiff; security=“watermarking”)– Program parameters
• name="method" value="composite" isOptional="false"• name="dissolve" value="15" isOptional="false"• name="tile" value="-" isOptional="false"
• Transformation Graph – nodes
• content types– edges
• transformation units – usage
• finds transformation units so as to perform an object transformation from its content type (source) to a target content type
![Page 10: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/10.jpg)
Core Functionality• Data Handlers
– Data Sources• supply gDTS with input data
– Data Sinks• store the resulting transformed data
– Data element• Envelop of data object • Contains
– content type of the object – reference to the content or the raw content itself
• single or multipart– Multipart
» contain nested envelops» Content types: multipart/mixed, multipart/alternative
– Data element buffers• Are both source and sinks• Internal use
![Page 11: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/11.jpg)
Core Functionality
• Data Handlers:– Initialization:
• Caller specifies – transfer mechanism/protocol
» e.g. ftp, http– parameters for I/O
» e.g. hostname, port, password– Function:
• Data source– sequential access of data elements
• Data sink– write the transformed data elements to the destination
– Advantage• abstraction over the original data source and destination • uniform means to process data
– transfer protocols can be different.
![Page 12: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/12.jpg)
Core Functionality• Transformation Graph
– Usage• finds applicable transformation units from source to target content types
– Result• transformation unit or path
– Exact match» media type of Cs matches with media type of Ctu-s, » subtype of Cs matches with subtype of Ctu-s, » number of CTP of Cs equals with the number of CTP of Ctu-s» each CTP in Cs matches with Ctu-s
– Approximate match» #CTP of Cs can be greater #CTP of Ctu-s
– Same conditions must exist between Ct and Ctu-t
Cs Source content type
Ct Target content type
Ctu-s Source content type of TU
Ctu-t Target content type of TU
CTP Content type parameters
![Page 13: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/13.jpg)
Core Functionality• Transformation Graph
– Overall steps• Search for existing transformation unit with exact match• Search for paths in the graph with exact match
– composite transformation unit created» references the transformation units that comprise the path » registered in the transformation program registry
• Perform steps with approximate match instead of exact.– Approximate matches often
• Transformation may not be affected by CT parameters. e.g.– Source object: image/png; width=“1024px”, height=“1024px”– TU: image/png -> image/jpeg– Target Content Type: image/jpeg
– Maintenance• transformation program registry
– contains the transformation programs• periodic update
– if change happens in registry
![Page 14: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/14.jpg)
Core Functionality
• Program Execution– Deployment
• transformation program includes the location of the software
– download and install this software
• loads all the deployed libraries
– Invocation• entry class is specified in the transformation
program
![Page 15: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/15.jpg)
Core Functionality
• Internal operation
![Page 16: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/16.jpg)
Core Functionality
• Comment– download, conversion, storing time periods
overlap• can improve the performance• one procedure can be bottleneck
– buffers accept a certain amount of objects
![Page 17: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/17.jpg)
System Model
• gDTS targets– high transformation rates– effective utilization of hardware resources
• Why cloud– Usability
• due to virtualization technologies– OS or any pre-installed programs are specified by the user– root access to VM (facilitates program deployment)
– On demand resource provisioning• VMs easily created and destroyed on demand• other job submission frameworks (torque in clusters or grid)
– jobs may wait for hours or even days into queues until other jobs to end– we need to have control and adjust the number of workers participating
in each transformation
![Page 18: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/18.jpg)
System Model
• Architecture– Master – Worker pattern– Master (Coordinator)
• Supplies workers with objects to transform• Designates the amount of workers
![Page 19: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/19.jpg)
System Model
![Page 20: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/20.jpg)
System Model
• Comments– gDTS implemented as stateful WSRF-service
• WSRF: Factory design pattern
– Data elements are requested by the workers• underlying infrastructure may pose network restrictions on
the hosting nodes – Firewalls, NAT
• outbound http calls issued by the workers towards the coordinator service are generally permitted
– Transformation graph • implemented as a remote web service
– workers are not overloaded with graph maintenance
![Page 21: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/21.jpg)
System Model• Designating the number of workers
– Problem: How many workers to use?• Considerations
– Performance– Cost (resources and money)
– Performance• Using “many” workers may increase performance but• Bottlenecks may appear
– Possible causes» Bandwidth or CPU limitations of sources or sinks» Lack of resources in the cloud
– Result » Under-utilization in workers
– Cost• Under-utilization has two drawbacks
– Resources are occupied without using them» Other apps may need to use these resources
– In commercial clouds we pay to occupy resources
![Page 22: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/22.jpg)
System Model
• Designating the number of workers– Solutions
• Client specifies the number of workers– Is supported– But clients are agnostic to the runtime conditions of the infrastructure
• gDTS estimate its needs before transformation starts – i.e. calculate bandwidth and computing requirements– Drawbacks
» The amount and nature of data stored in the sources may not be known in advance
» The transformations are not known in advance» Complicated
• Adaptive approach– The amount of worker nodes is managed at runtime.
» Based on the transformation rate
![Page 23: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/23.jpg)
System Model
• Adaptive approach– iterative procedure
• monitor the transformation rate for a period of time• alter the number of workers (workers step)
– a policy module determines the workers step based in:» transformation rate » the number of workers used during each
measurement
– Policy modules• any policy can be plugged-in that might fit to
specific deployment environments
![Page 24: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/24.jpg)
System Model
• Simple policy used by gDTS– We define the variable ratio as:
• ratio = current_rate - prev_rate / (prev_rate * (prev_workers_num + prev_workers_step) / prev_workers_num) - prev_rate;
• If ratio > ratio_of_efficiency (value set by the client)– Continue adding workers
• If ratio = ratio_of_efficiency– We do not change the number of workers
• If ratio < ratio_of_efficiency– We remove workers
– Example
![Page 25: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο](https://reader036.fdocument.pub/reader036/viewer/2022062304/56649e885503460f94b8c937/html5/thumbnails/25.jpg)
Evaluation
• CPU Intensive transformation