TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage...
Transcript of TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage...
![Page 1: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/1.jpg)
TIE-22306Data-intensiveProgramming
Dr.TimoAaltonenDepartmentofPervasiveComputing
![Page 2: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/2.jpg)
Data-IntensiveProgramming
• Lecturer:TimoAaltonen– [email protected]
• Assistants– AdnanMushtaq–MScAnttiLuoto–MScAnttiKallonen
![Page 3: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/3.jpg)
Lecturer
• UniversityLecturer• DoctoraldegreeinSoftwareEngineering,TUT,SoftwareEngineering,2005
• Workhistory– Variouspositions,TUT,1995– 2010– PrincipalResearcher,SystemSoftwareEngineering,NokiaResearchCenter,2010- 2012
– Universitylecturer,TUT
![Page 4: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/4.jpg)
Workingatthecourse
• LecturesonFridays• Weeklyexercises– beginningfromtheweek#2
• Coursework– announcednextFriday
• Communication– http://www.cs.tut.fi/~dip/
• Exam
![Page 5: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/5.jpg)
WeeklyExercises
• LinuxclassTC217• Inthebeginningofthecoursehands-ontraining
• Intheendofthecoursereceptionforproblemswiththecoursework
• Enrolmentisopen• Notcompulsory,nocreditpoints• Twomoreinstanceswillbeadded
![Page 6: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/6.jpg)
CourseWork
• UsingHadooptoolsandframeworktosolvetypicalBigDataproblem(inJava)
• Groupsofthree• Hardware– Yourownlaptopwithself-installedHadoop– YourownlaptopwithVirtualBox 5.1andUbuntuVM– ATUTvirtualmachine
![Page 7: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/7.jpg)
Exam
• Electronicexamafterthecourse• Testsratherunderstandingthanexactsyntax• ”UsepseudocodetowriteaMapReduceprogramwhich…”
• GeneralquestionsonHadoopandrelatedtechnologies
![Page 8: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/8.jpg)
Today
• Bigdata• DataScience• Hadoop• HDFS• ApacheFlume
![Page 9: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/9.jpg)
1:BigData
• Worldisdrowningindata– clickstreamdataiscollectedbywebservers– NYSEgenerates1TBtradedataeveryday–MTCcollects5000attributesforeachcall– Smartmarketerscollectpurchasinghabits
• “Moredatausuallybeatsbetteralgorithms”
![Page 10: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/10.jpg)
ThreeVsofBigData
• Volume:amountofdata– Transactiondatastoredthroughtheyears,unstructureddatastreaminginfromsocialmedia,increasingamountsofsensorandmachine-to-machinedata
• Velocity:speedofdatainandout– streamingdatafromRFID,sensors,…
• Variety:rangeofdatatypesandsources– structured,unstructured
![Page 11: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/11.jpg)
BigData
• Variability– Dataflowscanbehighlyinconsistentwithperiodicpeaks
• Complexity– Datacomesfrommultiplesources.– linking,matching,cleansingandtransformingdataacrosssystemsisacomplextask
![Page 12: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/12.jpg)
DataScience
• Definition:Datascienceisanactivitytoextractsinsightsfrommessydata
• Facebookanalyzeslocationdata– toidentifyglobalmigrationpatterns– tofindoutthefanbases todifferentsportteams
• Aretailermighttrackpurchasesbothonlineandin-storetotargetedmarketing
![Page 13: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/13.jpg)
DataScience
![Page 14: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/14.jpg)
NewChallenges
• Compute-intensiveness– rawcomputingpower
• Challengesofdataintensiveness– amountofdata– complexityofdata– speedinwhichdataischanging
![Page 15: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/15.jpg)
DataStorageAnalysis
• Harddrivefrom1990– store1,370MB– speed4.4MB/s
• Harddrive2010s– store1TB– speed100MB/s
![Page 16: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/16.jpg)
Scalability
• Growswithoutrequiringdeveloperstore-architecttheiralgorithms/application
• Horizontalscaling• Verticalscaling
![Page 17: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/17.jpg)
ParallelApproach
• Readingfrommultipledisksinparallel– 100driveshaving1/100ofthedata=>1/100readingtime
• Problem:Hardwarefailures– replication
• Problem:Mostanalysistasksneedtobeabletocombinedatainsomeway–MapReduce
• Hadoop
![Page 18: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/18.jpg)
2:ApacheHadoop
• Hadoopisaframeworksoftools– librariesandmethodologies
• Operatesonlargeunstructureddatasets• Opensource(ApacheLicense)• Simpleprogrammingmodel• Scalable
![Page 19: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/19.jpg)
Hadoop
• Ascalablefault-tolerantdistributedsystemfordatastorageandprocessing(opensourceundertheApachelicense)
• CoreHadoophastwomainsystems:– HadoopDistributedFileSystem:self-healinghigh-bandwidthclusteredstorage
–MapReduce:distributedfault-tolerantresourcemanagementandschedulingcoupledwithascalabledataprogrammingabstraction
![Page 20: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/20.jpg)
Hadoop
• Administrators– Installation–Monitor/ManageSystems– TuneSystems
• EndUsers– DesignMapReduce Applications– Importandexportdata–WorkwithvariousHadoopTools
![Page 21: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/21.jpg)
Hadoop
• DevelopedbyDougCuttingandMichaelJ.Cafarella
• BasedonGoogleMapReduce technology• Designedtohandlelargeamountsofdataandberobust
• DonatedtoApacheFoundationin2006byYahoo
![Page 22: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/22.jpg)
HadoopDesignPrinciples
• Movingcomputationischeaperthanmovingdata• Hardwarewillfail• Hideexecutiondetailsfromtheuser• Usestreamingdataaccess• Usesimplefilesystemcoherencymodel
• HadoopisnotareplacementforSQL,alwaysfastandefficientquickad-hocquerying
![Page 23: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/23.jpg)
HadoopMapReduce
• MapReduce(MR)istheoriginalprogrammingmodelforHadoop
• Collocatedatawithcomputenode– dataaccessisfastsinceitslocal(datalocality)
• Networkbandwidthisthemostpreciousresourceinthedatacenter–MRimplementationsexplicitmodelthenetworktopology
![Page 24: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/24.jpg)
HadoopMapReduce
• MRoperatesatahighlevelofabstraction– programmerthinksintermsoffunctionsofkeyandvaluepairs
• MRisa shared-nothingarchitecture– tasksdonotdependoneachother– failedtaskscanberescheduledbythesystem
• MRwasintroducedbyGoogle– usedforproducingsearchindexes– applicabletomanyotherproblemstoo
![Page 25: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/25.jpg)
HadoopComponents
• HadoopCommon– AsetofcomponentsandinterfacesfordistributedfilesystemsandgeneralI/O
• HadoopDistributedFilesystem(HDFS)• HadoopYARN– aresource-managementplatform,scheduling
• HadoopMapReduce– Distributedprogrammingmodelandexecutionenvironment
![Page 26: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/26.jpg)
HadoopStackTransition
![Page 27: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/27.jpg)
HadoopEcosystem• HBase – ascalabledatawarehousewithsupportforlargetables
• Hive – adatawarehouseinfrastructurethatprovidesdatasummarizationandadhocquerying
• Pig – ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation
• Spark – afastandgeneralcomputeengineforHadoopdata.Widerangeofapplications– ETL,MachineLearning,streamprocessing,andgraphanalytics
![Page 28: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/28.jpg)
Flexibility:ComplexDataProcessing
1. JavaMapReduce:Mostflexibilityandperformance,buttediousdevelopmentcycle(theassemblylanguageofHadoop).
2. StreamingMapReduce (akaPipes):Allowsyoutodevelopinanyprogramminglanguageofyourchoice,butslightlylowerperformanceandlessflexibilitythannativeJavaMapReduce.
3. Crunch:Alibraryformulti-stageMapReduce pipelinesinJava(modeledAfterGoogle’sFlumeJava)
4. PigLatin:Ahigh-levellanguageoutofYahoo,suitableforbatchdataflowworkloads.
5. Hive:ASQLinterpreteroutofFacebook,alsoincludesameta-storemappingfilestotheirschemasandassociatedSerDes.
6. Oozie:Aworkflowenginethatenablescreatingaworkflowofjobscomposedofanyoftheabove.
![Page 29: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/29.jpg)
3:HadoopDistributedFileSystem
• HadoopcomeswithdistributedfilesystemcalledHDFS (HadoopDistributedFileSystem)
• BasedonGoogle’sGFS (GoogleFileSystem)• HDFSprovidesredundantstorageformassiveamountsofdata– usingcommodityhardware
• DatainHDFSisdistributedacrossalldatanodes– EfficientforMapReduceprocessing
![Page 30: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/30.jpg)
HDFSDesign
• Filesystemoncommodityhardware– Survivesevenwithhighfailureratesofthecomponents
• Supportslotsoflargefiles– FilesizehundredsGBorseveralTB
• Maindesignprinciples– Writeonce,readmanytimes– Ratherstreamingreads,thanfrequentrandomaccess– Highthroughputismoreimportantthanlowlatency
![Page 31: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/31.jpg)
HDFSArchitecture• HDFSoperatesontopofexistingfilesystem• Filesarestoredasblocks(defaultsize128MB,differentfromfilesystemblocks)
• Filereliabilityisbasedonblock-basedreplication– EachblockofafileistypicallyreplicatedacrossseveralDataNodes (defaultreplicationis3)
• NameNode storesmetadata,managesreplicationandprovidesaccesstofiles
• Nodatacaching(becauseoflargedatasets),butdirectreading/streamingfromDataNode toclient
![Page 32: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/32.jpg)
HDFSArchitecture
• NameNode storesHDFSmetadata– filenames,locationsofblocks,fileattributes–MetadataiskeptinRAMforfastlookups
• ThenumberoffilesinHDFSislimitedbytheamountofavailableRAMintheNameNode– HDFSNameNode federationcanhelpinRAMissues:severalNameNodes,eachofwhichmanagesaportionofthefilesystemnamespace
![Page 33: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/33.jpg)
HDFSArchitecture
• DataNode storesfilecontentsasblocks– DifferentblocksofthesamefilearestoredondifferentDataNodes
– SameblockistypicallyreplicatedacrossseveralDataNodes forredundancy
– PeriodicallysendsreportofallexistingblockstotheNameNode
– DataNodes exchangeheartbeatswiththeNameNode
![Page 34: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/34.jpg)
HDFSArchitecture
• Built-inprotectionagainstDataNode failure• IfNameNode doesnotreceiveanyheartbeatfromaDataNode withincertaintimeperiod,DataNode isassumedtobelost
• IncaseoffailingDataNode,blockreplicationisactivelymaintained– NameNode determineswhichblockswereonthelostDataNode
– TheNameNode findsothercopiesoftheselostblocksandreplicatesthemtoothernodes
![Page 35: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/35.jpg)
HDFS
• HDFSFederation–MultipleNamenode servers–Multiplenamespaces
• HighAvailability– redundantNameNodes• HeterogeneousStorageandArchivalStorageARCHIVE,DISK,SSD,RAM_DISK
![Page 36: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/36.jpg)
High-Availability(HA)Issues:NameNode Failure
• NameNode failurecorrespondstolosingallfilesonafilesystem
% sudo rm --dont-do-this /• Forrecovery,Hadoopprovidestwooptions– Backupfilesthatmakeupthepersistentstateofthefilesystem
– SecondaryNameNode• Alsosomemoreadvancedtechniquesexist
![Page 37: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/37.jpg)
HAIssues:thesecondaryNameNode
• ThesecondaryNameNode isnotmirroredNameNode• Requiredmemory-intensiveadministrativefunctions– NameNode keepsmetadatainmemoryandwriteschangestoaneditlog
– ThesecondaryNameNode periodicallycombinespreviousnamespaceimageandtheeditlogintoanewnamespaceimage,preventingthelogtobecometoolarge
• Keepsacopyofthemergednamespaceimage,whichcanbeusedintheeventoftheNameNode failure
![Page 38: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/38.jpg)
NetworkTopology
• HDFSisawarehowclosetwonodesareinthenetwork
• Fromclosertofurther0:Processesinthesamenode2:Differentnodesinthesamerack4:Nodesindifferentracksinthesamedatacenter6:Nodesindifferentdatacenters
![Page 39: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/39.jpg)
NetworkTopology
![Page 40: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/40.jpg)
FileBlockPlacement
• Clientsalwaysreadfromtheclosestnode• Defaultplacementstrategy– Onereplicainthesamelocalnodeasclient– Secondreplicainadifferentrack– Thirdreplicaindifferent,randomlyselected,nodeinthesamerackasthesecondreplica
• Additional(3+)replicasarerandom
![Page 41: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/41.jpg)
Balancing
• Hadoopworksbestwhenblocksareevenlyspreadout
• SupportforDataNodes ofdifferentsize– InoptimalcasethediskusagepercentageinallDataNodes approximatelythesamelevel
• Hadoopprovidesbalancerdaemon– Re-distributesblocks– ShouldberunwhennewDataNodes areadded
![Page 42: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/42.jpg)
RunningHadoop
• Threeconfigurations– standalone– pseudo-distributed– fully-distributed– https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html
![Page 43: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/43.jpg)
ConfiguringHDFS
• VariableHADOOP_CONF_DIRdefinesthedirectoryfortheHadoopconfigurationfiles
• core-site.xml<configuration>
<property><name>fs.defaultFS</name><value>hdfs://localhost:9001</value>
</property></configuration>
![Page 44: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/44.jpg)
• hdfs-site.xml<configuration>
<property><name>dfs.replication</name><value>1</value>
</property>
<property><name>dfs.namenode.name.dir</name><value>file:///home/NN/hadoop/namenode</value>
</property>
<property><name>dfs.datanode.name.dir</name> <value>file:///home/NN/hadoop/datanode</value>
</property></configuration>
![Page 45: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/45.jpg)
AccessingData
• Datacanbeaccessedusingvariousmethods– JavaAPI– CAPI– Commandline/POSIX(FUSEmount)– Commandline/HDFSclient:Demo– HTTP– Varioustools
![Page 46: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/46.jpg)
HDFSURI
• AllHDFS(CLI)commandstakepathURIsasarguments
• URIexample– hdfs://localhost:9000/user/hduser/log-data/file1.log
• Theschemeandauthorityareoptional– /user/hduser/log-data/file1.log
• Homedirectory– log-data/file1.log
![Page 47: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/47.jpg)
RDBMSvsHDFS• Schema-on-Write
(RDBMS)– Schemamustbecreated
beforeanydatacanbeloaded
– AnexplicitloadoperationwhichtransformsdatatoDBinternalstructure
– NewcolumnsmustbeaddedexplicitlybeforenewdataforsuchcolumnscanbeloadedintotheDB
• Schema-on-Read(HDFS)– Dataissimplycopiedto
thefilestore,notransformationisneeded
– ASerDe (Serializer/Deserlizer)isappliedduringreadtimetoextracttherequiredcolumns(latebinding)
– NewdatacanstartflowinganytimeandwillappearretroactivelyoncetheSerDe isupdatedtoparseit
![Page 48: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/48.jpg)
Conclusions
• Pros– Supportforverylargefiles– Designedforstreamingdata– Commodityhardware
• Cons– Notdesignedforlow-latencydataaccess– Architecturedoesnotsupportlotsofsmallfiles– Nosupportformultiplewriters/arbitraryfilemodifications(Writesalwaysattheendofthefile)
![Page 49: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/49.jpg)
Readingdata
![Page 50: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/50.jpg)
Flume
![Page 51: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/51.jpg)
4:DataModeling
• HDFSisaSchema-on-readsystem– allowsstoringallofyourrawdata
• Stillfollowingmustbeconsidered– Datastorageformats–Multitenancy– Schemadesign–Metadatamanagement
![Page 52: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/52.jpg)
DataStorageOptions
• Nostandarddatastorageformat– Hadoopallowsstoringofdatainanyformat
• Majorconsiderationsfordatastorageinclude– Fileformat(e.g.plaintext,SequenceFile ormorecomplexbutmorefunctionallyrichoptions,suchasAvroandParquet)
– Compression(splittability)– Datastoragesystem(HDFS,HBase,Hive,Impala)
![Page 53: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/53.jpg)
FileFormats:TextFile
• Commonusecase:weblogsandserverlogs– comesinmanyformats
• Organizationofthefilesinthefilesystem• Textfilesconsumespace->compression• Overheadforconversion(‘123’->123)• Structuredtextdata– XMLandJSONpresentchallengestoHADOOP
• hardtosplit– Dedicatedlibrariesexist
![Page 54: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/54.jpg)
FileFormats:BinaryData
• Hadoopcanbeusedtoprocessbinaryfiles– e.g.images
• Containerformatispreferred– e.g.SequenceFile
• Ifthesplittable unitofbinarydataislargerthan64MB,youmayconsiderputtingthedatainitsownfile,withoutusingacontainerformat
![Page 55: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/55.jpg)
HadoopFileTypes• Hadoop-specificfileformatsarespecificallycreatedtoworkwellwithMapReduce– file-baseddatastructuressuchassequencefiles,– serializationformatslikeAvro,and– columnarformatssuchasRCFile andParquet
• Splittable compression– Theseformatssupportcommoncompressionformatsandarealsosplittable
• Agnosticcompression– codecisstoredintheheadermetadataofthefileformat->thefilecanbecompressedwithanycompressioncodec,withoutreadershavingtoknowthecodec
![Page 56: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/56.jpg)
File-BasedDataStructures
• SequenceFile formatisoneofthemostcommonlyusedfile-basedformatsinHadoop– otherformats:MapFiles,SetFiles,ArrayFiles,BloomMapFiles,…
– storesdataasbinarykey-valuepairs– threeformatsavailableforrecords:uncompressed,record-compressed,block-compressed
![Page 57: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/57.jpg)
SequenceFile
• Headermetadata– compressioncodec,keyandvalueclassnames,user-definedmetadata,randomlygeneratedsyncmarker
• Oftenusedacontainerforsmallerfiles
![Page 58: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/58.jpg)
Compression
• AlsoforspeedingMapReduce– Notonlyforreducingstoragerequirements
• Compressionmustbesplittable–MapReduceframeworksplitsdataforinputtomultipletasks
![Page 59: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/59.jpg)
HDFSSchemaDesign• Hadoopisoftenadatahubfortheentireorganization– dataissharedbymanydepartmentsandteams
• Carefullystructuredandorganizedrepositoryhasseveralbenefits– standarddirectorystructuremakesiteasiertosharedatabetweenteams
– allowsforenforcingaccessrightsandquota– conventionsregardinge.g.stagingdata leadlesserrors– codereuse– Hadooptoolsmakeassumptionsofthedataplacement
![Page 60: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/60.jpg)
RecommendedLocationsofFiles
• /user/<username>– data,JARs,andconfig filesofaspecificuser
• /etl– datainallphasesofanETLworkflow– /etl/<group>/<application>/<process>/{input,processing,output,bad}
• /tmp– temporarydata
![Page 61: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/61.jpg)
RecommendedLocationsofFiles
• /data– datasetssharedacrossorganization– dataiswrittenbyautomatedETLprocesses– read-onlyforusers– subdirectoriesforeachdataset
• /app– JARs,Oozie workflowdefinitions,HiveHQLfiles,…– /app/<group>/<application>/<version>/<artifactdirectory>/<artifact>
![Page 62: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/62.jpg)
RecommendedLocationsofFiles
• /metadata– themetadatarequiredbysometools
![Page 63: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/63.jpg)
Partitioning
• HDFShasnoindexes– pro:fasttoingestdata– con:mightleadtofulltablescan(FTC),evenwhenonlyaportionofdataisneeded
• Solution:breakdatasetintosmallersubsets(partitions)– aHDFSsubdirectoryforeachpartition– allowsqueriestoreadonlythespecificpartitions
![Page 64: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/64.jpg)
Partitioning:Example
• Assumedatasetsforallordersforvariouspharmacies
• Withoutpartitioningcheckingorderhistoryforjustonephysicianoverthepastthreemonthsleadstofulltablescan
• medication_orders/date=20160824/{order1.csv,order2.csv}
– only90directoriesmustbescanned
![Page 65: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/65.jpg)
5:DataMovement
• Filesystemclientforsimpleusage• CommondatasourcesforHadoopinclude– traditionaldatamanagementsystemssuchasrelationaldatabasesandmainframes
– logs,machine-generateddata,andotherformsofeventdata
– filesbeingimportedfromexistingenterprisedatastoragesystems
![Page 66: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/66.jpg)
DataMovement:Considerations
• Timelinessofdataingestionandaccessibility–Whataretherequirementsaroundhowoftendataneedstobeingested?Howsoondoesdataneedtobeavailabletodownstreamprocessing?
• Incrementalupdates– Howwillnewdatabeadded?Doesitneedtobeappendedtoexistingdata?Oroverwriteexistingdata?
![Page 67: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/67.jpg)
DataMovement:Considerations
• Dataaccessandprocessing–Willthedatabeusedinprocessing?Ifso,willitbeusedinbatchprocessingjobs?Orisrandomaccesstothedatarequired?
• Sourcesystemanddatastructure–Whereisthedatacomingfrom?Arelationaldatabase?Logs?Isitstructured,semistructured,orunstructureddata?
![Page 68: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/68.jpg)
DataMovement:Considerations
• Partitioningandsplittingofdata– Howshoulddatabepartitionedafteringest?Doesthedataneedtobeingestedintomultipletargetsystems(e.g.,HDFSandHBase)?
• Storageformat–Whatformatwillthedatabestoredin?
• Datatransformation– Doesthedataneedtobetransformedinflight?
![Page 69: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/69.jpg)
TimelinessofDataIngestion• Timelagfromwhendataisavailableforingestiontowhenit’s
accessibleinHadoop• Classificationsingestionrequirements:• Macrobatch
– anythingover15minutestohours,orevenadailyjob.• Microbatch
– firedoffevery2minutesorso,butnomorethan15minutesintotal.• Near-Real-TimeDecisionSupport
– “immediatelyactionable”bytherecipientoftheinformation– deliveredinlessthan2minutesbutgreaterthan2seconds.
• Near-Real-TimeEventProcessing– under2seconds,andcanbeasfastasa100-millisecondrange.
• RealTime– anythingunder100milliseconds.
![Page 70: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/70.jpg)
IncrementalUpdates• Dataiseitherappendedtoanexistingdatasetoritismodified– HDFSworksfineforappendonlyimplementations.
• ThedownsidetoHDFSistheinabilitytodoappendsorrandomwritestofilesafterthey’recreated
• HDFSisoptimizedforlargefiles– Iftherequirementscallforatwo-minuteappendprocessthatendsupproducinglotsofsmallfiles,thenaperiodicprocesstocombinesmallerfileswillberequiredtogetthebenefitsfromlargerfiles
![Page 71: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/71.jpg)
OriginalSourceSystemandDataStructure
• Originalfiletype– anyformat:delimited,XML,JSON,Avro,fixedlength,variablelength,copybooks,…
• Hadoopcanacceptanyfileformat– notallformatsareoptimalforparticularusecases– notallfileformatscanworkwithalltoolsintheHadoopecosystem,example:variable-lengthfiles
![Page 72: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/72.jpg)
Compression
• Pro– transferringacompressedfileoverthenetworkrequireslessI/Oandnetworkbandwidth
• Con– mostcompressioncodecsappliedoutsideofHadooparenotsplittable (e.g.,Gzip)
![Page 73: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/73.jpg)
Misc• RDBMS
– Tool:Sqoop• StreamingData
– Twitterfeeds,aJavaMessageService(JMS)queue,eventsfiringfromawebapplicationserver
– Tools:FlumeorKafka• Logfiles
– ananti-patternistoreadthelogfiles fromdiskastheyarewrittenbecausethisisalmostimpossibletoimplementwithoutlosingdata
– Thecorrectwayofingestinglogfiles istostreamthelogsdirectlytoatoollikeFlumeorKafka,whichwillwritedirectlytoHadoopinstead
![Page 74: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/74.jpg)
Transformations
• modificationsonincomingdata,distributingthedataintopartitionsorbuckets,sendingthedatatomorethanonestoreorlocation– Transformation:XMLorJSONisconvertedtodelimiteddata
– Partitioning:incomingdataisstocktradedataandpartitioningbytickerisrequired
– Splitting:ThedataneedstolandinHDFSandHBase fordifferentaccesspatterns.
![Page 75: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in](https://reader033.fdocument.pub/reader033/viewer/2022042219/5ec540914db2f566f21dae32/html5/thumbnails/75.jpg)
DataIngestionOptions
• filetransfers• ToolslikeFlume,Sqoop,andKafka