Google云计算解决方案5-Hadoop技术回顾_第1页
Google云计算解决方案5-Hadoop技术回顾_第2页
Google云计算解决方案5-Hadoop技术回顾_第3页
Google云计算解决方案5-Hadoop技术回顾_第4页
Google云计算解决方案5-Hadoop技术回顾_第5页
已阅读5页,还剩71页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

GoogleClusterComputingFacultyTrainingWorkshopModuleV:HadoopTechnicalReview©SpinnakerLabs,Inc.OverviewHadoopTechnicalWalkthroughHDFSDatabasesUsingHadoopinanAcademicEnvironmentPerformancetipsandothertools©SpinnakerLabs,Inc.YouSay,“tomato…”Googlecallsit:Hadoopequivalent:MapReduceHadoopGFSHDFSBigtableHBaseChubby(nothingyet…butplanned)SomeMapReduceTerminologyJob–A“fullprogram”-anexecutionofaMapperandReduceracrossadatasetTask–AnexecutionofaMapperoraReduceronasliceofdataa.k.a.Task-In-Progress(TIP)TaskAttempt–Aparticularinstanceofanattempttoexecuteataskonamachine©SpinnakerLabs,Inc.TerminologyExampleRunning“WordCount”across20filesisonejob20filestobemappedimply20maptasks+somenumberofreducetasksAtleast20maptaskattemptswillbeperformed…moreifamachinecrashes,etc.©SpinnakerLabs,Inc.TaskAttemptsAparticulartaskwillbeattemptedatleastonce,possiblymoretimesifitcrashesIfthesameinputcausescrashesoverandover,thatinputwilleventuallybeabandonedMultipleattemptsatonetaskmayoccurinparallelwithspeculativeexecutionturnedonTaskIDfromTaskInProgressisnotauniqueidentifier;don’tuseitthatway©SpinnakerLabs,Inc.MapReduce:HighLevel©SpinnakerLabs,Inc.Node-to-NodeCommunicationHadoopusesitsownRPCprotocolAllcommunicationbeginsinslavenodesPreventscircular-waitdeadlockSlavesperiodicallypollfor“status”messageClassesmustprovideexplicitserialization©SpinnakerLabs,Inc.Nodes,Trackers,TasksMasternoderunsJobTrackerinstance,whichacceptsJobrequestsfromclientsTaskTrackerinstancesrunonslavenodesTaskTrackerforksseparateJavaprocessfortaskinstances©SpinnakerLabs,Inc.JobDistributionMapReduceprogramsarecontainedinaJava“jar”file+anXMLfilecontainingserializedprogramconfigurationoptionsRunningaMapReducejobplacesthesefilesintotheHDFSandnotifiesTaskTrackerswheretoretrievetherelevantprogramcode…Where’sthedatadistribution?©SpinnakerLabs,Inc.DataDistributionImplicitindesignofMapReduce!Allmappersareequivalent;somapwhateverdataislocaltoaparticularnodeinHDFSIflotsofdatadoeshappentopileuponthesamenode,nearbynodeswillmapinsteadDatatransferishandledimplicitlybyHDFS©SpinnakerLabs,Inc.ConfiguringWithJobConfMRProgramshavemanyconfigurableoptionsJobConfobjectshold(key,value)componentsmappingString’ae.g.,“mapred.map.tasks”20JobConfisserializedanddistributedbeforerunningthejobObjectsimplementingJobConfigurablecanretrieveelementsfromaJobConf©SpinnakerLabs,Inc.WhatHappensInMapReduce?

DepthFirst©SpinnakerLabs,Inc.JobLaunchProcess:ClientClientprogramcreatesaJobConfIdentifyclassesimplementingMapperandReducerinterfacesJobConf.setMapperClass(),setReducerClass()Specifyinputs,outputsJobConf.setInputPath(),setOutputPath()Optionally,otheroptionstoo:JobConf.setNumReduceTasks(),JobConf.setOutputFormat()…©SpinnakerLabs,Inc.JobLaunchProcess:JobClientPassJobConftoJobClient.runJob()orsubmitJob()runJob()blocks,submitJob()doesnotJobClient:DeterminesproperdivisionofinputintoInputSplitsSendsjobdatatomasterJobTrackerserver©SpinnakerLabs,Inc.JobLaunchProcess:JobTrackerJobTracker:InsertsjarandJobConf(serializedtoXML)insharedlocationPostsaJobInProgresstoitsrunqueue©SpinnakerLabs,Inc.JobLaunchProcess:TaskTrackerTaskTrackersrunningonslavenodesperiodicallyqueryJobTrackerforworkRetrievejob-specificjarandconfigLaunchtaskinseparateinstanceofJavamain()isprovidedbyHadoop©SpinnakerLabs,Inc.JobLaunchProcess:TaskTaskTracker.Child.main():SetsupthechildTaskInProgressattemptReadsXMLconfigurationConnectsbacktonecessaryMapReducecomponentsviaRPCUsesTaskRunnertolaunchuserprocess©SpinnakerLabs,Inc.JobLaunchProcess:TaskRunnerTaskRunner,MapTaskRunner,MapRunnerworkinadaisy-chaintolaunchyourMapperTaskknowsaheadoftimewhichInputSplitsitshouldbemappingCallsMapperonceforeachrecordretrievedfromtheInputSplitRunningtheReducerismuchthesame©SpinnakerLabs,Inc.CreatingtheMapperYouprovidetheinstanceofMapperShouldextendMapReduceBaseOneinstanceofyourMapperisinitializedbytheMapTaskRunnerforaTaskInProgressExistsinseparateprocessfromallotherinstancesofMapper–nodatasharing!©SpinnakerLabs,Inc.Mappervoidmap(WritableComparablekey, Writablevalue, OutputCollectoroutput, Reporterreporter)©SpinnakerLabs,Inc.WhatisWritable?Hadoopdefinesitsown“box”classesforstrings(Text),integers(IntWritable),etc.AllvaluesareinstancesofWritableAllkeysareinstancesofWritableComparable©SpinnakerLabs,Inc.WritingForCacheCoherencywhile(moreinputexists){ myIntermediate=newintermediate(input); myIcess(); exportoutputs;}©SpinnakerLabs,Inc.WritingForCacheCoherencymyIntermediate=newintermediate(junk);while(moreinputexists){ myIntermediate.setupState(input); myIcess(); exportoutputs;}©SpinnakerLabs,Inc.WritingForCacheCoherencyRunningtheGCtakestimeReusinglocationsallowsbettercacheusageSpeedupcanbeasmuchastwo-foldAllserializabletypesmustbeWritableanyway,somakeuseoftheinterface©SpinnakerLabs,Inc.GettingDataToTheMapperReadingDataDatasetsarespecifiedbyInputFormatsDefinesinputdata(e.g.,adirectory)IdentifiespartitionsofthedatathatformanInputSplitFactoryforRecordReaderobjectstoextract(k,v)recordsfromtheinputsource©SpinnakerLabs,Inc.FileInputFormatandFriendsTextInputFormat–Treatseach‘\n’-terminatedlineofafileasavalueKeyValueTextInputFormat–Maps‘\n’-terminatedtextlinesof“kSEPv”SequenceFileInputFormat–Binaryfileof(k,v)pairswithsomeadd’lmetadataSequenceFileAsTextInputFormat–Same,butmaps(k.toString(),v.toString())©SpinnakerLabs,Inc.FilteringFileInputsFileInputFormatwillreadallfilesoutofaspecifieddirectoryandsendthemtothemapperDelegatesfilteringthisfilelisttoamethodsubclassesmayoverridee.g.,Createyourown“xyzFileInputFormat”toread*.xyzfromdirectorylist©SpinnakerLabs,Inc.RecordReadersEachInputFormatprovidesitsownRecordReaderimplementationProvides(unused?)capabilitymultiplexingLineRecordReader–ReadsalinefromatextfileKeyValueRecordReader–UsedbyKeyValueTextInputFormat©SpinnakerLabs,Inc.InputSplitSizeFileInputFormatwilldividelargefilesintochunksExactsizecontrolledbymapred.min.split.sizeRecordReadersreceivefile,offset,andlengthofchunkCustomInputFormatimplementationsmayoverridesplitsize–e.g.,“NeverChunkFile”©SpinnakerLabs,Inc.SendingDataToReducersMapfunctionreceivesOutputCollectorobjectOutputCollector.collect()takes(k,v)elementsAny(WritableComparable,Writable)canbeused©SpinnakerLabs,Inc.WritableComparatorComparesWritableComparabledataWillcallWritableCpare()CanprovidefastpathforserializeddataJobConf.setOutputValueGroupingComparator()©SpinnakerLabs,Inc.SendingDataToTheClientReporterobjectsenttoMapperallowssimpleasynchronousfeedbackincrCounter(Enumkey,longamount)setStatus(Stringmsg)Allowsself-identificationofinputInputSplitgetInputSplit()©SpinnakerLabs,Inc.PartitionAndShufflePartitionerintgetPartition(key,val,numPartitions)OutputsthepartitionnumberforagivenkeyOnepartition==valuessenttooneReducetaskHashPartitionerusedbydefaultUseskey.hashCode()toreturnpartitionnumJobConfsetsPartitionerimplementation©SpinnakerLabs,Inc.Reductionreduce( WritableComparablekey, Iteratorvalues, OutputCollectoroutput, Reporterreporter)Keys&valuessenttoonepartitionallgotothesamereducetaskCallsaresortedbykey–“earlier”keysarereducedandoutputbefore“later”keys©SpinnakerLabs,Inc.Finally:WritingTheOutput©SpinnakerLabs,Inc.OutputFormatAnalogoustoInputFormatTextOutputFormat–Writes“keyval\n”stringstooutputfileSequenceFileOutputFormat–Usesabinaryformattopack(k,v)pairsNullOutputFormat–Discardsoutput©SpinnakerLabs,Inc.HDFS©SpinnakerLabs,Inc.HDFSLimitations“Almost”GFSNofileupdateoptions(recordappend,etc);allfilesarewrite-onceDoesnotimplementdemandreplicationDesignedforstreamingRandomseeksdevastateperformance©SpinnakerLabs,Inc.NameNode“Head”interfacetoHDFSclusterRecordsallglobalmetadata©SpinnakerLabs,Inc.SecondaryNameNodeNotafailoverNameNode!Recordsmetadatasnapshotsfrom“real”NameNodeCanmergeupdatelogsinflightCanuploadsnapshotbacktoprimary©SpinnakerLabs,Inc.NameNodeDeathNonewrequestscanbeservedwhileNameNodeisdownSecondarywillnotfailoverasnewprimarySowhyhaveasecondaryatall?©SpinnakerLabs,Inc.NameNodeDeath,cont’dIfNameNodediesfromsoftwareglitch,justrebootButifmachineishosed,metadataforclusterisirretrievable!©SpinnakerLabs,Inc.BringingtheClusterBackIforiginalNameNodecanberestored,secondarycanre-establishthemostcurrentmetadatasnapshotIfnot,createanewNameNode,usesecondarytocopymetadatatonewprimary,restartwholecluster()Isthereanotherway…?©SpinnakerLabs,Inc.KeepingtheClusterUpProblem:DataNodes“fix”theaddressoftheNameNodeinmemory,can’tswitchinflightSolution:BringnewNameNodeup,butuseDNStomakeclusterbelieveit’stheoriginaloneSecondarycanbethe“new”one©SpinnakerLabs,Inc.FurtherReliabilityMeasuresNamenodecanoutputmultiplecopiesofmetadatafilestodifferentdirectoriesIncludinganNFSmountedoneMaydegradeperformance;watchforNFSlocks©SpinnakerLabs,Inc.Databases©SpinnakerLabs,Inc.LifeAfterGFSStraightGFSfilesarenottheonlystorageoptionHBase(ontopofGFS)providescolumn-orientedstoragemySQLandotherdbenginesstillrelevant©SpinnakerLabs,Inc.HBaseCaninterfacedirectlywithHadoopProvidesitsownInput-andOutputFormatclasses;sendsrowsdirectlytomapper,receivesnewrowsfromreducer…Butmightnotbereadyforclassroomuse(leaststablecomponent)©SpinnakerLabs,Inc.MySQLClusteringMySQLdatabasecanbeshardedonmultipleserversForfastIO,usesamemachinesasHadoopTablescanbesplitacrossmachinesbyrowkeyrangeMultiplereplicascanservesametable©SpinnakerLabs,Inc.Sharding&HadoopPartitionersForbestperformance,ReducershouldgostraighttolocalmysqlinstanceGetalldataintherightmachineinonecopyImplementcustomPartitionertoensureparticularkeyrangegoestomysql-awareReducer©SpinnakerLabs,Inc.AcademicHadoopRequirements©SpinnakerLabs,Inc.ServerProfileUWcluster:40nodes,80processorstotal2GBram/processor24TBrawstoragespace(8TBreplicated)OnenodereservedforJobTracker/NameNodeTwomorewouldn’tcooperate…Butstillvastlyoverpowered©SpinnakerLabs,Inc.Setup&MaintenanceTookabouttwodaystosetupandconfigureMostlyhardware-relatedissuesHadoopsetupwasonlyacouplehoursMaintenance:onlyafewhours/weekMostlyrebootingtheclusterwhenjobsgotstuck©SpinnakerLabs,Inc.TotalUsageAbout15,000CPU-hoursconsumedby20students…Outof130,000availableoverquarterAverageloadisabout12%©SpinnakerLabs,Inc.Analyzingstudentusagepatterns©SpinnakerLabs,Inc.NotQuitetheWholeStoryRealistically,studentsdidmostworkveryclosetodeadlineClustersatunusedforafewdays,followedbyoverloadingfortwodaysstraight©SpinnakerLabs,Inc.AnalyzingstudentusagepatternsLesson:ResourcedemandsareNOTconstant!©SpinnakerLabs,Inc.HadoopJobSchedulingFIFOqueuematchesincomingjobstoavailablenodesNonotionoffairnessNeverswitchesoutrunningjobRun-awaytaskscouldstarveotherstudentjobs©SpinnakerLabs,Inc.HadoopSecurityButonthebright(?)side:NosecuritysystemforjobsAnyonecanstartajob;buttheycanalsocancelotherjobsRealistically,studentsdidnotcancelotherstudentjobs,evenwhentheyshould©SpinnakerLabs,Inc.HadoopSecurity:TheDarkSideNopermissionsinHDFSeitherJustnowaddedin0.16OnestudentdeletedthecommondatasetforaprojectEmailsubject:“Oops…”Nostudentscouldtesttheircodeuntildatasetrestoredfrombackup©SpinnakerLabs,Inc.JobSchedulingLessonsGettingstudentsto“playnice”ishardNoincentiveJustplainbad/buggycodeClustercontentioncausedproblemsatdeadlinesWorkingroupsStaggerdeadlines©SpinnakerLabs,Inc.AnotherPossibilityAmazonEC2provideson-demandserversMaybeabletohavestudentsusetheseforjobs“Labfee”wouldbe~$150/studentSimpleweb-basedinterfacesexistRHadoopOnDemand(HOD)comingsoonInjectsnewnodesintoliveclusters©SpinnakerLabs,Inc.MorePerformance&Scalability©SpinnakerLabs,Inc.NumberofTasksMappers=10*nodes(or3/2*cores)Reducers=2*nodes(or1.05*cores)Twodegreesoffreedominmapperruntime:Numberoftasks/node,andsizeofInputSplitsSee/lucene-hadoop/HowManyMapsAndReduces©SpinnakerLabs,Inc.MorePerformanceTweaksHadoopdefaultstoheapcapof200MBSet:mapred.child.java.opts=-Xmx512m1024MB/processmayalsobeappropriateDFSblocksizeis64MBForhugefiles,setdfs.block.size=134217728mapred.reduce.parallel.copiesSetto15—50;moredata=>morecopies©SpinnakerLabs,Inc.DeadTasksStudentjobswould“runaway”,adminrestartneededVeryoftenstuckinhugeshuffleprocessStudentsdidnotknowaboutPartitionerclass,mayhavehadnon-uniformdistribu

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论