把f作用在输入list的每个元素上_第1页
把f作用在输入list的每个元素上_第2页
把f作用在输入list的每个元素上_第3页
把f作用在输入list的每个元素上_第4页
把f作用在输入list的每个元素上_第5页
已阅读5页,还剩56页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

MapReduce&CloudPengBoDec6,2010MapReduceImperativeProgrammingIncomputerscience,imperativeprogrammingisaprogrammingparadigmthatdescribescomputationintermsofstatementsthatchangeaprogramstate.DeclarativeProgrammingIncomputerscience,declarativeprogrammingisaprogrammingparadigmthatexpressesthelogicofacomputationwithoutdescribingitscontrolflowFunctionalLanguagemapflst:(’a->’b)->(’alist)->(’blist)把f作用在输入list的每个元素上,输出一个新的list.foldfx0lst:('a*'b->'b)->'b->('alist)->'b

把f作用在输入list的每个元素和一个累加器元素上,f返回下一个累加器的值FromFunctionalLanguageViewmapflst:(’a->’b)->(’alist)->(’blist)把f作用在输入list的每个元素上,输出一个新的list.foldfx0lst:('a*'b->'b)->'b->('alist)->'b

把f作用在输入list的每个元素和一个累加器元素上,f返回下一个累加器的值Functional运算不修改数据,总是产生新数据map和reduce具有内在的并行性Map可以完全并行Reduce在f运算满足结合律时,可以乱序并发执行Reducefoldl:(a[a]a)Examplefunfoo(l:intlist)=sum(l)+mul(l)+length(l)funsum(lst)=foldl(fn(x,a)=>x+a)0lstfunmul(lst)=foldl(fn(x,a)=>x*a)1lstfunlength(lst)=foldl(fn(x,a)=>1+a)0lstMapReduceis…“MapReduceisaprogrammingmodelandanassociatedimplementationforprocessingandgeneratinglargedatasets.”[1]J.DeanandS.Ghemawat,"MapReduce:SimplifiedDataProcessingonLargeClusters,"inOsdi,2004,pp.137-150.FromParallelComputingViewMapReduce是一种并行编程模型theessenceisasinglefunctionthatexecutesinparallelonindependentdatasets,withoutputsthatareeventuallycombinedtoformasingleorsmallnumberofresults.f是一个map算子

mapf(x:xs)=fx:mapfxsg是一个reduce算子

reducegy(x:xs)=reduceg(gyx)xshomomorphicskeletonsMapreduceFrameworkTypicalproblemsolvedbyMapReduce读入数据:

key/value

对的记录格式数据Map:从每个记录里extractsomethingmap(in_key,in_value)->list(out_key,intermediate_value)处理inputkey/valuepair输出中间结果key/valuepairsShuffle:混排交换数据把相同key的中间结果汇集到相同节点上Reduce:aggregate,summarize,filter,etc.reduce(out_key,list(intermediate_value))->list(out_value)归并某一个key的所有values,进行计算输出合并的计算结果(usuallyjustone)输出结果ShuffleImplementationPartitionandSortGroupPartitionfunction:hash(key)%reducernumberGroupfunction:sortbykeyWordFrequenciesinWebpages输入:onedocumentperrecord用户实现map

function,输入为key=documentURLvalue=documentcontentsmap输出(potentiallymany)key/valuepairs.对document中每一个出现的词,输出一个记录<word,“1”>Examplecontinued:MapReduce运行系统(库)把所有相同key的记录收集到一起(shuffle/sort)用户实现reduce

function对一个key对应的values计算求和sumReduce输出<key,sum>

InvertedIndexBuildInvertedIndexMap:<doc#,word>➝[<word,doc-num>]Reduce:<word,[doc1,doc3,...]>➝<word,“doc1,doc3,…”>BuildindexInput:webpagedataMapper:<url,documentcontent><term,docid,locid>Shuffle&Sort:SortbytermReducer:<term,docid,locid>*<term,<docid,locid>*>Result:Globalindexfile,canbesplitbydocidrangeQuizPageRankAlgorithmClusteringAlgorithmRecommendationAlgorithm串行算法表述算法的核心公式、步骤描述和说明输入数据表示、核心数据结构MapReduce下的实现:map,reduce如何写各自的输入和输出是什么StoriesoftheCloud…APictureisWorth…TheInformationFactoriesGoogleplexserversnumber450,000,accordingtothelowestestimate200petabytesofharddiskstoragefourpetabytesofRAMTohandlethecurrentloadof100millionqueriesaday,input-outputbandwidthmustbeintheneighborhoodof3petabitspersecondTheSupercomputerthatConnectsEverythingandEveryoneLARRYPAGE:And,actually,theultimatesearchengine,whichwouldunderstand,youknow,exactlywhatyouwantedwhenyoutypedinaquery,anditwouldgiveyoutheexactrightthingback,incomputersciencewecallthatartificialintelligence.Thatmeansitwouldbesmart,andwe'realongwaysfromhavingsmartcomputers.

ThePrototype(1995)EarlyGoogleSystemSpring2000DesignLate2000DesignSpring2001DesignEmptyGoogleClusterThreeDaysLater…AgeofDataCentersHigh-endMainFmodityPCCluster性价比高,scaleoutBut可靠性差Scalein可靠性高HighCapabilitySystemSC58325832Gigaflops7776GigabytesECCmemory9726-core64-bitnodes29162GByte/sfabriclinksabout1microsecondMPIlatency1088-lanePCI-Express18KW1CabinetMillicomputers2007Millicomputers2008Guessesfor2010??PackagingComparisonsin1UCloudComputing“Thedesktopisdead.WelcometotheInternet

cloud,wheremassivefacilitiesacrosstheglobewillstoreallthedatayou'lleveruse.”WhatisCloudComputing?Firstwritedownyourownopinionabout“cloudcomputing”,whateveryouthoughtaboutinyourmind.Question:What?Who?Why?How?Prosandcons?Themostimportantquestionis:Whatistherelationwithme?CloudComputingis…NosoftwareaccesseverywherebyInternetpower--Large-scaledataprocessingAppealforstartupsCostefficiency实在是太方便了SoftwareasplatformConsSecurityDatalock-inSaaSPaaSUtilityComputingSoftwareasaService(SaaS)amodelofsoftwaredeploymentwherebyaproviderlicensesanapplicationtocustomersforuseasaserviceondemand.PlatformasaService(PaaS)对于开发WebApplication和Services,PaaS提供了一整套基于Internet的,从开发,测试,部署,运营到维护的全方位的集成环境。特别它从一开始就具备了Multi-tenantarchitecture,用户不需要考虑多用户并发的问题,而由platform来解决,包括并发管理,扩展性,失效恢复,安全。

UtilityComputing“pay-as-you-go”好比让用户把电源插头插在墙上,你得到的电压和Microsoft得到的一样,只是你用得少,payless;utilitycomputing的目标就是让计算资源也具有这样的服务能力,用户可以使用500强公司所拥有的计算资源,只是uselesspayless。这是cloudcomputing的一个重要方面

CloudComputingis…KeyCharacteristicsillusionofinfinitecomputingresourcesavailableondemand;eliminationofanup-frontcommitmentbyCloudusers;创业启动花费abilitytopayforuseofcomputingresourcesonashort-termbasisasneeded。小时间片的billing,报告指出utilitycomputing在这一点上的实践是失败的verylargedatacenterslarge-scalesoftwareinfrastructureoperationalexpertiseWhynow?verylarge-scaledatacenter的实践,因为新的技术趋势和Business模式pay-as-you-gocomputing

KeyPlayersAmazonWebServicesGoogleAppEngineMicrosoftWindowsAzureKeyApplicationsMobileInteractiveapplications,TimO’Reilly相信未来是属于能够实时对用户提供信息的服务。Mobile必定是关键。而后台在datacenter中运行是很自然的模式,特别是那些mashup融合类型的服务。Parallelbatchprocessing。大规模数据处理使用CloudComputing技术很自然,MapReduce,Hadoop在这里起到重要作用。这里,数据移入/移出cloud是很大的开销,Amazon开始尝试hostlargepublicdatasetsforfree。Theriseofanalytics。数据库应用中transactionbased应用还在增长,而analytics的应用增长迅速。数据挖掘,用户行为分析等应用的巨大推动。Extensionofcompute-intensivedesktopapplication。计算密集型的任务,说matlab,mathematica都有了cloudcomputing的扩展,woo~CloudComputing=SilverBullet?Google文档在3月7日发生了大批用户文件外泄事件。美国隐私保护组织就此提请政府对Google采取措施,使其加强云计算产品的安全性。ProblemofDataLock-inChallengesSomeotherVoicesIt’sstupidity.It’sworsethanstupidity:it’samarketinghypecampaign.Somebodyissayingthisisinevitable—andwheneveryouhearsomebodysayingthat,it’sverylikelytobeasetofbusinessescampaigningtomakeittrue.RichardStallman,quotedinTheGuardian,September29,2008TheinterestingthingaboutCloudComputingisthatwe’veredefinedCloudComputingtoincludeeverythingthatwealreadydo....Idon’tunderstandwhatwewoulddodifferentlyinthelightofCloudComputingotherthanchangethewordingofsomeofourads.LarryEllison,quotedintheWallStreetJournal,September26,2008What’smatterwithME?!Whatyouwanttodowith1000pcs,oreven100,000pcs?Cloudiscoming…CloudComputingInitiativeGoogleandIBMteamoncloudcomputinginitiativeforuniversities(2007-1008)provideseveralhundredcomputersaccessthroughtheInternettotestparallelprogrammingprojectsTheideafortheprogramfromGoogleseniorsoftwareengineerChristopheBiscigliaGoogleCodeUniversityM45:OpenAcademicClustersCollaborationwithMajorResearchUniversitiesFosteropenresearchFocusonlarge-scale,highlyparallelcomputingSeedFacility:DatacenterinaBox(DiB)500nodes,4000cores,3TBRAM,1.5PBdiskHighbandwidthconnectiontoInternetLocatedonYahoo!corporatecampusRunsYahoo!/ApacheGridStackCarnegieMellonUniversityisInitialPartnerPublicAnnouncement11/12/07SummaryMapReduceDistributedProgrammingModelIt’sfun!InfrastructureCloudcomputingImagination!Readings[1]J.D.a.S.Ghemawat,"MapReduce:SimplifiedDataProcessingonLargeClusters,"inOsdi,2004,pp.137-150.Resources[Ghemawat,2004] J.D.a.S.Ghemawat,"MapReduce:SimplifiedDataProcessingonLargeClusters,"inOsdi,2004,pp.137-150.[Gruber,2006] F.C.a.J.D.a.S.G.a.W.C.H.a.D.A.W.a.M.B.a.T.C.a.A.F.a.R.Gruber,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论