并行处理与体系结构课件hitsz-lec05

上传人：9*** IP属地：湖北上传时间：2023-02-06 格式：PPTX 页数：30 大小：981.78KB 积分：30 举报 版权申诉

已阅读5页，还剩25页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Threading&SimultaneousMultithreading

SlidesadaptedfromDavidPatterson,UC-Berkeleycs252-s0612OutlineThreadLevelParallelismMultithreadingSimultaneousMultithreadingPower4vs.Power5HeadtoHead:VLIWvs.Superscalarvs.SMTCommentaryConclusion3PerformancebeyondsinglethreadILPILPforarbitrarycodeislimitednowto3to6issues/cycle,therecanbemuchhighernaturalparallelisminsomeapplications(e.g.,databaseorscientificcodes)Explicit(specifiedbycompiler)ThreadLevelParallelismorDataLevelParallelismThread:aprocesswithitsowninstructionsanddata(ormuchharderoncompiler:carefullyselectedcodesegmentsinthesameprocessthatrarelyinteract)Athreadmaybeoneprocessthatispartofaparallelprogramofmultipleprocesses,oritmaybeanindependentprogramEachthreadhasallthestate(instructions,data,PC,registerstate,andsoon)necessarytoallowittoexecuteDataLevelParallelism:Performidentical(lock-step)operationsondatawhenhavelotsofdata.4ThreadLevelParallelism(TLP)ILP(lastlectures)exploitsimplicitlyparalleloperationswithinalooporstraight-linecodesegmentTLPisexplicitlyrepresentedbytheuseofmultiplethreadsofexecutionthatareinherentlyparallelGoal:UsemanyinstructionstreamstoimproveThroughputofcomputersthatrunmanyprogramsExecutiontimeofmulti-threadedprogramsTLPcouldbemorecost-effectivetoexploitthanILPformanyapplications.5NewApproach:MultithreadedExecutionMultithreading:multiplethreadstosharethefunctionalunitsofoneprocessorviaoverlappedexecutionprocessormustduplicateindependentstateofeachthread,e.g.,aseparatecopyoftheregisterfile,aseparatePC,andifrunningasindependentprograms,aseparatepagetablememorysharedthroughthevirtualmemorymechanisms,whichalreadysupportmultipleprocessesHWforfastthreadswitch(0.1to10clocks)ismuchfasterthanafullprocessswitch(100sto1000sofclocks)thatcopiesstate(state=registers,memory,andfileaccesstables)Whenswitchamongthreads?Alternateinstructionsfromnewthreads(finegrain)Whenathreadisstalled,perhapsforacachemiss,anotherthreadcanbeexecuted(coarsegrain)Incache-lessmultiprocessors,atstartofeachmemoryaccess6Formostapplications,theprocessingunit(s)stall80%ormoreoftimeduring“execution”From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism,ISCA1995.(FromUWash.)Just18%ofissueslotsOKforan8-waysuperscalar.<=#1<=#218

18%CPUissueslots

usefullybusy7MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%8Fine-GrainedMultithreadingSwitchesbetweenthreadsoneachinstructioncycle,causingtheexecutionofmultiplethreadstobeinterleavedUsuallydoneinaround-robinfashion,skippinganystalledthreadsCPUmustbeabletoswitchthreadseveryclockAdvantageisthatitcanhidebothshortandlongstalls,sinceinstructionsfromotherthreadsexecutedwhenonethreadstallsDisadvantageisitslowsdownexecutionofindividualthreads,sinceathreadreadytoexecutewithoutstallswillbedelayedbyinstructionsfromotherthreadsUsedonSun’sNiagarachip(with8cores,willseelater)9Course-GrainedMultithreadingSwitchesthreadsforcostlystalls,suchasL2cachemisses(oronanydatamemoryreferenceifnocaches)AdvantagesRelievesneedtohaveveryfastthread-switching(ifusecaches).Doesnotslowdownanythread,sinceinstructionsfromotherthreadsissuedonlywhenactivethreadencountersacostlystall

Disadvantageisthatitishardtoovercomethroughputlossesfromshorterstalls,becauseofpipelinestart-upcostsSinceCPUnormallyissuesinstructionsfromjustonethread,whenastalloccurs,thepipelinemustbeemptiedorfrozenNewthreadmustfillpipelinebeforeinstructionscancompleteBecauseofthisstart-upoverhead,coarse-grainedmultithreadingisefficientforreducingpenaltyonlyofhighcoststalls,wherestalltime>>pipelinerefilltimeUsedIBMAS/400(1988,forsmalltomediumbusinesses)10(UWash=>Intel)SimultaneousMulti-threading…

“Hyper-threading”123456789MMFXFXFPFPBRCCCycleOnethread,8funcunitsM=Load/Store,FX=FixedPoint,FP=FloatingPoint,BR=Branch,CC=ConditionCodes123456789MMFXFXFPFPBRCCCycleTwothreads,8unitsBusy:13/72=18.0%Busy:30/72=41.7%11UsebothILPandTLP?(UWash:“Yes”)TLPandILPexploittwodifferentkindsofparallelstructureinaprogramCouldaprocessororientedtowardILPbeusedtoexploitTLP?functionalunitsareoftenidleindatapathsdesignedforILPbecauseofeitherstallsordependencesinthecodeCouldtheTLPbeusedasasourceofindependentinstructionsthatmightkeeptheprocessorbusyduringstalls?CouldTLPbeusedtoemploythefunctionalunitsthatwouldotherwiselieidlewheninsufficientILPexists?

12SimultaneousMultithreading(SMT)Simultaneousmultithreading(SMT):insightthatadynamicallyscheduledprocessoralreadyhasmanyHWmechanismstosupportmultithreadingLargesetofvirtualregistersthatcanbeusedtoholdtheregistersetsofindependentthreadsRegisterrenamingprovidesuniqueregisteridentifiers,soinstructionsfrommultiplethreadscanbemixedindatapathwithoutconfusingsourcesanddestinationsacrossthreadsOut-of-ordercompletionallowsthethreadstoexecuteoutoforder,andgetbetterutilizationoftheHWJustneedtoaddaper-threadrenamingtableandkeepingseparatePCsIndependentcommitmentcanbesupportedby“logically”keepingaseparatereorderbufferforeachthreadSource:MicrprocessorReport,December6,1999

“CompaqChoosesSMTforAlpha”13DesignChallengesinSMTSinceSMTmakessenseonlywithfine-grainedimplementation,impactoffine-grainedschedulingonsinglethreadperformance?Doesdesignatingapreferredthreadallowsacrificingneitherthroughputnorsingle-threadperformance?Unfortunately,withapreferredthread,processorislikelytosacrificesomethroughputwhenthepreferredthreadstallsLargerregisterfileisneededtoholdmultiplecontextsTrynottoaffectclockcycletime,especiallyinInstructionissue-morecandidateinstructionsneedtobeconsideredInstructioncompletion-choosingwhichinstructionstocommitmaybechallengingEnsurethatcacheandTLBconflictsgeneratedbySMTdonotdegradeperformance14MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%15Power4Single-threadedpredecessortoPower5.Eightexecutionunitsinanout-of-orderengine,eachunitmayissueoneinstructioneachcycle.Instructionpipeline(IF:instructionfetch,IC:instructioncache,BP:branchpredict,D0:decodestage0,Xfer:transfer,GD:groupdispatch,MP:mapping,ISS:instructionissue,RF:registerfileread,EX:execute,EA:computeaddress,DC:datacaches,F6:six-cyclefloating-pointexecutionpipe,Fmt:dataformat,WB:writeback,andCP:groupcommit)16Power4-1threadPower5-2threads2fetch(PC),

2initialdecodes2completes(architectedregistersets)See/servers/eserver/pseries/news/related/2004/m2040.pdfPower5instructionpipeline(IF=instructionfetch,IC=instructioncache,BP=branchpredict,D0=decodestage0,Xfer=transfer,GD=groupdispatch,MP=mapping,ISS=instructionissue,RF=registerfileread,EX=execute,EA=computeaddress,DC=datacaches,F6=six-cyclefloating-pointexecutionpipe,Fmt=dataformat,WB=writeback,andCP=groupcommit)Page43.17Power5dataflow...Whyonly2threads?With4,somesharedresource(physicalregisters,cache,memorybandwidth)wouldoftenbottleneck

LSU=load/storeunit,FXU=fixed-pointexecutionunit,FPU=floating-pointunit,BXU=branchexecutionunit,andCRL=conditionregisterlogicalexecutionunit.18Power5threadperformance...Relativepriorityofeachthreadcontrollableinhardware.Forbalancedoperation,boththreadsrunslowerthanifthey“owned”themachine.19ChangesinPower5tosupportSMTIncreasedassociativityofL1instructioncacheandtheinstructionaddresstranslationbuffersAddedperthreadloadandstorequeuesIncreasedsizeoftheL2(1.92vs.1.44MB)andL3cachesAddedseparateinstructionprefetchandbufferingperthreadIncreasedthenumberofvirtualregistersfrom152to240IncreasedthesizeofseveralissuequeuesThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport20InitialPerformanceofSMTPentium4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_ratePentium4isdual-threadedSMTSPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmarkRunningonPentium4witheachof26SPECbenchmarkspairedwitheveryother(26*26runs)gavespeed-upsfrom0.90to1.58;averagewas1.20Power5,8processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_ratePower5running2“same”copiesofeachapplicationgavespeedupsfrom0.89to1.41,comparedto1.01and1.07averagesforPentium4.MostgainedsomeFloatingPt.applicationshadmostcacheconflictsandleastgains21ProcessorMicroarchitectureFetch/Issue/ExecuteFunct.UnitsClockRate(GHz)Transis-tors

DiesizePowerIntelPentium4ExtremeSpeculativedynamicallyscheduled;deeplypipelined;SMT3/3/47int.1FP3.8125M122mm2115WAMDAthlon64FX-57Speculativedynamicallyscheduled3/3/46int.3FP2.8114M

115mm2104WIBMPower5

(1CPUonly)Speculativedynamicallyscheduled;SMT;

2CPUcores/chip8/4/86int.2FP1.9200M300mm2(est.)80W(est.)IntelItanium2Staticallyscheduled

VLIW-style6/5/119int.2FP1.6592M423mm2130

WHeadtoHeadILPcompetition22PerformanceonSPECint200023PerformanceonSPECfp200024NormalizedPerformance:EfficiencyRankItanium2PentIum4AthlonPower5Int/Trans4213FP/Trans4213Int/area4213FP/area4213Int/Watt4312FP/Watt243125NoSilverBulletforILPNoobviousover-allleaderinperformanceTheAMDAthlonleadsonSPECIntperformancefollowedbythePentium4,Itanium2,andPower5Itanium2andPower5,whichperformsimilarlyonSPECFP,clearlydominatetheAthlonandPentium4onSPECFPItanium2isthemostinefficientprocessorbothforFl.Pt.andintegercodeforallbutoneefficiencymeasure(SPECFP/Watt)AthlonandPentium4bothmakegooduseoftransistorsandareaintermsofefficiency,IBMPower5isthemosteffectiveuserofenergyonSPECFPandessentiallytiedonSPECINT26LimitstoILPDoublingissueratesabovetoday’s3-6instructionsperclock,sayto6to12instructions,probablyrequiresaprocessortoissue3or4datamemoryaccessespercycle,resolve2or3branchespercycle,renameandaccessmorethan20registerspercycle,andfetch12to24instructionspercycle.Thecomplexitiesofimplementingthesecapabilitiesislikelytomeansacrificesinthemaximumclockrate

E.g,widestissueprocessoristheItanium2,butitalsohastheslowestclockrate,despitethefactthatitconsumesthemostelectricalpower!27MosttechniquesforincreasingperformanceincreasepowerconsumptionThekeyquestioniswhetheratechniqueisenergyefficient:doesitincreaseperformancefasterthanitincreasespowerconsumption?Multipleissueprocessortechniquesallareenergyinefficient:Issuingmultipleinstructionsincurssomeoverheadinlogicthatgrowsfaster(I2)thantheissuerategrowsGrowinggapbetweenpeakissueratesandsustainedperformanceNumberoftransistorsswitching=f(peakissuerate),andperformance=f(sustainedrate),

growinggapbetweenpeakandsustainedperformance

increasingenergyp

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

并行处理与体系结构课件hitsz-lec05

文档简介

温馨提示

最新文档

评论

并行处理与体系结构课件hitsz-lec05

文档简介

温馨提示

最新文档

评论

相关文档