版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Threading&SimultaneousMultithreading
SlidesadaptedfromDavidPatterson,UC-Berkeleycs252-s0612OutlineThreadLevelParallelismMultithreadingSimultaneousMultithreadingPower4vs.Power5HeadtoHead:VLIWvs.Superscalarvs.SMTCommentaryConclusion3PerformancebeyondsinglethreadILPILPforarbitrarycodeislimitednowto3to6issues/cycle,therecanbemuchhighernaturalparallelisminsomeapplications(e.g.,databaseorscientificcodes)Explicit(specifiedbycompiler)ThreadLevelParallelismorDataLevelParallelismThread:aprocesswithitsowninstructionsanddata(ormuchharderoncompiler:carefullyselectedcodesegmentsinthesameprocessthatrarelyinteract)Athreadmaybeoneprocessthatispartofaparallelprogramofmultipleprocesses,oritmaybeanindependentprogramEachthreadhasallthestate(instructions,data,PC,registerstate,andsoon)necessarytoallowittoexecuteDataLevelParallelism:Performidentical(lock-step)operationsondatawhenhavelotsofdata.4ThreadLevelParallelism(TLP)ILP(lastlectures)exploitsimplicitlyparalleloperationswithinalooporstraight-linecodesegmentTLPisexplicitlyrepresentedbytheuseofmultiplethreadsofexecutionthatareinherentlyparallelGoal:UsemanyinstructionstreamstoimproveThroughputofcomputersthatrunmanyprogramsExecutiontimeofmulti-threadedprogramsTLPcouldbemorecost-effectivetoexploitthanILPformanyapplications.5NewApproach:MultithreadedExecutionMultithreading:multiplethreadstosharethefunctionalunitsofoneprocessorviaoverlappedexecutionprocessormustduplicateindependentstateofeachthread,e.g.,aseparatecopyoftheregisterfile,aseparatePC,andifrunningasindependentprograms,aseparatepagetablememorysharedthroughthevirtualmemorymechanisms,whichalreadysupportmultipleprocessesHWforfastthreadswitch(0.1to10clocks)ismuchfasterthanafullprocessswitch(100sto1000sofclocks)thatcopiesstate(state=registers,memory,andfileaccesstables)Whenswitchamongthreads?Alternateinstructionsfromnewthreads(finegrain)Whenathreadisstalled,perhapsforacachemiss,anotherthreadcanbeexecuted(coarsegrain)Incache-lessmultiprocessors,atstartofeachmemoryaccess6Formostapplications,theprocessingunit(s)stall80%ormoreoftimeduring“execution”From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism,ISCA1995.(FromUWash.)Just18%ofissueslotsOKforan8-waysuperscalar.<=#1<=#218
18%CPUissueslots
usefullybusy7MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%8Fine-GrainedMultithreadingSwitchesbetweenthreadsoneachinstructioncycle,causingtheexecutionofmultiplethreadstobeinterleavedUsuallydoneinaround-robinfashion,skippinganystalledthreadsCPUmustbeabletoswitchthreadseveryclockAdvantageisthatitcanhidebothshortandlongstalls,sinceinstructionsfromotherthreadsexecutedwhenonethreadstallsDisadvantageisitslowsdownexecutionofindividualthreads,sinceathreadreadytoexecutewithoutstallswillbedelayedbyinstructionsfromotherthreadsUsedonSun’sNiagarachip(with8cores,willseelater)9Course-GrainedMultithreadingSwitchesthreadsforcostlystalls,suchasL2cachemisses(oronanydatamemoryreferenceifnocaches)AdvantagesRelievesneedtohaveveryfastthread-switching(ifusecaches).Doesnotslowdownanythread,sinceinstructionsfromotherthreadsissuedonlywhenactivethreadencountersacostlystall
Disadvantageisthatitishardtoovercomethroughputlossesfromshorterstalls,becauseofpipelinestart-upcostsSinceCPUnormallyissuesinstructionsfromjustonethread,whenastalloccurs,thepipelinemustbeemptiedorfrozenNewthreadmustfillpipelinebeforeinstructionscancompleteBecauseofthisstart-upoverhead,coarse-grainedmultithreadingisefficientforreducingpenaltyonlyofhighcoststalls,wherestalltime>>pipelinerefilltimeUsedIBMAS/400(1988,forsmalltomediumbusinesses)10(UWash=>Intel)SimultaneousMulti-threading…
“Hyper-threading”123456789MMFXFXFPFPBRCCCycleOnethread,8funcunitsM=Load/Store,FX=FixedPoint,FP=FloatingPoint,BR=Branch,CC=ConditionCodes123456789MMFXFXFPFPBRCCCycleTwothreads,8unitsBusy:13/72=18.0%Busy:30/72=41.7%11UsebothILPandTLP?(UWash:“Yes”)TLPandILPexploittwodifferentkindsofparallelstructureinaprogramCouldaprocessororientedtowardILPbeusedtoexploitTLP?functionalunitsareoftenidleindatapathsdesignedforILPbecauseofeitherstallsordependencesinthecodeCouldtheTLPbeusedasasourceofindependentinstructionsthatmightkeeptheprocessorbusyduringstalls?CouldTLPbeusedtoemploythefunctionalunitsthatwouldotherwiselieidlewheninsufficientILPexists?
12SimultaneousMultithreading(SMT)Simultaneousmultithreading(SMT):insightthatadynamicallyscheduledprocessoralreadyhasmanyHWmechanismstosupportmultithreadingLargesetofvirtualregistersthatcanbeusedtoholdtheregistersetsofindependentthreadsRegisterrenamingprovidesuniqueregisteridentifiers,soinstructionsfrommultiplethreadscanbemixedindatapathwithoutconfusingsourcesanddestinationsacrossthreadsOut-of-ordercompletionallowsthethreadstoexecuteoutoforder,andgetbetterutilizationoftheHWJustneedtoaddaper-threadrenamingtableandkeepingseparatePCsIndependentcommitmentcanbesupportedby“logically”keepingaseparatereorderbufferforeachthreadSource:MicrprocessorReport,December6,1999
“CompaqChoosesSMTforAlpha”13DesignChallengesinSMTSinceSMTmakessenseonlywithfine-grainedimplementation,impactoffine-grainedschedulingonsinglethreadperformance?Doesdesignatingapreferredthreadallowsacrificingneitherthroughputnorsingle-threadperformance?Unfortunately,withapreferredthread,processorislikelytosacrificesomethroughputwhenthepreferredthreadstallsLargerregisterfileisneededtoholdmultiplecontextsTrynottoaffectclockcycletime,especiallyinInstructionissue-morecandidateinstructionsneedtobeconsideredInstructioncompletion-choosingwhichinstructionstocommitmaybechallengingEnsurethatcacheandTLBconflictsgeneratedbySMTdonotdegradeperformance14MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%15Power4Single-threadedpredecessortoPower5.Eightexecutionunitsinanout-of-orderengine,eachunitmayissueoneinstructioneachcycle.Instructionpipeline(IF:instructionfetch,IC:instructioncache,BP:branchpredict,D0:decodestage0,Xfer:transfer,GD:groupdispatch,MP:mapping,ISS:instructionissue,RF:registerfileread,EX:execute,EA:computeaddress,DC:datacaches,F6:six-cyclefloating-pointexecutionpipe,Fmt:dataformat,WB:writeback,andCP:groupcommit)16Power4-1threadPower5-2threads2fetch(PC),
2initialdecodes2completes(architectedregistersets)See/servers/eserver/pseries/news/related/2004/m2040.pdfPower5instructionpipeline(IF=instructionfetch,IC=instructioncache,BP=branchpredict,D0=decodestage0,Xfer=transfer,GD=groupdispatch,MP=mapping,ISS=instructionissue,RF=registerfileread,EX=execute,EA=computeaddress,DC=datacaches,F6=six-cyclefloating-pointexecutionpipe,Fmt=dataformat,WB=writeback,andCP=groupcommit)Page43.17Power5dataflow...Whyonly2threads?With4,somesharedresource(physicalregisters,cache,memorybandwidth)wouldoftenbottleneck
LSU=load/storeunit,FXU=fixed-pointexecutionunit,FPU=floating-pointunit,BXU=branchexecutionunit,andCRL=conditionregisterlogicalexecutionunit.18Power5threadperformance...Relativepriorityofeachthreadcontrollableinhardware.Forbalancedoperation,boththreadsrunslowerthanifthey“owned”themachine.19ChangesinPower5tosupportSMTIncreasedassociativityofL1instructioncacheandtheinstructionaddresstranslationbuffersAddedperthreadloadandstorequeuesIncreasedsizeoftheL2(1.92vs.1.44MB)andL3cachesAddedseparateinstructionprefetchandbufferingperthreadIncreasedthenumberofvirtualregistersfrom152to240IncreasedthesizeofseveralissuequeuesThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport20InitialPerformanceofSMTPentium4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_ratePentium4isdual-threadedSMTSPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmarkRunningonPentium4witheachof26SPECbenchmarkspairedwitheveryother(26*26runs)gavespeed-upsfrom0.90to1.58;averagewas1.20Power5,8processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_ratePower5running2“same”copiesofeachapplicationgavespeedupsfrom0.89to1.41,comparedto1.01and1.07averagesforPentium4.MostgainedsomeFloatingPt.applicationshadmostcacheconflictsandleastgains21ProcessorMicroarchitectureFetch/Issue/ExecuteFunct.UnitsClockRate(GHz)Transis-tors
DiesizePowerIntelPentium4ExtremeSpeculativedynamicallyscheduled;deeplypipelined;SMT3/3/47int.1FP3.8125M122mm2115WAMDAthlon64FX-57Speculativedynamicallyscheduled3/3/46int.3FP2.8114M
115mm2104WIBMPower5
(1CPUonly)Speculativedynamicallyscheduled;SMT;
2CPUcores/chip8/4/86int.2FP1.9200M300mm2(est.)80W(est.)IntelItanium2Staticallyscheduled
VLIW-style6/5/119int.2FP1.6592M423mm2130
WHeadtoHeadILPcompetition22PerformanceonSPECint200023PerformanceonSPECfp200024NormalizedPerformance:EfficiencyRankItanium2PentIum4AthlonPower5Int/Trans4213FP/Trans4213Int/area4213FP/area4213Int/Watt4312FP/Watt243125NoSilverBulletforILPNoobviousover-allleaderinperformanceTheAMDAthlonleadsonSPECIntperformancefollowedbythePentium4,Itanium2,andPower5Itanium2andPower5,whichperformsimilarlyonSPECFP,clearlydominatetheAthlonandPentium4onSPECFPItanium2isthemostinefficientprocessorbothforFl.Pt.andintegercodeforallbutoneefficiencymeasure(SPECFP/Watt)AthlonandPentium4bothmakegooduseoftransistorsandareaintermsofefficiency,IBMPower5isthemosteffectiveuserofenergyonSPECFPandessentiallytiedonSPECINT26LimitstoILPDoublingissueratesabovetoday’s3-6instructionsperclock,sayto6to12instructions,probablyrequiresaprocessortoissue3or4datamemoryaccessespercycle,resolve2or3branchespercycle,renameandaccessmorethan20registerspercycle,andfetch12to24instructionspercycle.Thecomplexitiesofimplementingthesecapabilitiesislikelytomeansacrificesinthemaximumclockrate
E.g,widestissueprocessoristheItanium2,butitalsohastheslowestclockrate,despitethefactthatitconsumesthemostelectricalpower!27MosttechniquesforincreasingperformanceincreasepowerconsumptionThekeyquestioniswhetheratechniqueisenergyefficient:doesitincreaseperformancefasterthanitincreasespowerconsumption?Multipleissueprocessortechniquesallareenergyinefficient:Issuingmultipleinstructionsincurssomeoverheadinlogicthatgrowsfaster(I2)thantheissuerategrowsGrowinggapbetweenpeakissueratesandsustainedperformanceNumberoftransistorsswitching=f(peakissuerate),andperformance=f(sustainedrate),
growinggapbetweenpeakandsustainedperformance
increasingenergyp
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 排他性合作协议
- 运营隧道的养护与维修施工工艺隧道工艺标准系列之十五模板
- 婴幼儿护理技能培训课件
- 娱乐行业介绍
- 2026年工业锅炉运行培训试题及答案
- 2026年四川医疗卫生面试常见题型解析
- 2026年呼吸内科临床综合能力训练题及详细解答
- 2026年医患关系与纠纷处理能力试题含答案
- 2026年新疆油田稠油开发与处理工艺测试含答案
- 2026年股市熔断机制小测含答案
- 2026年建筑物智能化与电气节能技术发展
- 半导体产业人才供需洞察报告 202511-猎聘
- 电梯救援安全培训课件
- 2025年青岛市国企社会招聘笔试及答案
- 2026届江西省抚州市临川区第一中学高二上数学期末考试模拟试题含解析
- 云南省大理州2024-2025学年七年级上学期期末考试数学试卷(含解析)
- 物业管理法律法规与实务操作
- 高压避雷器课件
- 体检中心收费与财务一体化管理方案
- 四川省内江市2024-2025学年高二上学期期末检测化学试题
- 广东省深圳市龙岗区2024-2025学年二年级上学期学科素养期末综合数学试卷(含答案)
评论
0/150
提交评论