




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Threading&SimultaneousMultithreading
SlidesadaptedfromDavidPatterson,UC-Berkeleycs252-s0612OutlineThreadLevelParallelismMultithreadingSimultaneousMultithreadingPower4vs.Power5HeadtoHead:VLIWvs.Superscalarvs.SMTCommentaryConclusion3PerformancebeyondsinglethreadILPILPforarbitrarycodeislimitednowto3to6issues/cycle,therecanbemuchhighernaturalparallelisminsomeapplications(e.g.,databaseorscientificcodes)Explicit(specifiedbycompiler)ThreadLevelParallelismorDataLevelParallelismThread:aprocesswithitsowninstructionsanddata(ormuchharderoncompiler:carefullyselectedcodesegmentsinthesameprocessthatrarelyinteract)Athreadmaybeoneprocessthatispartofaparallelprogramofmultipleprocesses,oritmaybeanindependentprogramEachthreadhasallthestate(instructions,data,PC,registerstate,andsoon)necessarytoallowittoexecuteDataLevelParallelism:Performidentical(lock-step)operationsondatawhenhavelotsofdata.4ThreadLevelParallelism(TLP)ILP(lastlectures)exploitsimplicitlyparalleloperationswithinalooporstraight-linecodesegmentTLPisexplicitlyrepresentedbytheuseofmultiplethreadsofexecutionthatareinherentlyparallelGoal:UsemanyinstructionstreamstoimproveThroughputofcomputersthatrunmanyprogramsExecutiontimeofmulti-threadedprogramsTLPcouldbemorecost-effectivetoexploitthanILPformanyapplications.5NewApproach:MultithreadedExecutionMultithreading:multiplethreadstosharethefunctionalunitsofoneprocessorviaoverlappedexecutionprocessormustduplicateindependentstateofeachthread,e.g.,aseparatecopyoftheregisterfile,aseparatePC,andifrunningasindependentprograms,aseparatepagetablememorysharedthroughthevirtualmemorymechanisms,whichalreadysupportmultipleprocessesHWforfastthreadswitch(0.1to10clocks)ismuchfasterthanafullprocessswitch(100sto1000sofclocks)thatcopiesstate(state=registers,memory,andfileaccesstables)Whenswitchamongthreads?Alternateinstructionsfromnewthreads(finegrain)Whenathreadisstalled,perhapsforacachemiss,anotherthreadcanbeexecuted(coarsegrain)Incache-lessmultiprocessors,atstartofeachmemoryaccess6Formostapplications,theprocessingunit(s)stall80%ormoreoftimeduring“execution”From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism,ISCA1995.(FromUWash.)Just18%ofissueslotsOKforan8-waysuperscalar.<=#1<=#218
18%CPUissueslots
usefullybusy7MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%8Fine-GrainedMultithreadingSwitchesbetweenthreadsoneachinstructioncycle,causingtheexecutionofmultiplethreadstobeinterleavedUsuallydoneinaround-robinfashion,skippinganystalledthreadsCPUmustbeabletoswitchthreadseveryclockAdvantageisthatitcanhidebothshortandlongstalls,sinceinstructionsfromotherthreadsexecutedwhenonethreadstallsDisadvantageisitslowsdownexecutionofindividualthreads,sinceathreadreadytoexecutewithoutstallswillbedelayedbyinstructionsfromotherthreadsUsedonSun’sNiagarachip(with8cores,willseelater)9Course-GrainedMultithreadingSwitchesthreadsforcostlystalls,suchasL2cachemisses(oronanydatamemoryreferenceifnocaches)AdvantagesRelievesneedtohaveveryfastthread-switching(ifusecaches).Doesnotslowdownanythread,sinceinstructionsfromotherthreadsissuedonlywhenactivethreadencountersacostlystall
Disadvantageisthatitishardtoovercomethroughputlossesfromshorterstalls,becauseofpipelinestart-upcostsSinceCPUnormallyissuesinstructionsfromjustonethread,whenastalloccurs,thepipelinemustbeemptiedorfrozenNewthreadmustfillpipelinebeforeinstructionscancompleteBecauseofthisstart-upoverhead,coarse-grainedmultithreadingisefficientforreducingpenaltyonlyofhighcoststalls,wherestalltime>>pipelinerefilltimeUsedIBMAS/400(1988,forsmalltomediumbusinesses)10(UWash=>Intel)SimultaneousMulti-threading…
“Hyper-threading”123456789MMFXFXFPFPBRCCCycleOnethread,8funcunitsM=Load/Store,FX=FixedPoint,FP=FloatingPoint,BR=Branch,CC=ConditionCodes123456789MMFXFXFPFPBRCCCycleTwothreads,8unitsBusy:13/72=18.0%Busy:30/72=41.7%11UsebothILPandTLP?(UWash:“Yes”)TLPandILPexploittwodifferentkindsofparallelstructureinaprogramCouldaprocessororientedtowardILPbeusedtoexploitTLP?functionalunitsareoftenidleindatapathsdesignedforILPbecauseofeitherstallsordependencesinthecodeCouldtheTLPbeusedasasourceofindependentinstructionsthatmightkeeptheprocessorbusyduringstalls?CouldTLPbeusedtoemploythefunctionalunitsthatwouldotherwiselieidlewheninsufficientILPexists?
12SimultaneousMultithreading(SMT)Simultaneousmultithreading(SMT):insightthatadynamicallyscheduledprocessoralreadyhasmanyHWmechanismstosupportmultithreadingLargesetofvirtualregistersthatcanbeusedtoholdtheregistersetsofindependentthreadsRegisterrenamingprovidesuniqueregisteridentifiers,soinstructionsfrommultiplethreadscanbemixedindatapathwithoutconfusingsourcesanddestinationsacrossthreadsOut-of-ordercompletionallowsthethreadstoexecuteoutoforder,andgetbetterutilizationoftheHWJustneedtoaddaper-threadrenamingtableandkeepingseparatePCsIndependentcommitmentcanbesupportedby“logically”keepingaseparatereorderbufferforeachthreadSource:MicrprocessorReport,December6,1999
“CompaqChoosesSMTforAlpha”13DesignChallengesinSMTSinceSMTmakessenseonlywithfine-grainedimplementation,impactoffine-grainedschedulingonsinglethreadperformance?Doesdesignatingapreferredthreadallowsacrificingneitherthroughputnorsingle-threadperformance?Unfortunately,withapreferredthread,processorislikelytosacrificesomethroughputwhenthepreferredthreadstallsLargerregisterfileisneededtoholdmultiplecontextsTrynottoaffectclockcycletime,especiallyinInstructionissue-morecandidateinstructionsneedtobeconsideredInstructioncompletion-choosingwhichinstructionstocommitmaybechallengingEnsurethatcacheandTLBconflictsgeneratedbySMTdonotdegradeperformance14MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%15Power4Single-threadedpredecessortoPower5.Eightexecutionunitsinanout-of-orderengine,eachunitmayissueoneinstructioneachcycle.Instructionpipeline(IF:instructionfetch,IC:instructioncache,BP:branchpredict,D0:decodestage0,Xfer:transfer,GD:groupdispatch,MP:mapping,ISS:instructionissue,RF:registerfileread,EX:execute,EA:computeaddress,DC:datacaches,F6:six-cyclefloating-pointexecutionpipe,Fmt:dataformat,WB:writeback,andCP:groupcommit)16Power4-1threadPower5-2threads2fetch(PC),
2initialdecodes2completes(architectedregistersets)See/servers/eserver/pseries/news/related/2004/m2040.pdfPower5instructionpipeline(IF=instructionfetch,IC=instructioncache,BP=branchpredict,D0=decodestage0,Xfer=transfer,GD=groupdispatch,MP=mapping,ISS=instructionissue,RF=registerfileread,EX=execute,EA=computeaddress,DC=datacaches,F6=six-cyclefloating-pointexecutionpipe,Fmt=dataformat,WB=writeback,andCP=groupcommit)Page43.17Power5dataflow...Whyonly2threads?With4,somesharedresource(physicalregisters,cache,memorybandwidth)wouldoftenbottleneck
LSU=load/storeunit,FXU=fixed-pointexecutionunit,FPU=floating-pointunit,BXU=branchexecutionunit,andCRL=conditionregisterlogicalexecutionunit.18Power5threadperformance...Relativepriorityofeachthreadcontrollableinhardware.Forbalancedoperation,boththreadsrunslowerthanifthey“owned”themachine.19ChangesinPower5tosupportSMTIncreasedassociativityofL1instructioncacheandtheinstructionaddresstranslationbuffersAddedperthreadloadandstorequeuesIncreasedsizeoftheL2(1.92vs.1.44MB)andL3cachesAddedseparateinstructionprefetchandbufferingperthreadIncreasedthenumberofvirtualregistersfrom152to240IncreasedthesizeofseveralissuequeuesThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport20InitialPerformanceofSMTPentium4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_ratePentium4isdual-threadedSMTSPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmarkRunningonPentium4witheachof26SPECbenchmarkspairedwitheveryother(26*26runs)gavespeed-upsfrom0.90to1.58;averagewas1.20Power5,8processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_ratePower5running2“same”copiesofeachapplicationgavespeedupsfrom0.89to1.41,comparedto1.01and1.07averagesforPentium4.MostgainedsomeFloatingPt.applicationshadmostcacheconflictsandleastgains21ProcessorMicroarchitectureFetch/Issue/ExecuteFunct.UnitsClockRate(GHz)Transis-tors
DiesizePowerIntelPentium4ExtremeSpeculativedynamicallyscheduled;deeplypipelined;SMT3/3/47int.1FP3.8125M122mm2115WAMDAthlon64FX-57Speculativedynamicallyscheduled3/3/46int.3FP2.8114M
115mm2104WIBMPower5
(1CPUonly)Speculativedynamicallyscheduled;SMT;
2CPUcores/chip8/4/86int.2FP1.9200M300mm2(est.)80W(est.)IntelItanium2Staticallyscheduled
VLIW-style6/5/119int.2FP1.6592M423mm2130
WHeadtoHeadILPcompetition22PerformanceonSPECint200023PerformanceonSPECfp200024NormalizedPerformance:EfficiencyRankItanium2PentIum4AthlonPower5Int/Trans4213FP/Trans4213Int/area4213FP/area4213Int/Watt4312FP/Watt243125NoSilverBulletforILPNoobviousover-allleaderinperformanceTheAMDAthlonleadsonSPECIntperformancefollowedbythePentium4,Itanium2,andPower5Itanium2andPower5,whichperformsimilarlyonSPECFP,clearlydominatetheAthlonandPentium4onSPECFPItanium2isthemostinefficientprocessorbothforFl.Pt.andintegercodeforallbutoneefficiencymeasure(SPECFP/Watt)AthlonandPentium4bothmakegooduseoftransistorsandareaintermsofefficiency,IBMPower5isthemosteffectiveuserofenergyonSPECFPandessentiallytiedonSPECINT26LimitstoILPDoublingissueratesabovetoday’s3-6instructionsperclock,sayto6to12instructions,probablyrequiresaprocessortoissue3or4datamemoryaccessespercycle,resolve2or3branchespercycle,renameandaccessmorethan20registerspercycle,andfetch12to24instructionspercycle.Thecomplexitiesofimplementingthesecapabilitiesislikelytomeansacrificesinthemaximumclockrate
E.g,widestissueprocessoristheItanium2,butitalsohastheslowestclockrate,despitethefactthatitconsumesthemostelectricalpower!27MosttechniquesforincreasingperformanceincreasepowerconsumptionThekeyquestioniswhetheratechniqueisenergyefficient:doesitincreaseperformancefasterthanitincreasespowerconsumption?Multipleissueprocessortechniquesallareenergyinefficient:Issuingmultipleinstructionsincurssomeoverheadinlogicthatgrowsfaster(I2)thantheissuerategrowsGrowinggapbetweenpeakissueratesandsustainedperformanceNumberoftransistorsswitching=f(peakissuerate),andperformance=f(sustainedrate),
growinggapbetweenpeakandsustainedperformance
increasingenergyp
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 条例效力高于管理办法
- 校外实践教师管理办法
- 泉州临时仓储管理办法
- 部编本新人教版八年级上册音乐教学计划
- 湖南省湘西自治州2024-2025学年高一上学期期中模拟历史试题(解析版)
- 成品工序工价管理办法
- 科技研发团队管理办法
- 2024-2025学年湘教版数学八年级下册教学计划
- 离岗创业编制管理办法
- 学校物业材料管理办法
- 民兵应急知识培训课件
- 酒吧装修施工方案
- 初中生田径队训练计划
- 暨南大学《微观经济学》2023-2024学年第一学期期末试卷
- 班组安全工作总结汇报
- 高中英语必背3500单词表(完整版)
- DB11T 1911-2021 专业应急救援队伍能力建设规范 防汛排水
- 2024年版《输变电工程标准工艺应用图册》
- 北京市东城区东直门中学2024-2025学年七年级上学期分班考数学试卷
- 国家开放大学本科《理工英语3》一平台机考总题库2025珍藏版
- 陕西省咸阳市2023-2024学年高一下学期7月期末考试物理试题(原卷版)
评论
0/150
提交评论