第三章流水线_第1页
第三章流水线_第2页
第三章流水线_第3页
第三章流水线_第4页
第三章流水线_第5页
已阅读5页,还剩136页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

第三章流水线A.流水线基本定义Whatispipelining?(什么是流水线)HowisthepipeliningImplemented?(流水线是怎么实现的)Whatmakespipelininghardtoimplement?

(流水线实现的困难是什么)Basicmethodsimprovingpipelining(提高流水线性能的一些基本方法)WhatisPipelining?Pipelining(流水线定义):“Atechniquedesignedintosomecomputerstoincreasespeedbystartingtheexecutionofoneinstructionbeforecompletingthepreviousone.”

ModernEnglish-ChineseDictionaryimplementationtechniquewherebydifferentinstructionsareoverlappedinexecutionatthesametime.(多条指令同时重叠执行)implementationtechniquetomakefastCPUsWhyPipelining:ItsNature(流水线的本质)Laundry(洗衣店)Ann,Brian,Cathy,Dave

eachhaveoneloadofclothes

towash,dry,andfoldWasher(洗衣机)takes30minutesDryer(干衣机)takes40minutes“Folder”(叠衣人)takes20minutesABCD顺序执行的洗衣Sequentiallaundrytakes6hoursfor4loadsIftheylearnedpipelining,howlongwouldlaundrytake?ABCD3040203040203040203040206PM7891011MidnightTimeTaskOrder流水线的洗衣

开始工作越快越好Pipelinedlaundrytakes3.5hoursfor4loads

ABCD6PM7891011MidnightTaskOrderTime304040404020Whypipelining:overlapped(重叠)Latches(锁存器),calledpipelineregistersbreakupcomputationinto5stagesDeal5tasksatthesametime.Onlydealonetaskeachtime.Thistasktakes“suchalongtime”ns:nanosecond纳秒即十亿分之一秒Whypipelining:morefasterCan“launch(开始)”anewcomputationevery100nsinthisstructureCanfinish107computationspersecondCanlaunchanewcomputationevery20nsinpipelinedstructureCanfinish5×107computationspersecondWhypipelining:conclusionThekeyimplementationtechniqueusedtoMakefastCPU:decreaseCPUtime.(减少CPU时间的关键技术)ImprovingofThroughput(ratherthanindividualexecutiontime)(提高吞吐量,而不是单条指令的执行时间)Improvingofefficiencyforresources(functionalunit)

(提高了资源的有限利用率)Whatisapipeline?ApipelineislikeanautoassemblelineApipelinehasmanystages(流水线有多个阶段—称流水级或流水节拍)Eachstagecarriesoutadifferentpartofinstructionoroperation(每一个流水线完成指令操作的不同部分)Thestages,whichcooperatesatasynchronizedclock,areconnectedtoformapipe(各个流水级在同步时钟的控制下连成一条流水线)Aninstructionoroperationentersthroughoneendandprogressesthroughthestagesandexitthroughtheotherend(一条指令从一流水线的一端进入从另一端出来,完成整条指令的操作)Pipeliningisanimplementationtechniquethatexploitsparallelismamongtheinstructionsinasequentialinstructionstream

(流水线挖掘了指令流中的并行性)吞吐量Throughput(吞吐量)Numberofitems(cars,instructions)exitingthepipelineperunittime.Thethroughputoftheassemblylineisthenumberofproductscompletedperhour.ThethroughputofaCPUpipelineisthenumberofinstructionscompletedpersecond.(单位时间内完成的指令数)流水线特征

--时钟周期vs.机器周期Clockcycle(时钟周期)EverythinginaCPUmovesinlockstep(n.步伐一致),synchronizedbytheclock(TheHeartbeatofCPU)Machinecycle(Processorcycle,Stagetime)timerequiredtocompleteasinglepipelinestage.Thepipelinedesigner’sgoalistobalancethelengthofeachpipelinestage.(机器周期完成一个流水级所需要的时间)Inmanyinstances,machinecycle=max(timesforallstages).(机器周期=各个流水级完成时间的最大值)Amachinecycleisusuallyone,sometimestwo,clockcycleslong,butrarelymore.(一般是一到两个时间周期)流水线的描述Spatio-temporal(时空的)chartRelationofpipelineandtasksinsequenceFillbalancedDrainS412345n-1nS312345n-1nS212345n-1nS112345n-1nt1t2t3t4t5tntn+3time4+n-1流水节拍时间,以机器周期为单位流水线的理想性能SpeedupAssume:stages:k tasks:nTk=(k+(n-1))τp流水线实现所需要的时钟数 T1=nkτup非流水线实现所需要的时钟数τTTk1n1k1k-+=τ1)

p(nτkpnkup-+==peedupSspeedup=TimetasksonunpipelinedmachineTimesametasksonpipelinedmachinen—→∞ Speedup—→KK级流水线完成n个任务(指令)的加速比流水线的理想性能Ifthestagesareperfectlybalanced,Thetimeperinstructiononthepipelinedprocessorequalto(在理想状态下,流水线中每条指令所需的时间为)

TimeperinstructiononunpipelinedmachineNumberofpipestagesSo,Idealspeedup

equalto

Numberofpipestages.

(理想状态下,加速比等于流水线的级数)HowMIPSinstructionsetisimplementedwithoutpipelining?Fivephases(MIPS指令执行的五个阶段即五个流水级)IF:Instructionfetchcycle(取指令周期)ID:Instructiondecode/registerfetchcycle(指令解码/寄存器取数据周期)

EX:Execution/effectiveaddresscycle(指令执行/有效地址计算周期)MemoryreferenceR-R/R-IALUBranchMEM:Memoryaccess(存储器访问周期)WB:Write-backcycle(写结果周期)单周期实现seldomused!多周期实现流水线的MIPS指令集Sincetherearefiveseparatestages,wecanhaveapipelineinwhichoneinstructionisineachstage.CPIisdecreasedto1,sinceoneinstructionwillbeissued(orfinished)eachcycle.(CPI为1,每一周期流出一条指令)Duringanycycle,oneinstructionispresentineachstage.(在任何时候,每一个流水级里都有一条指令在执行)Ideally,performanceisincreasedfivefold!pipelinetimingchartstoreload5-stageVersionof

MIPSDatapath(数据通路)pipelineregistersorlatches

Howpipeliningdecreasetheexecutiontime?(流水线如何减少执行时间)

Ifyourstartingpointis(如果你的出发点是….)asingleclockcycleperinstructionmachinethen(单周期每指令的机器,则)

pipeliningdecreasescycletime.

(流水线减少周期时间)amultipleclockcycleperinstructionmachinethen(多周期每指令的机器,则)

pipeliningdecreasesCPI.

(流水线减少每条指令的平均周期数即CPI)单周期vs.流水线LoadIFIDEXMEMWBIFIDEXMEMWBStoreIFIDEXMEMWBR-typeSingleCycleImplementation:

CPI=1,longclockcycle

ClkLoadStoreWasteCycle1Cycle2PipelineImplementation:

CPI=1,clockcyclelongclockcycle/5ClkCycle1Cycle2Cycle3Cycle4Cycle5Cycle6Cycle7Cycle8Cycle9Cycle10多周期vs.流水线LoadStoreR-typeClkCycle1Cycle2Cycle3Cycle4Cycle5Cycle6Cycle7Cycle8Cycle9Cycle10ClkCycle1Cycle2Cycle3Cycle4Cycle5Cycle6Cycle7Cycle8Cycle9Cycle10Multip-CycleImplementation:

CPI=5,

PipelineImplementation:

CPI=1,LoadIFIDEXMEMWBIFIDEXMEMWBStoreIFIDEXMEMWBR-typeIFIDEXMEMWBIFIDEXMEMWBWhatweknewaboutpipelinePipeliningimplementationtechniquetoexecuteinstructionsinaoverlappedwaytomakefastCPUs(decreaseCPUtime,improvethroughput)(流水线通过让指令重叠执行的方式提高了吞吐量)Idealspeedupofpipelineequalto-Numberofpipestages(理想状态下流水线加速比是流水级数)IfthestartingpointisamultipleclockcycleperinstructionmachinethenpipeliningdecreasesCPI.

(在多周期指令计算机中,可以认为流水线降低了CPI)WorksintheMIPS5stagepipelineIF(Instructionfetchcycle)IRMem[PC];NPCPC=PC+4;ID(Instructiondecode/registerfetchcycle)ARegs[rs];BRegs[rt];Immsign-extendedimmediatefieldofIR;Note:ThefirsttwostagesofMIPSpipelinedothesamefunctionsforallkindsofinstructions.ThethirdstageofMIPSpipelineEX(Execution/effectiveaddresscycle)Memoryreference:ALUoutputA+ImmRegister-RegisterALUinstruction:ALUoutputAfuncB;Register-ImmediateALUinstruction:ALUoutputAopImm;Branch:ALUoutputNPC+(Imm<<2);Cond(A==0)Thelasttwostagesof

MIPSpipelineMEM(Memoryacces/branchcompletioncycle)Memoryreference(Load或Store指令起作用):LMDMem[ALUoutput]or(注:LMD—loadmemorydataregister)Mem[ALUoutput]BBranch:If(cond)PCALUoutputWB(Writebackcycle)Register-RegisterALUinstructionRegs[rd]ALUoutput;Register-ImmediateALUinstructionRegs[rt]ALUoutput;LoadInstruction:Regs[rt]LMD;Table:EventsoneverystageStageAnyinstructionIFIF/ID.IRMem[PC];IF/ID.NPC,PC(if((EX/MEM.opcode==branch)&EX/MEM.cond){EX/MEM.ALUoutput}else{PC+4});IDID/EX.ARegs[IF/ID.IR[rs]];ID/EX.BRegs[IF/ID.IR[rt]];ID/EX.NPCIF/ID.NPC;ID/EX.IRIF/ID.IR;ID/EX.Immsign-extend(IF/ID.IR[immediatefield]);ALUinstructionLd/stinstructionBranchinstructionEXEX/MEM.IRID/EX.IR;EX/MEM.ALUoutputID/EX.AfuncID/EX.B;orEX/MEM.ALUoutputID/EX.AopID/EX.Imm;EX/MEM.IRID/EX.IR;EX/MEM.ALUoutputID/EX.A+ID/EX.Imm;EX/MEM.BID/EX.B;EX/MEM.ALUoutputID/EX.NPC+(ID/EX.Imm<<2);EX/MEM.cond(ID/EX.A==0);MEMMEM/WB.IR

EX/MEM.IR;MEM/WB.ALUoutputEX/MEM.ALUoutput;MEM/WB.IR

EX/MEM.IR;MEM/WB.LMDMem[EX/MEM.ALUoutput];OrMEM/WB.LMDMem[EX/MEM.ALUoutput];WBRegs[MEM/WB.IR[rd]]MEM/WB.ALUoutput;orRegs[MEM/WB.IR[rt]]MEM/WB.ALUoutput;ForLoadonly;Regs[MEM/WB.IR[rt]]MEM/WB.LMDTheMIPSpipeliningProblemsthatpipeliningintroducesFocus:nodifferentoperationswiththesamedatapathresourceonthesameclockcycle.(structurehazard)(在同一时钟周期里,在同一资源上不会发生多个操作)Thereisconflict(冲突)aboutthememory!MemInstr.OrderTime(clockcycles)Ld/StInstr1Instr2Instr3ALUMemRegMemRegALUMemRegMemRegALUMemRegMemRegALURegMemRegSeparateinstructionanddatamemoriesusesplitinstructionanddatacachethememorysystemmustdeliver5timesthebandwidthovertheunpipelinedversion.IMInstr.OrderTime(clockcycles)Ld/StInstr1Instr2Instr3ALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALURegDMRegTheconflictabouttheregisters!SometimeswecanredesigntheresourceallowWRITE-then-READ

inoneclockcycle(doublepump前半拍寄存器组写入,后半拍寄存器组读出)Tworeadsandonewriterequiredperclock.Needtoprovidetworeadportandonewriteport.Whathappenswhenareadandawriteoccurtothesameregister?(Datahazard)ConflictoccurswhenPCupdateMustincrementandstorethePCeveryclock.Whathappenswhenmeetabranch?BrancheschangethevalueofthePC--buttheconditionisnotevaluateduntilID!Ifthebranchistaken,theinstructionsfetchedbehindthebranchareinvalid!Thisisclearlyaseriousproblem(Controlhazard)thatneedstobeaddressed.Pipelinehazard(流水线竞争):

themajorhurdle(流水线的主要障碍)AhazardisaconditionthatpreventsaninstructioninthepipefromexecutingitsnextscheduledpipestageTaxonomyofhazard(流水线竞争的分类)Structuralhazards

(结构竞争)Theseareconflictsoverhardwareresources.(硬件资料不够用导致的竞争)

Datahazards(数据竞争)Instructiondependsonresultofpriorcomputationwhichisnotready(computedorstored)yet(指令之间的相互依赖性导致的竞争)Controlhazards(控制竞争)branchconditionandthebranchPCarenotavailableintimetofetchaninstructiononthenextclock(转移条件或转移地址未能及时知道而导致的竞争)HazardscanalwaysberesolvedbyStall(停顿)Thesimplestwayto“fix”hazardsistostallthepipeline.

(处理竞争最简单的办法是暂停流水线的执行)Stallmeanssuspendingthepipelineforsomeinstructionsbyoneormoreclockcycles.Thestalldelaysallinstructionsissuedaftertheinstructionthatwasstalled,whileotherinstructionsinthepipelinegoonproceeding.

(停顿指令之后的所有指令都被暂停,而其之前的指令继续执行)Apipelinestallisalsocalledapipelinebubble(流水汽泡)orsimplybubble.Nonewinstructionsarefetchedduringastall.

(停顿时,不会再取新的指令)PerformanceofpipelinewithstallsPipelinestallsdecreaseperformancefromtheideal(流水线停顿降低性能)Recallthespeedupformula:AssumptionsforcalculationTheidealCPIonapipelinedprocessorisalmostalways1.SoIgnoretheoverheadofpipeliningclockcycle.Pipestagesareidealbalanced.Twodifferentimplementationsingle-cycleimplementationmulti-cycleimplementationCaseofsingle-cycleimplementationCPIunpipelined=1

Clockcyclepipelined=

ClockcycleunpipelinedpipelinedepthCaseofmulti-cycleimplementationClockcycleunpipelined=ClockcyclepipeliningCPlunpipelined=pipelinedepthStructuralhazard:

PipeStageContention(争夺)Structuralhazards(结构竞争)Occurswhentwoormoreinstructionswanttousethesamehardwareresourceinthesamecycle(多条指令同时需要使用相同硬件资源时产生)Causesbubble(stall)inpipelinedmachines(引起停顿)OvercomebyreplicatinghardwareresourcesMultipleaccessestotheregisterfileMultipleaccessestomemorysomefunctionalunitisnotfullypipelined.MultiaccesstotheregisterfileSimplyinsertastall,speedupwillbedecreased.Wehaveresolveditwith“doublebump”DoubleBumpWorks!MultiaccesstoSingleMemoryPortInsertstallprovideanothermemoryportsplitinstructionmemoryanddatamemoryuseinstructionbufferMemInstr.OrderTime(clockcycles)Ld/StInstr1Instr2Instr3ALUMemRegMemRegALUMemRegMemRegALUMemRegMemRegALURegMemRegInsertStallInstr.OrderTime(clockcycles)Ld/StInstr1Instr2Instr3ALUMemRegMemRegALUMemRegMemRegALUMemRegMemRegALUMemRegMemRegBubbleBubbleBubbleBubbleBubbleStallIMInstr.OrderTime(clockcycles)Ld/StInstr1Instr2Instr3ALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALURegDMRegSplitinstructionanddatamemorySplitinstructionanddatamemory/multiplememoryport/instructionbuffermeans:

fetchtheinstructionanddatainferenceusingdifferenthardwareresources.Notfullypipelinedfunctionunit:

maycausestructuralhazardMachinewithoutstructuralhazardswillalwayshavealowerCPIExample(howmuchtheloadstructuralhazardmightcost)Datareferenceconstitute40%ofthemixIdealCPIignoringthestructuralhazardis1Theprocessorwiththestructuralhazardhasaclockratethatis1.05timeshigherthanthatofaprocessorwithoutstructuralhazard.AnswerAverageinstructiontime=CPIClockcycletime=(1+0.41)CCideal/1.05=1.3CcidealClearly,theprocessorwithoutthestructuralhazardisfaster.Whyallowmachinewithstructuralhazard?Theprimaryreasonistoreducecost.(降低成本)i.e.addingsplitcaches,requirestwicethememorybandwidth.alsofullypipelinedfloatingpointunitscostslotsofgates.Itisnotworththecostifthehazarddoesnotoccurveryoften.

Toreducelatencyoftheunit.(减少部件延迟)MakingfunctionalunitspipelinedaddsdelayAnunpipelinedversionmayrequirefewerclocksperoperation.Example:impactofstructuralhazardtoperformanceExampleManymachineshaveunpipelinedfloat-pointmultiplier.ThefunctionunittimeofFPmultiplieris6clockcyclesFPmultiplyhasafrequencyof14%inaSPECfpbenchmarkWillthestructuralhzardhavealargeperformanceimpactontheSPECfpbenchmark?AnswertotheexampleInthebestcase:FPmultipliesaredistributeduniformly.Thereisonemultiplyinevery7clock.1/14%Thentherewillbenostructuralhazard,thenthereisnoperformancepenaltyatall.Intheworstcase:themultipliesareallclusteredwithnointerveninginstructions.Theneverymultiplyinstructionhavetostall5clockcyclestowaitforthemultiplierbereleased.TheCPIwillincrease70%to1.7(注:1+5*0.14),iftheidealCPIis1.Experimentresult:Thisstructuralhazardincreaseexecutiontimebylessthan3%.TaxonomyofHazards

-Structuralhazards

Theseareconflictsoverhardwareresources.

OK,maybeaddextrahardwareresources accordingtoAmdahl’sLaw;orfullpipelinedthefunctionalunits;otherwisestillhavetostall-DatahazardsInstructiondependsonresultofpriorcomputationwhichisnotready(computedorstored)yet-Controlhazards

branchconditionandthebranchPCarenotavailableintimetofetchaninstructiononthenextclockSummaryofStructuralhazard

PipeliningHazardsTaxonomyofHazards

Structuralhazards

Theseareconflictsoverhardwareresources.

DatahazardsInstructiondependsonresultofpriorcomputationwhichisnotready(computedorstored)yetControlhazards

branchconditionandthebranchPCarenotavailableintimetofetchaninstructiononthenextclockDatahazard(数据竞争)Datahazardsoccurwhenthepipelinechangestheorderofread/writeaccessestooperandscomparingwiththatinsequentialexecuting.(当流水导致读写操作数的顺序发生改变时可能发生此竞争)Let’sseeanExampleDADDR1,R1,R3DSUBR4,R1,R5ANDR6,R1,R7ORR8,R1,R9XORR10,R1,R11Datahazard(数据竞争)Basicstructure(数据竞争的基本情形)Aninstructioninflight(在执行过程当中)wantstouseadatavaluethat’snot“done”yet“Done”means“it’sbeencomputed”and“it’slocatedwhereIwouldnormallyexpecttogolookinthepipehardwaretofindit”Basiccause(数据竞争产生的基本原因)

YouareusedtoassumingapurelysequentialmodelofinstructionexecutionInstructionNfinishesbeforeinstructionN+k,fork>=1Therearedependenciesnowbetween“nearby”instructions(“near”insequentialorderoffetchfrommemory)Copingwithdatahazards:exampleSomecases“DoubleBump”cando!BubbleBubbleBubbleBubbleBubbleBubbleBubbleBubbleBubbleHowdowestall?

Insertnopbycompiler(编译时插入空操作指令)Howdowestall?

AddhardwareInterlock!(采用流水线内锁电路)AddextrahardwaretodetectstallsituationsWatchestheinstructionfieldbits(监视指令操作数域)Looksfor“readversuswrite”conflictsinparticularpipestages(查找指令间的“读写”冲突)AddextrahardwaretopushbubblesthrupipeActually,relativelyeasyCanjustlettheinstructionyouwanttostallGOFORWARDthroughthepipe…(让要停顿的指令继续执行)…but,TURNOFFthebitsthatallowanyresultstogetwrittenintothemachinestate(但不允许其执行结果保存下来)So,theinstruction“executes”(itdoesthework),butdoesn’t“save”(“执行”但不“保存”)Interlock:insertstallsBubbleBubbleEmptyslotsinthepipecalledbubbles;meansnorealinstructionworkgettingsavedhereHowtheinterlockisimplementated?Forwarding:(提前—又称旁路技术)

reducedatahazardstallsIftheresultyouneeddoesnotexistATALLyet,youareoutofluck,sorry.(如果你需要的结果值不存在,那是没办法的)But,whatiftheresultexists,butisnotstoredbackyet?(如果需要的结果值存在但还没存放到相应的位置上,怎么办呢)Insteadofstallinguntiltheresultisstoredbackinits“natural”home…grabtheresult“onthefly”from“inside”thepipe,andsendittotheotherinstruction(anotherpipestage)thatwantstouseit

(把结果值从当前位置上取出来,发送给目前需要使用这个值的指令)ForwardingGenericname:forwarding(bypass,short-circuiting)(常用的名称提前电路,旁路电路,短路电路)Insteadofwaitingtostoretheresult,weforwarditimmediately(moreorless)totheinstructionthatwantsit(把结果立即转发给需要它的指令,而不是等待直至其放到相应的位置上)Mechanically,weaddbusestothedatapathtomovethesevalues(在实现上,在数据通路上增加一些传输总线)andthesebusesalways“pointbackwards”inthedatapath,fromlaterstagestoearlierstages(这些总线总是往回指,从后面流水节拍往前面的节拍指)Forwarding:

reducedatahazardstalls

Datamaybealreadycomputed-justnotintheRegisterFileEX/MEM.ALUoutputALUinputportMEM/WB.ALUoutputALUinputport

HardwareChangeforForwarding

旁路技术的硬件实现MEM/WRID/EXEX/MEMDataMemoryALUmuxmuxRegistersNextPCImmediatemuxEX/Mem.ALUoutputALUinputMEM/WB.ALUoutputALUinputMEM/WB.LMDALUinputForwardingpathtootherinputentrystoreloadMEM/WB.LMDDMinputForwardingisn’tAlwaysavailableSowehavetoinsertstall:

LoadstallTime(clockcycles)Instr.Orderlwr1,0(r2)subr4,r1,r6andr6,r1,r7RegALUDMemIfetchRegRegIfetchALUDMemRegBubbleIfetchALUDMemRegBubbleRegIfetchALUDMemBubbleRegorr8,r1,r9Exampleof

ForwardingandLoadDelayWhyforwarding?ADDR4,R5,R2LWR15,0(R4)SWR15,4(R2)Whyloaddelay?ADDR4,R5,R2LWR15,0(R4)SWR15,4(R2)Solution(withoutforwarding)Solution(withforwarding)InstructionreorderingbycompilertoavoidloadstallTryproducingfastcodefor

a=b+c; d=e–f;

assuminga,b,c,d,e,andfinmemory.

Slowcode: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,RdFastcode: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,RdSummaryofDataHazardTaxonomyofHazards

StructuralhazardsTheseareconflictsoverhardwareresources.

DatahazardsInstructiondependsonresultofpriorcomputationwhichisnotready(computedorstored)yetOK,wedidthese,DoubleBump,Forwardingpath,softwarescheduling,otherwisehavetostallControlhazards

branchconditionandthebranchPCarenotavailableintimetofetchaninstructiononthenextclockTheControlhazard(控制竞争)Cause(产生原因)branchconditionandthebranchPCarenotavailableintimetofetchaninstructiononthenextclock(转移条件及转移地址未能及时知晓)ThenextPCtakestimetocompute(转移地址计算要花时间)Forconditionalbranches,thebranchdirectiontakestimetocompute.(对于条件转移,转移条件的计算要也花时间)ControlhazardscancauseagreaterandgreaterperformancelossforMIPSpipelinethandodatahazards.(控制竞争会导致流水线性能的大量损失!!)Example:BranchesBranchesofBasicPipelinedDatapathControlhazardb=3DealingwiththecontrolhazardFoursimplesolutions(四种简单的解决办法)Freezeorflushthepipeline(冻结或冲洗流水线)Predict-not-taken(Predict-untaken)(预测转移不成功)TreateverybranchasnottakenPredict-taken(预测转移成功)TreateverybranchastakenDelayedbranch(延迟转移技术)solvethehazardbyinsertingstalls48or72

ThepipelinestatusBranchinstructionIFIDEXMEMWBBranchSuccessorstallstallstallidleidleBranchsuccessor+1IFIDEXBranchsuccessor+2IFIDBranchsuccessor+3IFFlushingthepipeline(流水线冲洗)Simplesthardware:Holdingordeletinganyinstructionafterbranchuntilthebranchdestinationisknow.

(保持/删除转移指令之后所取的指令,直至转移目标确定)Penaltyisfixed.(损失固定)Cannotbereducedbysoftware.(不可能通过软件来减少损失,如调度等)StallsgreatlyhurttheperformanceProblem:Witha30%branchfrequencyandanidealCPIof1,howmuchtheperformaceisbyinsertingstalls?Answer:CPI=1+30%3=1.9thissimplesolutionachievesonlyabouthalfoftheidealperformance.StallingAlwaysHurtstheNottakencase(目前,在转移不成功的情况下,停顿也总是影响性能)HowaboutassumeBranchNotTaken(如果我们一开始就假设转移不成功,如何?)WhatIfBranchWasTaken…?

(万一转移又成功了又该怎么办呢?)Howtodowiththebranchtaken(怎么办?justkillthem!)Predict–not-taken(预测转移不成功)Hardware:Treateverybranchasnottaken(orastheformalinstruction)(对待每一个转移先当作它们没有发生)Whenbranchisnottaken,thefetchedinstructionjustcontinuestoflowon.Nostallatall.Ifthebranchistaken,thenrestartthefetchatthebranchtarget,whichcause3stall.(shouldturnthefetchedinstructionintoano-op)Compiler:Canimprovetheperformancebycodingthemostfrequentcaseintheuntakenpath.

(编译器可以通过把最有可能发生的代码放在转移不成功的序列里面来提高性能)Predict–taken(预测转移成功)HardwareTreateverybranchastaken(evidence:morethan60%brachesaretaken)(假设转移总是成功的)Assoonasthebranchtargetaddressiscomputed,assumethebranchtobetakenandbeginfetchingandexecutingatthetarget.(地址一计算出来,就根据这个地址去取指令)Onlyusefulwhenthetargetisknownbeforethebranchoutcome.(只有转移地址早于转移判断结果出来的情况才有意义)NoadvantageatallforMIPS5-stagepipeline.(对MIPS流水线而言没什么优势)CompilerCanimprovetheperformancebycodingthemostfrequentcaseinthetakenpath.(编译器可以通过把最有可能发生的代码放在转移成功的序列里面来提高性能)MovetheBranchComputationForward(把转移条件及转移地址的计算提前到ID级,如何?)storeloadResult:New&ImprovedMIPSDatapath(结果就是产生了新的更有效的MIPS数据通路)Needjust1extracycleaftertheBEQbranchtoknowrightaddress(仅需要一个周期的停顿来等待转移条件的计算)OnMIPS,itscalled-thebranchdelayslot(分支延时槽)48or72

Flushing:needonlytoinsertonestalltoresolvecontrolhazard48or72

40ADDR30,R30,R30IFIDEXMEMWB44BEQR1,R3,24IFIDEXMEMWB48ANDR12,R2,R5IFidleidleidleidle48or72IFIDEXMEMThepipelinestatusforpredict-not-taken40ADDR30,R30,R30IFIDEXMEMWB44BEQR1,R3,24IFIDEXMEMWB48ANDR12,R2,R5IFIDEXMEMWB52ORR13,R6,R2IFIDEXMEMBranchisnottaken:NostallBranchistaken:1stall40ADDR30,R30,R30IFIDEXMEMWB44BEQR1,R3,24IFIDEXMEMWB48ANDR12,R2,R5IFidleidleidleidle72LWR4,50(R7)IFIDEXMEM76IFIDEXDelayedbranch(延时转移技术)GoodnewsJust1cycletofigureoutwhattherightbranchaddressis(仅需要一个周期来计算转移条件与转移地址)So:OK,it’salways1cycle,andwealwayshavetowait.OnMIPS,wecaninsertainstructionthatalwaysexecutes,nomatterwhetherthebranchtakenornottaken.(byhardwarescheme)(事实上,我们也许可以调一条无论是否转换都要执行的指令到转移指令的后面,就不再损失一个周期了)Branchdelayslot(转移延时槽)Hencethename:branchdelayslotTheinstructioncycleafterthebranchisusedforaddresscalculation,1cycledelaynecessarySO…weregardthisasafreeinstructioncycle,andwejustDOITConsequenceYou(oryourcompiler)willneedtoadjustyourcodetoputsomeusefulworkinthat“slot”,sincejustputtinginaNOPiswasteful(compilerscheme)Howtoadjustthecodes?Example:rewritethecode(a)Example:rewritethecode(b-1)Loop:LWR2,0(R1)ADDR3,R2,R4SWR3,0(R1)……SUBR1,R1,#4BNEZR1,Loop

LWR2,0(R1)Loop:ADDR3,R2,R4SWR3,0(R1)……

SUBR1,R1,#4

BNEZR1,Loop

LWR2,0(R1)Example:rewritethecode(b-2)Loop:LWR2,0(R1)ADDR3,R2,R4SWR3,0(R1)DIV…..……SUBR1,R1,#4BNEZR1,LoopLoop:LWR2,0(R1)ADDR3,R2,R4DIV…...…...SUBR1,R1,#4BNEZR1,Loop

SWR3,+4(R1)Schedualingstrategyvs.performanceimprovementPipelinestatusfordelayedbranchConstraintsofthedelayedbranch

延时转移技术的一些限制TherearerestrictionsontheinstructionsthatarescheduledintothedelayslotsThecompiler‘sabilitytopredictaccuratelywhetherornotabranchistakendetermineshowmuchusefulworkisactuallydone.(编译对转移行为的预测能力很重要)Forschedulingschemebandc,ItmustbeO.K.toexecutetheSUBinstructionifthepredictioniswrong.Usually,thehardwareprovideawayofcancellingtheinstruction.(要引入“硬件清除”转移延时槽的功能)Cancelling(硬件清除)Ifbranchispredictedincorrectly,CPUturnstheinstructioninthebranchdelayslotintoano-op.Delayedbranchwithcancelling

(ofcaseb)PipelinehazardsTaxonomyofHazards

Structuralhazards

Theseareconflictsoverhardwareresources.

DatahazardsInstructiondependsonresultofpriorcomputationwhichisnotready(computedorstored)yetControlhazards

branchconditionandthebranchPCarenotavailableintimetofetchaninstructiononthenextclockOK,wedidthese,calculatethedestinationaddressandcondition,Flushingthepipeline,predict-not-taken,predict-taken,delayedbranch(with/withoutcancelling)WhatMakesPipelinesHardtoImplement?Detectingandresolvinghazards(检测和处理竞争)OK.Wehavesolvedthisproblem.ExceptionsandInterrupts(例外和中断)InstructionSetcomplications(指令集复杂)VerycomplexmulticycleinstructionsaredifficulttopipelineExample:stringMovfrom0x1234,to0x4000,0x1000bytesExceptioncauses(

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论