计算机体系结构8_第1页
计算机体系结构8_第2页
计算机体系结构8_第3页
计算机体系结构8_第4页
计算机体系结构8_第5页
已阅读5页,还剩56页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、1Branch InstructionsSima, Fountain and KacsukChapter 8CSE3304 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19972Major Chapter GoalslTo understand how to minimise the performance degradation of branches Basic approach to branch handling Delayed branching Branch processi

2、ng Multi-way branching Guarded Execution David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19973Types of branch instructionsUnconditionalBranchesConditionalBranchesSimpleUnconditionalBranchBranch toSubroutineReturn fromSubroutineLoop closingconditionalbranchOtherconditional

3、branchAlpha:BR taPowPC: b taBSR tabl taRET tabclrBTL R1, tabdnz taBNE R1, tabne ta David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19974Types of branch instructionsUnconditionalBranchesConditionalBranchesSimpleUnconditionalBranchBranch toSubroutineReturn fromSubroutineLoo

4、p closingconditionalbranchOtherconditionalbranchAlpha:BR taPowPC: b taBSR tabl taRET tabclrBTL R1, tabdnz taBNE R1, tabne taBR tata:BSR tata:RETta:SUBL R1,1,R1BLT R1,taBNE R1,tata: David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19975Why worry about different types of bra

5、nches?lBranches cause stalls of pipelineslDifferent techniques for minimizing branch effects lTake advantage of different type of branch e.g. Unconditional branch ALWAYS occurs. Therefore, hardware can plan for the branch in advance. e.g. difference between loop closing and normal conditional David

6、Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19976Ways of checking conditionlConditional branches need to evaluate a predicatelTwo main approaches Result state IBM 360, 370, PDP-11, VAX-11, x86, Pentium, MC68000, Sparc, PowerPC. Direct Check PDP-10, Cyber/70, PDP-8, CRAY, MI

7、PS, HPPA, Dec Alpha David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19977Result StatelResult state is declared to hold status information related to result of operationlTypical implementation is condition codes or flag registers which are updated after every arithmetic re

8、sult is producedlConditional branch instructions interrogate the flags in subsequent instructions David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19978Result state disadvantageslThe generation of result state is not straightforward; irregular structure occupies additional

9、 chip arealMakes pipeline longerALUTest 0 0= 01 Clock Cycle David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 19979Result state disadvantages .lSequential in concept. How do we pack instructions in Superscalar and VLIW machine?add r1, r2, r3sub r5, r6, r7blt ta1bgt ta2add r

10、1, r2, r3sub r5, r6, r7blt ta1bgt ta2 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199710Direct ChecklNo result state is declaredlSpecified conditions are directly checked by explicit instructionslConditional branching can be requested if the specified conditions are m

11、et. lMay involve one or two instructionslFits better into superscalar architectureslShorter clock cycle because only check when necessary David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199711Comparison of conditional branchesTwo instruction Implementationadd r1, r2, r3cm

12、peq r7, r1, 0bt r7, labeldiv r5, r4, r1One instruction Implementationadd r1, r2, r3beq r1, labeldiv r5, r4, r1 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199712The effect of brancheslNeed to understand whether branches are taken or notlThen tailor the architecture to

13、 perform fastest on most common case.lIt turns out that most branches are taken .FetchDec/RdALUWriteFetchDec/RdALUWriteFetch David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199713Branch statistics .UnconditionalBranchesConditionalBranchesSimpleUnconditionalBranchBranch to

14、SubroutineReturn fromSubroutineLoop closingconditionalbranchOtherconditionalbranch 1/3 1/3 1/3Taken for first n-1 iterationsTakenNotTaken 1/6 1/6Taken 5/6 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199714Branch HandlingUtilizing branch delay slotsHandling of unresolv

15、ed conditional branches Avoiding cond. branches Delayed BranchingPerformanceBlockingbranch proc.Speculativebranch proc.Multiwaybranch proc.Branch processingGuardedExecution David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199715Branch HandlinglDelayed branching Utilise oth

16、erwise wasted cycles following branches. Achieved by inserting an instruction behind the branch and executing it before the branch Utilized on Early and Subsequent RISC architectures David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199716Delayed branchingFetchDec/RdALUWrit

17、eFetchDec/RdALUWriteFetchDeadDeadaddbsubadd r1, r2, r3b anywherediv r5, r4, r1anywhere: sub .add r3, r2, r6 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199717Delayed branching .FetchDec/RdALUWriteFetchDec/RdALUWriteaddbdivadd r1, r2, r3bd anywherediv r5, r4, r1anywher

18、e: sub .add r3, r2, r6FetchDec/RdALUWriteFetchDec/RdALUWriteaddDelayed BranchFetchDec/RdALUWritesub David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199718Delayed branching optionslCan reduce to single delay slot if target address available at end of decode phaselCompiler

19、places NOPs in slots and migrates instructions into slots using code migrationlCompiler must perform dataflow analysis David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199719Code migration to fill slotsadd r8, r9, r10bd anywherediv r5, r4, r1anywhere: sub .add r3, r2, r6NO

20、PNOPadd r3, r2, r6bd anywherediv r5, r4, r1anywhere: sub .add r8, r9, r10 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199720Does this work with conditional branches/?lYes, but code migrated into delay slot must be performed unconditionallyadd r8, r9, r10beq r6,anywher

21、ediv r5, r4, r1add r3, r2, r6NOPNOPadd r3, r2, r6div r5, r4, r1add r8, r9, r10beq r6,anywhere David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199721Optimisations - AnnulmentlAnnul delay slot if branch not taken Useful for backward conditional branches Makes it possible to

22、 move a loop body instruction into the delay slotlAnnul delay slot if branch is taken Useful for forward conditional branches Makes it possible to relocate an instruction from sequential path into the delay slot David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199722Annul

23、delay slot if branch not taken1234 BrCDelayLoop bodyConditionalBranch with no AnnulmentLoop requires 5 instructions David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199723Annul delay slot if branch not taken .1234 BrCDelayLoop bodyConditionalBranch withAnnulment1Loop requi

24、res 4 instructions David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199724Annul delay slot if branch is taken1ConditionalBranch with no Annulment23BrCDelay45 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199725Annul delay slot if branch is tak

25、en1ConditionalBranch withAnnulment23BrC45 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199726FetchDec/RdALUWriteFetchDec/RdALUWriteFetchDec/RdALUWriteFetchDec/RdALUWriteBranch detectionlIn general, the earlier a branch is detected, the better the handling Reduces numbe

26、r of delay slots requiredlEarly branch detection In parallel Look ahead Integrated with fetchFetchDec/RdALUWrite David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199727Early branch detectionlIn Parallel Branches are detected in parallel with decode of other instructions us

27、ing a dedicated branch decoderlLook ahead Branches are detected from the instruction buffer but ahead of general instruction decodinglIntegrated Branches are detected during instruction fetch David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199728Conditional brancheslThe p

28、roblem with conditional branches is that we may not know until late in the instruction whether we wish to take or reject the control transferTake branch?FetchDec/RdALUWritebzr1,taget bzread r1cmp 0WritePC David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199729Handling unre

29、solved conditional brancheslBlocking branch processinglSpeculative branch processing Branch prediction Extent of speculation Recovery from mis-predictionlMulti-way branching David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199730Blocking branch processinglTrivial approachl

30、Execution of conditional is simply stalled until condition is knownlResult state Can possibly order instructions to reduce delay in waiting for condition codes to be setlDirect checking Need to wait for ALU result in this instructionSet CCBrCFetchDec/RdALUWrite David Abramson, 2000Material from Sima

31、, Fountain and Kacsuk, Addison Wesley 199731Speculative branch processinglPipeline stalls can be avoidedlAfter detection of unresolved conditional branch a guess is made of the outcomelIf guess is correct Speculation confirmed and continued executionlIf guess is incorrect Instructions discarded Fetc

32、h restarted David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199732Branch prediction schemeslFixed predictionlTrue prediction A “true” guess is made Static Prediction Based on object code Dynamic Prediction Based on execution history David Abramson, 2000Material from Sima,

33、 Fountain and Kacsuk, Addison Wesley 199733Fixed Prediction (Guess Not Taken)lDetect unresolved conditional branch and guess as not takenlContinue with the execution of the sequential path, but start the execution of taken path (e.g. calculate BTA)lWhen conditional becomes available, check the guess

34、lIf correct, continue with execution delete taken path pre-processinglIf incorrect, delete speculative execution and continue with taken path David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199734Fixed Prediction (Guess Not Taken)ConditionKnownEvaluate BTACORRECT David Ab

35、ramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199735Fixed Prediction (Guess Not Taken)ConditionKnownEvaluate BTAINCORRECT David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199736Fixed Prediction (Guess Taken)ConditionKnownEvaluate BTAINCORRECT David Ab

36、ramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199737Static PredictionlLike fixed prediction, but some properties of the object code are used to decide whether to use always taken or not takenlHas possibility of performing better since some information is used. David Abramson, 2

37、000Material from Sima, Fountain and Kacsuk, Addison Wesley 199738Static Prediction .lTypical options are Opcode based prediction For certain opcodes the branch is assumed to be taken, for others not taken. Displacement based prediction If displacement 0 taken, else not taken Compiler directed Predic

38、t bit in instruction set by compiler analysis or trace dataOpBTASign bit David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199739Dynamic predictionlPrediction is made on branch historylPhilosophy is that history is good guide to future behaviourlGood for loops because tend

39、to iterate multiple timeslGood for some conditional branches depending on behaviour of datalGood for exception condition checking What Happened Last Time? David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199740Dynamic Prediction TechniqueslExplicit technique Branch history

40、 explicitly stated history bits 1, 2 or 3 bit schemes Number of bits represent likelihood of branch being taken in futurelImplicit technique Branch being “recorded” is an indication that branch was taken Roughly equivalent to 1 bit scheme David Abramson, 2000Material from Sima, Fountain and Kacsuk,

41、Addison Wesley 199741Branch History BitslFor each branch, record a status field.lOne bit status Did last occurrence of branch occur?lTwo bit status State table decideswhich statelThree bit scheme History of last 3 branches Majority decision on outcomeStateDesignation11Strongly Taken10Weakly Taken01W

42、eakly not taken00Strongly not taken David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199742State transitions in 2 bit schemelFor each branch AT = Actually taken ANT = Actually not takenStronglyTakenWeaklyTakenWeaklyNOTTakenStronglyNOTTakenATANTANTANTATATATANTPrediction Tak

43、enPrediction Not Taken David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199743State transitions in 2 bit schemelFor each branch AT = Actually taken ANT = Actually not takenStronglyTakenWeaklyTakenWeaklyNOTTakenStronglyNOTTakenATANTANTANTATATATANTStronglyNOTTakenANTWeaklyNO

44、TTakenATWeaklyTakenAT ANTWeaklyNOTTakenATWeaklyTaken David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199744Implicit Dynamic SchemelTwo main schemes Branch Target Access Cache (BTAC) Branch Target Instruction Cache (BTIC)lBoth schemes introduce an extra cachelEntries are o

45、nly stored for taken branches Not taken branches are not storeslIf an entry is in cache, then next branch is takenlBehaves like 1 bit David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199745BTAC and BTIC implementationlStore Branch Address PLUS The target address itself (BT

46、AC) The instruction at the target (BTIC)10002000Load R1, .100020001000Load R1, . David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199746BTAC implementation .Load R1, .1000200010002000Dont take branchSpeculativelyWhoops!Add this branch address to the cacheNext timeTake bran

47、chSpeculatively David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199747Performance differences in BTAC and BTIClBTAC delivers address of branch in time for next fetchlBTIC delivers actual instruction. Can be slower because it does not require fetchLookup BTACFetchDec/RdExe

48、WriteFetchDec/RdExeWriteLookup BTIC David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199748Implementation of History bitslPlacement of history bits I-cache Alpha, UltraSparc Branch History Table (BHT) Power PC 604, 620, R10000 Branch Target Address Cache (BTAC) MC68060, Pe

49、ntium, R8000lI-cache and BHT effectively the same David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199749I-Cache and BHT (Power PC 604)I-cache16 KFour way set associativeInstruction fetch addressBHT128 x 4 entriesPredictionLogic4 instr./cycle2 History BitsTaken/Not takenBT

50、A for a taken guessDecode Queue4 x 1 Instr.Issue Queue David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199750Prediction AccuracyP = fc * Pc + fm * Pmwhere fc: Probability of correctly predicting branches fm: Probability of mis-predicting branches Pc: Penalty of correctly

51、predicted branches Pm: Penalty of mis-predicted branches David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199751Prediction Accuracy .Branch Penalty P0 Against accuracy00.20.40.60.811.21.41.6Fixed - AlwaysTaken (0.625)Static,displacementbased (0.685)Dynamic 1 bit(0.89)Dynam

52、ic 2 bit(0.93)Prediction MethodPenalty (P0) (Cycles)Pm=4Pm=2Pm=1 David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199752Recovery from MispredictionlIf branch prediction hardware makes wrong guess, need to revert to alternative path of execution.FetchDec/RdALUWriteFetchDec/

53、RdALUWriteFetchDec/RdALUWriteFetchDec/RdALUWrite? David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199753Recovery from Misprediction .lFor pipelined machines, may be possible to abort register writeslStore instructions are harder to recoverlCondition codes might also need

54、to be restored.FetchDec/RdALUWriteFetchDec/RdALUWriteFetchDec/RdALUWriteFetchDec/RdALUWriteRegister File David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199754Schemes to shorten mis-prediction recoverylBasic prior measures for recovery In a “taken” guess, save sequential

55、address In a “not taken” guess, pre-calculate and save branch target address Requires two address registers per speculated conditional branchSave this addressSave this address David Abramson, 2000Material from Sima, Fountain and Kacsuk, Addison Wesley 199755Schemes to shorten mis-prediction recovery.lEnhanced prior measures to shorten recovery In a “taken” guess save sequ

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论