教程讲稿高体vliw

上传人：我*** IP属地：北京上传时间：2022-09-01 格式：PPTX 页数：31 大小：370.65KB 积分：14 举报 版权申诉

免费预览已结束，剩余26页可下载查看

 下载本文档

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、（第五讲）2011年4月16日程旭VLIW/EPIC基于静态调度的ILP高等计算机系统结构复习: 先进超标量技术简单的转移预测技术也会有不错的效果基于路径（Path-based）的预测器可取得95%的正确率BTB可以在流水线的前期就改变控制流, BHT的每个表项成本小，但需要等到指令译码转移错误预测的恢复需要使用流水线的快照（snapshots）以减少损失统一物理寄存器堆的设计，可以避免从不同地方 (ROB+arch regfile)读取数据超标量处理器可以在一个时钟周期对多条相关指令进行重命名需要采用推测式存储缓冲器（speculative store buffer）来避免等待存储操作的确

2、认FetchDecode & RenameReorder BufferPCBranchPredictionUpdate predictorsCommitBranchResolutionBranchUnitALUReg. FileMEMStore BufferD$Executekillkillkillkill复习: 转移预测和推测式执行超标量处理器控制逻辑的可扩展性每条被发射的指令都必须检查与W*L条指令间的关系，即硬件的增长 W*(W*L)对于按序机器，L与流水线延迟相关对于乱序机器, L还包括指令缓冲器消耗的时间（指令窗口或 ROB)随着W的增加，需要更大的指令窗口来找都维持机器忙碌所需的

3、指令级并行性 = L变得更大= 乱序控制逻辑的复杂程度增长速度大于 W2 (W3)Lifetime LIssue GroupPreviously Issued InstructionsIssue Width W乱序控制的复杂度:MIPS R10000Control Logic SGI/MIPS Technologies Inc., 1995 Check instruction dependenciesSuperscalar processor串行ISA的瓶颈a = foo(b);for (i=0, i no x-operation RAW checkNo data use before dat

4、a ready = no data interlocksTwo Integer Units,Single Cycle LatencyTwo Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,Four Cycle LatencyInt Op 2Mem Op 1Mem Op 2FP Op 1FP Op 2Int Op 1VLIW 编译器的职责编译器: 调度成最大并行执行确保指令内的并行性避免数据冒险(无互锁)通常用显式的NOP指令来分隔实际操作早期VLIW机器FPS AP120B (1976)scientific attach

5、ed array processorfirst commercial wide instruction machinehand-coded vector math libraries using software pipelining and loop unrollingMultiflow Trace (1987)commercialization of ideas from Fishers Yale group including “trace scheduling”available in configurations with 7, 14, or 28 operations/instru

6、ction28 operations packed into a 1024-bit instruction wordCydrome Cydra-5 (1987)7 operations encoded in 256-bit instruction wordrotating register file循环的执行for (i=0; iN; i+) Bi = Ai + C;Int1Int 2M1M2FP+FPxloop:How many FP ops/cycle?ld add r1fadd sd add r2 bne 1 fadd / 8 cycles = 0.125loop: ld f1, 0(r

7、1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loopCompileSchedule循环展开（Loop Unrolling）for (i=0; iN; i+) Bi = Ai + C;for (i=0; iN; i+=4) Bi = Ai + C; Bi+1 = Ai+1 + C; Bi+2 = Ai+2 + C; Bi+3 = Ai+3 + C;Unroll inner loop to perform 4 iterations at onceNeed to handle values of N that are

8、 not multiples of unrolling factor with final cleanup loop循环展开后代码的调度loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2)add r2, 32 bne r1, r3, loopScheduleInt1Int 2

9、M1M2FP+FPxloop:Unroll 4 waysld f1ld f2ld f3ld f4add r1fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r2bneHow many FLOPS/cycle?4 fadds / 11 cycles = 0.36软件流水（Software Pipelining）loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f

10、3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loopInt1Int 2M1M2FP+FPxUnroll 4 ways firstld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r1add r2bneld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r1add r2

11、bneld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5add r1loop:iterateprologepilogHow many FLOPS/cycle?4 fadds / 4 cycles = 1软件流水与循环展开timeperformancetimeperformanceLoop UnrolledSoftware PipelinedStartup overheadWind-down overheadLoop IterationLoop IterationSoftware pipelining pays startup/wind-d

12、own costs only once per loop, not once per iteration如果没有循环，怎么办?转移指令限制了控制流敏感的非规则代码中基本块的大小很难从单独的基本块中发现ILP Basic block踪迹调度（Trace Scheduling） Fisher,EllisPick string of basic blocks, a trace, that represents most frequent branch pathUse profiling feedback or compiler heuristics to find common branch pat

13、hs Schedule whole “trace” at onceAdd fixup code to cope with branches jumping out of trace“典型” VLIW的问题目标代码的兼容性需要针对每种机器重新编译所有代码，甚至对于同一代中的两种不同机器也需要如此目标代码的大小指令填充浪费指令存取空间（存储器/cache）循环展开/软件流水将复制代码需要调度访存延迟可变的操作caches 和/或主存块的冲突可能产生静态时刻不可预知的变化确认转移发生概率动态剖视（Profiling）需要增加巨大的一步来进行静态转移预测对于静态时刻不可预测的转移进行调度根据转移路

14、径的不同，最佳调度策略也不同VLIW指令的编码Schemes to reduce effect of unused fieldsCompressed format in memory, expand on I-cache refillused in Multiflow Traceintroduces instruction addressing challengeMark parallel groupsused in TMS320C6x DSPs, Intel IA-64Provide a single-op VLIW instruction Cydra-5 UniOp instruction

15、sGroup 1Group 2Group 3Rotating Register FilesProblems: Scheduled loops require lots of registers, Lots of duplicated code in prolog, epilogSolution: Allocate new set of registers for each loop iterationld r1, ()add r2, r1, #1st r2, ()ld r1, ()add r2, r1, #1st r2, ()ld r1, ()add r2, r1, #1st r2, ()ld

16、 r1, ()add r2, r1, #1st r2, ()ld r1, ()add r2, r1, #1st r2, ()ld r1, ()add r2, r1, #1st r2, ()PrologEpilogLoopRotating Register FileP0P1P2P3P4P5P6P7RRB=3+R1Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical regi

17、ster number. Usually, split into rotating and non-rotating registers.dec RRBbloopdec RRBdec RRBdec RRBld r1, ()add r3, r2, #1st r4, ()ld r1, ()add r3, r2, #1st r4, ()ld r1, ()add r2, r1, #1st r4, ()PrologEpilogLoopLoop closing branch decrements RRBRotating Register File(Previous Loop Example)bloopsd

18、 f9, ()fadd f5, f4, .ld f1, ()Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3)Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 f5 = 4)bloopsd P17, ()fadd P13, P12,ld P9, ()RRB=8bloopsd P16, ()fadd P12, P11,ld P8, ()RRB=

19、7bloopsd P15, ()fadd P11, P10,ld P7, ()RRB=6bloopsd P14, ()fadd P10, P9,ld P6, ()RRB=5bloopsd P13, ()fadd P9, P8,ld P5, ()RRB=4bloopsd P12, ()fadd P8, P7,ld P4, ()RRB=3bloopsd P11, ()fadd P7, P6,ld P3, ()RRB=2bloopsd P10, ()fadd P6, P5,ld P2, ()RRB=1Cydra-5:存储延迟寄存器（Memory Latency Register：MLR)问题: 装入

20、操作可能出现时延变化解决方案：让软件选择所需时延编译对代码进行调度以满足最大装入-使用（load-use）距离软件将MLR设置为机器代码调度的延迟硬件保证装入操作在MLR要求的周期将所需数值返回给处理器流水线硬件将缓存提前返回的装入数值如果装入没有按时返回数值，硬件将暂停处理器Intel EPIC IA-64与传统CISC和RISC相比，EPIC是一种新型的体系结构Explicitly Parallel Instruction ComputingIA-64是Intel最新采用的一种ISAIA-64 = Intel Architecture 64-bit一种目标代码兼容的VLIWItanium

21、 (最早称为Merced)是采用IA-64的第一种处理器1995年预计第一种产品在1997问世 (实际上为2001)McKinley是2002年推出的第二类产品IA-64 的指令格式Template bits describe grouping of these instructions with others in adjacent bundlesEach group contains instructions that can execute in parallelInstruction 2Instruction 1Instruction 0Template128-bit instruct

22、ion bundlegroup igroup i+1group i+2group i-1bundle jbundle j+1bundle j+2bundle j-1IA-64中的寄存器128个通用64位定点寄存器128个通用64/80位浮点寄存器64个1位预测寄存器GPR可以旋转（rotate）以减小软件流水化循环的代码大小IA-64 Predicated ExecutionProblem: Mispredicted branches limit ILPSolution: Eliminate hard to predict branches with predicated executionA

23、lmost all IA-64 instructions can be executed conditionally under predicateInstruction es NOP if predicate register falseInst 1Inst 2br a=b, b2Inst 3Inst 4br b3Inst 5Inst 6Inst 7Inst 8b0:b1:b2:b3:ifelsethenFour basic blocksInst 1Inst 2p1,p2 50% branches removedPredicate Software Pipeline StagesSingle

24、 VLIW InstructionSoftware pipeline stages turned on by rotating predicate registers Much denser encoding of loops(p1) bloop(p3) st r4(p2) add r3(p1) ld r1Dynamic Execution(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1(p1) ld r1(p2) add r3(p2) add r3(p2) add r3(p2) add r3(p2) add r3(p3) st r4(p3) st r4(p3)

25、 st r4(p3) st r4(p3) st r4(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloop(p1) bloopIA-64 Speculative ExecutionProblem: Branches restrict compiler code motionInst 1Inst 2br a=b, b2Load r1Use r1Inst 3Cant move load above branch because might cause spurious exceptionLoad.s r1Inst 1Inst 2br

26、 a=b, b2Chk.s r1Use r1Inst 3Speculative load never causes exception, but sets “poison” bit on destination registerCheck for exception in original home block jumps to fixup code if exception detectedParticularly useful for scheduling long latency loads earlySolution: Speculative operations that dont cause exceptionsIA-64

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

教程讲稿高体vliw

文档简介

温馨提示

最新文档

评论

教程讲稿高体vliw

文档简介

温馨提示

最新文档

评论

相关文档