教学第四章指令并行软件方面课件

上传人：h*** IP属地：贵州上传时间：2022-11-26 格式：PPT 页数：48 大小：838.62KB 积分：25 举报 版权申诉

已阅读5页，还剩43页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

ComputerArchitecture

----AQuantitativeApproach计算机体系结构计算机体系结构Chapter4(2)

Instruction-LevelParallelism

SoftwareApproaches 王奕Estelle.ywang@ComputerArchitecture

----AQLectureforILP:

Softwareapproaches(软件方法)BasicCompilerTechniqueforExposingILPLoopunrolling（基本的发现ILP的编译技术是循环展开）StaticBranchPrediction(静态分支预测)StaticmultipleIssue:VLIW（静态多指令发射VLIW）AdvancedCompilorSupportforExposingandExploitingILP(对发现和开发ILP的高级编译器支持)Softwarepipelining(软件流水)GlobalCodescheduling（全局代码调度）HardwareSupportforExposingMoreParallelismatcompiletime(对编译时开发ILP的硬件支持)ConditionalorPredicated(断言的)instructions(条件指令或预测指令)Compilerspeculationwithhardwaresupport(在硬件支持下的编译器投机技术)LectureforILP:

SoftwareapprFPLoop:WherearetheHazards?Loop: LD F0,0(R1) ;F0=vectorelement ADDDF4,F0,F2 ;addscalarfromF2 SD 0(R1),F4 ;storeresult SUBI R1,R1,8 ;decrementpointer8B(DW) BNEZ R1,Loop ;branchR1!=zero NOP ;delayedbranchslotAssumptionsofthelatencyoftheFPoperations:Instruction Instruction Latency

producingresult usingresult incyclesFPALUop AnotherFPALUop 3FPALUop Storedouble 2Loaddouble FPALUop 1Loaddouble Storedouble 0Integerop Integerop 0

Wherearethestalls?FPLoop:WherearetheHazardsReducingstallsfromschedullinginBBanddelayedbranchLoop:LDF0,0(R1)ADDDF4,F0,F2SD0(R1),F4SUBIR1,R1,#8BNEZR1,LoopFDXMWFDsA1A2A3A4WFsDssXMWFssDXMWFsDXMW

10CCFFLoop:LDF0,0(R1)SUBIR1,R1,#8ADDDF4,F0,F2BNEZR1,Loop

SD+8(R1),F4FDXMWFDXM

WFDA1A2A3A4WFDXMW

FDsXMW

6CCF

DXMWReducingstallsfromschedulliUnrollLoopFourTimes(straightforwardway)

Rewritelooptominimizestalls?1Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;dropSUBI&BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;dropSUBI&BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;dropSUBI&BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SUBI R1,R1,#32 ;alterto4*8/////////////////13 SD +8(R1),F1614 BNEZ R1,LOOP15 NOP

15+4x(1+2)=27clockcycles,or6.8periterationAssumesR1ismultipleof41cyclestall2cyclesstallUnrollLoopFourTimes(straigUnrolledLoopThatMinimizesStallsWhatassumptionsmadewhenmovedcode?OKtomovestorepastSUBIeventhoughchangesregisterOKtomoveloadsbeforestores:getrightdata?Whenisitsafeforcompilertodosuchchanges?1Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD +16(R1),F1213 BNEZ R1,LOOP14 SD 8(R1),F16 ;8-32=-24

14clockcycles,or3.5periterationUnrolledLoopThatMinimizesSUsingLoopunrollingandschedulingwithstaticMultipleIssueIntegerInstructionFPinstructionClockcycleLoop:L.DF0,0(R1)1L.DF0,-8(R1)2L.DF0,-16(R1)ADD.DF4,F0.F23L.DF0,-24(R1)ADD.DF8,F6.F24L.DF0,-32(R1)ADD.DF12,F10.F25S.DF4,0(R1)ADD.DF16,F14.F26S.DF8,-8(R1)ADD.DF20,F18.F27S.DF12,-16(R1)8DADDUIR1,R1,#-409S.DF16,16(R1)10BNER1,R2,Loop11S.DF20,8(R1)12UsingLoopunrollingandschedStaticBranchPrediction

静态分支预测Staticbranchpredictorsareusedinprocessorswhenbranchbehaviorisexpectedhighlypredictableatcompiletime.(静态分支预测一般用于分支行为在编译器时就具有很高有可预测性的情形)SeveraldifferentmethodsAlwayspredictabranchastakenoruntaken(总是预测转移成功或不成功)Predictonthebasisofbranchdirection(基于转移方向的预测)Backward-goingbranchtobetaken,(向后预测为成功)Forward-goingbranchtobenottaken.(向前预测为不成功)Profile-basedPrediction(基于以往概要信息(含多方面的行为)的预测)StaticBranchPrediction

静态分支预StaticMultipleissue:VLIW

(静态多发射：VLIW)VLIW:VeryLongInstructionWord(超长指令字)Each“instruction”hasexplicitcodingformultipleoperations(每条“指令”都显式地包括多个操作)InEPIC,groupingcalleda“packet”InTransmeta,groupingcalleda“molecule”(with“atoms”asops)Tradeoffinstructionspaceforsimpledecoding

(为了编码简单，牺牲了一些代码空间)ThelonginstructionwordhasroomformanyoperationsBydefinition,alltheoperationsthecompilerputsinthelonginstructionwordareindependent=>executeinparallelE.g.,2integeroperations,2FPops,2Memoryrefs,1branch16to24bitsperfield=>7*16or112bitsto7*24or168bitswideNeedcompilingtechniquethatschedulesacrossseveralbranchesStaticMultipleissue:VLIW

(静LoopUnrollinginVLIWMemory Memory FP FP Int.op/ Clock

reference1 reference2 operation1 op.2 branchLDF0,0(R1) LDF6,-8(R1) 1LDF10,-16(R1) LDF14,-24(R1) 2LDF18,-32(R1) LDF22,-40(R1) ADDDF4,F0,F2 ADDDF8,F6,F2 3LDF26,-48(R1) ADDDF12,F10,F2 ADDDF16,F14,F2 4 ADDDF20,F18,F2 ADDDF24,F22,F2 5SD0(R1),F4 SD-8(R1),F8 ADDDF28,F26,F2 6SD-16(R1),F12 SD-24(R1),F16 7SD-32(R1),F20 SD-40(R1),F24 SUBIR1,R1,#48 8SD-0(R1),F28 BNEZR1,LOOP 9

Unrolled7timestoavoiddelays7resultsin9clocks,or1.3clocksperiteration(1.8X)Average:2.5opsperclock,50%efficiencyNote:NeedmoreregistersinVLIW(15vs.6inSS)LoopUnrollinginVLIWMemory ProblemsforVLIWTechnicalproblems(技术问题)Increaseincodesize(代码的增长)LoopunrollingUnusedfunctionslotsLimitationsoflockstepoperation(锁定同步操作的限制)AstallinanyfunctionunitmaycausetheentireprocessortostallLogisticalproblem(逻辑问题)Binarycodecompatibility(二进制代码的兼容性)Majorchallengeforallmultiple-issueprocessorsExploitlargeamountsofILPProblemsforVLIWTechnicalproAdvancedCompilerSupportforExploitingILP(编译器对开发ILP的高级支持)DetectingandEnhancingLoop-levelParallelism(检测并增强循环级并行)EliminatingDependentComputations(消除相关计算)Softwarepipelining:Symbolicloopunrolling(软件流水：符号循环展开)GlobalCodeScheduling(全局代码调度)TraceScheduling:focusonCriticalpath

(路径调度：关注关键路径)SuperblocksAdvancedCompilerSupportforDetectingandEnhancingLoop-levelParallelismLoop-carrieddependence(循环传递相关----存在循环之间的相关性)DataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterationsAloopisparallelifitcanbewrittenwithoutacycleinthedependences.(一个循环中，如果相关性没有构成一个环，就说这个循环是可并行的)AnassumptionAllarrayindices(下标)areaffine(仿射的).Aone-dimensionalarrayindexisaffine,ifitcanbewrittenintheformofai+b.Adependenceexistsiftwoconditionshold(满足下面两条件，即相关存在):TwoindicesJ,K,withinthelimitsoftheloop.(下标的两个取值,j,k)TheloopstoresintoE[aj+b]andlaterfetchfromthesameelementE[ck+d],itcansatisfyaj+b=ck+d(存数与读取数下标满足aj+b=ck+d)GCD(Greatestcommondivisor)test---最大公因子测试Ifaloop-carrieddependenceexists,thenGCD(c,a)mustdivide(d-b).(GCD(c,a)必须被(d-b)整除)DetectingandEnhancingLoop-lEliminatingDependentComputations--消除相关计算DADDUIR1,R2,#4DADDUIR1,R1,#4ADDR1,R2,R3ADDR4,R1,R6ADDR8,R4,R7SUM=SUM+XDADDUIR1,R2,#8ADDR1,R2,R3ADDR4,R6,R7ADDR8,R1,R4SUM=SUM+X1+X2+X3+X4+X5SUM=((SUM+X1)+(X2+X3))+(X4+X5)R8=R2+R3+R6+R7把R1与R7的位置对换EliminatingDependentComputatSoftwarePipelining-软件流水Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferentiterations

(如果循环的迭代之间是不相关的，则可以从不同迭代中取指执行可以获得更多的可并行性)Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop(TomasuloinSW)(软件流水是从源循环的不同迭代体中取出必要的指令，重新建立新的循环，提供连续指令给多发射处理器)SoftwarePipelining-软件流水ObservSoftwarePipeliningExampleBefore:Unrolled3times

1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4

4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8

7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOPAfter:SoftwarePipelined

1 SD 0(R1),F4; StoresM[i]

2 ADDD F4,F0,F2; AddstoM[i-1]

3 LD F0,-16(R1); LoadsM[i-2]

4 SUBI R1,R1,#85 BNEZ R1,LOOPSymbolicLoopUnrollingMaximizeresult-usedistanceLesscodespacethanunrollingFill&drainpipeonlyonceperloop

vs.oncepereachunrollediterationinloopunrolling5cyclesperiterationSWPipelineLoopUnrolledoverlappedopsTimeTimeSoftwarePipeliningExampleBefTraceScheduling(路径调度—专用于VLIW)ParallelismacrossIFbranchesvs.LOOPbranches

(挖掘跨越if转移和LOOP转移的并行性)Twosteps(路径调度技术包含两个独立的处理过程)TraceSelection(路径选择)Findlikelysequenceofbasicblocks(trace—预测路径)of(staticallypredictedorprofilepredicted)longsequenceofstraight-linecode(首先根据转移行为预测转移可能的两个路径方向，找出使用概率大的那个方向作为扩展基本块的方向，这个方向的后继指令称为预测路径）TraceCompaction(路径压缩)SqueezetraceintofewVLIWinstructions(将选定路径上的操作封装成超长指令)NeedbookkeepingcodeincasepredictioniswrongThisisaformofcompiler-generatedspeculationCompilermustgenerate“fixup(修正)”codetohandlecasesinwhichtraceisnotthetakenbranch(预测失效要采取补偿措施)Needsextraregisters:undoesbadguessbydiscardingTraceScheduling(路径调度—专用于VLIW)ExampleofTraceSchedulingExampleofTraceSchedulingExample原始代码路径调度之后的代码Example原始代码路径调度之后的代码AdvantagesofHW(Tomasulo)vs.SW(VLIW)SpeculationHWadvantages:HWbetteratmemorydisambiguation（内存释意）sinceknowsactualaddressesHWbetteratbranchpredictionsinceloweroverheadHWmaintainspreciseexceptionmodelHWdoesnotexecutebookkeepinginstructions(补偿代码)SamesoftwareworksacrossmultipleimplementationsSmallercodesize(notasmanynoopsfilingblankinstructions)SWadvantages:WindowofinstructionsthatisexaminedforparallelismmuchhigherMuchlesshardwareinvolvedinVLIW(unlessyouareIntel…!)MoreinvolvedtypesofspeculationcanbedonemoreeasilySpeculationcanbebasedonlarge-scaleprogrambehavior,notjustlocalinformationAdvantagesofHW(Tomasulo)vsSuperscalarv.VLIWSmallercodesize(较小的代码长度)Binarycompatability（二进制代码的兼容性好）acrossgenerationsofhardwareSimplifiedHardwarefordecoding,issuinginstructionsNoInterlockHardware(compilerchecks?)Moreregisters,butsimplifiedHardwareforRegisterPorts(multipleindependentregisterfiles?)Superscalarv.VLIWSmallercodHardwareSupportforExpoiltingILPatcompiletimeConditional/predicatedinstruction)(条件指令或预测指令)Aconditionalinstructionreferstoaconditionwhichisevaluatedaspartoftheinstructionexecution,(条件指令的条件判断仅仅作为指令执行的一部分)Example:If(A==0){S=T}BNEZR1,LCMOVR2,R3,

R1ADDUR2,R3,R0L:……

theCPUalwaysexecutestheinstructionbutwritestheresultonlyiftheconditionismet.

(CPU总是会执行这条指令，但是否写结果要看条件是否满足)Aconditionalbranchchangesacontroldependenceintoadatadependence.(把控制相关转成数据相关)HardwareSupportforExpoiltinConditionalinstructionsTheexecutionofallinstructioniscontrolledbyapredicate.Whenpredicateisfalse,theinstructionbecomesano-opSimplyconvertsmallblocksofcodethatarebranchdependent.EliminatenonloopbranchesCanbeusedtospeculativelymoveaninstructionthatistimecritical.ConditionalinstructionsTheexCompilerSpeculationwithHardwareSupport--硬件支持的编译投机Movespeculatedinstructionsnotonlybeforethebranch,butbeforetheconditionevaluation.Fourmethodsforsupportingambitious(大胆的)speculationHardwareandOScooperativelyignoreexceptionsforspeculativeinstructions.(硬件与OS协同忽略投机指令引起的异常中断)Speculativeinstructionsthatneverraiseexceptionsareused.(调度那些不影响异常中断行为的指令作为投机指令)

Poisonbitsareattachedtotheresultregisterswrittenbyspeculativeinstructions.(采用抑制位的投机技术)Amechanismisprovidedtoindicatethataninstructionisspeculative,thehardwarebufferstheresultuntiltheinstructionnolongerspeculative.CompilerSpeculationwithHardComputerArchitecture

----AQuantitativeApproach计算机体系结构计算机体系结构Chapter4(2)

Instruction-LevelParallelism

SoftwareApproaches 王奕Estelle.ywang@ComputerArchitecture

----AQLectureforILP:

producingresult usingresult incyclesFPALUop AnotherFPALUop 3FPALUop Storedouble 2Loaddouble FPALUop 1Loaddouble Storedouble 0Integerop Integerop 0

Wherearethestalls?FPLoop:WherearetheHazardsReducingstallsfromschedullinginBBanddelayedbranchLoop:LDF0,0(R1)ADDDF4,F0,F2SD0(R1),F4SUBIR1,R1,#8BNEZR1,LoopFDXMWFDsA1A2A3A4WFsDssXMWFssDXMWFsDXMW

10CCFFLoop:LDF0,0(R1)SUBIR1,R1,#8ADDDF4,F0,F2BNEZR1,Loop

SD+8(R1),F4FDXMWFDXM

WFDA1A2A3A4WFDXMW

FDsXMW

6CCF

DXMWReducingstallsfromschedulliUnrollLoopFourTimes(straightforwardway)

静态分支预StaticMultipleissue:VLIW

(静LoopUnrollinginVLIWMemory Memory FP FP Int.op/ Clock

1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4

4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8

7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOPAfter:SoftwarePipelined

1 SD 0(R1),F4; StoresM[i]

2 ADDD F4,F0,F2; AddstoM[i-1]

3 LD F0,-16(R1); LoadsM[i-2]

4 SUBI R1,R1,#85 BNEZ R1,LOOPSymbolicLoopUnrollingMaximizeresult-usedistanceLesscodespacethanunrollingFill&drainpipeonlyonceperloop

人人文库> 全部分类> 教育资料 > 辅导培训

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

教学第四章指令并行软件方面课件

文档简介

温馨提示

最新文档

评论

教学第四章指令并行软件方面课件

文档简介

温馨提示

最新文档

评论

相关文档