三个臭皮匠胜过一个诸葛亮_第1页
三个臭皮匠胜过一个诸葛亮_第2页
三个臭皮匠胜过一个诸葛亮_第3页
三个臭皮匠胜过一个诸葛亮_第4页
三个臭皮匠胜过一个诸葛亮_第5页
已阅读5页,还剩37页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

ExploitingThread-LevelParallelismin

GeneralPurposeApplicationsPen-ChungYew游本中

DepartmentofComputerScienceandEngineeringUniversityofMinnesotahttp:///Agassiz2023/1/111PCYew-Taiwan三個臭皮匠勝過一個諸葛亮

三個諸葛亮勝不過一個臭皮匠Pen-ChungYew游本中

DepartmentofComputerScienceandEngineeringUniversityofMinnesotahttp:///Agassiz2023/1/112PCYew-TaiwanImpactofHardwareTechnologyonComputerArchitecturesPerformance

improvementofmicroprocessorssofarhasbeendrivenprimarilybyhigherclockrates:

smallerfeaturesizes

(Moore’sconjecture),higherpowerdensity,highercoolingcostResults?IntelcancelledtwoPentiumprojectsrecentlyVLSItechnologyallowsmorethan1billiontransistorsonasinglechip=>plentyofgates,whattodowiththem?Superscalarishardtoscalebeyond~10instructionsperclockcycle:inherent

ILP(Instruction-LevelParallelism)limitationinapplicationprograms,longwiredelays,highpowerdensityMemorywallisgettinghigherbetweenCPUandstoragedevicesImprovingsingleprogramperformanceisstillveryimportant2023/1/113PCYew-TaiwanParallelProcessingComestotheRescue–Finally?Parallelprocessinghasbeenproposedtosalvageclockratelimitation=>forthepastthirtyyears!!!Finally?=>multiplecoresinIntel’sroadmapaswellasinmostembeddedprocessorstodayWhatisnewhere?

Usethread-levelparallelism(TLP)toimproveinstruction-levelparallelism(ILP)

forgeneral-purposeapplications2023/1/114PCYew-TaiwanILPvs.TLPTimeop1op2op3op4op5op6op7op8op9op10op11op12……………..op21op22op23op24t1t2t3t6Timet1t2t3t6op1op2op3op4op5op6op7op8op9op10op11op12…………op21op22op23op24Th1Th2Th3Th4SuperscalarTLPILP2023/1/115PCYew-TaiwanParallelProcessingComestotheRescue–Finally?IsthereenoughTLPingeneral-purposeapplicationprograms(toimproveILP)?=>muchharderthanscientificapplications(floating-point-intensive)2023/1/116PCYew-Taiwan2023/1/117PCYew-TaiwanTLPChallengesinGeneral-PurposeApplicationsMostlyDo-whileloopsNeedthread-levelcontrolspeculationParallelismexistsmostlyinouterloopsNotgoodforVLIW(i.e.softwarepipelining),orvectorprocessing=>needthread-levelsupportPointerscomplicatealiasanddatadependenceanalysisNeedruntimedisambiguationanddataspeculationManysmallloopsanddoacrossloopsNeedfastandlowoverheadcommunicationSmallbasicblocks–needtoexploitbothILPandTLP

Neednewapproachestoapplyparallelprocessingtosuchapplications!!2023/1/118PCYew-TaiwanOutlineMulti-threadedarchitecturesSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutiononmulti-threadedprocessorsProfile-basedanalysesConclusion2023/1/119PCYew-TaiwanMulti-ThreadedArchitectures

Toimprovesingle-programspeedupMultiscalarSuperthreaded

ProrcessorsTraceprocessorMultiprocessoronachipToimproveresourceutilization,throughputSimultaneousMultithreading(SMT)TohidememorylatencyTeracomputer,HyperthreadingTosupportsystem/applicationfunctionalityReference:SpeculativeExecutioninHighPerformanceComputerArchitectures,editedbyKaeliandYew,CRCPress,20052023/1/1110PCYew-TaiwanSuperthreadedArchitectures

Exploitthread-levelparallelismtoenhanceILPMultiplevs.singleinstructionwindows(notforscalabilityasintraditionalparallelprocessing)Controlspeculation(notstoppedbybranchinstructions)Dataspeculation(notstoppedbydatadep’sbetweenthreads)Fastcommunication=>smalltaskgranularityHighcachehitrates,automaticdataprefetchingNeednewhardwareandcompiler/softwaresupportReference:

TheSuperthreadedProcessorArchitecture,Tsai,etal

IEEETrans.OnComputers,Sept19992023/1/1111PCYew-Taiwan

InstructionCache

DataCacheThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnitThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnitThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnitThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnit2023/1/1112PCYew-TaiwanSpeculation:

BreakingProgramDependencyControlanddata

dependenceslimitprogramperformanceHowever,MostbrancheshavegoodpredictabilityMostdatadependences

happeninfrequently

atruntimeSpeculationisaneffectiveapproachtobreakdependencesOptimizeprogramexecutionbyignoringinfrequent

datadependences,ortakingpredictedpathsCheck

possibleviolation(mis-speculation)atruntimeRecoverifviolationoccurs2023/1/1113PCYew-TaiwanTypeofSpeculationControlspeculationSpeculateonprogramcontrolflowpathDataspeculationSpeculateonhowlikelymemoryreferencesaretothesamememorylocation(address)ValuespeculationSpeculationontheresultvalueofanoperation2023/1/1114PCYew-TaiwanOutlineMulti-treadedarchitecturesSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutioninmulti-threadedprocessorsProfile-basedanalysesConclusion2023/1/1115PCYew-TaiwanSpeculationonIntelIA64BothcontrolanddataspeculationaresupportedonIntelIA64SpecialinstructionsandhardwareareprovidedMemoryloadoperationistargetedforspeculationMemorydelayisusuallythebottleneckofperformanceMemoryloadisusuallythestartofspeculativeoperations2023/1/1116PCYew-TaiwanSpeculatingonDataDependence

MorespeculativeoptimizationsI1:…=*qI2:*p=bI3:…=*qI4:*r=…I5:…=*pI6:*r=…SpeculateonthisdependenceRedundancyeliminationopportunity2023/1/1117PCYew-TaiwanSpeculateonDataDependences

MorespeculativeoptimizationsI1:…=*qI2:*p=bI3:…=*qI4:*r=…I5:…=*pI6:*r=…SpeculateonthisdependenceCopypropagationopportunity2023/1/1118PCYew-TaiwanSpeculateonDataDependences

MorespeculativeoptimizationsI1:…=*qI2:*p=bI3:…=*qI4:*r=…I5:…=*pI6:*r=…SpeculateonthisdependenceDeadstoreeliminationopportunity2023/1/1119PCYew-TaiwanObservationsSpeculativeoptimizationopportunitiesexistinmanyapplications(originally,itwasonlyformemorylatencyhidingduringcodescheduling)AgeneralcompilerframeworkisneededtosupportbothcontrolanddataspeculationinoptimizationsNeedtogeneraterecoverycodeformis-speculationNeedextensivesupportfordatadependence,alias,andvalueprofiling

(nolongerconservativeanalysis)Reference:

ACompilerFrameworkforSpeculativeAnalysisandOptimizations,ACM/SIGPLANConf.OnProgrammingLanguageDesignandImplementation(PLDI),June2003,alsoinACMTrans.OnArchitectureandCodeOptimization(TACO),Vol.1,No.3,Sept.2004,pp.247-2712023/1/1120PCYew-TaiwanACompilerFramework:

IntelOpenResearchCompiler(ORC)2023/1/1121PCYew-TaiwanPerformanceImprovementofSpeculativeRegisterPromotionBasedonaliasprofileandcomparedwith–O3withtype-basedaliasanalysisonIntelORCcompiler2023/1/1122PCYew-TaiwanValueSpeculation

ValueLocality:likelihoodofapreviously-seenvaluerecurringwithinastoragelocationObservedinanystoragelocationsRegistersCachememoryMainmemoryMostworkfocussingonvaluestoredinregisterstobreakpotentialdatadependences:registervaluelocality2023/1/1123PCYew-TaiwanPerformanceofValuePredictorsPredictabilityofDataValues,SazeidesandSmith,Micro-30,1997Lastvaluepredictionvariesfrom23%to61%,averageabout40%Stridepredictionvariesfrom38%to80%,averageabout56%FCMwithanorderof3variesfrom56%toover90%,withanaverageofabout78%ImprovementdiminishesasorderincreasesLesssensitivetodifferenttypesofinstructions2023/1/1124PCYew-TaiwanOutlineIntroductionSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutiononmulti-threadedprocessorsProfile-basedanalysesConclusion2023/1/1125PCYew-TaiwanCompilerOptimizationsforSpeculativeThreadsWithoutcompileroptimization,thereislimitedTLPevenunderperfecthardwaresupport.[OplingerPACT99]CompilerhavetodecideWhichloops/regionstobetransformedintothreadUsesynchronizationorspeculationHowtoschedulethecodetoimproveoverlapsWhattransformationstobeusedWhen/HowtogeneraterecoverycodeProfile-basedanalysiscouldbeveryefficient2023/1/1126PCYew-TaiwanLoopSelectionprogramspeedupCarefullyselectedloopscanimproveperformancesignificantly!2023/1/1127PCYew-TaiwanSpeculativeCodeMotion*p=*p=*p==*p=*p=*p*p=*p=*p=

=*p=*p=*pstall

critical

pathother

computation

beforecodemotionaftercodemotion2023/1/1128PCYew-TaiwanOutlineIntroductionSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutiononmulti-threadedprocessorsProfile-basedanalysisConclusion2023/1/1129PCYew-TaiwanCrucialConsiderationsinDependenceProfilingProgramcoverage=>needcompiler’ssupportoruseheuristicrulesInputsensitivityProfilingoverhead(spaceandtime)Usingaliasanddatadependenceprofilesisinherentlyspeculative=>needhardwaresupportforcorrectexecution2023/1/1130PCYew-TaiwanAliasProfilingvs.StaticAnalysis

Mostpossibledatadependencereportedbycompilerdonotoccuratruntime2023/1/1131PCYew-Taiwan

DataDependenceProfilingDatadependenceedgesamongmemoryreferencesandfunctioncallsDetailedinformationtype:flow,anti,output,orinputprobability:frequencyofoccurrenceWhenloopsaretargeteddependencedistance:limited2023/1/1132PCYew-TaiwanOverheadofProfiling96110102121120020406080bzip2craftygapgccgzipmcfparserperlbmktwolfvortexvpraverageXtimessloweraliasDDwithoutdistanceDDforinnermostloopsDD4-levelloopsCompiler:ORCversion2.0Machine:Itanium2,900MHzand2GmemoryBenchmarks:SPECCPU2000IntInstrumentationoptimizationhasbeendone2023/1/1133PCYew-TaiwanTechniquestoReduceProfilingOverheadReducethespacerequirementbyhashtableLargergranularityofaddressSmalleriterationcounterSamplingSamplethesnapshotsofproceduresorloopsinsteadofindividualreferencesUseinstrumentation-basedsamplingframeworkSwitchatproceduresorloops2023/1/1134PCYew-TaiwanConclusionsMicroprocessorshavecaughtupwithsupercomputersin’90andhavegonemulti-coreItisnon-trivialtoapplycurrentsupercomputingtechnologiestogeneral-purposeapplicationsNewarchitecturalsupportsuchasthread-levelspeculativeexecution,andnewcompilertechniquessuchasspeculativeoptimizationsusingaliasanddatadependenceprofiling,evendynamicoptimizationatruntime,arecrucial–asalwaysAveryexcitingandneweraforparallelprocessingmighthavearrived(especiallyinembeddedsystems)–finally!2023/1/1135PCYew-TaiwanReferencesJ.Linetal,ACompilerFrameworkforSpeculativeAnalysisandOptimizations,Proc.OfACM/SIGPLANConf.OnProgrammingLanguageDesignandImplementation(PLDI),June2003,alsoinACMTrans.OnArchitectureandCodeOptimization(TACO),Vol.1,No.3,Sept.2004,pp.247-271J.Linetal,RecoveryCodeGenerationforGeneralSpeculativeOptimizations,toappearinACMTrans.OnArchitectureandCodeOptimization(TACO)2005.(3)J.Linetal,SpeculativeRegisterPromotionUsingAdvancedLoadAddressTable(ALAT),Proc.OfIEEE/ACMInt’lSymp.OnCodeGenerationandOptimization(CGO),March2003(4)T.Chenetal,DataDependenceProfilingforSpeculativeOptimizations,Proc.OfInt’lConfonCompilerConstruction(CC),March2004(5)T.Chenetal,AnEmpiricalStudyontheGranularityofPointerAnalysisinCprograms,Proc.15thWorkshoponLanguagesandCompilersforParallelComputing(LCPC),August2002(6)J.Y.Tsaietal,TheSuperthreadedProcessorArchitecture,IEEETransonComputers,specialissueonMultithreadedArchitecture,Vol.48,No.9,Sept19992023/1/1136PCYew-TaiwanControlSpeculationld.s:movetheloadoperationacrossthebarri

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论