高级计算机体系结构10存储器结构(英文)_第1页
高级计算机体系结构10存储器结构(英文)_第2页
高级计算机体系结构10存储器结构(英文)_第3页
高级计算机体系结构10存储器结构(英文)_第4页
高级计算机体系结构10存储器结构(英文)_第5页
已阅读5页,还剩35页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Lecture10:

MemoryHierarchy:ReducingHitTime,MainMemory,&ExamplesSpring2010SuperComputingLab.Review:ReducingMisses3Cs:Compulsory,Capacity,ConflictMissesReducingMissRate1.ReduceMissesviaLargerBlockSize2.ReduceMissesviaHigherAssociativity3.ReducingMissesviaVictimCache4.ReducingMissesviaPseudo-Associativity5.ReducingMissesbyHWPrefetchingInstr,Data6.ReducingMissesbySWPrefetchingData7.ReducingMissesbyCompilerOptimizationsRememberdangerofconcentratingonjustoneparameterwhenevaluatingperformanceReducingMissPenaltySummaryFivetechniquesReadpriorityoverwriteonmissSubblockplacementEarlyRestartandCriticalWordFirstonmissNon-blockingCaches(HitunderMiss,MissunderMiss)SecondLevelCacheCanbeappliedrecursivelytoMultilevelCachesDangeristhattimetoDRAMwillgrowwithmultiplelevelsinbetweenFirstattemptsatL2cachescanmakethingsworse,sinceincreasedworstcaseisworseReview:ImprovingCachePerformance1.Reducethemissrate,2.Reducethemisspenalty,or3.Reducethetimetohitinthecache-hittime:readtag+compare

1.FastHittimes

viaSmallandSimpleCachesWhyAlpha21164has8KBInstructionand8KBdatacache+96KBsecondlevelcache?Smalldatacache(faster)andclockrate(on-chip)DirectMapped,onchipAdvantage:overlaptagcheck&datatransfer1.FastHittimesvia

SmallandSimpleCachesIndextagmemoryandthencomparetakestime

SmallcachecanhelphittimesincesmallermemorytakeslesstimetoindexE.g.,L1cachessamesizefor3generationsofAMDmicroprocessors:K6,Athlon,andOpteronAlsoL2cachesmallenoughtofitonchipwiththeprocessoravoidstimepenaltyofgoingoffchipSimple

directmappingCanoverlaptagcheckwithdatatransmissionsincenochoiceAccesstimeestimatefor90nmusingCACTImodel4.0Medianratiosofaccesstimerelativetothedirect-mappedcachesare1.32,1.39,and1.43for2-way,4-way,and8-waycaches2.FasthitsbyAvoidingAddressTranslationSendvirtualaddresstocache:CalledVirtuallyAddressedCache

orjustVirtualCachevs.PhysicalCacheEverytimeprocessisswitchedlogicallymustflushthecache;otherwisegetfalsehitsCostistimetoflush+“compulsory”missesfromemptycacheDealingwithaliases

(sometimescalledsynonyms);

TwodifferentvirtualaddressesmaptosamephysicaladdressI/Omustinteractwithcache,soneedvirtualaddressSolutiontoaliasesHWguaranteesthateverycacheblockhasuniquephysicaladdressSWguarantee:lowernbitsmusthavesameaddress;

aslongascoversindexfield&directmapped,theymustbeunique;

calledpagecoloringSolutiontocacheflushAdd

processidentifiertag

thatidentifiesprocessaswellasaddresswithinprocess:cannotgetahitifwrongprocessVirtuallyAddressedCachesCPUTB$MEMVAPAPAConventionalOrganizationCPU$TBMEMVAVAPAVirtuallyAddressedCacheTranslateonlyonmissSynonymProblemCPU$TBMEMVAPATagsPAOverlap$accesswithVAtranslation:requires$indextoremaininvariantacrosstranslationVATagsL2$2’.FastCacheHitsbyAvoidingTranslation:IndexwithPhysicalPortionofAddressIfindexisphysicalpartofaddress,canstarttagaccessinparallelwithtranslationsothatcancomparetophysicaltag

Limitscachetopagesize:whatifwantbiggercachesandusessametrick?HigherassociativitymovesbarriertorightPagecoloringPageAddressPageOffsetAddressTagIndexBlockOffset3112110PipelineTagCheckandUpdateCacheasseparatestages;currentwritetagcheck&previouswritecacheupdateOnlySTORESinthepipeline;emptyduringamiss

Storer2,(r1) Checkr1

Add --

Sub --

Storer4,(r3) M[r1]<-r2& checkr3

InshadeisDelayedWriteBuffer?mustbecheckedonreads;eithercompletewriteorreadfrombuffer3.FastHitTimesViaPipelinedWriteswritebufferCPUinout

DRAM(orlowermem)4.FastWritesonMissesViaSmallSubblocksIfmostwritesare1word,subblocksizeis1word,&writethroughthenalwayswritesubblock&tagimmediatelyTagmatchandvalidbitalreadyset:Writingtheblockwasproper,¬hinglostbysettingvalidbitonagain.Tagmatchandvalidbitnotset:Thetagmatchmeansthatthisistheproperblock;writingthedataintothesubblockmakesitappropriatetoturnthevalidbiton.Tagmismatch:Thisisamissandwillmodifythedataportionoftheblock.Sincewrite-throughcache,noharmwasdone;memorystillhasanup-to-datecopyoftheoldvalue.OnlythetagtotheaddressofthewriteandthevalidbitsoftheothersubblockneedbechangedDoesn’tworkwithwritebackduetolastcase5.FastHittimesviaTraceCache(Pentium4only;andlasttime?)Findmoreinstructionlevelparallelism?

Howavoidtranslationfromx86tomicroops?TracecacheinPentium4Dynamictracesoftheexecutedinstructionsvs.staticsequencesofinstructionsasdeterminedbylayoutinmemoryBuilt-inbranchpredictorCachethemicro-opsvs.x86instructionsDecode/translatefromx86tomicro-opsontracecachemiss+ 1.betterutilizelongblocks(don’texitinmiddleofblock,don’tenteratlabelinmiddleofblock)1.complicatedaddressmappingsinceaddressesnolongeralignedtopower-of-2multiplesofwordsize- 1.instructionsmayappearmultipletimesinmultipledynamictracesduetodifferentbranchoutcomes6:IncreasingCacheBandwidthbyPipeliningPipelinecacheaccesstomaintainbandwidth,buthigherlatencyInstructioncacheaccesspipelinestages: 1:Pentium 2:PentiumProthroughPentiumIII 4:Pentium4greaterpenaltyonmispredictedbranchesmoreclockcyclesbetweentheissueoftheloadandtheuseofthedata7:IncreasingCacheBandwidthviaMultipleBanksRatherthantreatthecacheasasinglemonolithicblock,divideintoindependentbanksthatcansupportsimultaneousaccessesE.g.,T1(“Niagara”)L2has4banksBankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanksmappingofaddressestobanksaffectsbehaviorofmemorysystemSimplemappingthatworkswellis“sequentialinterleaving”SpreadblockaddressessequentiallyacrossbanksE,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;bank1hasallblockswhoseaddressmodulo4is1;…CacheOptimizationSummaryTechnique MR MP HT ComplexityLargerBlockSize + - 0

HigherAssociativity + - 1

VictimCaches + 2

Pseudo-AssociativeCaches + 2

HWPrefetchingofInstr/Data + 2

CompilerControlledPrefetching + 3

CompilerReduceMisses + 0PrioritytoReadMisses + 1

SubblockPlacement + + 1

EarlyRestart&CriticalWord1st + 2

Non-BlockingCaches + 3

SecondLevelCaches + 2Small&SimpleCaches - + 0

AvoidingAddressTranslation + 2

PipeliningWrites + 1missratehittimemisspenaltyMainMemoryBackgroundPerformanceofMainMemory:Latency:CacheMissPenaltyAccessTime:timebetweenrequestandwordarrivesCycleTime:mintimebetweenrequeststomemoryBandwidth:I/O&LargeBlockMissPenalty(L2)MainMemoryisDRAM:DynamicRandomAccessMemoryDynamicsinceneedstoberefreshedperiodically(8ms,1%time)Addressesdividedinto2halves(Memoryasa2Dmatrix):RASorRowAccessStrobeCASorColumnAccessStrobeCacheusesSRAM:StaticRandomAccessMemoryNorefresh(6transistors/bitvs.1transistor/bit,areais10X)Addressnotdivided:FulladdreessSize:DRAM/SRAM:4-8,

Cost/Cycletime:SRAM/DRAM:8-16DRAMlogicalorganization

(4Mbit)SquarerootofbitsperRAS/CASColumnDecoderSenseAmps&I/OMemoryArray(2,048x2,048)A0…A1011DQWordLineStorageCellDRAMphysicalorganization(4Mbit)BlockRowDec.9:512RowBlockRowDec.9:512ColumnAddressBlockRowDec.9:512BlockRowDec.9:512Block0Block3I/OI/OI/OI/OI/OI/OI/OI/ODQAddress28I/Os8I/Os4KeyDRAMTimingParameterstRAC:minimumtimefromRASlinefallingtothevaliddataoutput.QuotedasthespeedofaDRAMwhenbuyAtypical4MbDRAMtRAC=60nsSpeedofDRAMsinceonpurchasesheet?tRC:minimumtimefromthestartofonerowaccesstothestartofthenext.tRC=110nsfora4MbitDRAMwithatRACof60nstCAC:minimumtimefromCASlinefallingtovaliddataoutput.15nsfora4MbitDRAMwithatRACof60nstPC:minimumtimefromthestartofonecolumnaccesstothestartofthenext.35nsfora4MbitDRAMwithatRACof60nsADOE_L256Kx8DRAM98WE_LCAS_LRAS_LOE_LARowAddressWE_LJunkReadAccessTimeOutputEnableDelayCAS_LRAS_LColAddressRowAddressJunkColAddressDHighZDataOutDRAMReadCycleTimeEarlyReadCycle:OE_LassertedbeforeCAS_LLateReadCycle:OE_LassertedafterCAS_LEveryDRAMaccessbeginsat:TheassertionoftheRAS_L2waystoread:

earlyorlatev.CASJunkDataOutHighZDRAMReadTimingDRAMPerformanceA60ns(tRAC)DRAMcanperformarowaccessonlyevery110ns(tRC)performcolumnaccess(tCAC)in15ns,buttimebetweencolumnaccessesisatleast35ns(tPC).Inpractice,externaladdressdelaysandturningaroundbusesmakeit40to50nsThesetimesdonotincludethetimetodrivetheaddressesoffthemicroprocessornorthememorycontrolleroverhead!DRAMHistoryDRAMs:capacity+60%/yr,cost-30%/yr2.5Xcells/area,1.5Xdiesizein3years‘98DRAMfablinecosts$2BDRAMonly:density,leakagev.speedRelyonincreasingno.ofcomputers&memorypercomputer(60%market)SIMMorDIMMisreplaceableunit

=>computersuseanygenerationDRAMCommodity,secondsourceindustry

=>highvolume,lowprofit,conservativeLittleorganizationinnovationin20yearsOrderofimportance:1)Cost/bit2)CapacityFirstRAMBUS:10XBW,+30%cost=>littleimpactDRAMFuture:1GbitDRAM

Mitsubishi

SamsungBlocks 512x2Mbit 1024x1MbitClock 200MHz 250MHzDataPins 64 16DieSize 24x24mm 31x21mmSizeswillbemuchsmallerinproductionMetalLayers 3 4Technology 0.15micron 0.16micronFastMemorySystems:DRAMspecificMultipleCASaccesses:severalnames(pagemode)ExtendedDataOut(EDO):30%fasterinpagemodeNewDRAMstoaddressgap;

whatwilltheycost,willtheysurvive?RAMBUS:startupcompany;reinventDRAMinterfaceEachChipamodulevs.sliceofmemoryShortbusbetweenCPUandchipsDoesownrefreshVariableamountofdatareturned1byte/2ns(500MB/sperchip)20%increaseinDRAMareaSynchronousDRAM:2banksonchip,aclocksignaltoDRAM,transfersynchronoustosystemclock(66-150MHz)IntelclaimsRAMBUSDirect(16bwide)isfuturePCmemory?Possiblynottrue!InteltodropRAMBUS?Nichememoryormainmemory?e.g.,VideoRAMforframebuffers,DRAM+fastserialoutputMainMemoryPerformanceSimple:CPU,Cache,Bus,Memorysamewidth

(32or64bits)Wide:CPU/Mux1word;Mux/Cache,Bus,MemoryNwords(Alpha:64bits&256bits;UtraSPARC512)Interleaved:CPU,Cache,Bus1word:MemoryNModules

(4Modules);exampleiswordinterleavedInterleavingAccessPatternwithoutInterleaving:StartAccessforD1CPUMemoryStartAccessforD2D1availableAccessPatternwith4-wayInterleaving:AccessBank0AccessBank1AccessBank2AccessBank3CPUMemoryBank1MemoryBank0MemoryBank3MemoryBank2MainMemoryPerformanceTimingmodel(wordsizeis32bits)1tosendaddress,6accesstime,1tosenddataCacheBlockis4wordsSimpleM.P.=4x(1+6+1)=32WideM.P.=1+6+1=8InterleavedM.P.=1+6+4x1=11IndependentMemoryBanksMemorybanksforindependentaccesses

vs.fastersequentialaccessesMultiprocessorI/OCPUwithHitundernMisses,Non-blockingCacheSuperbank:allmemoryactiveononeblocktransfer(orBank)Bank:portionwithinasuperbankthatiswordinterleaved(orSubbank)SuperbankBankIndependentMemoryBanksHowmanybanks? numberbanks>=numberclockstoaccesswordinbankForsequentialaccesses,otherwisewillreturntooriginalbankbeforeithasnextwordready(likeinvectorcase)IncreasingDRAM=>fewerchips=>hardertohavebanksDRAMsperPCoverTimeMinimumMemorySizeDRAMGeneration86 89 92 96 99 021Mb 4Mb 16Mb 64Mb 256Mb 1Gb 4MB8MB16MB32MB64MB128MB256MB3281648241824182DRAMLatency>>BWMoreAppBandwidth=>Cachemisses

=>DRAMRAS/CASApplicationBW=>

LowerDRAMLatencyRAMBUS,SynchDRAMincreaseBWbuthigherlatencyEDODRAM<5%inPCDRAMDRAMDRAMDRAMBusI$D$ProcL2$Potential:DRAMCrossroads?After20yearsof4Xevery3years,runningintowall?(64Mb-1Gb)Howcankeep$1BfablinesfullifbuyfewerDRAMspercomputer?Cost/bit-0%/yrifstop4X/3yr?Whatwillhappento$40B/yrDRAMindustry?MainMemorySummaryWiderMemoryInterleavedMemory:forsequentialorindependentaccessesAvoidingbankconflicts:SW&HWDRAMspecificoptimizations:pagemode&SpecialtyDRAMCacheCrossCuttingIssuesSuperscalarCPU&NumberCachePortsmustmatch:numbermemoryaccesses/cycle?SpeculativeExecutionandnon-faultingoptiononmemory/TLBParallelExecutionvs.CachelocalityWantfarseparationtofindindependentoperationsvs.wantreuseofdataaccessestoavoidmissesI/OandconsistencyofdatabetweencacheandmemoryCaches=>multiplecopiesofdataConsistencybyHWorbySW?WhereconnectI/Otocomputer?Alpha21064SeparateInstr&DataTLB&CachesTLBsfullyassociativeTLBupdatesinSW

(“PrivArchLibr”)Caches8KBdirectmapped,writethruCritical8bytesfirstPrefetchinstr.st

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论