




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Lecture10:
MemoryHierarchy:ReducingHitTime,MainMemory,&ExamplesSpring2010SuperComputingLab.Review:ReducingMisses3Cs:Compulsory,Capacity,ConflictMissesReducingMissRate1.ReduceMissesviaLargerBlockSize2.ReduceMissesviaHigherAssociativity3.ReducingMissesviaVictimCache4.ReducingMissesviaPseudo-Associativity5.ReducingMissesbyHWPrefetchingInstr,Data6.ReducingMissesbySWPrefetchingData7.ReducingMissesbyCompilerOptimizationsRememberdangerofconcentratingonjustoneparameterwhenevaluatingperformanceReducingMissPenaltySummaryFivetechniquesReadpriorityoverwriteonmissSubblockplacementEarlyRestartandCriticalWordFirstonmissNon-blockingCaches(HitunderMiss,MissunderMiss)SecondLevelCacheCanbeappliedrecursivelytoMultilevelCachesDangeristhattimetoDRAMwillgrowwithmultiplelevelsinbetweenFirstattemptsatL2cachescanmakethingsworse,sinceincreasedworstcaseisworseReview:ImprovingCachePerformance1.Reducethemissrate,2.Reducethemisspenalty,or3.Reducethetimetohitinthecache-hittime:readtag+compare
1.FastHittimes
viaSmallandSimpleCachesWhyAlpha21164has8KBInstructionand8KBdatacache+96KBsecondlevelcache?Smalldatacache(faster)andclockrate(on-chip)DirectMapped,onchipAdvantage:overlaptagcheck&datatransfer1.FastHittimesvia
SmallandSimpleCachesIndextagmemoryandthencomparetakestime
SmallcachecanhelphittimesincesmallermemorytakeslesstimetoindexE.g.,L1cachessamesizefor3generationsofAMDmicroprocessors:K6,Athlon,andOpteronAlsoL2cachesmallenoughtofitonchipwiththeprocessoravoidstimepenaltyofgoingoffchipSimple
directmappingCanoverlaptagcheckwithdatatransmissionsincenochoiceAccesstimeestimatefor90nmusingCACTImodel4.0Medianratiosofaccesstimerelativetothedirect-mappedcachesare1.32,1.39,and1.43for2-way,4-way,and8-waycaches2.FasthitsbyAvoidingAddressTranslationSendvirtualaddresstocache:CalledVirtuallyAddressedCache
orjustVirtualCachevs.PhysicalCacheEverytimeprocessisswitchedlogicallymustflushthecache;otherwisegetfalsehitsCostistimetoflush+“compulsory”missesfromemptycacheDealingwithaliases
(sometimescalledsynonyms);
TwodifferentvirtualaddressesmaptosamephysicaladdressI/Omustinteractwithcache,soneedvirtualaddressSolutiontoaliasesHWguaranteesthateverycacheblockhasuniquephysicaladdressSWguarantee:lowernbitsmusthavesameaddress;
aslongascoversindexfield&directmapped,theymustbeunique;
calledpagecoloringSolutiontocacheflushAdd
processidentifiertag
thatidentifiesprocessaswellasaddresswithinprocess:cannotgetahitifwrongprocessVirtuallyAddressedCachesCPUTB$MEMVAPAPAConventionalOrganizationCPU$TBMEMVAVAPAVirtuallyAddressedCacheTranslateonlyonmissSynonymProblemCPU$TBMEMVAPATagsPAOverlap$accesswithVAtranslation:requires$indextoremaininvariantacrosstranslationVATagsL2$2’.FastCacheHitsbyAvoidingTranslation:IndexwithPhysicalPortionofAddressIfindexisphysicalpartofaddress,canstarttagaccessinparallelwithtranslationsothatcancomparetophysicaltag
Limitscachetopagesize:whatifwantbiggercachesandusessametrick?HigherassociativitymovesbarriertorightPagecoloringPageAddressPageOffsetAddressTagIndexBlockOffset3112110PipelineTagCheckandUpdateCacheasseparatestages;currentwritetagcheck&previouswritecacheupdateOnlySTORESinthepipeline;emptyduringamiss
Storer2,(r1) Checkr1
Add --
Sub --
Storer4,(r3) M[r1]<-r2& checkr3
InshadeisDelayedWriteBuffer?mustbecheckedonreads;eithercompletewriteorreadfrombuffer3.FastHitTimesViaPipelinedWriteswritebufferCPUinout
DRAM(orlowermem)4.FastWritesonMissesViaSmallSubblocksIfmostwritesare1word,subblocksizeis1word,&writethroughthenalwayswritesubblock&tagimmediatelyTagmatchandvalidbitalreadyset:Writingtheblockwasproper,¬hinglostbysettingvalidbitonagain.Tagmatchandvalidbitnotset:Thetagmatchmeansthatthisistheproperblock;writingthedataintothesubblockmakesitappropriatetoturnthevalidbiton.Tagmismatch:Thisisamissandwillmodifythedataportionoftheblock.Sincewrite-throughcache,noharmwasdone;memorystillhasanup-to-datecopyoftheoldvalue.OnlythetagtotheaddressofthewriteandthevalidbitsoftheothersubblockneedbechangedDoesn’tworkwithwritebackduetolastcase5.FastHittimesviaTraceCache(Pentium4only;andlasttime?)Findmoreinstructionlevelparallelism?
Howavoidtranslationfromx86tomicroops?TracecacheinPentium4Dynamictracesoftheexecutedinstructionsvs.staticsequencesofinstructionsasdeterminedbylayoutinmemoryBuilt-inbranchpredictorCachethemicro-opsvs.x86instructionsDecode/translatefromx86tomicro-opsontracecachemiss+ 1.betterutilizelongblocks(don’texitinmiddleofblock,don’tenteratlabelinmiddleofblock)1.complicatedaddressmappingsinceaddressesnolongeralignedtopower-of-2multiplesofwordsize- 1.instructionsmayappearmultipletimesinmultipledynamictracesduetodifferentbranchoutcomes6:IncreasingCacheBandwidthbyPipeliningPipelinecacheaccesstomaintainbandwidth,buthigherlatencyInstructioncacheaccesspipelinestages: 1:Pentium 2:PentiumProthroughPentiumIII 4:Pentium4greaterpenaltyonmispredictedbranchesmoreclockcyclesbetweentheissueoftheloadandtheuseofthedata7:IncreasingCacheBandwidthviaMultipleBanksRatherthantreatthecacheasasinglemonolithicblock,divideintoindependentbanksthatcansupportsimultaneousaccessesE.g.,T1(“Niagara”)L2has4banksBankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanksmappingofaddressestobanksaffectsbehaviorofmemorysystemSimplemappingthatworkswellis“sequentialinterleaving”SpreadblockaddressessequentiallyacrossbanksE,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;bank1hasallblockswhoseaddressmodulo4is1;…CacheOptimizationSummaryTechnique MR MP HT ComplexityLargerBlockSize + - 0
HigherAssociativity + - 1
VictimCaches + 2
Pseudo-AssociativeCaches + 2
HWPrefetchingofInstr/Data + 2
CompilerControlledPrefetching + 3
CompilerReduceMisses + 0PrioritytoReadMisses + 1
SubblockPlacement + + 1
EarlyRestart&CriticalWord1st + 2
Non-BlockingCaches + 3
SecondLevelCaches + 2Small&SimpleCaches - + 0
AvoidingAddressTranslation + 2
PipeliningWrites + 1missratehittimemisspenaltyMainMemoryBackgroundPerformanceofMainMemory:Latency:CacheMissPenaltyAccessTime:timebetweenrequestandwordarrivesCycleTime:mintimebetweenrequeststomemoryBandwidth:I/O&LargeBlockMissPenalty(L2)MainMemoryisDRAM:DynamicRandomAccessMemoryDynamicsinceneedstoberefreshedperiodically(8ms,1%time)Addressesdividedinto2halves(Memoryasa2Dmatrix):RASorRowAccessStrobeCASorColumnAccessStrobeCacheusesSRAM:StaticRandomAccessMemoryNorefresh(6transistors/bitvs.1transistor/bit,areais10X)Addressnotdivided:FulladdreessSize:DRAM/SRAM:4-8,
Cost/Cycletime:SRAM/DRAM:8-16DRAMlogicalorganization
(4Mbit)SquarerootofbitsperRAS/CASColumnDecoderSenseAmps&I/OMemoryArray(2,048x2,048)A0…A1011DQWordLineStorageCellDRAMphysicalorganization(4Mbit)BlockRowDec.9:512RowBlockRowDec.9:512ColumnAddressBlockRowDec.9:512BlockRowDec.9:512Block0Block3I/OI/OI/OI/OI/OI/OI/OI/ODQAddress28I/Os8I/Os4KeyDRAMTimingParameterstRAC:minimumtimefromRASlinefallingtothevaliddataoutput.QuotedasthespeedofaDRAMwhenbuyAtypical4MbDRAMtRAC=60nsSpeedofDRAMsinceonpurchasesheet?tRC:minimumtimefromthestartofonerowaccesstothestartofthenext.tRC=110nsfora4MbitDRAMwithatRACof60nstCAC:minimumtimefromCASlinefallingtovaliddataoutput.15nsfora4MbitDRAMwithatRACof60nstPC:minimumtimefromthestartofonecolumnaccesstothestartofthenext.35nsfora4MbitDRAMwithatRACof60nsADOE_L256Kx8DRAM98WE_LCAS_LRAS_LOE_LARowAddressWE_LJunkReadAccessTimeOutputEnableDelayCAS_LRAS_LColAddressRowAddressJunkColAddressDHighZDataOutDRAMReadCycleTimeEarlyReadCycle:OE_LassertedbeforeCAS_LLateReadCycle:OE_LassertedafterCAS_LEveryDRAMaccessbeginsat:TheassertionoftheRAS_L2waystoread:
earlyorlatev.CASJunkDataOutHighZDRAMReadTimingDRAMPerformanceA60ns(tRAC)DRAMcanperformarowaccessonlyevery110ns(tRC)performcolumnaccess(tCAC)in15ns,buttimebetweencolumnaccessesisatleast35ns(tPC).Inpractice,externaladdressdelaysandturningaroundbusesmakeit40to50nsThesetimesdonotincludethetimetodrivetheaddressesoffthemicroprocessornorthememorycontrolleroverhead!DRAMHistoryDRAMs:capacity+60%/yr,cost-30%/yr2.5Xcells/area,1.5Xdiesizein3years‘98DRAMfablinecosts$2BDRAMonly:density,leakagev.speedRelyonincreasingno.ofcomputers&memorypercomputer(60%market)SIMMorDIMMisreplaceableunit
=>computersuseanygenerationDRAMCommodity,secondsourceindustry
=>highvolume,lowprofit,conservativeLittleorganizationinnovationin20yearsOrderofimportance:1)Cost/bit2)CapacityFirstRAMBUS:10XBW,+30%cost=>littleimpactDRAMFuture:1GbitDRAM
Mitsubishi
SamsungBlocks 512x2Mbit 1024x1MbitClock 200MHz 250MHzDataPins 64 16DieSize 24x24mm 31x21mmSizeswillbemuchsmallerinproductionMetalLayers 3 4Technology 0.15micron 0.16micronFastMemorySystems:DRAMspecificMultipleCASaccesses:severalnames(pagemode)ExtendedDataOut(EDO):30%fasterinpagemodeNewDRAMstoaddressgap;
whatwilltheycost,willtheysurvive?RAMBUS:startupcompany;reinventDRAMinterfaceEachChipamodulevs.sliceofmemoryShortbusbetweenCPUandchipsDoesownrefreshVariableamountofdatareturned1byte/2ns(500MB/sperchip)20%increaseinDRAMareaSynchronousDRAM:2banksonchip,aclocksignaltoDRAM,transfersynchronoustosystemclock(66-150MHz)IntelclaimsRAMBUSDirect(16bwide)isfuturePCmemory?Possiblynottrue!InteltodropRAMBUS?Nichememoryormainmemory?e.g.,VideoRAMforframebuffers,DRAM+fastserialoutputMainMemoryPerformanceSimple:CPU,Cache,Bus,Memorysamewidth
(32or64bits)Wide:CPU/Mux1word;Mux/Cache,Bus,MemoryNwords(Alpha:64bits&256bits;UtraSPARC512)Interleaved:CPU,Cache,Bus1word:MemoryNModules
(4Modules);exampleiswordinterleavedInterleavingAccessPatternwithoutInterleaving:StartAccessforD1CPUMemoryStartAccessforD2D1availableAccessPatternwith4-wayInterleaving:AccessBank0AccessBank1AccessBank2AccessBank3CPUMemoryBank1MemoryBank0MemoryBank3MemoryBank2MainMemoryPerformanceTimingmodel(wordsizeis32bits)1tosendaddress,6accesstime,1tosenddataCacheBlockis4wordsSimpleM.P.=4x(1+6+1)=32WideM.P.=1+6+1=8InterleavedM.P.=1+6+4x1=11IndependentMemoryBanksMemorybanksforindependentaccesses
vs.fastersequentialaccessesMultiprocessorI/OCPUwithHitundernMisses,Non-blockingCacheSuperbank:allmemoryactiveononeblocktransfer(orBank)Bank:portionwithinasuperbankthatiswordinterleaved(orSubbank)SuperbankBankIndependentMemoryBanksHowmanybanks? numberbanks>=numberclockstoaccesswordinbankForsequentialaccesses,otherwisewillreturntooriginalbankbeforeithasnextwordready(likeinvectorcase)IncreasingDRAM=>fewerchips=>hardertohavebanksDRAMsperPCoverTimeMinimumMemorySizeDRAMGeneration86 89 92 96 99 021Mb 4Mb 16Mb 64Mb 256Mb 1Gb 4MB8MB16MB32MB64MB128MB256MB3281648241824182DRAMLatency>>BWMoreAppBandwidth=>Cachemisses
=>DRAMRAS/CASApplicationBW=>
LowerDRAMLatencyRAMBUS,SynchDRAMincreaseBWbuthigherlatencyEDODRAM<5%inPCDRAMDRAMDRAMDRAMBusI$D$ProcL2$Potential:DRAMCrossroads?After20yearsof4Xevery3years,runningintowall?(64Mb-1Gb)Howcankeep$1BfablinesfullifbuyfewerDRAMspercomputer?Cost/bit-0%/yrifstop4X/3yr?Whatwillhappento$40B/yrDRAMindustry?MainMemorySummaryWiderMemoryInterleavedMemory:forsequentialorindependentaccessesAvoidingbankconflicts:SW&HWDRAMspecificoptimizations:pagemode&SpecialtyDRAMCacheCrossCuttingIssuesSuperscalarCPU&NumberCachePortsmustmatch:numbermemoryaccesses/cycle?SpeculativeExecutionandnon-faultingoptiononmemory/TLBParallelExecutionvs.CachelocalityWantfarseparationtofindindependentoperationsvs.wantreuseofdataaccessestoavoidmissesI/OandconsistencyofdatabetweencacheandmemoryCaches=>multiplecopiesofdataConsistencybyHWorbySW?WhereconnectI/Otocomputer?Alpha21064SeparateInstr&DataTLB&CachesTLBsfullyassociativeTLBupdatesinSW
(“PrivArchLibr”)Caches8KBdirectmapped,writethruCritical8bytesfirstPrefetchinstr.st
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 生产设备实施管理制度
- 生产车间会议管理制度
- 公园文化类活动方案
- 公墓基金活动方案
- 上海门责管理制度
- 专科专家管理制度
- 世联案场管理制度
- 业务切换管理制度
- 中国几级管理制度
- 中学文件管理制度
- 光伏项目居间服务合同协议书
- DL∕T 5390-2014 发电厂和变电站照明设计技术规定
- 2023年上海浦东新区公办学校储备教师教辅招聘考试真题
- 《压铸件常见缺陷》课件
- 系统整合选择题附有答案
- 2024年贵州省中考理科综合试卷(含答案)
- TSG-T7001-2023电梯监督检验和定期检验规则宣贯解读
- 万科物业管理公司员工手册
- 机器学习在教育领域的应用研究
- 一例ANCA相关性血管炎患者的护理查房
- 2024年全国初中数学联合竞赛试题参考答案及评分标准
评论
0/150
提交评论