并行处理与体系结构课件hitsz-lec01

上传人：9*** IP属地：湖北上传时间：2023-02-06 格式：PPTX 页数：86 大小：2.14MB 积分：30 举报 版权申诉

已阅读5页，还剩81页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Chapter1:

FundamentalsofComputerDesignDavidPattersonElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley/~pattrsn/~cs252Originalslidescreatedby:Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls2WhatisComputerArchitecture?FunctionaloperationoftheindividualHWunitswithinacomputersystem,andtheflowofinformationandcontrolamongthem.TechnologyProgrammingLanguageInterfaceInterfaceDesign(ISA)Measurement&EvaluationParallelismComputerArchitecture:ApplicationsOSHardwareOrganization34AbstractionLayersinModernSystemsAlgorithmGates/Register-TransferLevel(RTL)ApplicationInstructionSetArchitecture(ISA)OperatingSystem/VirtualMachineMicroarchitectureDevicesProgrammingLanguageCircuitsPhysicsOriginaldomainofthecomputerarchitect(‘50s-’80s)Domainofrecentcomputerarchitecture(‘90s)Reliability,power,…Parallelcomputing,security,…Reinvigorationofcomputerarchitecture,mid-2000sonward.5ComputerSystems:TechnologyTrends1988SupercomputersMassivelyParallelProcessorsMini-supercomputersMinicomputersWorkstationsPC’s2002PowerfulPC’sandSMPWorkstationsNetworkofSMPWorkstationsMainframesSupercomputersEmbeddedComputersCrossroads:ConventionalWisdominComp.ArchOldConventionalWisdom:Powerisfree,TransistorsexpensiveNewConventionalWisdom:“Powerwall”Powerexpensive,Xtorsfree

(Canputmoreonchipthancanaffordtoturnon)OldCW:SufficientlyincreasingInstructionLevelParallelismviacompilers,innovation(Out-of-order,speculation,…)NewCW:“ILPwall”lawofdiminishingreturnsonmoreHWforILPOldCW:Multipliesareslow,MemoryaccessisfastNewCW:“Memorywall”Memoryslow,multipliesfast

(200clockcyclestoDRAMmemory,4clocksformultiply)OldCW:Uniprocessorperformance2X/1.5yrsNewCW:PowerWall+ILPWall+MemoryWall=BrickWallUniprocessorperformancenow2X/5(?)yrs Seachangeinchipdesign:multiple“cores”

(2Xprocessorsperchip/~2years)Moresimplerprocessorsaremorepowerefficient6Crossroads:UniprocessorPerformanceVAX :25%/year1978to1986RISC+x86:52%/year1986to2002RISC+x86:??%/year2002topresentFromHennessyandPatterson,ComputerArchitecture:AQuantitativeApproach,4thedition,October,2006Lessthan20%7ChangeinChipDesignIntel4004(1971):4-bitprocessor,

2312transistors,0.4MHz,

10micronPMOS,11mm2chip

Processoristhenewtransistor?

RISCII(1983):32-bit,5stage

pipeline,40,760transistors,3MHz,

3micronNMOS,60mm2chip125mm2chip,0.065micronCMOS

=2312RISCII+FPU+Icache+DcacheRISCIIshrinksto~0.02mm2at65nmCachesviaDRAMor1transistorSRAM()?ProximityCommunicationviacapacitivecouplingat>1TB/s?

(IvanSutherland@Sun/Berkeley)8TakingAdvantageofParallelismIncreasingthroughputofservercomputerviamultipleprocessorsormultipledisksDetailedHWdesignCarrylookaheadaddersusesparallelismtospeedupcomputingsumsfromlineartologarithmicinnumberofbitsperoperandMultiplememorybankssearchedinparallelinset-associativecachesPipelining:overlapinstructionexecutiontoreducethetotaltimetocompleteaninstructionsequence.Noteveryinstructiondependsonimmediatepredecessorexecutinginstructionscompletely/partiallyinparallelpossibleClassic5-stagepipeline:

1)InstructionFetch(Ifetch),

2)RegisterRead(Reg),

3)Execute(ALU),

4)DataMemoryAccess(Dmem),

5)RegisterWrite(Reg)9PipelinedInstructionExecutionInstr.OrderTime(clockcycles)RegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchRegCycle1Cycle2Cycle3Cycle4Cycle6Cycle7Cycle510Limitstopipelining

HazardspreventnextinstructionfromexecutingduringitsdesignatedclockcycleStructuralhazards:attempttousethesamehardwaretodotwodifferentthingsatonceDatahazards:InstructiondependsonresultofpriorinstructionstillinthepipelineControlhazards:Causedbydelaybetweenthefetchingofinstructionsanddecisionsaboutchangesincontrolflow(branchesandjumps).Instr.OrderTime(clockcycles)RegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchReg11ThePrincipleofLocalityThePrincipleofLocality:Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.TwoDifferentTypesofLocality:TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)SpatialLocality(LocalityinSpace):Ifanitemisreferenced,itemswhoseaddressesareclosebytendtobereferencedsoon

(e.g.,straight-linecode,arrayaccess)Last30years,HWreliedonlocalityformemoryperf.PMEM$12LevelsoftheMemoryHierarchyCPURegisters100sBytes300–500ps(0.3-0.5ns)L1andL2Cache10s-100sKBytes~1ns-~10ns$1000s/GByteMainMemoryGBytes80ns-200ns~$100/GByteDisk10sTBytes,10ms

(10,000,000ns)~$1/GByteCapacityAccessTimeCostTapeinfinitesec-min~$1/GByteRegistersL1CacheMemoryDiskTapeInstr.OperandsBlocksPagesFilesStagingXferUnitprog./compiler1-8bytescachecntl32-64bytesOS4K-8Kbytesuser/operatorMbytesUpperLevelLowerLevelfasterLargerL2Cachecachecntl64-128bytesBlocks13WhatComputerArchitecturebringstoTableOtherfieldsoftenborrowideasfromarchitectureQuantitativePrinciplesofDesignTakeAdvantageofParallelismPrincipleofLocalityFocusontheCommonCaseAmdahl’sLawTheProcessorPerformanceEquationCareful,quantitativecomparisonsDefine,quantity,andsummarizerelativeperformanceDefineandquantityrelativecostDefineandquantitydependabilityDefineandquantitypowerCultureofanticipatingandexploitingadvancesintechnologyCultureofwell-definedinterfacesthatarecarefullyimplementedandthoroughlychecked14Comp.Arch.isanIntegratedApproachWhatreallymattersisthefunctioningofthecompletesystemhardware,runtimesystem,compiler,operatingsystem,andapplicationInnetworking,thisiscalledthe“EndtoEndargument”Computerarchitectureisnotjustabouttransistors,individualinstructions,orparticularimplementationsE.g.,OriginalRISCprojectsreplacedcomplexinstructionswithacompiler+simpleinstructions15ComputerArchitectureis

DesignandAnalysisArchitectureisaniterativeprocess:SearchingthespaceofpossibledesignsAtalllevelsofcomputersystemsCreativityGoodIdeasMediocreIdeasBadIdeasCost/PerformanceAnalysis16Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls17FocusontheCommonCaseCommonsenseguidescomputerdesignSinceitsengineering,commonsenseisvaluableInmakingadesigntrade-off,favorthefrequentcaseovertheinfrequentcaseE.g.,Instructionfetchanddecodeunitusedmorefrequentlythanmultiplier,sooptimizeit1stE.g.,Ifdatabaseserverhas50disks/processor,storagedependabilitydominatessystemdependability,sooptimizeit1stFrequentcaseisoftensimplerandcanbedonefasterthantheinfrequentcaseE.g.,overflowisrarewhenadding2numbers,soimproveperformancebyoptimizingmorecommoncaseofnooverflowMayslowdownoverflow,butoverallperformanceimprovedbyoptimizingforthenormalcaseWhatisfrequentcaseandhowmuchperformanceimprovedbymakingcasefaster=>Amdahl’sLaw

18Amdahl’sLawBestyoucouldeverhopetodo:19Amdahl’sLawexampleNewCPU10XfasterI/Oboundserver,so60%timewaitingforI/OApparently,itshumannaturetobeattractedby10Xfaster,vs.keepinginperspectiveitsjust1.6Xfaster20Processorperformanceequation InstCount CPI ClockRateProgram X Compiler X (X)Inst.Set. X XOrganization X XTechnology XCPUtime =Seconds=InstructionsxCyclesxSeconds Program ProgramInstructionCycleinstcountCPICycletime21RelatingMetricsCPUexecutiontimeMeasuredtimeforarunningprogramEasytobemeasuredClockcyclesThenumberoftheclockpulseforarunningprogramHardtobemeasuredInstructioncountThenumberofinstructionsexecutedbytheprogramcanbemeasuredbyusingsoftwaretoolsthatprofiletheexecutionorbyusingasimulatorofthearchitectureCPIClockcyclesperinstructionsNeedtheclockcyclesandcountinstructionnumberforeachinstructiontypeforcomputingtheCPIClocksDigitalcircuithasaclockthatrunsataconstantrate(像人的脈膊),clockisusedforsignalsynchronizationCycletime=timeforonefullcycle(secondspercycle)Clockrate=cyclespersecond(HertzorHz)AlsoknownasclockfrequencyScientificPrefixesusingwithcycletimeandclockratePrefixSymbolMultipleteraT10E12gigaG10E9megaM10E6kilok10E3millim10E-3micro

u10E-6nanon10E-9picop10E-12What’saClockCycle?Olddays:10levelsofgatesToday:determinedbynumeroustime-of-flightissues+gatedelaysclockpropagation,wirelengths,driversLatchorregistercombinationallogic24TheaveragenumberofclockcycleseachinstructiontakestoexecuteAfloatingpointintensiveapplicationmighthaveahigherCPICPUclockcycles=InstructioncountxCPICPUtime=CPUclockcyclesxClockcycletimeCPUtime=InstructioncountxCPIxClockcycletimeCPUtime=(InstructioncountxCPI)/ClockrateCPI(Clockcyclesperinstruction)Supposewehavetwoimplementationsofthesameinstructionset

architecture(ISA).

Forsomeprogram,

MachineAhasaclockcycletimeof10ns.andaCPIof4.0

MachineBhasaclockcycletimeof20ns.andaCPIof1.2

Whatmachineisfasterforthisprogram,andbyhowmuch?

CPIExampleCPIExampleAnswer:MachineA:clockcycle=1ns,CPI=2MachineB:clockcycle=2ns,CPI=1.2CPUclockcyclesA=InstructionCountx4.0CPUclockcyclesB=InstructionCountx1.2CPUtimeA=CPUclockcyclesAxclockcycletime=InstructionCountx2x1=2xInstructionCountCPUtimeB=InstructionCountx1.2x2=4.4xInstructionCountPerformanceA/PerformanceB=ExecutiontimeB/ExecutiontimeA=(4.4xI)/(2xI)=1.2Thus,Ais1.2timesfasterthanBOutline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls28Desktop:personalcomputerServer:webservers,fileservers,databaseserversEmbedded:handhelddevices(phones,cameras),dedicatedparallelcomputersThreemainclassesofcomputers29FeatureDesktopServerEmbeddedPriceofsystemPriceofmultiprocessormoduleCriticalsystemdesignissues$500-$5000$5000-$5,000,000$10-$100,000$50-$500$200-$10,000$.01-$100Price-performance,GraphicsperformanceThroughput,Availability,ScalabilityPrice,Powerconsumption,Application-specificperformance30Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls31InstructionSetArchitecture:CriticalInterfacePropertiesofagoodabstractionLaststhroughmanygenerations(portability)Usedinmanydifferentways(generality)ProvidesconvenientfunctionalitytohigherlevelsPermitsanefficientimplementationatlowerlevelsinstructionsetsoftwarehardware32Example:MIPSarchitecture0r0r1°°°r31PClohiProgrammablestorage 2^32xbytes 31x32-bitGPRs(R0=0) 32x32-bitFPregs(pairedDP) HI,LO,PCDatatypes?Format?AddressingModes? Arithmeticlogical

Add,AddU,Sub,SubU,And,Or,Xor,Nor,SLT,SLTU, AddI,AddIU,SLTI,SLTIU,AndI,OrI,XorI,LUI SLL,SRL,SRA,SLLV,SRLV,SRAVMemoryAccess

LB,LBU,LH,LHU,LW,LWL,LWR SB,SH,SW,SWL,SWRControl

J,JAL,JR,JALR BEq,BNE,BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL32-bitinstructionsonwordboundary33RegistertoregisterTransfer,branchesJumpsMIPSarchitectureinstructionsetformat34ISAvs.ComputerArchitectureOlddefinitionofcomputerarchitecture

=instructionsetdesignOtheraspectsofcomputerdesigncalledimplementationInsinuatesimplementationisuninterestingorlesschallengingOurviewiscomputerarchitecture>>ISAArchitect’sjobmuchmorethaninstructionsetdesign;technicalhurdlestodaymorechallengingthanthoseininstructionsetdesignSinceinstructionsetdesignnotwhereactionis,someconcludecomputerarchitecture(usingolddefinition)isnotwhereactionisWedisagreeonconclusionAgreethatISAnotwhereactionis(ISAinCA:AQA4/eappendix)35Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls36Moore’sLaw:2Xtransistors/“year”“CrammingMoreComponentsontoIntegratedCircuits”GordonMoore,Electronics,1965#ontransistors/cost-effectiveintegratedcircuitdoubleeveryNmonths(12≤N≤24)37TrackingTechnologyPerformanceTrendsDrilldowninto4technologies:Disks,Memory,Network,ProcessorsCompare~1980Archaic(Nostalgic)vs.

~2000Modern(Newfangled)PerformanceMilestonesineachtechnologyCompareforBandwidthvs.LatencyimprovementsinperformanceovertimeBandwidth:numberofeventsperunittimeE.g.,Mbits/secondovernetwork,Mbytes/secondfromdiskLatency:elapsedtimeforasingleeventE.g.,one-waynetworkdelayinmicroseconds,

averagediskaccesstimeinmilliseconds38Disks:Archaic(Nostalgic)v.Modern(Newfangled)CDCWrenI,19833600RPM0.03GBytescapacityTracks/Inch:800

Bits/Inch:9550

Three5.25”platters

Bandwidth:

0.6MBytes/secLatency:48.3msCache:noneSeagate373453,200315000RPM (4X)73.4GBytes (2500X)Tracks/Inch:64000 (80X)Bits/Inch:533,000 (60X)Four2.5”platters

(in3.5”formfactor)Bandwidth:

86MBytes/sec (140X)Latency:5.7ms (8X)Cache:8MBytes39LatencyLagsBandwidth(forlast~20years)PerformanceMilestonesDisk:3600,5400,7200,10000,15000RPM(8x,143x)(latency=simpleoperationw/ocontentionBW=best-case)40Memory:Archaic(Nostalgic)v.Modern(Newfangled)1980DRAM

(asynchronous)0.06Mbits/chip64,000xtors,35mm216-bitdatabuspermodule,16pins/chip13Mbytes/secLatency:225ns(noblocktransfer)2000

DoubleDataRateSynchr.

(clocked)DRAM256.00Mbits/chip (4000X)256,000,000xtors,204mm264-bitdatabusper

DIMM,66pins/chip (4X)1600Mbytes/sec (120X)Latency:52ns (4X)Blocktransfers(pagemode)41LatencyLagsBandwidth(last~20years)PerformanceMilestones

MemoryModule:16bitplainDRAM,PageModeDRAM,32b,64b,SDRAM,

DDRSDRAM(4x,120x)Disk:

3600,5400,7200,10000,15000RPM(8x,143x)(latency=simpleoperationw/ocontentionBW=best-case)42LANs:Archaic(Nostalgic)v.Modern(Newfangled)Ethernet802.3

YearofStandard:197810Mbits/s

linkspeedLatency:3000msecSharedmediaCoaxialcableEthernet802.3ae

YearofStandard:200310,000Mbits/s (1000X)

linkspeedLatency:190msec (15X)SwitchedmediaCategory5copperwireCoaxialCable:CoppercoreInsulatorBraidedouterconductorPlasticCoveringCopper,1mmthick,

twistedtoavoidantennaeffectTwistedPair:"Cat5"is4twistedpairsinbundle43LatencyLagsBandwidth(last~20years)PerformanceMilestones

Ethernet:10Mb,100Mb,1000Mb,10000Mb/s(16x,1000x)MemoryModule:

16bitplainDRAM,PageModeDRAM,32b,64b,SDRAM,

DDRSDRAM(4x,120x)Disk:

3600,5400,7200,10000,15000RPM(8x,143x)(latency=simpleoperationw/ocontentionBW=best-case)44CPUs:Archaic(Nostalgic)v.Modern(Newfangled)1982Intel8028612.5MHz2MIPS(peak)Latency320ns134,000xtors,47mm216-bitdatabus,68pinsMicrocodeinterpreter,

separateFPUchip(nocaches)

2001IntelPentium4

1500

MHz (120X)4500MIPS(peak) (2250X)Latency15ns (20X)42,000,000xtors,217mm264-bitdatabus,423pins3-waysuperscalar,

DynamictranslatetoRISC,Superpipelined(22stage),

Out-of-OrderexecutionOn-chip8KBDatacaches,

96KBInstr.Tracecache,

256KBL2cache45LatencyLagsBandwidth(last~20years)PerformanceMilestonesProcessor:‘286,‘386,‘486,Pentium,PentiumPro,Pentium4(21x,2250x)Ethernet:10Mb,100Mb,1000Mb,10000Mb/s(16x,1000x)MemoryModule:16bitplainDRAM,PageModeDRAM,32b,64b,SDRAM,

DDRSDRAM(4x,120x)Disk:3600,5400,7200,10000,15000RPM(8x,143x)CPUhigh,Memorylow

(“MemoryWall”)46RuleofThumbforLatencyLaggingBWInthetimethatbandwidthdoubles,latencyimprovesbynomorethanafactorof1.2to1.4

(andcapacityimprovesfasterthanbandwidth)Statedalternatively:

BandwidthimprovesbymorethanthesquareoftheimprovementinLatency

476ReasonsLatency

LagsBandwidth1. Moore’sLawhelpsBWmorethanlatencyFastertransistors,moretransistors,

morepinshelpBandwidthMPUTransistors: 0.130vs.42Mxtors (300X)DRAMTransistors: 0.064vs.256Mxtors (4000X)MPUPins: 68vs.423pins

(6X)DRAMPins: 16vs.66pins

(4X)Smaller,fastertransistorsbutcommunicate

over(relatively)longerlines:limitslatency

Featuresize: 1.5to3vs.0.18micron (8X,17X)MPUDieSize: 35vs.204mm2 (ratiosqrt2X)DRAMDieSize: 47vs.217mm2 (ratiosqrt2X)486ReasonsLatency

LagsBandwidth(cont’d)

2.Distancelimitslatency

SizeofDRAMblock

longbitandwordlines

mostofDRAMaccesstimeSpeedoflightandcomputersonnetwork1.&2.explainslinearlatencyvs.squareBW?3. Bandwidtheasiertosell(“bigger=better”)E.g.,10Gbits/sEthernet(“10Gig”)vs.

10mseclatencyEthernet4400MB/sDIMM(“PC4400”)vs.50nslatencyEvenifjustmarketing,customersnowtrainedSincebandwidthsells,moreresourcesthrownatbandwidth,whichfurthertipsthebalance496ReasonsLatency

LagsBandwidth(cont’d)

4. LatencyhelpsBW,butnotviceversa

Spinningdiskfasterimprovesbothbandwidthandrotationallatency

3600RPM15000RPM=4.2XAveragerotationallatency:8.3ms2.0msThingsbeingequal,alsohelpsBWby4.2XLowerDRAMlatency

Moreaccess/second(higherbandwidth)HigherlineardensityhelpsdiskBW

(andcapacity),butnotdiskLatency9,550BPI533,000BPI

60XinBW506ReasonsLatency

LagsBandwidth(cont’d)

5.BandwidthhurtslatencyQueueshelpBandwidth,hurtLatency(QueuingTheory)AddingchipstowidenamemorymoduleincreasesBandwidthbuthigherfan-outonaddresslinesmayincreaseLatency6.OperatingSystemoverheadhurts

LatencymorethanBandwidthLongmessagesamortizeoverhead;

overheadbiggerpartofshortmessages51Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls52Defineandquantitypower(1/2)ForCMOSchips,traditionaldominantenergyconsumptionhasbeeninswitchingtransistors,calleddynamicpower:Formobiledevices,energybettermetricForafixedtask,slowingclockrate(frequencyswitched)reducespower,butnotenergyCapacitiveloadafunctionofnumberoftransistorsconnectedtooutputandtechnology,whichdeterminescapacitanceofwiresandtransistorsDroppingvoltagehelpsboth,sowentfrom5Vto1VTosaveenergy&dynamicpower,mostCPUsnowturnoffclockofinactivemodules(e.g.Fl.Pt.Unit)53ExampleofquantifyingpowerSuppose15%reductioninvoltageresultsina15%reductioninfrequency.Whatisimpactondynamicpower?54Defineandquantitypower(2/2)Becauseleakagecurrentflowsevenwhenatransistorisoff,nowstaticpowerimportanttooLeakagecurrentincreasesinprocessorswithsmallertransistorsizesIncreasingthenumberoftransistorsincreasespowereveniftheyareturnedoffIn2006,goalforleakageis25%oftotalpowerconsumption;highperformancedesignsat40%Verylowpowersystemsevengatevoltagetoinactivemodulestocontrollossduetoleakage55Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls56CostofIntegratedCircuitsdependsofseveralfactors:Time:Thepricedropswithtime,learningcurveincreasesVolume:ThepricedropswithvolumeincreaseCommodities:ManymanufacturersproducethesameproductCompetitionbringspricesdown57ThepriceofIntelPentium4andPentiumM58AMDOpteronMicroprocessorDie59A300mmsiliconwafercontains117AMDOpteronmicroprocessorchipsina90nmprocess60Costofintegratedcircuit=Costofdie+Costoftestingdie+CostofPackagingandfinalTestFinalTestYieldCostofdie=CostofWaferDiesperwaferXDieyield61Diesperwafer=PiXWaferDiameterSqrt(2XDiearea)Example:WaferDiameter=300mmDiearea=1.5cmX1.5cm=2.25cm^2Diesperwafer=270PiX(WaferDiameter/2)^2Diearea-62Dieyield=DefectsperunitareaXDieareaaWaferyieldX(1+)-aWaferyield:measureshowmanywafersarecompletelybada=4Empiricalformulacorrespondstomaskinglevelsinmanufacturingprocess63Example:Diearea=1.5cmX1.5cm=2.25cm^2Dieyield=0.44Defectdensity=0.4percm^2Diearea=1.0cmX1.0cm=1cm^2Dieyield=0.68Smallerdieareagivesmoredieyield64Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependability

PerformanceFallaciesandPitfalls65Defineandquantitydependability(1/3)Howdecidewhenasystemisoperatingproperly?InfrastructureprovidersnowofferServiceLevelAgreements(SLA)toguaranteethattheirnetworkingorpowerservicewouldbedependableSystemsalternatebetween2statesofservicewithrespecttoanSLA:Serviceaccomplishment,wheretheserviceisdeliveredasspecifiedinSLAServiceinterruption,wherethedeliveredserviceisdifferentfromtheSLAFailure=transitionfromstate1tostate2Restoration=transitionfromstate2tostate166Defineandquantitydependability(2/3)Modulereliability=measureofcontinuousserviceaccomplishment(ortimetofailure).

2metricsMeanTimeToFailure(MTTF)measuresReliabilityFailuresInTime(FIT)=1/MTTF,therateoffailuresTraditionallyreportedasfailuresperbillionhoursofoperationMeanTimeToRepair(MTTR)measuresServiceInterruptionMeanTimeBetweenFailures(MTBF)=MTTF+MTTRModuleavailabilitymeasuresserviceasalternatebetweenthe2statesofaccomplishmentandinterruption(numberbetween0and1,e.g.0.9)Moduleavailability=MTTF/(MTTF+MTTR)67ExamplecalculatingreliabilityIfmoduleshaveexponentiallydistributedlifetimes(ageofmoduledoesnotaffectprobabilityoffailure),overallfailurerateisthesumoffailureratesofthemodulesCalculateFITandMTTFfor10disks(1MhourMTTFperdisk),1diskcontroller(0.5MhourMTTF),and1powersupply(0.2MhourMTTF):68Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls6970HowtoQuantifyPerformance?Timetorunthetask(ExTime)Executiontime,responsetime,latencyTasksperday,hour,week,sec,ns…(Performance)Throughput,bandwidthPlaneBoeing747BAD/SudConcodreSpeed610mph1350mphDCtoParis6.5hours3hoursPassengers470132Throughput(pmph)286,700178,200Definition:Performance Performance(X) Execution_time(Y) n= = Performance(Y) Execution_time(X)PerformanceisinunitsofthingspersecbiggerisbetterIfweareprimarilyconcernedwithresponsetime1 execution_time(x)"XisntimesfasterthanY"means:performance(x)=71Performance:WhattomeasureUsuallyrelyonbenchmarksvs.realworkloadsToincreasepredictability,collectionsofbenchmarkapplications,calledbenchmarksuites,arepopularSPECCPU:populardesktopbenchmarksuiteCPUonly,splitbetweenintegerandfloatingpointprogramsSPECint2000has12integer,SPECfp2000has14integerpgmsSPECCPU2006tobeannouncedSpring2006SPECSFS(NFSfileserver)andSPECWeb(WebServer)addedasserverbenchmarksTransactionProcessingCouncilmeasuresserverperformanceandcost-performancefordatabasesTPC-CComplexqueryforOnlineTransactionProcessingTPC-HmodelsadhocdecisionsupportTPC-WatransactionalwebbenchmarkTPC-Appapplicationserverandwebservicesbenchmark7273SPEC:SystemPerformanceEvaluationCooperativeFirstRound198910programsyieldingasinglenumber(“SPECmarks”)SecondRound1992SPECInt92(6integerprograms)andSPECfp92(14floatingpointprograms)CompilerFlagsunlimited.March93newsetofprograms:SPECint95(8integerprograms)andSPECfp95(10floatingpoint)“benchmarksusefulfor3years”Singleflagsettingforallprograms:SPECint_base95,SPECfp_base95

SPECCPU2000(11integerbenchmarks–CINT2000,and14floating-pointbenchmarks–CFP2000NormalizedExecutionTimeNormalizeexecutiontimetoareferencemachineTwocommonmethodArithmeticmeanGeometricmeanComparisonArithmeticmeanUsetopredictperformanceMaynotbeconsistentGeometricmeanIndependentoftherunningtimesoftheindividualprogramsCannotbeusedtopredictrelativeexecutiontimeforaworkload4.5NormalizedExecutionTime–ExampleTimeonATimeonBNormalizedtoANormalizedtoBABABProgram111011

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

并行处理与体系结构课件hitsz-lec01

文档简介

温馨提示

最新文档

评论

并行处理与体系结构课件hitsz-lec01

文档简介

温馨提示

最新文档

评论

相关文档