专业英语特点_第1页
专业英语特点_第2页
专业英语特点_第3页
专业英语特点_第4页
专业英语特点_第5页
已阅读5页,还剩57页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1/30/20241AnOverviewofHighPerformanceComputingandChallengesfortheFuture

JackDongarraINNOVATIVECOMPINGLABORATORYUniversityofTennesseeOakRidgeNationalLaboratoryUniversityofManchesterOutlineTop500ResultsFourImportantConceptsthatWillEffectMathSoftwareEffectiveUseofMany-CoreExploitingMixedPrecisioninOurNumericalComputationsSelfAdapting/AutoTuningofSoftwareFaultTolerantAlgorithms23H.Meuer,H.Simon,E.Strohmaier,&JD-Listingofthe500mostpowerfulComputersintheWorld-Yardstick:RmaxfromLINPACKMPP

Ax=b,denseproblem-Updatedtwiceayear SC‘xyintheStatesinNovember MeetinginGermanyinJune-AlldataavailablefromSizeRateTPPperformance4PerformanceDevelopmentMyLaptop6-8years29thList/June2007page529thList:TheTOP10ManufacturerComputerRmax[TF/s]InstallationSiteCountryYear#Proc1IBMBlueGene/L

eServerBlueGene280.6DOE/NNSA/LLNLUSA2005131,0722

10CrayJaguar

CrayXT3/XT4101.7DOE/ORNLUSA200723,0163

2Sandia/CrayRedStorm

CrayXT3101.4DOE/NNSA/SandiaUSA200626,5444

3IBMBGW

eServerBlueGene91.29IBMThomasWatsonUSA200540,9605IBMNewYorkBLue

eServerBlueGene82.16StonyBrook/BNLUSA200736,8646

4IBMASCPurple

eServerpSeriesp57575.76DOE/NNSA/LLNLUSA200512,2087IBMBlueGene/L

eServerBlueGene73.03RensselaerPolytechnicInstitute/CCNIUSA200732,7688DellAbe

PowerEdge1955,Infiniband62.68NCSAUSA20079,6009

5IBMMareNostrum

JS21Cluster,Myrinet62.63BarcelonaSupercomputingCenterSpain200612,24010SGIHLRB-II

SGIAltix470056.52LRZGermany20079,7286PerformanceProjection7CoresperSystem-June2007888systems>10Tflop/s326systems>5Tflop/s14systems>50Tflop/s88systems>10Tflop/s326systems>5Tflop/s996%=58%Intel17%IBM21%AMDChipsUsedinEachofthe500Systems10Interconnects/Systems(206)(46)GigE+Infiniband+Myrinet=74%(128)29thList/June2007page11Countries/SystemsRankSiteManufactComputerProcsRMaxSegmentInterconnectFamily66CINECAIBMeServer326OpteronDual512012608AcademicInfband132SCSS.r.l.HPClusterPlatform3000Xeon10247987.2ResearchInfband271TelecomItaliaHPSuperDome875MHz30725591IndustryMyrinet295TelecomItaliaHPClusterPlatform3000Xeon7405239IndustryGige305EsprinetHPClusterPlatform3000Xeon6645179IndustryGige12PowerisanIndustryWideProblem“HidinginPlainSight,GoogleSeeksMorePower〞,byJohnMarkoff,June14,2006NewGooglePlantinTheDulles,Oregon,fromNYT,June14,2006Googlefacilitiesleveraginghydroelectricpoweroldaluminumplants>500,000serversworldwideGflop/KWattintheTop201314Chip(2processors)17wattsComputeCard(2chips,2x1x1)4processorsNodeBoard(32chips,4x4x2)16ComputeCards64processors(64racks,64x32x32)131,072procsRack(32Nodeboards,8x8x16)2048processors2.8/5.6GF/s4MB(cache)5.6/11.2GF/s1GBDDR90/180GF/s16GBDDR2.9/5.7TF/s0.5TBDDR180/360TF/s32TBDDRIBMBlueGene/L#1

131,072CoresTotalof33systemsintheTop500“FastestComputer〞BG/L700MHz131Kproc64racksPeak: 367Tflop/sLinpack: 281Tflop/s77%ofpeakBlueGene/LComputeASIC

Fullsystemtotalof131,072processorsThecomputenodeASICsincludeallnetworkingandprocessorfunctionality.EachcomputeASICincludestwo32-bitsuperscalarPowerPC440embeddedcores(notethatL1cachecoherenceisnotmaintainedbetweenthesecores).(13Ksecabout3.6hours;n=1.8M)1.6MWatts(1600homes)43,000ops/s/person15LowerVoltageIncreaseClockRate

&TransistorDensityWehaveseenincreasingnumberofgatesonachipandincreasingclockspeed.Heatbecominganunmanageableproblem,IntelProcessors>100WattsWewillnotseethedramaticincreasesinclockspeedsinthefuture.However,thenumberofgatesonachipwillcontinuetoincrease.IncreasingthenumberofgatesintoatightknotanddecreasingthecycletimeoftheprocessorCoreCacheCoreCacheCoreC1C2C3C4CacheC1C2C3C4CacheC1C2C3C4C1C2C3C4C1C2C3C4C1C2C3C416PowerCostofFrequencyPower∝Voltage2xFrequency

(V2F)Frequency∝VoltagePower∝Frequency317PowerCostofFrequencyPower∝Voltage2xFrequency

(V2F)Frequency∝VoltagePower∝Frequency3What’sNext?SRAM+3DStackedMemoryManyFloating-PointCoresAllLargeCoreMixedLarge

andSmallCoreAllSmallCoreManySmallCoresDifferentClassesofChipsHomeGames/GraphicsBusinessScientific19NovelOpportunitiesinMulticoresDon’thavetocontendwithuniprocessorsNotyoursameoldmultiprocessorproblemHowdoesgoingfromMultiprocessorstoMulticoresimpactprograms?Whatchanged?WhereistheImpact?CommunicationBandwidthCommunicationLatency20CommunicationBandwidthHowmuchdatacanbecommunicated

betweentwocores?Whatchanged?NumberofWiresClockrateMultiplexingImpactonprogrammingmodel?MassivedataexchangeispossibleDatamovementisnotthebottleneck

processoraffinitynotthatimportant32Gigabits/sec~300Terabits/sec10,000X21CommunicationLatencyHowlongdoesittakeforaroundtripcommunication?Whatchanged?LengthofwirePipelinestagesImpactonprogrammingmodel?Ultra-fastsynchronizationCanrunreal-timeapps

onmultiplecores50X~200Cycles~4cycles2280CoreIntel’s80Corechip1Tflop/s62Watts1.2TB/sinternalBW$200M10Pflop/s;40K8-core4GhzIBMPower7chips;1.2PBmemory;5PB/sglobalbandwidth;interconnectBWof0.55PB/s;18PBdiskat1.8TB/sI/Obandwidth.ForusebyafewpeopleNSFTrack1–NCSA/UIUC$65Mover5yearsfora1Pflop/ssystem$30Mover5yearsforequipment36cabinetsofaCrayXT5(AMD8-core/chip,12socket/board,3GHz,4flops/cycle/core)$35Mover5yearsforoperationsPowercost:$1.1M/yearCrayMaintenance:$1M/yearTobeusedbytheNSFcommunity1000’sofusersJoinsUCSD,PSC,TACCNSFUTK/JICSTrack2proposalLastYear’sTrack2awardtoUofTexas27MajorChangestoSoftwareMustrethinkthedesignofoursoftwareAnotherdisruptivetechnologySimilartowhathappenedwithclustercomputingandmessagepassingRethinkandrewritetheapplications,algorithms,andsoftwareNumericallibrariesforexamplewillchangeForexample,bothLAPACKandScaLAPACKwillundergomajorchangestoaccommodatethis28MajorChangestoSoftwareMustrethinkthedesignofoursoftwareAnotherdisruptivetechnologySimilartowhathappenedwithclustercomputingandmessagepassingRethinkandrewritetheapplications,algorithms,andsoftwareNumericallibrariesforexamplewillchangeForexample,bothLAPACKandScaLAPACKwillundergomajorchangestoaccommodatethisANewGenerationofSoftware:

AlgorithmsfollowhardwareevolutionintimeLINPACK(80’s)(Vectoroperations)Relyon-Level-1BLASoperationsLAPACK(90’s)(Blocking,cachefriendly)Relyon-Level-3BLASoperationsPLASMA(00’s)NewAlgorithms(many-corefriendly)Relyon-aDAG/scheduler-blockdatalayout-someextrakernelsThosenewalgorithms-haveaverylowgranularity,theyscaleverywell(multicore,petascalecomputing,…)-removesalotsofdependenciesamongthetasks,(multicore,distributedcomputing)-avoidlatency(distributedcomputing,out-of-core)-relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.ANewGenerationofSoftware:

ParallelLinearAlgebraSoftwareforMulticoreArchitectures(PLASMA)AlgorithmsfollowhardwareevolutionintimeLINPACK(80’s)(Vectoroperations)Relyon-Level-1BLASoperationsLAPACK(90’s)(Blocking,cachefriendly)Relyon-Level-3BLASoperationsPLASMA(00’s)NewAlgorithms(many-corefriendly)Relyon-aDAG/scheduler-blockdatalayout-someextrakernelsThosenewalgorithms-haveaverylowgranularity,theyscaleverywell(multicore,petascalecomputing,…)-removesalotsofdependenciesamongthetasks,(multicore,distributedcomputing)-avoidlatency(distributedcomputing,out-of-core)-relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.31StepsintheLAPACKLU(Factorapanel)(Backwardswap)(Forwardswap)(Triangularsolve)(Matrixmultiply)LUTimingProfile(4processorsystem)1DdecompositionandSGIOriginTimeforeachcomponentDGETF2DLASWP(L)DLASWP(R)DTRSMDGEMMThreads–nolookaheadBulkSyncPhases33AdaptiveLookahead-DynamicEventDrivenMultithreadingReorganizingalgorithmstousethisapproach34ACABCTTTFork-Joinvs.DynamicExecutionFork-Join–parallelBLASExperimentsonIntel’sQuadCoreClovertownwith2Socketsw/8Treads

Time35ACABCTTTFork-Joinvs.DynamicExecutionFork-Join–parallelBLASDAG-based–dynamicschedulingTimeExperimentsonIntel’sQuadCoreClovertownwith2Socketsw/8Treads

Timesaved36WiththeHypeonCell&PS3

WeBecameInterestedThePlayStation3'sCPUbasedona"Cell“processorEachCellcontainsaPowerPCprocessorand8SPEs.(SPEisprocessingunit,SPE:SPU+DMAengine)AnSPEisaselfcontainedvectorprocessorwhichactsindependentlyfromtheothers.4waySIMDfloatingpointunitscapableofatotalof25.6Gflop/s@3.2GHZ204.8Gflop/s

peak!

Thecatchisthatthisisfor32bitfloatingpoint;(SinglePrecisionSP)And64bitfloatingpointrunsat14.6Gflop/stotalforall8SPEs!!DivideSPpeakby14;factorof2becauseofDPand7becauseoflatencyissuesSPE~25Gflop/speakPerformanceofSinglePrecisiononConventionalProcessorsSingleprecisionisfasterbecause:HigherparallelisminSSE/vectorunitsReduceddatamotionHigherlocalityincacheRealizedhavethesimilarsituationonourcommodityprocessors.Thatis,SPis2XasfastasDPonmanysystemsTheIntelPentiumandAMDOpteronhaveSSE22flops/cycleDP4flops/cycleSPIBMPowerPChasAltiVec8flops/cycleSP4flops/cycleDPNoDPonAltiVec

SizeSGEMM/

DGEMMSizeSGEMV/

DGEMVAMDOpteron24630002.0050001.70UltraSparc-IIe30001.6450001.66IntelPIIICoppermine30002.0350002.09PowerPC97030002.0450001.44IntelWoodcrest30001.8150002.18IntelXEON30002.0450001.82IntelCentrinoDuo30002.7150002.213832or64bitFloatingPointPrecision?Alongtimeago32bitfloatingpointwasusedStillusedinscientificappsbutlimitedMostappsuse64bitfloatingpointAccumulationofroundofferrorA10TFlop/scomputerrunningfor4hoursperforms>1Exaflop(1018)ops.IllconditionedproblemsIEEESPexponentbitstoofew(8bits,10±38)CriticalsectionsneedhigherprecisionSometimesneedextendedprecision(128bitflpt)Howeversomecangetbywith32bitflptinsomepartsMixedprecisionapossibilityApproximateinlowerprecisionandthenrefineorimprovesolutiontohighprecision.39IdeaGoesSomethingLikeThis…Exploit32bitfloatingpointasmuchaspossible.EspeciallyforthebulkofthecomputationCorrectorupdatethesolutionwithselectiveuseof64bitfloatingpointtoprovidearefinedresultsIntuitively:Computea32bitresult,Calculateacorrectionto32bitresultusingselectedhigherprecisionand,Performtheupdateofthe32bitresultswiththecorrectionusinghighprecision.LU=lu(A)

SINGLE

O(n3)x=L\(U\b) SINGLE

O(n2)r=b–Ax DOUBLE

O(n2)WHILE||r||notsmallenough

z=L\(U\r) SINGLE

O(n2)

x=x+z DOUBLE

O(n1)

r=b–Ax DOUBLE

O(n2)ENDMixed-PrecisionIterativeRefinementIterativerefinementfordensesystems,Ax=b,canworkthisway.LU=lu(A)

SINGLE

O(n3)x=L\(U\b) SINGLE

O(n2)r=b–Ax DOUBLE

O(n2)WHILE||r||notsmallenough

z=L\(U\r) SINGLE

O(n2)

x=x+z DOUBLE

O(n1)

r=b–Ax DOUBLE

O(n2)ENDMixed-PrecisionIterativeRefinementIterativerefinementfordensesystems,Ax=b,canworkthisway.Wilkinson,Moler,Stewart,&HighamprovideerrorboundforSPflptresultswhenusingDPflpt.Itcanbeshownthatusingthisapproachwecancomputethesolutionto64-bitfloatingpointprecision.Requiresextrastorage,totalis1.5timesnormal;O(n3)workisdoneinlowerprecisionO(n2)workisdoneinhighprecisionProblemsifthematrixisill-conditionedinsp;O(108)

ResultsforMixedPrecisionIterativeRefinementforDenseAx=bSingleprecisionisfasterthanDPbecause:Higherparallelismwithinvectorunits4ops/cycle(usually)insteadof2ops/cycleReduceddatamotion32bitdatainsteadof64bitdataHigherlocalityincacheMoredataitemsincache

ResultsforMixedPrecisionIterativeRefinementforDenseAx=bSingleprecisionisfasterthanDPbecause:Higherparallelismwithinvectorunits4ops/cycle(usually)insteadof2ops/cycleReduceddatamotion32bitdatainsteadof64bitdataHigherlocalityincacheMoredataitemsincacheArchitecture(BLAS-MPI)#procsnDPSolve/SPSolveDPSolve/IterRef#iterAMDOpteron(Goto–OpenMPIMX)32226271.851.796AMDOpteron(Goto–OpenMPIMX)64320001.901.83644WhatabouttheCell?PowerPCat3.2GHzDGEMMat5Gflop/sAltivecpeakat25.6Gflop/sAchieved10Gflop/sSGEMM8SPUs204.8Gflop/s

peak!

Thecatchisthatthisisfor32bitfloatingpoint;(SinglePrecisionSP)And64bitfloatingpointrunsat14.6Gflop/stotalforall8SPEs!!DivideSPpeakby14;factorof2becauseofDPand7becauseoflatencyissues

MovingDataAroundontheCell256KBWorstcasememoryboundoperations(noreuseofdata)3datamovements(2inand1out)with2ops(SAXPY)Forthecellwouldbe4.6Gflop/s(25.6GB/s*2ops/12B)Injectionbandwidth25.6GB/sInjectionbandwidth46IBMCell3.2GHz,Ax=b.30secs3.9secs8SGEMM(EmbarrassinglyParallel)47IBMCell3.2GHz,Ax=b.30secs.47secs3.9secs8.3X8SGEMM(EmbarrassinglyParallel)3348CholeskyontheCell,Ax=b,A=AT,xTAx>0

FortheSPE’sstandardCcodeandClanguageSIMDextensions(intrinsics)SingleprecisionperformanceMixedprecisionperformanceusingiterativerefinementMethodachieving64bitaccuracyCholesky-Using2CellChips4950IntriguingPotentialExploitlowerprecisionasmuchaspossiblePayoffinperformanceFasterfloatingpointLessdatatomoveAutomaticallyswitchbetweenSPandDPtomatchthedesiredaccuracyComputesolutioninSPandthenacorrectiontothesolutioninDPPotentialforGPU,FPGA,specialpurposeprocessorsWhatabout16bitfloatingpoint?UseaslittleyoucangetawaywithandimprovetheaccuracyAppliestosparsedirectanditerativelinearsystemsandEigenvalue,optimizationproblems,whereNewton’smethodisused.Correction=-A\(b–Ax)3351IBM/MercuryCellBladeFromIBMorMercury2CellchipEachw/8SPEs512MB/Cell~$8K-17KSomeSW3352SonyPlaystation3ClusterPS3-TFromIBMorMercury2CellchipEachw/8SPEs512MB/Cell~$8K-17KSomeSWFromWAL*MARTPS31Cellchipw/6SPEs256MB/PS3$600DownloadSWDualbootSITCELLCellHardwareOverviewPEPEPEPEPEPE200GB/s512MiB25GB/sPowerPCPEPE3.2GHz25GB/sinjectionbandwidth200GB/sbetweenSPEs32bitpeakperf8*25.6Gflop/s 204.8Gflop/speak64bitpeakperf8*1.8Gflop/s 14.6Gflop/speak512MiBmemory25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/sSITCELLPS3HardwareOverviewPEPEPEPEPEPE200GB/sGameOSHypervisor256MiBDisabled/Broken:Yieldissues25GB/sPowerPC3.2GHz25GB/sinjectionbandwidth200GB/sbetweenSPEs32bitpeakperf6*25.6Gflop/s 153.6Gflop/speak64bitpeakperf6*1.8Gflop/s 10.8Gflop/speak1Gb

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论