![专业英语特点_第1页](http://file4.renrendoc.com/view10/M00/27/13/wKhkGWW4ke-AAJXFAAFfyHJYLD0209.jpg)
![专业英语特点_第2页](http://file4.renrendoc.com/view10/M00/27/13/wKhkGWW4ke-AAJXFAAFfyHJYLD02092.jpg)
![专业英语特点_第3页](http://file4.renrendoc.com/view10/M00/27/13/wKhkGWW4ke-AAJXFAAFfyHJYLD02093.jpg)
![专业英语特点_第4页](http://file4.renrendoc.com/view10/M00/27/13/wKhkGWW4ke-AAJXFAAFfyHJYLD02094.jpg)
![专业英语特点_第5页](http://file4.renrendoc.com/view10/M00/27/13/wKhkGWW4ke-AAJXFAAFfyHJYLD02095.jpg)
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1/30/20241AnOverviewofHighPerformanceComputingandChallengesfortheFuture
JackDongarraINNOVATIVECOMPINGLABORATORYUniversityofTennesseeOakRidgeNationalLaboratoryUniversityofManchesterOutlineTop500ResultsFourImportantConceptsthatWillEffectMathSoftwareEffectiveUseofMany-CoreExploitingMixedPrecisioninOurNumericalComputationsSelfAdapting/AutoTuningofSoftwareFaultTolerantAlgorithms23H.Meuer,H.Simon,E.Strohmaier,&JD-Listingofthe500mostpowerfulComputersintheWorld-Yardstick:RmaxfromLINPACKMPP
Ax=b,denseproblem-Updatedtwiceayear SC‘xyintheStatesinNovember MeetinginGermanyinJune-AlldataavailablefromSizeRateTPPperformance4PerformanceDevelopmentMyLaptop6-8years29thList/June2007page529thList:TheTOP10ManufacturerComputerRmax[TF/s]InstallationSiteCountryYear#Proc1IBMBlueGene/L
eServerBlueGene280.6DOE/NNSA/LLNLUSA2005131,0722
10CrayJaguar
CrayXT3/XT4101.7DOE/ORNLUSA200723,0163
2Sandia/CrayRedStorm
CrayXT3101.4DOE/NNSA/SandiaUSA200626,5444
3IBMBGW
eServerBlueGene91.29IBMThomasWatsonUSA200540,9605IBMNewYorkBLue
eServerBlueGene82.16StonyBrook/BNLUSA200736,8646
4IBMASCPurple
eServerpSeriesp57575.76DOE/NNSA/LLNLUSA200512,2087IBMBlueGene/L
eServerBlueGene73.03RensselaerPolytechnicInstitute/CCNIUSA200732,7688DellAbe
PowerEdge1955,Infiniband62.68NCSAUSA20079,6009
5IBMMareNostrum
JS21Cluster,Myrinet62.63BarcelonaSupercomputingCenterSpain200612,24010SGIHLRB-II
SGIAltix470056.52LRZGermany20079,7286PerformanceProjection7CoresperSystem-June2007888systems>10Tflop/s326systems>5Tflop/s14systems>50Tflop/s88systems>10Tflop/s326systems>5Tflop/s996%=58%Intel17%IBM21%AMDChipsUsedinEachofthe500Systems10Interconnects/Systems(206)(46)GigE+Infiniband+Myrinet=74%(128)29thList/June2007page11Countries/SystemsRankSiteManufactComputerProcsRMaxSegmentInterconnectFamily66CINECAIBMeServer326OpteronDual512012608AcademicInfband132SCSS.r.l.HPClusterPlatform3000Xeon10247987.2ResearchInfband271TelecomItaliaHPSuperDome875MHz30725591IndustryMyrinet295TelecomItaliaHPClusterPlatform3000Xeon7405239IndustryGige305EsprinetHPClusterPlatform3000Xeon6645179IndustryGige12PowerisanIndustryWideProblem“HidinginPlainSight,GoogleSeeksMorePower〞,byJohnMarkoff,June14,2006NewGooglePlantinTheDulles,Oregon,fromNYT,June14,2006Googlefacilitiesleveraginghydroelectricpoweroldaluminumplants>500,000serversworldwideGflop/KWattintheTop201314Chip(2processors)17wattsComputeCard(2chips,2x1x1)4processorsNodeBoard(32chips,4x4x2)16ComputeCards64processors(64racks,64x32x32)131,072procsRack(32Nodeboards,8x8x16)2048processors2.8/5.6GF/s4MB(cache)5.6/11.2GF/s1GBDDR90/180GF/s16GBDDR2.9/5.7TF/s0.5TBDDR180/360TF/s32TBDDRIBMBlueGene/L#1
131,072CoresTotalof33systemsintheTop500“FastestComputer〞BG/L700MHz131Kproc64racksPeak: 367Tflop/sLinpack: 281Tflop/s77%ofpeakBlueGene/LComputeASIC
Fullsystemtotalof131,072processorsThecomputenodeASICsincludeallnetworkingandprocessorfunctionality.EachcomputeASICincludestwo32-bitsuperscalarPowerPC440embeddedcores(notethatL1cachecoherenceisnotmaintainedbetweenthesecores).(13Ksecabout3.6hours;n=1.8M)1.6MWatts(1600homes)43,000ops/s/person15LowerVoltageIncreaseClockRate
&TransistorDensityWehaveseenincreasingnumberofgatesonachipandincreasingclockspeed.Heatbecominganunmanageableproblem,IntelProcessors>100WattsWewillnotseethedramaticincreasesinclockspeedsinthefuture.However,thenumberofgatesonachipwillcontinuetoincrease.IncreasingthenumberofgatesintoatightknotanddecreasingthecycletimeoftheprocessorCoreCacheCoreCacheCoreC1C2C3C4CacheC1C2C3C4CacheC1C2C3C4C1C2C3C4C1C2C3C4C1C2C3C416PowerCostofFrequencyPower∝Voltage2xFrequency
(V2F)Frequency∝VoltagePower∝Frequency317PowerCostofFrequencyPower∝Voltage2xFrequency
(V2F)Frequency∝VoltagePower∝Frequency3What’sNext?SRAM+3DStackedMemoryManyFloating-PointCoresAllLargeCoreMixedLarge
andSmallCoreAllSmallCoreManySmallCoresDifferentClassesofChipsHomeGames/GraphicsBusinessScientific19NovelOpportunitiesinMulticoresDon’thavetocontendwithuniprocessorsNotyoursameoldmultiprocessorproblemHowdoesgoingfromMultiprocessorstoMulticoresimpactprograms?Whatchanged?WhereistheImpact?CommunicationBandwidthCommunicationLatency20CommunicationBandwidthHowmuchdatacanbecommunicated
betweentwocores?Whatchanged?NumberofWiresClockrateMultiplexingImpactonprogrammingmodel?MassivedataexchangeispossibleDatamovementisnotthebottleneck
processoraffinitynotthatimportant32Gigabits/sec~300Terabits/sec10,000X21CommunicationLatencyHowlongdoesittakeforaroundtripcommunication?Whatchanged?LengthofwirePipelinestagesImpactonprogrammingmodel?Ultra-fastsynchronizationCanrunreal-timeapps
onmultiplecores50X~200Cycles~4cycles2280CoreIntel’s80Corechip1Tflop/s62Watts1.2TB/sinternalBW$200M10Pflop/s;40K8-core4GhzIBMPower7chips;1.2PBmemory;5PB/sglobalbandwidth;interconnectBWof0.55PB/s;18PBdiskat1.8TB/sI/Obandwidth.ForusebyafewpeopleNSFTrack1–NCSA/UIUC$65Mover5yearsfora1Pflop/ssystem$30Mover5yearsforequipment36cabinetsofaCrayXT5(AMD8-core/chip,12socket/board,3GHz,4flops/cycle/core)$35Mover5yearsforoperationsPowercost:$1.1M/yearCrayMaintenance:$1M/yearTobeusedbytheNSFcommunity1000’sofusersJoinsUCSD,PSC,TACCNSFUTK/JICSTrack2proposalLastYear’sTrack2awardtoUofTexas27MajorChangestoSoftwareMustrethinkthedesignofoursoftwareAnotherdisruptivetechnologySimilartowhathappenedwithclustercomputingandmessagepassingRethinkandrewritetheapplications,algorithms,andsoftwareNumericallibrariesforexamplewillchangeForexample,bothLAPACKandScaLAPACKwillundergomajorchangestoaccommodatethis28MajorChangestoSoftwareMustrethinkthedesignofoursoftwareAnotherdisruptivetechnologySimilartowhathappenedwithclustercomputingandmessagepassingRethinkandrewritetheapplications,algorithms,andsoftwareNumericallibrariesforexamplewillchangeForexample,bothLAPACKandScaLAPACKwillundergomajorchangestoaccommodatethisANewGenerationofSoftware:
AlgorithmsfollowhardwareevolutionintimeLINPACK(80’s)(Vectoroperations)Relyon-Level-1BLASoperationsLAPACK(90’s)(Blocking,cachefriendly)Relyon-Level-3BLASoperationsPLASMA(00’s)NewAlgorithms(many-corefriendly)Relyon-aDAG/scheduler-blockdatalayout-someextrakernelsThosenewalgorithms-haveaverylowgranularity,theyscaleverywell(multicore,petascalecomputing,…)-removesalotsofdependenciesamongthetasks,(multicore,distributedcomputing)-avoidlatency(distributedcomputing,out-of-core)-relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.ANewGenerationofSoftware:
ParallelLinearAlgebraSoftwareforMulticoreArchitectures(PLASMA)AlgorithmsfollowhardwareevolutionintimeLINPACK(80’s)(Vectoroperations)Relyon-Level-1BLASoperationsLAPACK(90’s)(Blocking,cachefriendly)Relyon-Level-3BLASoperationsPLASMA(00’s)NewAlgorithms(many-corefriendly)Relyon-aDAG/scheduler-blockdatalayout-someextrakernelsThosenewalgorithms-haveaverylowgranularity,theyscaleverywell(multicore,petascalecomputing,…)-removesalotsofdependenciesamongthetasks,(multicore,distributedcomputing)-avoidlatency(distributedcomputing,out-of-core)-relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.31StepsintheLAPACKLU(Factorapanel)(Backwardswap)(Forwardswap)(Triangularsolve)(Matrixmultiply)LUTimingProfile(4processorsystem)1DdecompositionandSGIOriginTimeforeachcomponentDGETF2DLASWP(L)DLASWP(R)DTRSMDGEMMThreads–nolookaheadBulkSyncPhases33AdaptiveLookahead-DynamicEventDrivenMultithreadingReorganizingalgorithmstousethisapproach34ACABCTTTFork-Joinvs.DynamicExecutionFork-Join–parallelBLASExperimentsonIntel’sQuadCoreClovertownwith2Socketsw/8Treads
Time35ACABCTTTFork-Joinvs.DynamicExecutionFork-Join–parallelBLASDAG-based–dynamicschedulingTimeExperimentsonIntel’sQuadCoreClovertownwith2Socketsw/8Treads
Timesaved36WiththeHypeonCell&PS3
WeBecameInterestedThePlayStation3'sCPUbasedona"Cell“processorEachCellcontainsaPowerPCprocessorand8SPEs.(SPEisprocessingunit,SPE:SPU+DMAengine)AnSPEisaselfcontainedvectorprocessorwhichactsindependentlyfromtheothers.4waySIMDfloatingpointunitscapableofatotalof25.6Gflop/s@3.2GHZ204.8Gflop/s
peak!
Thecatchisthatthisisfor32bitfloatingpoint;(SinglePrecisionSP)And64bitfloatingpointrunsat14.6Gflop/stotalforall8SPEs!!DivideSPpeakby14;factorof2becauseofDPand7becauseoflatencyissuesSPE~25Gflop/speakPerformanceofSinglePrecisiononConventionalProcessorsSingleprecisionisfasterbecause:HigherparallelisminSSE/vectorunitsReduceddatamotionHigherlocalityincacheRealizedhavethesimilarsituationonourcommodityprocessors.Thatis,SPis2XasfastasDPonmanysystemsTheIntelPentiumandAMDOpteronhaveSSE22flops/cycleDP4flops/cycleSPIBMPowerPChasAltiVec8flops/cycleSP4flops/cycleDPNoDPonAltiVec
SizeSGEMM/
DGEMMSizeSGEMV/
DGEMVAMDOpteron24630002.0050001.70UltraSparc-IIe30001.6450001.66IntelPIIICoppermine30002.0350002.09PowerPC97030002.0450001.44IntelWoodcrest30001.8150002.18IntelXEON30002.0450001.82IntelCentrinoDuo30002.7150002.213832or64bitFloatingPointPrecision?Alongtimeago32bitfloatingpointwasusedStillusedinscientificappsbutlimitedMostappsuse64bitfloatingpointAccumulationofroundofferrorA10TFlop/scomputerrunningfor4hoursperforms>1Exaflop(1018)ops.IllconditionedproblemsIEEESPexponentbitstoofew(8bits,10±38)CriticalsectionsneedhigherprecisionSometimesneedextendedprecision(128bitflpt)Howeversomecangetbywith32bitflptinsomepartsMixedprecisionapossibilityApproximateinlowerprecisionandthenrefineorimprovesolutiontohighprecision.39IdeaGoesSomethingLikeThis…Exploit32bitfloatingpointasmuchaspossible.EspeciallyforthebulkofthecomputationCorrectorupdatethesolutionwithselectiveuseof64bitfloatingpointtoprovidearefinedresultsIntuitively:Computea32bitresult,Calculateacorrectionto32bitresultusingselectedhigherprecisionand,Performtheupdateofthe32bitresultswiththecorrectionusinghighprecision.LU=lu(A)
SINGLE
O(n3)x=L\(U\b) SINGLE
O(n2)r=b–Ax DOUBLE
O(n2)WHILE||r||notsmallenough
z=L\(U\r) SINGLE
O(n2)
x=x+z DOUBLE
O(n1)
r=b–Ax DOUBLE
O(n2)ENDMixed-PrecisionIterativeRefinementIterativerefinementfordensesystems,Ax=b,canworkthisway.LU=lu(A)
SINGLE
O(n3)x=L\(U\b) SINGLE
O(n2)r=b–Ax DOUBLE
O(n2)WHILE||r||notsmallenough
z=L\(U\r) SINGLE
O(n2)
x=x+z DOUBLE
O(n1)
r=b–Ax DOUBLE
O(n2)ENDMixed-PrecisionIterativeRefinementIterativerefinementfordensesystems,Ax=b,canworkthisway.Wilkinson,Moler,Stewart,&HighamprovideerrorboundforSPflptresultswhenusingDPflpt.Itcanbeshownthatusingthisapproachwecancomputethesolutionto64-bitfloatingpointprecision.Requiresextrastorage,totalis1.5timesnormal;O(n3)workisdoneinlowerprecisionO(n2)workisdoneinhighprecisionProblemsifthematrixisill-conditionedinsp;O(108)
ResultsforMixedPrecisionIterativeRefinementforDenseAx=bSingleprecisionisfasterthanDPbecause:Higherparallelismwithinvectorunits4ops/cycle(usually)insteadof2ops/cycleReduceddatamotion32bitdatainsteadof64bitdataHigherlocalityincacheMoredataitemsincache
ResultsforMixedPrecisionIterativeRefinementforDenseAx=bSingleprecisionisfasterthanDPbecause:Higherparallelismwithinvectorunits4ops/cycle(usually)insteadof2ops/cycleReduceddatamotion32bitdatainsteadof64bitdataHigherlocalityincacheMoredataitemsincacheArchitecture(BLAS-MPI)#procsnDPSolve/SPSolveDPSolve/IterRef#iterAMDOpteron(Goto–OpenMPIMX)32226271.851.796AMDOpteron(Goto–OpenMPIMX)64320001.901.83644WhatabouttheCell?PowerPCat3.2GHzDGEMMat5Gflop/sAltivecpeakat25.6Gflop/sAchieved10Gflop/sSGEMM8SPUs204.8Gflop/s
peak!
Thecatchisthatthisisfor32bitfloatingpoint;(SinglePrecisionSP)And64bitfloatingpointrunsat14.6Gflop/stotalforall8SPEs!!DivideSPpeakby14;factorof2becauseofDPand7becauseoflatencyissues
MovingDataAroundontheCell256KBWorstcasememoryboundoperations(noreuseofdata)3datamovements(2inand1out)with2ops(SAXPY)Forthecellwouldbe4.6Gflop/s(25.6GB/s*2ops/12B)Injectionbandwidth25.6GB/sInjectionbandwidth46IBMCell3.2GHz,Ax=b.30secs3.9secs8SGEMM(EmbarrassinglyParallel)47IBMCell3.2GHz,Ax=b.30secs.47secs3.9secs8.3X8SGEMM(EmbarrassinglyParallel)3348CholeskyontheCell,Ax=b,A=AT,xTAx>0
FortheSPE’sstandardCcodeandClanguageSIMDextensions(intrinsics)SingleprecisionperformanceMixedprecisionperformanceusingiterativerefinementMethodachieving64bitaccuracyCholesky-Using2CellChips4950IntriguingPotentialExploitlowerprecisionasmuchaspossiblePayoffinperformanceFasterfloatingpointLessdatatomoveAutomaticallyswitchbetweenSPandDPtomatchthedesiredaccuracyComputesolutioninSPandthenacorrectiontothesolutioninDPPotentialforGPU,FPGA,specialpurposeprocessorsWhatabout16bitfloatingpoint?UseaslittleyoucangetawaywithandimprovetheaccuracyAppliestosparsedirectanditerativelinearsystemsandEigenvalue,optimizationproblems,whereNewton’smethodisused.Correction=-A\(b–Ax)3351IBM/MercuryCellBladeFromIBMorMercury2CellchipEachw/8SPEs512MB/Cell~$8K-17KSomeSW3352SonyPlaystation3ClusterPS3-TFromIBMorMercury2CellchipEachw/8SPEs512MB/Cell~$8K-17KSomeSWFromWAL*MARTPS31Cellchipw/6SPEs256MB/PS3$600DownloadSWDualbootSITCELLCellHardwareOverviewPEPEPEPEPEPE200GB/s512MiB25GB/sPowerPCPEPE3.2GHz25GB/sinjectionbandwidth200GB/sbetweenSPEs32bitpeakperf8*25.6Gflop/s 204.8Gflop/speak64bitpeakperf8*1.8Gflop/s 14.6Gflop/speak512MiBmemory25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/s25.6Gflop/sSITCELLPS3HardwareOverviewPEPEPEPEPEPE200GB/sGameOSHypervisor256MiBDisabled/Broken:Yieldissues25GB/sPowerPC3.2GHz25GB/sinjectionbandwidth200GB/sbetweenSPEs32bitpeakperf6*25.6Gflop/s 153.6Gflop/speak64bitpeakperf6*1.8Gflop/s 10.8Gflop/speak1Gb
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- g k h 说课稿-2024-2025学年语文一年级上册统编版
- 2024年四年级英语上册 Unit 5 I like those shoes Lesson 30说课稿 人教精通版(三起)
- 14小狗学叫 说课稿-2024-2025学年三年级上册语文统编版
- 项目产品推广方案
- Unit 1 My classroom Part B Read and write 大单元整体说课稿表格式-2024-2025学年人教PEP版英语四年级上册
- 5《协商决定班级事务》第1课时(说课稿)-部编版道德与法治五年级上册
- 出售供暖平房合同范本
- Unit 4 Then and now 单元整体(说课稿)-2023-2024学年人教PEP版英语六年级下册
- 万亿存款合同范例
- 中介房产抵押合同范例
- Unit 2 Know your body(说课稿)-2024-2025学年外研版(三起)(2024)英语三年级下册
- 跨学科主题学习2-探索太空逐梦航天 说课稿-2024-2025学年粤人版地理七年级上册
- 《电子技术应用》课程标准(含课程思政)
- 电力储能用集装箱技术规范
- 小学生雪豹课件
- 《课标教材分析》课件
- 《信号工程施工》课件 项目一 信号图纸识读
- 基础护理常规制度
- 针灸治疗动眼神经麻痹
- 倾听幼儿马赛克方法培训
- 设备日常维护及保养培训
评论
0/150
提交评论