并行计算课件_第1页
并行计算课件_第2页
并行计算课件_第3页
并行计算课件_第4页
并行计算课件_第5页
已阅读5页,还剩52页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

GPUArchitectureGPUArchitectureindetailandBinZHOU1•Generalguideline••Generalguideline••Keplerin3~~~~peakinst/sComparetopeakGB/sFindGeneralOptimizationStrategies:GeneralOptimizationStrategies:•FindoutthelimitingfactorinkernelMemorybandwidthbound(memory•Measureeffectivememory/instructionMemory•IftheMemory•Ifthecodeismemory-boundandeffectivememorythroughputismuchlowerthanthepeak•Purpose:accessonlydatathatareabsolutely•MajorReduceredundantaccess:read-onlycache,shared•Ifyou•IfyoufindoutthecodeisinstructionCompute-intensivealgorithmcaneasilybecomememory-boundifnotcarefulTypically,worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizations•–Uselessinstructionstogetthesamejob•MajorUsehighthroughputinstructions(ex.widerReducewastedinstructions:branchdivergence,reducereplay(conflict),•Whenthecode•Whenthecodeislatency–Boththememoryandinstructionthroughputsarefarfromthe•Latencyhiding:switching–Athreadblockswhenoneoftheoperandsisn’t•Purpose:haveenoughwarpstohide•Majortechniques:increaseactivewarps,increaseCPU-GPUCPU-GPU9MinimizeCPU-GPUdataHost<->devicedataMinimizeCPU-GPUdataHost<->devicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess.16GB/s(PCIex16Gen3)vs250GB/s&3.95Tinst/sSometimesit’sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGroupOnelargetransfermuchbetterthanmanysmallonesOverlapmemorytransferwithcomputationRevisitGPURevisitGPUProcessing CopyinputdatafromCPUmemorytoRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteCopyresultsfromGPUmemorytoCPU•𝑇𝑡𝑜𝑡𝑎𝑙•𝑇𝑡𝑜𝑡𝑎𝑙=𝑇𝐻𝑡𝑜𝐷+𝑇𝐸𝑥𝑒𝑐+•MoreStreamStreamStreamStreamStreamsrc1,size,kernel<<<grid,Streamsrc1,size,kernel<<<grid,0,src1,size,StreamStreamcudaMemcpyAsync(dst2,src2,size,kernel<<<grid,block,0,kernel<<<grid,block,0,cudaMemcpyAsync(dst1,src1,size,cudaMemcpyDeviceToHost,KEPLERKEPLERIN•NVIDIA1.31tflops•NVIDIA1.31tflopsdouble3.95tflopssingle250gb/secmemory2,688FunctionalUnits•#1onTop500inNVIDIAGK110-KeplerGK110SMXKeplerGK110SMXvsFermi3xPowergoesNewISAEncoding:NewISAEncoding:255Registersper•Fermilimit:63registersperAcommonFermiperformanceLeadstoexcessive•Kepler:Upto255registersper–EspeciallyhelpfulforFP64•FeatureofKeplerK20GPUstoincreaseapplication•FeatureofKeplerK20GPUstoincreaseapplicationthroughputbyenablingworktobescheduledontotheGPUinparallel•TwowaystotakeCUDAStreams–nowtheyreallyareCUDAProxyforMPI–concurrentCUDAMPIprocessesononeGPUWorkWorkKeplerGridManagementPending&SuspendedStreamQueueWorkWorkKeplerGridManagementPending&SuspendedStreamQueueyemFermiA<<<>>>;B<<<>>>StreamP<<<>>>FermiA<<<>>>;B<<<>>>StreamP<<<>>>;Q<<<>>>StreamHardwareWorkX<<<>>>;Y<<<>>>;StreamFermiallows16-way–––Upto16gridscanrunatButCUDAstreamsmultiplexintoasingleOverlaponlyatstreamA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;YA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;Y<<<>>>;MultipleHardwareWorkStreamKeplerallows32-wayOneworkqueueperConcurrencyatfull-streamlevelNointer-streamdependenciesABCDEFCPUSharedABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedHyper-Q:SimultaneousCPUCUDASharedClientHyper-Q:SimultaneousCPUCUDASharedClient–ServerSoftwareABCDEFABCDEFDDCFACCBBEFFEBAAEDDDCFACCBBEFFEBAAEDWhatisDynamicTheabilitytolaunchnewWhatisDynamicTheabilitytolaunchnewkernelsfromtheDynamically-basedonrun-timeSimultaneously-frommultiplethreadsatIndependently-eachthreadcanlaunchadifferentFermi:OnlyCPUcangenerateGPUKepler:GPUcangenerateworkforWhatDoesItGPUasCo-Autonomous,DynamicWhatDoesItGPUasCo-Autonomous,DynamicNewTypesof••NewTypesof••RecursiveParallelAlgorithmslikeQuickAdaptiveMeshAlgorithmslikeComputationalallocatedtoregionsofCUDAonFamiliarProgrammingAXBYCZglobalFamiliarProgrammingAXBYCZglobalvoidB(float{X<<<...>>>Y<<<...>>>Z<<<...>>>}intmain()float*data;A<<<...>>>B<<<...>>>C<<<...>>>return0;}CodeLaunchisper-andCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-launchedkernelsandCUDAobjectslikestreamsarevisibletoallthreadsinathreadblockcannotbepassedtochildif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatglobalCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}cudaDeviceSynchronize()imply}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateatMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=Local&sharedmemoryConstantsareif(tid==0)launch<<<128,256>>>(buf);}syncthreads();if(tid==0){cudaMemcpyAsync(data,buf,}ProvidestechnologynecessarytoenablelowerlatencyProvidestechnologynecessarytoenablelowerlatencymemorytransfersbetweenGPUandotherPCIEdeviceswithoutrequiringcustomhardware.••APIanddocumentationfordevicedriver•AvailableonLinux•SupportedonKeplerQuadroandTelsaNVIDIAGPUDirect™NowSupportsNVIDIAGPUDirect™NowSupportsMorethreadsareMorethreadsare•••2-3xthroughputperclockperMemorybandwidthBiggerSMhavebiggerMorethreadare•Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerMorethreadare•Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerwillensureenoughactivewarpsonSMs.2xregisterfileoneach•–––E.g.63registersperthread,blockDimInFermi16activewarpsInKepler32active•2xsimultaneousblocksper–E.g.16registersperthread,blockDim–InFermi96*8/32=24active–Inkepler96*16/32=46active•Ifonekernelcan’tIfonekernelcan’tlaunchenough•ConcurrentGK110allowsupto32concurrentkernelstoHyper-Using

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论