并行计算课件

上传人：我*** IP属地：北京上传时间：2024-01-17 格式：DOCX 页数：57 大小：2.15MB 积分：16 举报 版权申诉

已阅读5页，还剩52页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

GPUArchitectureGPUArchitectureindetailandBinZHOU1•Generalguideline••Generalguideline••Keplerin3~~~~peakinst/sComparetopeakGB/sFindGeneralOptimizationStrategies:GeneralOptimizationStrategies:•FindoutthelimitingfactorinkernelMemorybandwidthbound(memory•Measureeffectivememory/instructionMemory•IftheMemory•Ifthecodeismemory-boundandeffectivememorythroughputismuchlowerthanthepeak•Purpose:accessonlydatathatareabsolutely•MajorReduceredundantaccess:read-onlycache,shared•Ifyou•IfyoufindoutthecodeisinstructionCompute-intensivealgorithmcaneasilybecomememory-boundifnotcarefulTypically,worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizations•–Uselessinstructionstogetthesamejob•MajorUsehighthroughputinstructions(ex.widerReducewastedinstructions:branchdivergence,reducereplay(conflict),•Whenthecode•Whenthecodeislatency–Boththememoryandinstructionthroughputsarefarfromthe•Latencyhiding:switching–Athreadblockswhenoneoftheoperandsisn’t•Purpose:haveenoughwarpstohide•Majortechniques:increaseactivewarps,increaseCPU-GPUCPU-GPU9MinimizeCPU-GPUdataHost<->devicedataMinimizeCPU-GPUdataHost<->devicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess.16GB/s(PCIex16Gen3)vs250GB/s&3.95Tinst/sSometimesit’sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGroupOnelargetransfermuchbetterthanmanysmallonesOverlapmemorytransferwithcomputationRevisitGPURevisitGPUProcessing CopyinputdatafromCPUmemorytoRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteCopyresultsfromGPUmemorytoCPU•𝑇𝑡𝑜𝑡𝑎𝑙•𝑇𝑡𝑜𝑡𝑎𝑙=𝑇𝐻𝑡𝑜𝐷+𝑇𝐸𝑥𝑒𝑐+•MoreStreamStreamStreamStreamStreamsrc1,size,kernel<<<grid,Streamsrc1,size,kernel<<<grid,0,src1,size,StreamStreamcudaMemcpyAsync(dst2,src2,size,kernel<<<grid,block,0,kernel<<<grid,block,0,cudaMemcpyAsync(dst1,src1,size,cudaMemcpyDeviceToHost,KEPLERKEPLERIN•NVIDIA1.31tflops•NVIDIA1.31tflopsdouble3.95tflopssingle250gb/secmemory2,688FunctionalUnits•#1onTop500inNVIDIAGK110-KeplerGK110SMXKeplerGK110SMXvsFermi3xPowergoesNewISAEncoding:NewISAEncoding:255Registersper•Fermilimit:63registersperAcommonFermiperformanceLeadstoexcessive•Kepler:Upto255registersper–EspeciallyhelpfulforFP64•FeatureofKeplerK20GPUstoincreaseapplication•FeatureofKeplerK20GPUstoincreaseapplicationthroughputbyenablingworktobescheduledontotheGPUinparallel•TwowaystotakeCUDAStreams–nowtheyreallyareCUDAProxyforMPI–concurrentCUDAMPIprocessesononeGPUWorkWorkKeplerGridManagementPending&SuspendedStreamQueueWorkWorkKeplerGridManagementPending&SuspendedStreamQueueyemFermiA<<<>>>;B<<<>>>StreamP<<<>>>FermiA<<<>>>;B<<<>>>StreamP<<<>>>;Q<<<>>>StreamHardwareWorkX<<<>>>;Y<<<>>>;StreamFermiallows16-way–––Upto16gridscanrunatButCUDAstreamsmultiplexintoasingleOverlaponlyatstreamA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;YA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;Y<<<>>>;MultipleHardwareWorkStreamKeplerallows32-wayOneworkqueueperConcurrencyatfull-streamlevelNointer-streamdependenciesABCDEFCPUSharedABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedHyper-Q:SimultaneousCPUCUDASharedClientHyper-Q:SimultaneousCPUCUDASharedClient–ServerSoftwareABCDEFABCDEFDDCFACCBBEFFEBAAEDDDCFACCBBEFFEBAAEDWhatisDynamicTheabilitytolaunchnewWhatisDynamicTheabilitytolaunchnewkernelsfromtheDynamically-basedonrun-timeSimultaneously-frommultiplethreadsatIndependently-eachthreadcanlaunchadifferentFermi:OnlyCPUcangenerateGPUKepler:GPUcangenerateworkforWhatDoesItGPUasCo-Autonomous,DynamicWhatDoesItGPUasCo-Autonomous,DynamicNewTypesof••NewTypesof••RecursiveParallelAlgorithmslikeQuickAdaptiveMeshAlgorithmslikeComputationalallocatedtoregionsofCUDAonFamiliarProgrammingAXBYCZglobalFamiliarProgrammingAXBYCZglobalvoidB(float{X<<<...>>>Y<<<...>>>Z<<<...>>>}intmain()float*data;A<<<...>>>B<<<...>>>C<<<...>>>return0;}CodeLaunchisper-andCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-launchedkernelsandCUDAobjectslikestreamsarevisibletoallthreadsinathreadblockcannotbepassedtochildif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatglobalCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}cudaDeviceSynchronize()imply}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateatMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=Local&sharedmemoryConstantsareif(tid==0)launch<<<128,256>>>(buf);}syncthreads();if(tid==0){cudaMemcpyAsync(data,buf,}ProvidestechnologynecessarytoenablelowerlatencyProvidestechnologynecessarytoenablelowerlatencymemorytransfersbetweenGPUandotherPCIEdeviceswithoutrequiringcustomhardware.••APIanddocumentationfordevicedriver•AvailableonLinux•SupportedonKeplerQuadroandTelsaNVIDIAGPUDirect™NowSupportsNVIDIAGPUDirect™NowSupportsMorethreadsareMorethreadsare•••2-3xthroughputperclockperMemorybandwidthBiggerSMhavebiggerMorethreadare•Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerMorethreadare•Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerwillensureenoughactivewarpsonSMs.2xregisterfileoneach•–––E.g.63registersperthread,blockDimInFermi16activewarpsInKepler32active•2xsimultaneousblocksper–E.g.16registersperthread,blockDim–InFermi96*8/32=24active–Inkepler96*16/32=46active•Ifonekernelcan’tIfonekernelcan’tlaunchenough•ConcurrentGK110allowsupto32concurrentkernelstoHyper-Using

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

并行计算课件

文档简介

温馨提示

最新文档

评论

并行计算课件

文档简介

温馨提示

最新文档

评论

相关文档