




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
GPUArchitectureGPUArchitectureindetailandBinZHOU1•Generalguideline••Generalguideline••Keplerin3~~~~peakinst/sComparetopeakGB/sFindGeneralOptimizationStrategies:GeneralOptimizationStrategies:•FindoutthelimitingfactorinkernelMemorybandwidthbound(memory•Measureeffectivememory/instructionMemory•IftheMemory•Ifthecodeismemory-boundandeffectivememorythroughputismuchlowerthanthepeak•Purpose:accessonlydatathatareabsolutely•MajorReduceredundantaccess:read-onlycache,shared•Ifyou•IfyoufindoutthecodeisinstructionCompute-intensivealgorithmcaneasilybecomememory-boundifnotcarefulTypically,worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizations•–Uselessinstructionstogetthesamejob•MajorUsehighthroughputinstructions(ex.widerReducewastedinstructions:branchdivergence,reducereplay(conflict),•Whenthecode•Whenthecodeislatency–Boththememoryandinstructionthroughputsarefarfromthe•Latencyhiding:switching–Athreadblockswhenoneoftheoperandsisn’t•Purpose:haveenoughwarpstohide•Majortechniques:increaseactivewarps,increaseCPU-GPUCPU-GPU9MinimizeCPU-GPUdataHost<->devicedataMinimizeCPU-GPUdataHost<->devicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess.16GB/s(PCIex16Gen3)vs250GB/s&3.95Tinst/sSometimesit’sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGroupOnelargetransfermuchbetterthanmanysmallonesOverlapmemorytransferwithcomputationRevisitGPURevisitGPUProcessing CopyinputdatafromCPUmemorytoRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteCopyresultsfromGPUmemorytoCPU•𝑇𝑡𝑜𝑡𝑎𝑙•𝑇𝑡𝑜𝑡𝑎𝑙=𝑇𝐻𝑡𝑜𝐷+𝑇𝐸𝑥𝑒𝑐+•MoreStreamStreamStreamStreamStreamsrc1,size,kernel<<<grid,Streamsrc1,size,kernel<<<grid,0,src1,size,StreamStreamcudaMemcpyAsync(dst2,src2,size,kernel<<<grid,block,0,kernel<<<grid,block,0,cudaMemcpyAsync(dst1,src1,size,cudaMemcpyDeviceToHost,KEPLERKEPLERIN•NVIDIA1.31tflops•NVIDIA1.31tflopsdouble3.95tflopssingle250gb/secmemory2,688FunctionalUnits•#1onTop500inNVIDIAGK110-KeplerGK110SMXKeplerGK110SMXvsFermi3xPowergoesNewISAEncoding:NewISAEncoding:255Registersper•Fermilimit:63registersperAcommonFermiperformanceLeadstoexcessive•Kepler:Upto255registersper–EspeciallyhelpfulforFP64•FeatureofKeplerK20GPUstoincreaseapplication•FeatureofKeplerK20GPUstoincreaseapplicationthroughputbyenablingworktobescheduledontotheGPUinparallel•TwowaystotakeCUDAStreams–nowtheyreallyareCUDAProxyforMPI–concurrentCUDAMPIprocessesononeGPUWorkWorkKeplerGridManagementPending&SuspendedStreamQueueWorkWorkKeplerGridManagementPending&SuspendedStreamQueueyemFermiA<<<>>>;B<<<>>>StreamP<<<>>>FermiA<<<>>>;B<<<>>>StreamP<<<>>>;Q<<<>>>StreamHardwareWorkX<<<>>>;Y<<<>>>;StreamFermiallows16-way–––Upto16gridscanrunatButCUDAstreamsmultiplexintoasingleOverlaponlyatstreamA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;YA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;Y<<<>>>;MultipleHardwareWorkStreamKeplerallows32-wayOneworkqueueperConcurrencyatfull-streamlevelNointer-streamdependenciesABCDEFCPUSharedABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedHyper-Q:SimultaneousCPUCUDASharedClientHyper-Q:SimultaneousCPUCUDASharedClient–ServerSoftwareABCDEFABCDEFDDCFACCBBEFFEBAAEDDDCFACCBBEFFEBAAEDWhatisDynamicTheabilitytolaunchnewWhatisDynamicTheabilitytolaunchnewkernelsfromtheDynamically-basedonrun-timeSimultaneously-frommultiplethreadsatIndependently-eachthreadcanlaunchadifferentFermi:OnlyCPUcangenerateGPUKepler:GPUcangenerateworkforWhatDoesItGPUasCo-Autonomous,DynamicWhatDoesItGPUasCo-Autonomous,DynamicNewTypesof••NewTypesof••RecursiveParallelAlgorithmslikeQuickAdaptiveMeshAlgorithmslikeComputationalallocatedtoregionsofCUDAonFamiliarProgrammingAXBYCZglobalFamiliarProgrammingAXBYCZglobalvoidB(float{X<<<...>>>Y<<<...>>>Z<<<...>>>}intmain()float*data;A<<<...>>>B<<<...>>>C<<<...>>>return0;}CodeLaunchisper-andCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-launchedkernelsandCUDAobjectslikestreamsarevisibletoallthreadsinathreadblockcannotbepassedtochildif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatglobalCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}cudaDeviceSynchronize()imply}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateatMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=Local&sharedmemoryConstantsareif(tid==0)launch<<<128,256>>>(buf);}syncthreads();if(tid==0){cudaMemcpyAsync(data,buf,}ProvidestechnologynecessarytoenablelowerlatencyProvidestechnologynecessarytoenablelowerlatencymemorytransfersbetweenGPUandotherPCIEdeviceswithoutrequiringcustomhardware.••APIanddocumentationfordevicedriver•AvailableonLinux•SupportedonKeplerQuadroandTelsaNVIDIAGPUDirect™NowSupportsNVIDIAGPUDirect™NowSupportsMorethreadsareMorethreadsare•••2-3xthroughputperclockperMemorybandwidthBiggerSMhavebiggerMorethreadare•Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerMorethreadare•Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerwillensureenoughactivewarpsonSMs.2xregisterfileoneach•–––E.g.63registersperthread,blockDimInFermi16activewarpsInKepler32active•2xsimultaneousblocksper–E.g.16registersperthread,blockDim–InFermi96*8/32=24active–Inkepler96*16/32=46active•Ifonekernelcan’tIfonekernelcan’tlaunchenough•ConcurrentGK110allowsupto32concurrentkernelstoHyper-Using
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 采矿业务协议
- 苗木销售协议的基本要素
- 物流园区合作协议
- 房屋的质量赔偿协议书7篇
- 新材料战略合作协议8篇
- 虚拟资产购买协议5篇
- 2025年中医学专业基础考试试题及答案
- 2025年数据科学与技术考试试题及答案
- 2025年数据库管理考试试题及答案
- 2025年企业管理师证书考试试题及答案
- 脂肪肝介绍课件
- 2025 年上海社区工作人员招聘考试模拟卷
- 2024年市场营销师品牌宣传技巧试题及答案
- 应急物资、设备检查维护保养制度
- 2025年医疗器械全国总策划代理协议书
- 《数据网组建与维护》课件-8.1任务1 WLAN基本配置
- 2025解题觉醒邓诚数学(名师大招册)
- 第四单元第一课 多姿多彩的乐音世界-《唱脸谱》 课件 2024-2025学年湘艺版(2024)初中音乐七年级下册
- 给小朋友科普化学小知识
- 中医专科护士进修汇报
- 9.2 法律保障生活课件(共13张)-2024-2025学年统编版道德与法治七年级下册
评论
0/150
提交评论