版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CUDAProgramming
(GPUProgramming)Instructor:ZhangWeizhe(张伟哲)ComputerNetworkandInformationSecurityTechniqueResearchCenter,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyMotivation动机GPUArchitectureGPU架构Threewaystoaccelerateapplications三种加速应用的方法CUDAProgrammingModelCUDA编程模型CUDAProgrammingBasicsCUDA编程基础Outline3ASimpleExampleThreeIntegerArrays
A[N]B[N]C[N]Wewanttocalculate
C[i]=A[i]+B[i]
4Traditionally,onthecpu(serial) for(i=0;i<N;++i)
C[i]=A[i]+B[i] T(N)=O(N)5Traditionally,onthecpu(parallel)createNthreads
C[threadid.i]=A[threadid.i]+B[threadid.i]
T(N)=O(1)6GPUComputingButthereisaproblem.ApplicationslikeNeedthousandsofthreadstoexecuteMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline8AsimplecomparisonbetweenCPUandGPU
9AdetaileddescriptionGraphicsProcessingClusters(GPCs)TextureProcessingClusters(TPCs)StreamingMultiprocessors(SM)10pascal-architecture-whitepaperMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline12ThreemethodsCUDAOptimizedLibrariesProgrammingLanguages13ThreemethodsCUDAOptimizedLibrariesTheselibrariesarewritteninCUDASimplyreplaceyourstandardlibraryfunctionswithcorrespondingCUDAlibrariesItsupportsmanymathlibrariesbutnotallasupportedlistcanbefoundat/gpu-accelerated-libraries
14ThreemethodsItisadirective-basedprogrammingmodelYouneedtoisnertsomedirectivesinyourcodeUseopenacccompilertocompilethecode15ThreemethodsProgrammingLanguagesMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline18PrerequestforCUDAProgrammingHardwareANvidiaGraphicscard:Itcanbeaspecializedcomputingcard,likeTeslaPascalGP100(tooexpensive),oranormalgamegraphiccard,likeGTorGTX.CheckwheteryourGPUsupportsCUDA:youcancheckoutthiswebsite/cuda-gpus
Clickon19PrerequestforCUDAProgrammingSoftwareCUDAToolkit:It’ssupportedonWindows,Mac,andmoststandardLinuxdistributions.Downloadfromhttps:///cuda-toolkitVisualStudio(ifonwindows):IfyouworkonWindows,forIknow,VSistheonlyIDEthatcanworkwithCUDA.Ifyoudon’twanttoinstallVS,youcanusetheCUDAcompilerNVCCdirectlyfromacommandline.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline21CUDAExecutionFlowCUDAApplicationHost=CPUDevice=GPUHost=CPUDevice=GPUParallelcodeSerialcodeSerialcodeParallelcode22CUDAExecutionFlow1.CopydatafromCPUmemorytoGPUmemory23CUDAExecutionFlowInstructtheGPUtostartcomputing24CUDAExecutionFlowCopytheresultsbacktoCPUmemoryCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline26CUDAThreadsParallelportionofanapplicatonfloatx=in[threadIdx.x];floaty=func(x);out[threadIdx.x]=y;
in[i]in[i+1]in[i+2]in[i+3]out[i]out[i+1]out[i+2]out[i+3]AkernelisafunctionexecutedontheGPUasanarrayofthreadsinparallelandcanbecalledfromCPUAllthreadsexecutethesamecode,cantakedifferentpathsEachthreadhasanID27CUDAThreads.....…..WarpWarpBlock28CUDAThreads.....Warp32ThreadsaregroupedintowarpsAwarpinCUDAistheminimumsizeofthedataprocessedinSIMDfashionbyaCUDAmultiprocessor.ThreadIDswithinawarpareconsecutiveandincreasingWarpisunitofthreadschedulinginSMs29CUDAThreads.....WarpOneorMorewarpsaregroupedintoblocksAthreadblockisbatchofthreadsthatcancooperatewitheachotherbysharingdatathroughsharedmemoryandsynchronizingtheirexecution.Ablockcanatmostcontain1024threadsbecasuseofthehardwaresourcelimitThethreadidisuniqueandstartsfromzeroinablockWarp30CUDAThreadsBlockBlockBlockBlockBlockBlockGridAkernelwillbeexecutedasagrid31CUDAThreadsKernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith2SMsSM0SM1Block0Block1Block2Block3Block4Block5Block6Block7CUDAThreads32KernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith4SMsSM0SM1SM2SM4Block0Block4Block1Block5Block2Block6Block3Block7CUDAThreads33CUDAThreads34Block0CUDAThreads35Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblem:branchdivergenceExamplewithdivergence:If(threadIdx.x>2){…}ThiscreatestwodifferentcontrolpathsforthreadsinablockToavoidthis:If(threadIdx.x/WARP_SIZE>2){…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Sodon’tusethiskindofcode,letthewholewarpdothesamework.CUDAThreads36Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblembranchdivergenceExamplewithdivergence:
If(threadIdx.x%2==0){…} else{…}ThiscreatestwodifferentcontrolpathsforthreadsinablockCUDAThreads37
If(threadIdx.x%2==0){…}else{…}CUDAThreads38If(threadIdx.x/WARP_SIZE==0){…}Else{…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Letthewholewarpdothesamework.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutlineGlobalMemory&SyntaxGlobalmemoryisthe“main”memoryoftheGPU.Ithasglobalscopeandlifetimeoftheallocatingprogram(oruntilcudaFreeiscalled)GlobalmemoryissimilartotheheapinaCprogram.GlobalmemorysyntaxAllocatewithcudaMalloc(void**devPtr,size_tsize)FreewithcudaMalloc(void*devPtr)40intblk_sz=64;float*Md;intsize=blk_sz*blk_sz*sizeof(float);cudaMalloc((void**)&Md,size);…cudaFree(Md);Host-DeviceDataTransfercudaMemcpy()MemorydatatransferRequiresfourparametersPointertodestinationPointertosourceNumberofbytescopiedType/DirectionoftransferHosttoHost,HosttoDevice,DevicetoHost,DevicetoDeviceTransfertodeviceisasynchronous41cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);CPUMemoryGPUGPUMemoryCPUPCI-E8GB/sGDDR5190GB/sConstantMemory&SyntaxConstantmemoryisaformofvirtualaddressingofglobalmemory.SpecialpropertiesCached&read-onlySupportsbroadcastingasinglevaluetoalltheelementswithinawarpConstantmemoryisrestrictedto64KB(kernelargumentsarepassedthroughconstantmemory)ConstantmemorysyntaxInglobalscope(outsideofkernel,attoplevelofprogram)__constant__int
foo[2014];InhostcodecudaMemcpyToSymbol(foo,h_src,sizeof(int)*1024);42TextureMemoryComplicatedandonlymarginallyusefulforgeneralpurposecomputationUsefulcharacteristics2Dor3Ddatalocalityforcachingpurposesthrough“CUDAarrays”.GoesintospecialtexturecacheFastinterpolationon1D,2D,or3DarrayConvertingintegersto“unitized”floatingpointnumbers43It’sacomplextopic,youcanlearneverythingyouwanttoknowaboutitfromCUDAHandbookSharedMemory&SyntaxSharedmemoryisusedtoexchangedatabetweenCUDAthreadswithinablock.VeryfastmemorylocatedintheSMOn-chipmemory,low-latency,user-controlledL1cacheSharedmemorysyntaxStaticallocation__shared__floatdata[1024];//declarationinkernel,nothinginhostcodeDynamicallocationHost:
kernel<<<grid,block,numBytesShMem>>>(arg);Device(inkernel):
extern__shared__float
s[];44RememberSM=StreamingmultiprocessorSM≠SharedmemoryComputationalIntensityComputationalintensity=FLOPS/IOMatrixmultiplication:n3/n2=nN-bodysimulation:n2/n=n45Ifcomputationalintensityis>1,thensamedatausedinmorethan1computation.Doasfewgloballoadsandasmanysharedloadsaspossible.Registers&LocalMemoryRegistersFastest“memory”possible,about10xfasterthansharedmemoryMoststackvariablesdeclaredinkernelsarestoredinregisters(example:floatx)StaticallyindexedarraysstoredonthestackaresometimesputinregistersLocalMemoryLocalmemoryiseverythingonthestackthatcan’tfitinregistersThescopeoflocalmemoryisjustthethread.Localmemoryisstoredinglobalmemory(muchslowerthanregisters!)46Non-Programmable!CUDAMemoryModelSummaryMemorySpaceManagedbyPhysicalImplementationScopeonGPUScopeonCPULifetimeRegistersCompilerOn-chipPerThreadNotvisibleLifetimeofathreadLocalCompilerDeviceMemoryPerThreadNotvisibleSharedProgrammerOn-chipBlockNotvisibleBlocklifetimeGlobalProgrammerDeviceMemoryAllThreadsRead/WriteApplicationoruntilexplicitlyfreedConstantProgrammerDeviceMemoryAllThreadsRead-onlyRead/WriteTextureProgrammerDeviceMemoryAllThreadsRead-onlyRead/Write47MotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline50ParallelProgramminginCUDACThefirstexample:
SummingVectorsImaginehavingtwolistsofnumberswherewewanttosumcorrespondingelementsofeachlistandstoretheresultinathirdlist.51ParallelProgramminginCUDACCPUcore1CPUcore252ParallelProgramminginCUDAC53ParallelProgramminginCUDAC__host____device__cudaError_tcudaMalloc(void**devPtr,size_tsize)__host__and__device__istypequafiler,whichmeasthisfunctioncanbecalledonthedeviceorthehost.AllcudafunctionstakesaerrorcodeasareturnvalueDifferentfromC’sMallocfunction,thisfunctionstakesapointertopointerasaparameter.54ParallelProgramminginCUDACExecutionconfiguration
<<<Dg,Db,Ns,S>>>Dgisoftypedim3andspecifiesthedimensionandsizeofthegrid,suchthatDg.x*Dg.y*Dg.zequalsthenumberofblocksbeinglaunched;Dbisoftypedim3andspecifiesthedimensionandsizeofeachblock,suchthatDb.x*Db.y*Db.zequalsthenumberofthreadsperblock;Nsistypeofsize_tandspecifiesthenumberofbytesinsharedmemorythatisdynamicallyallocatedperblockforthiscallinadditiontothestaticallyallocatedmemory;It’sanoptionalargumentwhichdefaultto0;SisoftypecudaStream_tandspecifiestheassociatedstream;It’sanoptionalargumentwhichdefaultsto055ParallelProgramminginCUDAC56ParallelProgramminginCUDAC__host__cudaError_tcudaMemcpy(void*dst,constvoid*src,size_t count,cudaMemcpyKindkind)CopiesdatabetweenhostanddevicecudaMemcpyKindspecifiesthedirectionofthecopy.ItisoneofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice57ParallelProgramminginCUDAC__global__functionsmushhavevoidreturntype.Anycalltoa__global__functionmustspecifyitsexecutionconfiguration.Acalltoa__global__functionisasynchronous,meaningitreturnsbeforethedevicehascompleteditsexecution.58ParallelProgramminginCUDACblockIdx.xcontainstheblockindexwithinthegridthreadIdx.xcontainsthethreadindexwithintheblockCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline60SharedMemoryandSynchronizationThreadswithinablockcancommunicatewitheachotherthroughsharedmemory.__shared__isusedtomakethevariableresidentinsharedmemory.Iftwothreadswanttowritethesamesharedvariable,theremustbeasynchronizationbetweentwothreads.EitherAwritesafterB,orBwritesafterA.Now,let’stakealookatanexamplethatusesthesefeatures61SharedMemoryandSynchronizationDOTPRODUCTEachthreadmultipliesapairofcorrespondingentries,andtheneverythreadmovesontoitsnextpair.Becausetheresultneedstobethesumofallthesepairwiseproducts,eachthreadkeepsarunningsumofthepairsithasadded.62SharedMemoryandSynchronizationSupposewehaveNthreads,andthearraysizeisN*M63SharedMemoryandSynchronization64SharedMemoryandSynchronization65SharedMemoryandSynchronization66SharedMemoryandSynchronization67SharedMemoryandSynchronizationCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline69ConstantMemoryandEventsConstantmemoryusedfordatathatwillnotchangeoverthecourseofakernelexecutionNVIDIAhardwareprovides64KBofconstantmemorythatittreatsdifferentlythanittreatsstandardglobalmemory.70ConstantMemoryandEventsProblemDescription:Produceanimageofathree-dimensionalseene.Asimpleidea:Imaginethelightsgothroughtheobjects,andcastashadowontheplate.Sincelightscancomefromanyplaceatanypointinourscene,itturnsourit’seasiertoworkbackward.Eachpixelintheimagewillshootarayintothescene.Wefigureoutwhatcolorisseenbyeachpixelbytracingarayfromthepixelinquestionthroughthesceneuntilithitsoneofourobjects.Wethensaythatthepixelwould“see”thisobjectandcanassignitscolorbasedonthecoloroftheobjectitsees.Mostofthecomputationrequiredbyraytracingisinhecomputationoftheseintersectionsoftheraywiththeobjextsinthescene.71ConstantMemoryandEvents72ConstantMemoryandEventsWhatwilltheraytracerdoItwillfirearayfromeachpixelComputewhichrayshitwhichspheres,andthedepthofeachofthesehits.Inthecasewherearaypassesthroughmultiplespheres,onlythespherefurthesttotheimagecanbeseen.Wewillmodeloursphereswithadatastructurethatstoresthesphere’scentercoordinateof(x,y,z),itsradius,anditscolorof(r,b,g).73ConstantMemoryandEvents74ConstantMemoryandEvents75ConstantMemoryandEvents76ConstantMemoryandEventsConstantmemoryalwaysdeclaredinFilescope(globalvariable)CUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline78TextureMemoryLikeconstantmemory,texturememoryisAnothervarietyofread-onlymemoryCachedonchip,soitwillprovidhighereffectivebandwidthDesignedforgraphicsapplicationswherememoryaccesspatternsexhibitagreatdealofspatialocality79TextureMemoryspatialocalityInacomputingapplication,thisroughlyimpliesthatathreadislikelytoreadfromanaddress“near”theaddressthatnearbythreadsread,asshowninthefigure.80TextureMemoryForCPUcachingscheme-------->thefouraddressesshownarenotconsecutiveandwouldnotbecached
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 永州师范高等专科学校《设计基础造型》2023-2024学年第一学期期末试卷
- 营口理工学院《男生极限飞盘》2023-2024学年第一学期期末试卷
- 鹰潭职业技术学院《物理化学(2-2)》2023-2024学年第一学期期末试卷
- 银川能源学院《化工热力学A》2023-2024学年第一学期期末试卷
- 2024印刷厂印刷材料采购与供应链优化承包合同3篇
- 益阳职业技术学院《环境设计与沟通》2023-2024学年第一学期期末试卷
- 银川科技学院《AM与嵌入式系统》2023-2024学年第一学期期末试卷
- 2024年度担保公司能源行业担保合同详细范本3篇
- 长期出租车合同3篇
- 高效系统维护服务合同3篇
- 2023年山东省烟台市中考英语试卷(含解析)
- A320 机型飞行人员理论考试题库-导出版
- 新媒体时代网络舆情应对技巧
- 2023年地下室基础换填专项施工方案
- 高中语文词汇表5000
- 高一语文必修一新闻和报告文学阅读复习题及答案解析
- 泛海三江JB-QGL-9100火灾报警控制器(联动型)使用手册
- 6077三菱帕杰罗v86v93v98w维修手册原厂
- 中华人民共和国史马工程课件01第一章
- 初中体育与健康人教七-九年级全一册球类足球脚内侧传接球 市赛获奖PPT
- 基于CAN通讯的储能变流器并机方案及应用分析报告-培训课件
评论
0/150
提交评论