并行程序设计 中文课件 07 CUDA Programming_第1页
并行程序设计 中文课件 07 CUDA Programming_第2页
并行程序设计 中文课件 07 CUDA Programming_第3页
并行程序设计 中文课件 07 CUDA Programming_第4页
并行程序设计 中文课件 07 CUDA Programming_第5页
已阅读5页,还剩82页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

CUDAProgramming

(GPUProgramming)Instructor:ZhangWeizhe(张伟哲)ComputerNetworkandInformationSecurityTechniqueResearchCenter,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyMotivation动机GPUArchitectureGPU架构Threewaystoaccelerateapplications三种加速应用的方法CUDAProgrammingModelCUDA编程模型CUDAProgrammingBasicsCUDA编程基础Outline3ASimpleExampleThreeIntegerArrays

A[N]B[N]C[N]Wewanttocalculate

C[i]=A[i]+B[i]

4Traditionally,onthecpu(serial) for(i=0;i<N;++i)

C[i]=A[i]+B[i] T(N)=O(N)5Traditionally,onthecpu(parallel)createNthreads

C[threadid.i]=A[threadid.i]+B[threadid.i]

T(N)=O(1)6GPUComputingButthereisaproblem.ApplicationslikeNeedthousandsofthreadstoexecuteMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline8AsimplecomparisonbetweenCPUandGPU

9AdetaileddescriptionGraphicsProcessingClusters(GPCs)TextureProcessingClusters(TPCs)StreamingMultiprocessors(SM)10pascal-architecture-whitepaperMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline12ThreemethodsCUDAOptimizedLibrariesProgrammingLanguages13ThreemethodsCUDAOptimizedLibrariesTheselibrariesarewritteninCUDASimplyreplaceyourstandardlibraryfunctionswithcorrespondingCUDAlibrariesItsupportsmanymathlibrariesbutnotallasupportedlistcanbefoundat/gpu-accelerated-libraries

14ThreemethodsItisadirective-basedprogrammingmodelYouneedtoisnertsomedirectivesinyourcodeUseopenacccompilertocompilethecode15ThreemethodsProgrammingLanguagesMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline18PrerequestforCUDAProgrammingHardwareANvidiaGraphicscard:Itcanbeaspecializedcomputingcard,likeTeslaPascalGP100(tooexpensive),oranormalgamegraphiccard,likeGTorGTX.CheckwheteryourGPUsupportsCUDA:youcancheckoutthiswebsite/cuda-gpus

Clickon19PrerequestforCUDAProgrammingSoftwareCUDAToolkit:It’ssupportedonWindows,Mac,andmoststandardLinuxdistributions.Downloadfromhttps:///cuda-toolkitVisualStudio(ifonwindows):IfyouworkonWindows,forIknow,VSistheonlyIDEthatcanworkwithCUDA.Ifyoudon’twanttoinstallVS,youcanusetheCUDAcompilerNVCCdirectlyfromacommandline.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline21CUDAExecutionFlowCUDAApplicationHost=CPUDevice=GPUHost=CPUDevice=GPUParallelcodeSerialcodeSerialcodeParallelcode22CUDAExecutionFlow1.CopydatafromCPUmemorytoGPUmemory23CUDAExecutionFlowInstructtheGPUtostartcomputing24CUDAExecutionFlowCopytheresultsbacktoCPUmemoryCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline26CUDAThreadsParallelportionofanapplicatonfloatx=in[threadIdx.x];floaty=func(x);out[threadIdx.x]=y;

in[i]in[i+1]in[i+2]in[i+3]out[i]out[i+1]out[i+2]out[i+3]AkernelisafunctionexecutedontheGPUasanarrayofthreadsinparallelandcanbecalledfromCPUAllthreadsexecutethesamecode,cantakedifferentpathsEachthreadhasanID27CUDAThreads.....…..WarpWarpBlock28CUDAThreads.....Warp32ThreadsaregroupedintowarpsAwarpinCUDAistheminimumsizeofthedataprocessedinSIMDfashionbyaCUDAmultiprocessor.ThreadIDswithinawarpareconsecutiveandincreasingWarpisunitofthreadschedulinginSMs29CUDAThreads.....WarpOneorMorewarpsaregroupedintoblocksAthreadblockisbatchofthreadsthatcancooperatewitheachotherbysharingdatathroughsharedmemoryandsynchronizingtheirexecution.Ablockcanatmostcontain1024threadsbecasuseofthehardwaresourcelimitThethreadidisuniqueandstartsfromzeroinablockWarp30CUDAThreadsBlockBlockBlockBlockBlockBlockGridAkernelwillbeexecutedasagrid31CUDAThreadsKernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith2SMsSM0SM1Block0Block1Block2Block3Block4Block5Block6Block7CUDAThreads32KernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith4SMsSM0SM1SM2SM4Block0Block4Block1Block5Block2Block6Block3Block7CUDAThreads33CUDAThreads34Block0CUDAThreads35Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblem:branchdivergenceExamplewithdivergence:If(threadIdx.x>2){…}ThiscreatestwodifferentcontrolpathsforthreadsinablockToavoidthis:If(threadIdx.x/WARP_SIZE>2){…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Sodon’tusethiskindofcode,letthewholewarpdothesamework.CUDAThreads36Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblembranchdivergenceExamplewithdivergence:

If(threadIdx.x%2==0){…} else{…}ThiscreatestwodifferentcontrolpathsforthreadsinablockCUDAThreads37

If(threadIdx.x%2==0){…}else{…}CUDAThreads38If(threadIdx.x/WARP_SIZE==0){…}Else{…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Letthewholewarpdothesamework.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutlineGlobalMemory&SyntaxGlobalmemoryisthe“main”memoryoftheGPU.Ithasglobalscopeandlifetimeoftheallocatingprogram(oruntilcudaFreeiscalled)GlobalmemoryissimilartotheheapinaCprogram.GlobalmemorysyntaxAllocatewithcudaMalloc(void**devPtr,size_tsize)FreewithcudaMalloc(void*devPtr)40intblk_sz=64;float*Md;intsize=blk_sz*blk_sz*sizeof(float);cudaMalloc((void**)&Md,size);…cudaFree(Md);Host-DeviceDataTransfercudaMemcpy()MemorydatatransferRequiresfourparametersPointertodestinationPointertosourceNumberofbytescopiedType/DirectionoftransferHosttoHost,HosttoDevice,DevicetoHost,DevicetoDeviceTransfertodeviceisasynchronous41cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);CPUMemoryGPUGPUMemoryCPUPCI-E8GB/sGDDR5190GB/sConstantMemory&SyntaxConstantmemoryisaformofvirtualaddressingofglobalmemory.SpecialpropertiesCached&read-onlySupportsbroadcastingasinglevaluetoalltheelementswithinawarpConstantmemoryisrestrictedto64KB(kernelargumentsarepassedthroughconstantmemory)ConstantmemorysyntaxInglobalscope(outsideofkernel,attoplevelofprogram)__constant__int

foo[2014];InhostcodecudaMemcpyToSymbol(foo,h_src,sizeof(int)*1024);42TextureMemoryComplicatedandonlymarginallyusefulforgeneralpurposecomputationUsefulcharacteristics2Dor3Ddatalocalityforcachingpurposesthrough“CUDAarrays”.GoesintospecialtexturecacheFastinterpolationon1D,2D,or3DarrayConvertingintegersto“unitized”floatingpointnumbers43It’sacomplextopic,youcanlearneverythingyouwanttoknowaboutitfromCUDAHandbookSharedMemory&SyntaxSharedmemoryisusedtoexchangedatabetweenCUDAthreadswithinablock.VeryfastmemorylocatedintheSMOn-chipmemory,low-latency,user-controlledL1cacheSharedmemorysyntaxStaticallocation__shared__floatdata[1024];//declarationinkernel,nothinginhostcodeDynamicallocationHost:

kernel<<<grid,block,numBytesShMem>>>(arg);Device(inkernel):

extern__shared__float

s[];44RememberSM=StreamingmultiprocessorSM≠SharedmemoryComputationalIntensityComputationalintensity=FLOPS/IOMatrixmultiplication:n3/n2=nN-bodysimulation:n2/n=n45Ifcomputationalintensityis>1,thensamedatausedinmorethan1computation.Doasfewgloballoadsandasmanysharedloadsaspossible.Registers&LocalMemoryRegistersFastest“memory”possible,about10xfasterthansharedmemoryMoststackvariablesdeclaredinkernelsarestoredinregisters(example:floatx)StaticallyindexedarraysstoredonthestackaresometimesputinregistersLocalMemoryLocalmemoryiseverythingonthestackthatcan’tfitinregistersThescopeoflocalmemoryisjustthethread.Localmemoryisstoredinglobalmemory(muchslowerthanregisters!)46Non-Programmable!CUDAMemoryModelSummaryMemorySpaceManagedbyPhysicalImplementationScopeonGPUScopeonCPULifetimeRegistersCompilerOn-chipPerThreadNotvisibleLifetimeofathreadLocalCompilerDeviceMemoryPerThreadNotvisibleSharedProgrammerOn-chipBlockNotvisibleBlocklifetimeGlobalProgrammerDeviceMemoryAllThreadsRead/WriteApplicationoruntilexplicitlyfreedConstantProgrammerDeviceMemoryAllThreadsRead-onlyRead/WriteTextureProgrammerDeviceMemoryAllThreadsRead-onlyRead/Write47MotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline50ParallelProgramminginCUDACThefirstexample:

SummingVectorsImaginehavingtwolistsofnumberswherewewanttosumcorrespondingelementsofeachlistandstoretheresultinathirdlist.51ParallelProgramminginCUDACCPUcore1CPUcore252ParallelProgramminginCUDAC53ParallelProgramminginCUDAC__host____device__cudaError_tcudaMalloc(void**devPtr,size_tsize)__host__and__device__istypequafiler,whichmeasthisfunctioncanbecalledonthedeviceorthehost.AllcudafunctionstakesaerrorcodeasareturnvalueDifferentfromC’sMallocfunction,thisfunctionstakesapointertopointerasaparameter.54ParallelProgramminginCUDACExecutionconfiguration

<<<Dg,Db,Ns,S>>>Dgisoftypedim3andspecifiesthedimensionandsizeofthegrid,suchthatDg.x*Dg.y*Dg.zequalsthenumberofblocksbeinglaunched;Dbisoftypedim3andspecifiesthedimensionandsizeofeachblock,suchthatDb.x*Db.y*Db.zequalsthenumberofthreadsperblock;Nsistypeofsize_tandspecifiesthenumberofbytesinsharedmemorythatisdynamicallyallocatedperblockforthiscallinadditiontothestaticallyallocatedmemory;It’sanoptionalargumentwhichdefaultto0;SisoftypecudaStream_tandspecifiestheassociatedstream;It’sanoptionalargumentwhichdefaultsto055ParallelProgramminginCUDAC56ParallelProgramminginCUDAC__host__cudaError_tcudaMemcpy(void*dst,constvoid*src,size_t count,cudaMemcpyKindkind)CopiesdatabetweenhostanddevicecudaMemcpyKindspecifiesthedirectionofthecopy.ItisoneofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice57ParallelProgramminginCUDAC__global__functionsmushhavevoidreturntype.Anycalltoa__global__functionmustspecifyitsexecutionconfiguration.Acalltoa__global__functionisasynchronous,meaningitreturnsbeforethedevicehascompleteditsexecution.58ParallelProgramminginCUDACblockIdx.xcontainstheblockindexwithinthegridthreadIdx.xcontainsthethreadindexwithintheblockCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline60SharedMemoryandSynchronizationThreadswithinablockcancommunicatewitheachotherthroughsharedmemory.__shared__isusedtomakethevariableresidentinsharedmemory.Iftwothreadswanttowritethesamesharedvariable,theremustbeasynchronizationbetweentwothreads.EitherAwritesafterB,orBwritesafterA.Now,let’stakealookatanexamplethatusesthesefeatures61SharedMemoryandSynchronizationDOTPRODUCTEachthreadmultipliesapairofcorrespondingentries,andtheneverythreadmovesontoitsnextpair.Becausetheresultneedstobethesumofallthesepairwiseproducts,eachthreadkeepsarunningsumofthepairsithasadded.62SharedMemoryandSynchronizationSupposewehaveNthreads,andthearraysizeisN*M63SharedMemoryandSynchronization64SharedMemoryandSynchronization65SharedMemoryandSynchronization66SharedMemoryandSynchronization67SharedMemoryandSynchronizationCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline69ConstantMemoryandEventsConstantmemoryusedfordatathatwillnotchangeoverthecourseofakernelexecutionNVIDIAhardwareprovides64KBofconstantmemorythatittreatsdifferentlythanittreatsstandardglobalmemory.70ConstantMemoryandEventsProblemDescription:Produceanimageofathree-dimensionalseene.Asimpleidea:Imaginethelightsgothroughtheobjects,andcastashadowontheplate.Sincelightscancomefromanyplaceatanypointinourscene,itturnsourit’seasiertoworkbackward.Eachpixelintheimagewillshootarayintothescene.Wefigureoutwhatcolorisseenbyeachpixelbytracingarayfromthepixelinquestionthroughthesceneuntilithitsoneofourobjects.Wethensaythatthepixelwould“see”thisobjectandcanassignitscolorbasedonthecoloroftheobjectitsees.Mostofthecomputationrequiredbyraytracingisinhecomputationoftheseintersectionsoftheraywiththeobjextsinthescene.71ConstantMemoryandEvents72ConstantMemoryandEventsWhatwilltheraytracerdoItwillfirearayfromeachpixelComputewhichrayshitwhichspheres,andthedepthofeachofthesehits.Inthecasewherearaypassesthroughmultiplespheres,onlythespherefurthesttotheimagecanbeseen.Wewillmodeloursphereswithadatastructurethatstoresthesphere’scentercoordinateof(x,y,z),itsradius,anditscolorof(r,b,g).73ConstantMemoryandEvents74ConstantMemoryandEvents75ConstantMemoryandEvents76ConstantMemoryandEventsConstantmemoryalwaysdeclaredinFilescope(globalvariable)CUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline78TextureMemoryLikeconstantmemory,texturememoryisAnothervarietyofread-onlymemoryCachedonchip,soitwillprovidhighereffectivebandwidthDesignedforgraphicsapplicationswherememoryaccesspatternsexhibitagreatdealofspatialocality79TextureMemoryspatialocalityInacomputingapplication,thisroughlyimpliesthatathreadislikelytoreadfromanaddress“near”theaddressthatnearbythreadsread,asshowninthefigure.80TextureMemoryForCPUcachingscheme-------->thefouraddressesshownarenotconsecutiveandwouldnotbecached

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论