《信息科学类专业英语》课件第3章_第1页
《信息科学类专业英语》课件第3章_第2页
《信息科学类专业英语》课件第3章_第3页
《信息科学类专业英语》课件第3章_第4页
《信息科学类专业英语》课件第3章_第5页
已阅读5页,还剩40页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Lesson3UtilisationoftheGPUArchitectureforHPC

(第三课GPU用于高性能计算)

Vocabulary(词汇)ImportantSentences(重点句)QuestionsandAnswers(问答)Problems(问题)1Introduction

GraphicsProcessingUnits(GPUs),whichcommonlyaccompanystandardCentralProcessingUnits(CPUs)inconsumerPCs,arespecialpurposeprocessorsdesignedtoefficientlyperformthecalculationsnecessarytogeneratevisualoutputfromprogramdata.Videogameshaveparticularlyhighrenderingdemands,andthismarkethasdriventhedevelopmentofGPUs,whichincomparisontoCPUs,offerextremelyhighperformanceforthemonetarycost.

Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.[1]Inparticular,thereispotentialtouseGPUstoboosttheperformanceofthetypesofsimulationscommonlydoneontraditionalHPC(HighPerformanceComputing).systemssuchasHPCx.Therearechallengestobeovercome,however,torealisethispotential.

ThedemandsplacedonGPUsfromtheirnativeapplicationsare,however,usuallyquiteunique,andassuchtheGPUarchitectureisquitedifferentfromthatoftheCPU.Graphicsprocessingisinherentlyextremelyparallelsocanbehighlythreadedandperformedonthelargenumbers(typicallyhundreds)ofprocessingcoresfoundintheGPUchip.TheGPUmemorysystemisquitedifferenttothestandardCPUequivalentsystem.Furthermore,theGPUarchitecturereflectsthefactthatgraphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation.SpecialisedsoftwaredevelopmentiscurrentlyrequiredtoenableapplicationstoefficientlyutilisetheGPUarchitecture.

ThisreportfirstgivesadiscussiononscientificcomputingonGPUs.Then,wedescribetheportingofanHPCbenchmarkapplicationtotheNVIDIATESLAGPUarchitecture,andgiveperformanceresultscomparingtouseofastandardCPU.2Background

2.1GPUs

ThekeydifferencebetweenGPUsandCPUsisthatwhileamodernCPUcontainsafewhigh-functionalitycores,GPUstypicallycontain100ormorebasiccores.GPUsalsoboastalargermemorybuswidththanCPUswhichresultsinfastermemoryaccess.TheGPUclockfrequencyistypicallylowerthanthatofaCPU,butthisgaphasbeenclosingoverthelastfewyears.Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelisation,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftwaredevelopment.[2]

ThissectionintroducesthearchitecturaldesignofGPUs.NVIDIA’sproductsarefocusedonherebutofferingsfromotherGPUmanufacturers,suchasATI,aresimilar.Fig.1illustratesthelayoutofaGPU.Itcanbeseenthattherearemanyprocessingcores(processors)toperformcomputation,eachgroupedintomultiprocessors.Thereareseverallevelsofmemorywhichdifferintermsofaccessspeedandscope.TheRegistershaveprocessorscope;theSharedMemory,ConstantCacheandTextureCachehavemultiprocessorscopeandtheDevice(orGlobal)memorycanbeaccessedbyallcoresonachip.NotethattheGPUmemoryaddressspaceisseparatefromthatfortheCPU,andcopyingofdatabetweenthedevicesmustbemanagedinsoftware.Typically,theCPUwillruntheprogramskeleton,andoffloadoneormorecomputationallydemandingcodesectionstotheGPU.Thus,theGPUeffectivelyacceleratestheapplication.TheCPUisreferredtoastheHostandtheGPUastheDevice.FunctionsthatrunontheDevicearecalledkernels.Fig.1architecturallayoutofNVIDIAGPUchipandmemory

OntheGPU,operationsareperformedbythreadsthataregroupedintoblocks,whichareinturnarrangedonagrid.Eachblockisexecutedbyasingleprocessor,howeverifthereareenoughresourcesavailable,severalblockscanbeactiveatthesametimeonaprocessor.Theprocessorwilltime-slicetheblockstoimproveperformance,oneblockperformingcalculationswhileanotheriswaitingforamemoryread,forexampleSomeofthememoryavailabletotheGPUexhibitsconsiderablylatency,howeverbyusingthismethodoftime-slicing,thislatencycanbehiddenforapplicationsthataresuitable.

Agroupof32threadsiscalledawarp,and16threadsahalf-warp.GPUsachievebestperformancewhenhalf-warpsofthreadsperformthesameoperation.Thisisbecauseinthissituation,thethreadscanbeexecutedinparallel.Conditionalscanmeanthatthreadsdonotperformthesameoperationsandsotheymustbeserialised.Suchthreadsaresaidtobedivergent.ThisalsoappliesforGlobalMemoryaccesses:ifthethreadsofahalf-warpaccessGlobalMemorytogetherandobeycertainrulestoqualifyasbeingcoalesced,thentheyaccessthememoryinparallelanditwillonlytakethetimeofasingleaccessforallthreadsofthehalf-warptoaccessthememory.

GlobalMemoryislocatedinthegraphicscard’sGDDR3memory.Thiscanbeaccessedbyallthreads,althoughitisusuallyslowerthanon-chipmemory.Memoryaccessissignificantlyimprovedifmemoryaccessesarecoalescedasthisallowsallthethreadsofahalf-warptoaccessthememorysimultaneously.

SharedMemorycanonlybeaccessedbythreadsinthesameblock.Becauseitisonchip,theSharedMemoryspaceismuchfasterthanthelocalandGlobalMemoryspaces.Approximately16KBofsharedmemoryareavailableoneachMP(multi-processor),howevertopermiteachMPtohaveseveralblocksactiveatatime(whichimprovesperformance)itisadvisabletouseaslittleSharedMemoryaspossibleperblock.Alittlebitlessthan16KBiseffectivelyavailableduetostorageofinternalvariables.

SharedMemoryconsistsof16memorybanks.WhenSharedMemoryisallocated,eachconsecutive32bitwordisplacedonadifferentmemorybank.Toachievemaximummemoryperformance,bankconflictsmustbeavoided(twothreadstryingtoaccessthesamebankatthesametime).Inthecaseofabankconflict,theconflictingmemoryaccessesareserialised,otherwisememoryaccessbyeachhalf-warpisdoneinparallel.ConstantMemoryisread-onlymemorythatiscached.ItislocatedinGlobalMemory,howeverthereisacachelocatedoneachMulti-processor.Iftherequestedmemoryisinthecache,thenaccessisasfastasSharedMemory,howeverifitisnotthentheaccesswillbethesameasaGlobalMemoryaccess.

TextureMemoryisread-onlymemorythatiscachedandisoptimizedfor2Dspatiallocality.Thismeansthataccessing[a][b]and[a+1][b],say,willprobablygetbetterspeedthanif[a][b]and[a+54][b]wereaccessedinstead.[3]TheTextureCacheis16KBperprocessor.Thisisadifferent16KBtotheSharedMemory,sousingtheTextureCachedoesnotreduceavailableSharedMemory.

RegistermemoryexistsandaccessspeedissimilartoSharedMemory.Eachthreadinablockhasitsownindependentversionofregistervariablesdeclared.VariablesthataretoolargewillbeplacedinLocalMemorywhichislocatedinGlobalMemory.TheLocalMemoryspaceisnotcached,soaccessestoitareasexpensiveasnormalaccessestoGlobalMemory.

2.2CUDA

CUDA(ComputeUnifiedDeviceArchitecture)isaprogramminglanguagedevelopedbyNVIDIAtofacilitatewritingprogramsthatrunonCUDA-enabledGPUs.ItisanextensionofCandiscompiledusingthenvcccompiler.ThemostcommonlyusedextensionsarecudaMalloc*toallocatememoryonthedevice,cudaMemcpy*tocopydatabetweenthehostanddeviceandbetweendifferentlocationsonthedevice,kernelname<<<griddimensions,blockdimensions>>>(parameters)tolaunchakernel,threadIdx.x,blockIdx.x,blockDim.x,andgridDim.xtoidentifythethread,block,blockdimension,andgriddimensioninthexdirection.

CUDAaddressedanumberofissuesthataffecteddevelopingprogramsforGPUs,whichpreviouslyrequiredmuchspecialistknowledge.CUDAisquitesimple,soitwillnottakemuchtimeforaprogrammeralreadyfamiliarwithCtobeginusingit.CUDAalsopossessesanumberofotherbenefitsoverpreviousmethodsofGPUprogramming.OneoftheseisthatitpermitsthreadstoaccessanylocationintheGPUmemoryandtoreadandwritetoasmanymemorylocationsasnecessary.Thesewerepreviouslyquitelimitingconstraints,andsoeasingthemrepresentsasignificantadvantageforCUDA.AnothermajorbenefitispermittingaccesstoSharedMemory,whichwaspreviouslynotpossible.

TomakeadoptionofCUDAaseasyaspossible,NVIDIAhascreatedCUDAUwhichcontainsawell-writtentutorialwithexercisesaswellaslinkstocoursenotesandvideosofCUDAcoursestaughtattheUniversityofIllinois.AReferenceManualandProgrammingGuidearealsoavailable.

TheCUDASDKcontainsmanyexamplecodesthatcanbeusedtotesttheinstallationofaGPUand,asthesourcecodesareprovided,demonstrateCUDAprogrammingtechniques.Oneoftheprovidedcodesisatemplate,providingthebasicstructureonwhichprogramscanbebased.

OneofthemainfeaturesofCUDAistheprovisionofaLinearAlgebralibrary(CUBLAS)andanFFTlibrary(CUFFT).ThesegreatlyeasetheimplementationofmanyscientificcodesonaGPU.

2.3ReviewofGPUSuccesses

Inthissection,somerecentworkinvolvingusingGPUsforscientificcomputingishighlighted.

·TheTheoreticalandComputationalBiophysicsgroupattheUniversityofIllinoisatUrbana-ChampaignhasusedGPUstoachieveaccelerationsofbetween20and100timesformolecularmodellingapplications.TProfessorMikeGilesofOxfordUniversityachieveda100timesspeed-upforaLIBOR.

·MonteCarloapplicationanda50timesspeed-upfora3DLaplaceSolver.TheLaplaceSolverwasimplementedontheGPUusingonlyGlobalandSharedMemory.ItusesaJacobiiterationofaLaplacediscretisationonauniform3Dgrid.TheLIBORMonteCarlocodeusedwasquitesimilartotheoriginalCPUcode.ItusesGlobalandConstantMemory.

·ManyotherUKresearchersarealsoexperimentingwithGPUs.NVIDIAhasashowcaseofapplicationsreportedtothem.GPGPU.orgalsomaintainsalistofresearchersusingGPUs.

·RapidMindachieveda2.4timesspeed-upforBLASSGEMM,2.7timesforFFT,and32.2timesforBlack-Sholes.

2.4GPUDisadvantagesandAlternativeAccelerationTechnologies

Inthissection,somedisadvantagesoftheGPUarchitecturearediscussed,andsomealternativeaccelerationtechnologiesarebrieflydescribed.ThekeylimitationofGPUsistherequirementforahighlevelofparallelismtobeinherenttotheapplicationtoenableexploitationofthemanycores.Furthermore,graphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation,andthisisreflectedinthefactthattypicallyGPUslackbotherrorcorrectionfunctionalityanddoubleprecisioncomputationalfunctionality.ThisisexpectedtoimprovewithfutureGPUarchitectures.

AnothercommoncriticismofGPUsisthelargepowerconsumption.TheNVIDIATeslaC870usesupto170Wpeak,and120Wtypical.TheamountofheatproducedwouldmakeitdifficulttoclusterlargenumbersofGPUstogether.

GPUsalsoplacegreaterconstraintsonprogrammersthanCPUs.Toavoidsignificantperformancedegradationitisnecessarytoavoidconditionalsinsidekernels.Avoidingnon-coalescedGlobalMemoryaccessesisverydifficultformanyapplications,whichcanalsoseverelydegradeperformance.Thelackofanyinter-blockcommunicationfunctionalitymeansthatitisnotpossibleforthreadsinablocktodeterminewhenthethreadsinanotherblockhavecompletedtheircalculation.Thismeansthatifresultsofcomputationfromotherblocksarerequiredthentheonlysolutionisforthekerneltoexitandanotherlaunch,guaranteeingthatalloftheblockshavecompleted.

Finally,GPUssufferfromlargelatencyinCPU-GPUcommunication.ThisbottleneckcanmeanthatunlesstheamountofprocessingthatisdoneontheGPUisgreatenough,itmaybefastertosimplyperformcalculationsontheCPU.Thereareotheralternativeaccelerationtechnologiesavailable,someofwhicharebrieflydescribedbelow.

ClearspeedOnealternativetoGPUsareprocessorsdesignedespeciallyforHPCapplications,suchasthoseofferedbyClearspeed.TheseproductsareusuallyquitesimilartoGPUs,withafewmodificationsthatusuallymakethemmoresuitableforHPCapplications.OneofthesedifferencesisthatallinternalandexternalmemorycontainsECC(ErrorCorrectionCode)todetectandcorrect‘softerrors’.‘Softerrors’arerandomone-biterrorsthatarecausedbyexternalfactorssuchascosmicrays.

Inthegraphicsmarketsucherrorsaretolerable,andsoGPUsdonotcontainECC,howeverforHPCapplicationsitisoftendesirableorrequired.ClearspeedproductsalsohavemorecoresthanGPUs,buttheyrunataslowerclockspeedtoreduceheatloss.Doubleprecisionisalsoavailable.

SpecialisedproductssuchasClearspeedprocessorshaveamuchsmallermarketthanthatofGPUs.ThisgivesGPUsanumberofadvantages,suchaseconomiesofscale,greateravailability,andmoremoneyspentonR&D.

IntelLarrabeeAnotheralternativethatislikelytogeneratemuchinterestwhenitisreleasedin2009-2010isIntel’sLarrabeeprocessor.Thiswillbeamany-corex86processorwithvectorcapability.IthasthesignificantadvantageoverGPUsofmakinginter-processorcommunicationpossible.ItshouldalsosolveanumberofotherproblemsthataffectGPUs,suchasthelatencyofCPU-GPUcommunication.Itwillinitiallybeaimedatthegraphicsmarket,althoughspecialisedHPCproductsbasedonitarepossibleinthefuture.ItislikelythatitwillalsocontainECCtominimise‘softerrors’.AMDisalsodevelopingasimilarproduct,currentlynamed‘AMDFusion’,howeverfewdetailshavebeenreleasedyet.

CellProcessorACellchipcontainsonePowerProcessorElement(PPE)andseveralSynergisticProcessingElements(SPEs).ThePPEactsmainlytocontroltheSPEs,whichdomostofthecalculations.CellprocessorsarequitesimilartoGPUs.ForsomeapplicationsGPUsoutperformCellProcessors,whileforotherstheoppositeistrue.

FPGAsFieldProgrammableGateArrays(FPGAs)areprogrammablesemiconductordevicesthatarebasedaroundamatrixofconfigurablelogicblocksconnectedviaprogrammableinterconnects.Asopposedtonormalmicroprocessors,wherethedesignofthedeviceisfixedbythemanufacturer,FPGAscanbeprogrammedtocomputetheexactalgorithmrequiredbyagivenapplication.Thismakesthemverypowerfulandversatile.Themaindisadvantagesarethattheyareusuallyquitedifficulttoprogram,andtheyarealsoslowifhigh-precisionisrequired.Forcertaintaskstheyarepopular,however.Severaltime-consumingalgorithmsinAstronomywhereonly4bitprecisionisnecessaryareverysuitableforFPGAs,forexample.3GPUAccelerationofanHPCBenchmark(Omitted)

4Conclusions

GPUs,originallydesignedtosatisfytherenderingcomputationaldemandsofvideogames,potentiallyofferperformancebenefitsformoregeneralpurposeapplications,includingHPCsimulations.ThedifferencesbetweentheGPUandstandardCPUarchitecturesresultintherequirementthatsignificanteffortmustbeinvestedtoenableefficientuseoftheGPUarchitectureforsuchapplications.

WedescribedtheGPUarchitectureandmethodsusedforsoftwaredevelopment,andreportedthatthereispotentialfortheuseofGPUsinHPC:therehavebeennotablesuccessesinseveralresearchareas.WedescribedtheportingofanHPCbenchmarkapplicationtotheGPUarchitecture,whereseveraldegreesofoptimisationwereperformed,andbenchmarkedtheresultingcodesagainstcoderunonastandardCPU.TheGPUwasseentoofferuptoafactorof7.5performanceimprovement.

1. rendervt.报答,归还,给予;呈递,提供,开出;演出,演奏;翻译;使,致使;使成为,使变得,使处于某状态;递交,呈献;粉刷;将(脂肪)熬成油,熔化;(用其他语言)表达,把……译成;放弃,让与,交出(与up连用);归还,交回(与back连用);付给,交纳,纳贡;提供(帮助等),给予(服务等);表达,描绘;给……重新措词,翻译(常与in或into连用)vi.给予补偿;熬油n.在图形学领域,render是染色器。

2. harnessn.马具,挽具;(防止坠落或摔倒的)背带,保护带vt.给(马等)装上挽具;治理,利用。Vocabulary

3. susceptibleadj.易受影响的,易动感情的;过敏的;易受……感染的;能经受的;好动感情的,感情丰富的,善感的;容许……的,可能……的,可以……的。

4. scopen.(活动或能力的)余地,机会;(处理、研究事务的)范围;……镜(观察仪器);视野,视界;见识,眼界,理解的范围;(活动)范围,(影响、波及)面;能力,力量;长度。

5. threadn.线,细线;线索,思路;线状物;细细的一条;螺纹;衣服vt.将(针、线等)穿过……;将(影片)装入放映机;穿成串,串在一起;给……装入(胶片、绳子);用……线缝;把……线编织进。

6. warpn.弯曲,歪斜;经线;经纱;vt.&vi.弄弯,变歪vt.使(行为等)不合情理;使乖戾。

7. divergentadj.有分歧的;叉开的;发散的,扩散的。

8. texturen.手感,质感,质地;口感;(音乐或文学的)谐和统一感,神韵。

9. coalescevi.联合,合并。

[1] Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.

很自然地,人们对GPU提供的处理能力是否能够用来加强更多通用计算产生了兴趣。asto,关于;Tobringundercontrolanddirecttheforceof,统治,管理,支配控制住和指挥……的力量:这里表示指挥和控制GPU的图形处理能力使之加强通用计算。ImportantSentences

[2] Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelization,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftw

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论