NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色

上传人：行*** IP属地：北京上传时间：2023-06-24 格式：DOCX 页数：36 大小：1.54MB 积分：30.8 举报 版权申诉

NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色_第2页

NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色_第3页

NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色_第4页

NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色_第5页

已阅读5页，还剩31页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

June2023

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadership

ReferenceArchitecture

FeaturingNVIDIADGXH100Systems

RA-11333-001v6

BCM3.23.05

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|i

Abstract

TheNVIDIADGXSuperPOD™withNVIDIADGX™H100systemsisthenextgenerationofdatacenterarchitectureforartificialintelligence(AI).DesignedtoprovidethelevelsofcomputingperformancerequiredtosolveadvancedcomputationalchallengesinAI,highperformancecomputing(HPC),andhybridapplicationswherethetwoarecombinedtoimprovepredictionperformanceandtime-to-solution.TheDGXSuperPODisbasedupontheinfrastructurebuiltatNVIDIAforinternalresearchpurposesandisdesignedtosolvethemostchallengingcomputationalproblemsoftoday.SystemsbasedontheDGXSuperPODarchitecturehavebeendeployedatcustomerdatacentersandcloud-serviceprovidersaroundtheworld.

Toachievethemostscalability,DGXSuperPODispoweredbyseveralkeyNVIDIAtechnologies,including:

>NVIDIADGXH100system—toprovidethemostpowerfulcomputationalbuilding

blockforAIandHPC.

>NVIDIANDR(400Gbps)InfiniBand—bringingthehighestperformance,lowest

latency,andmostscalablenetworkinterconnect.

>NVIDIANVLink—networkingtechnologiesthatconnectGPUsattheNVLinklayerto

provideunprecedentedperformanceformostdemandingcommunicationpatterns.

TheDGXSuperPODarchitectureismanagedbyNVIDIAsolutionsincludingNVIDIABaseCommand™,NVIDIAAIEnterprise,CUDA,andMagnumIO™.Thesetechnologieshelpkeepthesystemrunningatthehighestlevelsofavailability,performance,andwithNVIDIAEnterpriseSupport(NVEX),keepsallcomponentsandapplicationsrunningsmoothly.

Thisreferencearchitecture(RA)discussesthecomponentsthatdefinethescalableandmodulararchitectureoftheDGXSuperPOD.Thesystemisbuiltuponbuildingblocksofscalableunits(SU),eachcontaining32DGXH100systems,whichprovidesforrapiddeploymentofsystemsofmultiplesizes.ThisRAincludesdetailsregardingtheSUdesignandspecificsofInfiniBand,NVLinknetwork,Ethernetfabrictopologies,storagesystemspecifications,recommendedracklayouts,andwiringguides.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|ii

Contents

KeyComponentsoftheDGXSuperPOD 1

NVIDIADGXH100System 1

NVIDIAInfiniBandTechnology 2

RuntimeandSystemManagement 2

Components 3

DesignRequirements 4

SystemDesign 4

InfiniBandFabrics 4

ComputeFabric 4

StorageFabric 4

EthernetFabrics 5

In-BandManagementNetwork 5

Out-of-BandManagementNetwork 5

StorageRequirements 5

High-PerformanceStorage 5

UserStorage 5

DGXSuperPODArchitecture 6

NetworkFabrics 8

Compute—InfiniBandFabric 9

Storage—InfiniBandFabric 10

In-BandManagementNetwork 11

Out-of-BandManagementNetwork 12

StorageArchitecture 13

DGXSuperPODSoftware 16

NVIDIABaseCommand 16

NVIDIANGC 17

NVIDIAAIEnterprise 17

Summary 18

AppendixA.MajorComponents iii

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|1

KeyComponentsoftheDGXSuperPOD

TheDGXSuperPODarchitecturehasbeendesignedtomaximizeperformanceforstate-of-the-artmodeltraining,scaletoexaflopsofperformance,providethehighestperformancetostorageandsupportallcustomersintheenterprise,highereducation,research,andthepublicsector.ItisadigitaltwinofthemainNVIDIAresearchanddevelopmentsystem,meaningthecompany'ssoftware,applications,andsupportstructurearefirsttestedandvettedonthesamearchitecture.UsingSUs,systemdeploymenttimesarereducedfrommonthstoweeks.LeveragingtheDGXSuperPODdesignsreducestime-to-solutionandtime-to-marketofnextgenerationmodelsandapplications.

TheDGXSuperPODistheintegrationofkeyNVIDIAcomponents,aswellasstoragesolutionsfrompartnerscertifiedtoworkinaDGXSuperPODenvironment.

NVIDIADGXH100System

TheNVIDIADGXH100system

(Figure1

)isanAIpowerhousethatenablesenterprisestoexpandthefrontiersofbusinessinnovationandoptimization.TheDGXH100system,whichisthefourth-generationNVIDIADGXsystem,deliversAIexcellenceinaneightGPUconfiguration.TheNVIDIAHopperGPUarchitectureprovideslatesttechnologiessuchasthetransformerenginesandfourth-generationNVLinktechnologythatbringsmonthsofcomputationaleffortdowntodaysandhours,onsomeofthelargestAI/MLworkloads.

Figure1.DGXH100system

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|2

SomeofthekeyhighlightsoftheDGXH100systemovertheDGXA100systeminclude:>Upto9Xmoreperformancewith32petaFLOPSatFP8precision.

>Dual56-core4thGenIntel®Xeon®capableprocessorswithPCIe5.0supportandDDR5

memory.

>2Xfasternetworkingandstorage@400GbpsInfiniBand/EthernetwithNVIDIA

ConnectX®-7smartnetworkinterfacecards(SmartNICs).

>1.5XhigherbandwidthperGPU@900GBpswithfourthgenerationofNVIDIA

NVLink.

>640GBofaggregatedHBM3memorywith24TB/sofaggregatememorybandwidth,

1.5XhigherthanDGXA100system.

NVIDIAInfiniBandTechnology

InfiniBandisahigh-performance,lowlatency,RDMAcapablenetworkingtechnology,provenover20yearsintheharshestcomputeenvironmentstoprovidethebestinter-nodenetworkperformance.DrivenbytheInfiniBandTradeAssociation(IBTA),itcontinuestoevolveandleaddatacenternetworkperformance.

ThelatestgenerationInfiniBand,NDR,hasapeakspeedof400Gbpsperdirection.ItisbackwardscompatiblewiththepreviousgenerationsofInfiniBandspecifications.InfiniBandismorethanjustpeakperformance.InfiniBandprovidesadditionalfeaturestooptimizeperformanceincludingadaptiverouting(AR),collectivecommunicationwithSHARPTM,dynamicnetworkhealingwithSHIELDTM,andsupportsseveralnetworktopologiesincludingfat-tree,Dragonfly,andmulti-dimensionalTorustobuildthelargestfabricsandcomputesystemspossible.

RuntimeandSystemManagement

TheDGXSuperPODRArepresentsthebestpracticesforbuildinghigh-performancedatacenters.Thereisflexibilityinhowthesesystemscanbepresentedtocustomersandusers.NVIDIABaseCommandsoftwareisusedtomanageallDGXSuperPODdeployments.

DGXSuperPODcanbedeployedon-premises,meaningthecustomerownsandmanagesthehardwareasatraditionalsystem.Thiscanbewithinacustomer’sdatacenterorco-locatedatacommercialdatacenter,butthecustomerownsthehardware.Foron-premisessolutions,thecustomerhastheoptiontooperatethesystemwithasecure,cloud-nativeinterfacethroughNVIDIANGC™.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|3

Components

ThecomponentsoftheDGXSuperPODaredescribedin

Table1.

Table1.FourSU,127-nodeDGXSuperPODcomponents

Component

Technology

Description

Computenodes

127×NVIDIADGXH100systemwitheight80GBH100

GPUs

Fourthgenerationoftheworld’spremierpurpose-builtAIsystemsfeaturingNVIDIAH100TensorCoreGPUs,4thgenerationNVIDIANVLink®and3rdgenerationNVIDIANVSwitch™technologies.

Computefabric

NVIDIAQuantumQM9700NDR400GbpsInfiniBand

Rail-optimized,fullfat-treenetworkwitheightNDR400connectionspersystem

Storagefabric

NVIDIAQuantumQM9700NDR400Gb/sInfiniBand

Thefabricisoptimizedtomatchpeakperformanceoftheconfiguredstoragearray

Compute/storagefabricmanagement

NVIDIAUnifiedFabricManager,EnterpriseEdition

NVIDIAUFMcombinesenhanced,real-timenetworktelemetrywithAIpoweredcyberintelligenceandanalyticstomanagescale-outInfiniBanddatacenters

In-bandmanagement

network

NVIDIASN4600Cswitch

64port100GbpsEthernetswitchprovidinghighportdensitywithhighperformance

Out-of-band(OOB)managementnetwork

NVIDIASN2201switch

48port1GbpsEthernetswitchleveragingcopperportstominimizecomplexity

DGXSuperPODsoftwarestack

NVIDIABaseCommand

Manager

ClustermanagementforDGXSuperPOD

NVIDIAAIEnterprise

Best-in-classdevelopmenttoolsandframeworksfortheAIpractitionerandreliablemanagementandorchestrationforITprofessionals

MagnumIO

TheNVIDIAMAGNUMIOenablesincreasedperformanceforAIandHPC

NVIDIANGC

TheNGCcatalogprovidesacollectionofGPU-optimizedcontainersforAIandHPC

Userenvironment

Slurm

Slurmisaclassicworkloadmanagerusedtomanagecomplexworkloadsinamulti-node,batch-style,computeenvironment

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|4

DesignRequirements

TheDGXSuperPODisdesignedtominimizesystembottlenecksthroughoutthetightlycoupledconfigurationtoprovidethebestperformanceandapplicationscalability.Eachsubsystemhasbeenthoughtfullydesignedtomeetthisgoal.Inaddition,theoveralldesignremainsflexiblesothatdatacenterrequirementscanbetailoredtobetterintegrateintoexistingdatacenters.

SystemDesign

TheDGXSuperPODisoptimizedforacustomers’particularworkloadofmulti-nodeAI,HPC,andHybridapplications:

>AmodulararchitecturebasedonSUsof32DGXH100systemseach.

>AfullytestedsystemscalestofourSUs,butlargerdeploymentscanbebuiltbased

oncustomerrequirements.

>Rackdesigncansupportone,two,orfourDGXH100systemsperrack,sothatthe

racklayoutcanbemodifiedtoaccommodatedifferentdatacenterrequirements.

>StoragepartnerequipmentthathasbeencertifiedtoworkinDGXSuperPOD

environments.

>Fullsystemsupport(includingcompute,storage,network,andsoftware)isprovided

byNVIDIAEnterpriseSupportNVES).

InfiniBandFabrics

ComputeFabric

>TheInfiniBandcomputefabricisrail-optimizedtothetoplayerofthefabric.>TheInfiniBandfabricisabalanced,full-fattree.

>ManagedNDRswitchesareusedthroughoutthedesigntoprovidebetter

managementofthefabric.

>ThefabricisdesignedtosupportthelatestSHaRPv3features.

StorageFabric

Thestoragefabricprovideshighbandwidthtosharedstorage.Italsohasthesecharacteristics:

>Itisindependentofthecomputefabrictomaximizeperformanceofbothstorage

andapplicationperformance.

>Providessingle-nodebandwidthofatleast40GBpstoeachDGXH100system.>StorageisprovidedoverInfiniBandandleveragesRDMAtoprovidemaximum

performanceandminimizeCPUoverhead.

>Itisflexibleandcanscaledtomeetspecificcapacityandbandwidthrequirements.>User-accessiblemanagementnodesprovideaccesstosharedstorage.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|5

EthernetFabrics

MultipleEthernetfabricsareusedtosupportmanagementcommunications,Ethernet-basedstoragetargets,Internetaccess,andothertraditionalTCP/IPbasedservices.

In-BandManagementNetwork

>Thein-bandmanagementnetworkfabricisEthernet-basedandisusedfornode

provisioning,datamovement,Internetaccess,andotherservicesthatmustbeaccessiblebytheusers.

>Thein-bandmanagementnetworkconnectionsforcomputeandmanagement

serversoperateat100Gbpsandarebondedforresiliency.

Out-of-BandManagementNetwork

TheOOBmanagementnetworkconnectsallthebasemanagementcontroller(BMC)ports,aswellasotherdevicesthatshouldbephysicallyisolatedfromsystemusers.

StorageRequirements

TheDGXSuperPODcomputearchitecturemustbepairedwithahigh-performance,balanced,storagesystemtomaximizeoverallsystemperformance.TheDGXSuperPODisdesignedtousetwoseparatestoragesystems,high-performancestorage(HPS)anduserstorage,optimizedforkeyoperationsofthroughput,parallelI/O,aswellashigherIOPSandmetadataworkloads.

High-PerformanceStorage

HPSmustprovide:

>High-performance,resilient,POSIX-stylefilesystemoptimizedformulti-threadedreadandwriteoperationsacrossmultiplenodes.

>NativeInfiniBandsupport.

>LocalsystemRAMfortransparentcachingofdata.

>Leveragelocaldisktransparentlyforcachingoflargerdatasets.

UserStorage

Userstoragemust:

>Bedesignedforhighmetadataperformance,IOPS,andkeyenterprisefeaturessuch

ascheckpointing.ThisisdifferentthantheHPS,whichisoptimizedforparallelI/Oandlargecapacity.

>CommunicateoverEthernettoprovideasecondarypathtostorageso,thatinthe

eventofafailureofthestoragefabricorHPS,nodescanstillbeaccessedandmanagedbyadministratorsinparallel.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|6

DGXSuperPODArchitecture

TheDGXSuperPODarchitectureisacombinationofDGXsystems,InfiniBandandEthernetnetworking,managementnodes,andstorage.

Figure2

showstheracklayoutofasingleSU.Inthisexample,powerconsumptionperrackexceeds40kW.Theracklayoutcanbeadjustedtomeetlocaldatacenterrequirements,suchasmaximumpowerperrackandracklayoutbetweenDGXsystemsandsupportingequipmenttomeetlocalneedsforpowerandcoolingdistribution.

Figure2.CompletesingleSUracklayout

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|7

Figure3

showsatypicalmanagementrackconfigurationwithInfiniBandandEthernet

switches,managementservers,storagearrays,andUFMappliances.

Figure3.Typicalmanagementrack

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|8

NetworkFabrics

SeveralnetworksaredeployedontheDGXSuperPOD.Thecomputefabricisusedforinter-nodecommunicationthroughtheapplications.Aseparatestoragefabricisusedtoisolatestoragetraffic.TherearetwoEthernetfabricsforin-bandandOOBmanagement.Requirementsforeachsectionaredetailedbelow.Inaddition,designsforthenetworkareprovidedaftertherequirements.

Figure4

showsthedifferentportsonthebackoftheDGXH100CPUtrayandtheconnectivityprovided.TheInfiniBandcomputefabricportsinthemiddleuseatwo-porttransceivertoaccessalleightGPUs.Eachpairofin-bandEthernetmanagementandInfiniBandstorageportsprovideparallelpathwaysintotheDGXH100systemforincreasedperformance.TheOOBportisusedforBMCaccess.Inaddition,thereisanadditionalLANportnexttotheBMCbutisnotusedintheDGXSuperPOD.

Figure4.DGXH100networkports

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|9

Compute—InfiniBandFabric

Figure5

showsthecomputefabriclayoutforthefull127-nodeDGXSuperPOD.Eachgroupof32nodesisrail-aligned.TrafficperrailoftheDGXH100systemsisalwaysonehopawayfromtheother31nodesinaSU.Trafficbetweennodes,orbetweenrails,traversesthespinelayer.

Figure5.ComputeInfiniBandfabricforfull127nodeDGXSuperPOD

Table2

showsthenumberofcablesandswitchesrequiredforthecomputefabricfordifferentSUsizes.

Table2.Computefabriccomponentcount

SUCount

Cluster

Size#

Nodes

ClusterSize

#GPUs

LeafSwitchCount

SpineSwitchCount

Compute+UFM

NodeCable

Count

Spine-LeafCableCount

311

248

252

256

504

508

512

760

764

768

127

1016

1020

1024

1.Thisisa32nodeperSUdesign,howeveraDGXNodemustberemovedtoaccommodateforUFMconnectivity.

BuildingsystemsbySUprovidesthemostefficientdesigns.However,ifadifferentnodecountisrequiredduetobudgetaryconstraints,datacenterconstraints,orotherneeds,thefabricshouldbedesignedtosupportthefullSU,includingleafswitchesandleaf-spinecables,andleavetheportionofthefabricunusedwherethesenodeswouldbelocated.Thiswillensureoptimaltrafficroutingandensurethatperformanceisconsistentacrossallportionsofthefabric.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|10

Storage—InfiniBandFabric

ThestoragefabricemploysanInfiniBandnetworkfabricthatisessentialtomaximumbandwidth

(Figure6

).ThisisbecausetheI/Oper-nodefortheDGXSuperPODmustexceed40GBps.High-bandwidthrequirementswithadvancedfabricmanagementfeatures,suchascongestioncontrolandAR,providesignificantbenefitsforthestoragefabric.

Figure6.InfiniBandstoragefabriclogicaldesign

Thestoragefabricuses

MQM9700-NS2F

switches

(Figure7

).Thestoragedevicesareconnectedata1:1porttouplinkratio.TheDGXH100systemconnectionsareslightlyoversubscribedwitharationear4:3withadjustmentsasneededtoallowformorestorageflexibilityregardingcostandperformance.

Figure7.MQM9700-NS2Fswitch

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|11

In-BandManagementNetwork

Thein-bandmanagementnetworkprovidesseveralkeyfunctions:

>Connectsalltheservicesthatmanagethecluster.

>Enablesaccesstothehomefilesystemandstoragepool.

>Providesconnectivityforthein-clusterservicessuchasBaseCommandManager,

SlurmandtootherservicesoutsideoftheclustersuchastheNGCregistry,coderepositories,anddatasources.

Figure8

showsthelogicallayoutofthein-bandEthernetnetwork.Thein-bandnetworkconnectsthecomputenodesandmanagementnodes.Inaddition,theOOBnetworkisconnectedtothein-bandnetworktoprovidehigh-speedinterfacesfromthemanagementnodestosupportparalleloperationstodevicesconnectedtotheOOBstoragefabric,suchasstorage.

Figure8.In-bandEthernetnetwork

Thein-bandmanagementnetworkuses

SN4600C

switches

(Figure9

Figure9.SN4600Cswitch

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|12

Out-of-BandManagementNetwork

Figure10

showstheOOBEthernetfabric.ItconnectsthemanagementportsofalldevicesincludingDGXandmanagementservers,storage,networkinggear,rackPDUs,andallotherdevices.Theseareseparateontotheirownfabricsincethereisnouse-casewhereusersneedaccesstotheseportsandaresecuredusinglogicalnetworkseparation.

Figure10.LogicalOOBmanagementnetworklayout

TheOOBmanagementnetworkusesSN2201switches

(Figure11

Figure11.SN2201switch

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|13

StorageArchitecture

Data,lotsofdata,isthekeytodevelopmentofaccuratedeeplearning(DL)models.Datavolumecontinuestogrowexponentially,anddatausedtotrainindividualmodelscontinuestogrowaswell.Dataformat,notjustvolumecanplayakeyfactorintherateatwhichdataisaccessed.TheperformanceoftheDGXH100systemisuptoninetimesfasterthanitspredecessor.Toachievethisinpractice,storagesystemperformancemustscalecommensurately.

ThekeyI/OoperationinDLtrainingisre-read.Itisnotjustthatdataisread,butitmustbereusedagainandagainduetotheiterativenatureofDLtraining.Purereadperformancestillisimportantassomemodeltypescantraininafractionofanepoch(ex:somerecommendermodels)andinferenceofexistingcanbehighlyI/Ointensive,muchmoresothantraining.Writeperformancecanalsobeimportant.AsDLmodelsgrowinsizeandtime-to-train,writingcheckpointsisnecessaryforfaulttolerance.ThesizeofcheckpointfilescanbeterabytesinsizeandwhilenotwrittenfrequentlyaretypicallywrittensynchronouslythatblocksforwardprogressofDLmodels.

Ideally,dataiscachedduringthefirstreadofthedataset,sodatadoesnothavetoberetrievedacrossthenetwork.SharedfilesystemstypicallyuseRAMasthefirstlayerofcache.Readingfilesfromcachecanbeanorderofmagnitudefasterthanfromremotestorage.Inaddition,theDGXH100systemprovideslocalNVMestoragethatcanalsobeusedforcachingorstagingdata.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|14

DGXSuperPODisdesignedtosupportallworkloads,butthestorageperformancerequiredtomaximizetrainingperformancecanvarydependingonthetypeofmodelanddataset.Theguidelinesin

Table3

and

Table4

areprovidedtohelpdeterminetheI/Olevelsrequiredfordifferenttypesofmodels.

Table3.Storageperformancerequirements

PerformanceLevel

WorkDescription

DatasetSize

Good

NaturalLanguageProcessing(NLP)

Datasetsgenerallyfitwithinlocalcache

Better

Imageprocessingwithcompressedimages(ex:ImageNet)

Manytomostdatasetscanfitwithinthelocalsystem’scache

Best

Trainingwith1080p,4K,or

uncompressedimages,offline

inference,ETL,

Datasetsaretoolargetofitintocache,massivefirstepochI/Orequirements,workflowsthatonlyreadthedatasetonce

Table4.Guidelinesforstorageperformance

PerformanceCharacteristic

Good(GBps)

Better(GBps)

Best(GBps)

Single-noderead

Single-nodewrite

SingleSUaggregatesystemread

125

SingleSUaggregatesystemwrite

4SUaggregatesystemread

160

500

4SUaggregatesystemwrite

250

Evenforthebestcategoryabove,itisdesirablethatthesinglenodereadperformanceisclosertothemaximumnetworkperformanceof80GBps.

Note:Asdatasetsgetlarger,theymaynolongerfitincacheonthelocalsystem.PairinglargedatasetsthatdonotfitincachewithveryfastGPUscancreateasituationwhereitisdifficulttoachievemaximumtrainingperformance.NVIDIAGPUDirectStorage®(GDS)providesawaytoreaddatafromtheremotefilesystemorlocalNVMedirectlyintoGPUmemoryprovidinghighersustainedI/Operformancewithlowerlatency.UsingthestoragefabricontheDGXSuperPOD,aGDS-enabledapplicationshouldbeabletoreaddataatover40GBpsdirectlyintotheGPUs.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|15

High-speedstorageprovidesasharedviewofanorganization’sdatatoallnodes.Itmustbeoptimizedforsmall,randomI/Opatterns,andprovidehighpeaknodeperformanceandhighaggregatefilesystemperformancetomeetthevarietyofworkloadsanorganizationmayencounter.High-speedstorageshouldsupportbothefficientmulti-threadedreadsandwritesfromasinglesystem,butmostDLworkloadswillberead-dominant.

Usecasesinautomotiveandothercomputervision-relatedtasks,where1080pimagesareusedfortraining(andinsomecasesareuncompressed)involvedatasetsthateasilyexceed30TBinsize.Inthesecases,4GBpsperGPUforreadperformanceisneeded.

WhileNLPcasesoftendonotrequireasmuchreadperformancefortraining,peakperformanceforreadsandwritesareneededforcreatingandreadingcheckpointfiles.Thisisasynchronousoperationandtrainingstopsduringthisphase.Ifyouarelookingforbestend-to-endtrainingperformance,donotignoreI/Ooperationsforcheckpoints.

Theprecedingmetricsassumeavarietyofworkloads,datasets,andneedfortraininglocallyanddirectlyfromthehigh-speedstoragesystem.Itisbesttocharacterizeworkloadsandorganizationalneedsbeforefinalizingperformanceandcapacityrequirements.

NVIDIADGXSuperPOD:NextGenerationScalableInfrastructureforAILeadershipRA-11333-001v6|16

DGXSuperPODSoftware

DGXSuperPODisanintegratedhardwareandsoftwaresolution.Theincludedsoftware

(Figure12

)isoptimizedforAIfromtoptobottom,fromtheacceleratedframeworksandworkflowmanagementthroughtosystemmanagementandlow-leveloperatingsystem(OS)optimizations,everypartofthestackisdesignedtomaximizetheperformanceandvalueofDGXSuperPOD.

Figur

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色

文档简介

温馨提示

最新文档

评论

NVIDIA DGX SuperPOD-新一代可扩展的人工智能领先基础设施-参考架构-以NVIDIA DG X H100系统为特色

文档简介

温馨提示

最新文档

评论

相关文档