Dell EMC准备好的AI深度学习解决方案搭配NVIDIA_第1页
Dell EMC准备好的AI深度学习解决方案搭配NVIDIA_第2页
Dell EMC准备好的AI深度学习解决方案搭配NVIDIA_第3页
Dell EMC准备好的AI深度学习解决方案搭配NVIDIA_第4页
Dell EMC准备好的AI深度学习解决方案搭配NVIDIA_第5页
已阅读5页,还剩41页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA

DeepLearningwithNVIDIAArchitectureGuide

Authors:RenganXu,FrankHan,NishanthDandapanthula

Abstract

TherehasbeenanexplosionofinterestinDeepLearningandtheplethoraofchoicesmakesdesigningasolutioncomplexandtimeconsuming.Dell

sforAIDeepLearningwithNVIDIAisacompletesolution,designedtosupportallphasesofDeepLearning,incorporatesthelatestCPU,GPU,memory,network,storage,andsoftwaretechnologieswithimpressiveperformanceforbothtrainingandinferencephases.ThearchitectureofthisDeepLearningsolutionispresentedinthisdocument.

August2018

DellEMCReferenceArchitecture

Revisions

Date

Description

August2018

Initialrelease

publication,andspecificallydisclaimsimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.

Use,copying,anddistributionofanysoftwaredescribedinthispublicationrequiresanapplicablesoftwarelicense.

©August2018v1.0DellInc.oritssubsidiaries.AllRightsReserved.Dell,EMC,DellEMCandothertrademarksaretrademarksofDellInc.oritssubsidiaries.Othertrademarksmaybetrademarksoftheirrespectiveowners.

Dellbelievestheinformationinthisdocumentisaccurateasofitspublicationdate.Theinformationissubjecttochangewithoutnotice.

DellEMCReferenceArchitecture

TableofContents

Revisions 2

TableofContents 3

Executivesummary 4

SolutionOverview 5

SolutionArchitecture 7

HeadNodeConfiguration 7

SharedStorageviaNFSoverInfiniBand 8

ComputeNodeConfiguration 8

GPU 9

ProcessorrecommendationforHeadNodeandComputeNodes 10

MemoryrecommendationforHeadNodeandComputeNodes 10

IsilonStorage 11

Network 12

Software 13

DeepLearningTrainingandInferencePerformanceandAnalysis 14

DeepLearningTraining 14

FP16vsFP32 15

V100vsP100 16

V100-SXM2vsV100-PCIe 17

ScalingPerformancewithMulti-GPU 18

StoragePerformance 21

DeepLearningInference 28

NVIDIADIGITSToolandtheDeepLearningSolution 30

ContainersforDeepLearning 32

SingularityContainers 32

RunningNVIDIAGPUCloudwiththeReadySolutionsforAI-DeepLearning 34

TheDataScientistPortal 38

CreatingandRunningaNotebook 38

TensorboardIntegration 42

SlurmScheduler 43

ConclusionsandFutureWork 46

DellEMCReadySolutionsforAI-DeepLearningwithNVIDIA anArchitectureGuide|v1.0

Executivesummary

DeepLearningtechniqueshasenabledgreatsuccessinmanyfieldssuchascomputervision,naturallanguageprocessing(NLP),gamingandautonomousdrivingbyenablingamodeltolearnfromexistingdataandthentomakecorrespondingpredictions.Thesuccessisduetoacombinationofimprovedalgorithms,accesstolargedatasetsandincreasedcomputationalpower.Tobeeffectiveatenterprisescale,thecomputationalintensityofDeepLearningneuralnetworktrainingrequireshighlypowerfulandefficientparallelarchitectures.Thechoiceanddesignofthesystemcomponents,carefullyselectedandtunedforDeepLearninguse-cases,canmakethedifferenceinthebusinessoutcomesofapplyingDeepLearningtechniques.Inadditiontoseveraloptionsforprocessors,acceleratorsandstoragetechnologies,therearemultipleDeepLearningsoftwareframeworksandlibrariesthatmustbeconsidered.Thesesoftwarecomponentsareunderactivedevelopment,updatedfrequentlyandcumbersometomanage.ItiscomplicatedtosimplybuildandrunDeepLearningapplicationssuccessfully,leavinglittletimeforfocusontheactualbusinessproblem.

Toresolvethiscomplexitychallenge,DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.ThisdocumentpresentsthearchitectureofthisDeepLearningsolutionincludingdetailsonthedesignchoiceforeachcomponent.Theperformanceaspectsofthiscompletesolutionhavealsobeencharacterizedandarealsodescribedhere.

AUDIENCE

ThisdocumentisintendedfororganizationsinterestedinacceleratingDeepLearningwithadvancedcomputinganddatamanagementsolutions.Solutionarchitects,systemadministratorsandothersinterestedreaderswithinthoseorganizationsconstitutethetargetaudience.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

SolutionOverview

DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.Thiscompletesolutionisprovidedas sforAIDeepLearningwithNVIDIA.Thesolutionincludesfullyintegratedandoptimizedhardware,software,andservicesincludingdeployment,integrationandsupportmakingiteasierfororganizationstostartandgrowtheirDeepLearningpractice.

ThehighleveloverviewofDellEMCReadySolutionsforAI-DeepLearningisshowninFigure1.

Figure1:OverviewofDellEMCReadySolutionsforAI-DeepLearning

DataScientistPortal:Thisisanewportalfordatascientistscreatedforthissolution.Itenablesdatascientists,whoshouldnotneedtobeexpertsinclustertechnologies,touseasimplewebportaltotakeadvantageoftheunderlyingtechnology.Thescientistscanwrite,trainanddoinferencefordifferentDeepLearningmodelswithinJupyterNotebookwhichincludesPython2,Python3,Randotherkernels.

BrightClusterManagerandBrightMachineLearning:BrightClusterManagerisusedforthemonitoring,deployment,management,andmaintenanceofthecluster.TheBrightMachineLearning(ML)includesthedeeplearningframeworks,libraries,andcompilersandsoon.

DeepLearningFrameworksandLibraries:ThiscategoryincludesTensorFlow,MXNet,Caffe2,CUDA,cuDNN,andNCCL.Thelatestversionoftheseframeworksandlibrariesareintegratedintothesolution.

Infrastructure:Theinfrastructurecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Inthisinstanceofthesolution,themasternodeisaDellEMCPowerEdgeR740xd,eachcomputenodeisPowerEdgeC4140withNVIDIATeslaGPUs,thestorageincludesNetworkFileSystem(NFS)andIsilon,andthenetworksincludeEthernetandMellanoxInfiniBand.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

Section2describeseachofthesesolutioncomponentsinmoredetail,coveringthecompute,network,storageandsoftwareconfigurations.ExtensiveperformanceanalysisonthissolutionwasconductedintheHPCandAIInnovationLabandthoseresultsarepresentedinSection3.Theseincludestestswithtrainingandinferenceworkloads,conductedondifferenttypesofGPUs,usingdifferentfloatingpointandintegerprecisionarithmetic,andwithdifferentstoragesub-systemsforDeepLearningworkloads.ThatisfollowedbySection4thatdescribescontainerizationtechniquesforDeepLearning.Section5hasdetailsontheDataScientistPortaldevelopedbyDellEMC.ConclusionandfuturedirectioncompletesthedocumentinSection6.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

SolutionArchitecture

Thehardwarecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Themasternodeorheadnoderolescanincludedeployingtheclusterofcomputenodes,managingthecomputenodes,userloginsandaccess,providingacompilationenvironment,andjobsubmissionstocomputenodes.Thecomputenodesaretheworkhorseandexecutethesubmittedjobs.SoftwarefromBrightComputingcalledBrightClusterManagerisusedtodeployandmanagethewholecluster.

Figure2showsthehigh-leveloverviewoftheclusterwhichincludesoneheadnode,ncomputenodes,thelocaldisksontheclusterheadnodeexportedoverNFS,Isilonstorage,andtwonetworks.AllcomputenodesareinterconnectedthroughanInfiniBandswitch.TheheadnodeisalsoconnectedtotheInfiniBandswitchasitusesIPoIBtoexporttheNFSsharetothecomputenodes.Allcomputenodesandtheheadnodearealsoconnectedtoa1GigabitEthernetmanagementswitchwhichisusedbyBrightClusterManagertoadministerthecluster.AnIsilonstoragesolutionisconnectedtotheFDR-40GigEGatewayswitchsothatitcanbeaccessedbytheheadnodeandallcomputenodes.

Figure2:Theoverviewofthecluster

HeadNodeConfiguration

TheDellEMCPowerEdgeR740xdisrecommendedfortheroleoftheheadnode.This

socket,2Urackserverthatcansupportthememorycapacities,I/Oneedsandnetworkoptionsrequiredoftheheadnode.Theheadnodewillperformtheclusteradministration,clustermanagement,NFSserver,userloginnodeandcompilationnoderoles.

ThesuggestedconfigurationofthePowerEdgeR740xdislistedinTable1.Itincludes12x12TBNLSASlocaldisksthatareformattedasanXFSfilesystemandexportedviaNFStothecomputenodesoverIPoIB.RAID50isusedinsteadofRAID6/RAID60totakeintoconsiderationfasterrebuildtimeandcapacityadvantagesprovidedbytheformer.Detailsofeachconfigurationchoicearedescribedinthefollowingsections.FormoreinformationonthisservermodelpleaserefertoPowerEdgeR740/740xdTechnicalGuide.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

Table1:PowerEdgeR740xdconfigurations

Component

Details

ServerModel

PowerEdgeR740xd

Processor

2xIntelXeonGold6148CPU@2.40GHz

Memory

24x16GBDDR42666MT/sDIMMs-384GB

Disks

12x12TBNLSASRAID50(Recommended10+drives)

I/O&Ports

Networkdaughtercardwith

2x10GE+2x1GE

NetworkAdapter

1xInfiniBandEDRadapter

OutofBandManagement

iDRAC9EnterprisewithLifecycleController

PowerSupplies

Titanium1100W,Platinum

StorageControllers

PowerEdgeRAIDController(PERC)H730p

SharedStorageviaNFSoverInfiniBand

ThedefaultsharedstoragesystemfortheclusterisprovidedoverNFS.Itisbuiltusing12x12TBNLSASdisksthatarelocaltotheheadnodeconfiguredinRAID50withtwoparitycheckdisks.Thisprovidesusablecapacityof120TB(109TiB).RAID50waschosenbecauseithasbalancedperformanceandshorterrebuildtimecomparedtoRAID6orRAID60(sinceRAID50hasfewerparitydisksthanRAID6orRAID60).This120TBvolumeisformattedasanXFSfilesystemandexportedtothecomputenodesviaNFSoverIPoIB.

Inthedefaultconfiguration,bothhomedirectoriesandsharedapplicationandlibraryinstalllocationsarehostedonthisNFSshare.Inadditiontothis,forsolutionswhichrequirealargercapacitysharedstoragesolution,theIsilonF800isasanalternativeoptionandisdescribedinSection2.5.AcomparisonbetweenvariousstoragesubsystemsisprovidedinSection3.1.5,includingthisNLSASNFS,theIsilon,andsmallertestconfigurationsusingSSDsandNVMedevices.

ComputeNodeConfiguration

DeepLearningmethodswouldnothavegainedsuccesswithoutthecomputationalpowertodrivetheiterativetrainingprocess.Therefore,akeycomponentofDeepLearningsolutionsishighlycapablenodesthatcansupportcomputeintensiveworkloads.Thestate-of-artneuralnetworkmodelsinDeepLearninghavemorethan100layerswhichrequirethecomputationtobeabletoscaleacrossmanycomputenodesinorderforanytimelyresults.TheDellEMCPowerEdgeC4140,anaccelerator-optimized,highdensity1Urackserver,isusedasthecomputenodeunitinthissolution.ThePowerEdgeC4140cansupportfourNVIDIAVoltaSMX2GPUs,boththeV100-SXM2aswellastheV100-PCIemodels.Figure3showstheCPU-GPUandGPU-GPUconnectiontopologyofacomputenode.

ThedetailedconfigurationofeachPowerEdgeC4140computenodeislistedinTable2.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

Figure3:ThetopologyofacomputenodeTable2:PowerEdgeC4140Configurations

Component

Details

ServerModel

PowerEdgeC4140

Processor

2xIntelXeonGold6148CPU@2.40GHz

Memory

24x16GBDDR42666MT/sDIMMs-384GB

LocalDisks

120GBSSD,1.6TBNVMe

I/O&Ports

Networkdaughtercardwith

2x10GE+2x1GE

NetworkAdapter

1xInfiniBandEDRadapter

GPU

4xV100-SXM216GB

OutofBandManagement

iDRAC9EnterprisewithLifecycleController

PowerSupplies

2000Whot-plugRedundantPowerSupplyUnit(PSU)

GPU

TheNVIDIATeslaV100isthelatestdatacenterGPUavailabletoaccelerateDeepLearning.Poweredby

engineerstotacklechallengesthatwereoncedifficult.With640TensorCores,TeslaV100isthefirstGPUtobreakthe100teraflops(TFLOPS)barrierofDeepLearningperformance.

Table3:V100-SXM2vsV100-PCIe

Description

V100-PCIe

V100-SXM2

CUDACores

5120

5120

GPUMaxClockRate(MHz)

1380

1530

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

TensorCores

640

640

MemoryBandwidth(GB/s)

900

900

NVLinkBandwidth(GB/s)(uni-direction)

N/A

300

DeepLearning(TensorOPS)

112

120

TDP(Watts)

250

300

TeslaV100productlineincludestwovariations,V100-PCIeandV100-SXM2.ThecomparisonoftwovariantsisshowninTable3.IntheV100-PCIe,allGPUscommunicatewitheachotheroverPCIebuses.WiththeV100-SXM2model,allGPUsareconnectedbyNVIDIANVLink.Inuse-caseswheremultipleGPUsarerequired,theV100-SXM2modelsprovidetheadvantageoffasterGPU-to-GPUcommunicationovertheNVLINKinterconnectwhencomparedtoPCIe.V100-SXM2GPUsprovidesixNVLinksperGPUforbi-directionalcommunication.ThebandwidthofeachNVLinkis25GB/sinuni-directionandallfourGPUswithinanodecancommunicateatthesametime,thereforethetheoreticalpeakbandwidthis6*25*4=600GB/sinbi-direction.However,thetheoreticalpeakbandwidthusingPCIeisonly16*2=32GB/sastheGPUscanonlycommunicateinorder,whichmeansthecommunicationcannotbedoneinparallel.SointheorythedatacommunicationwithNVLinkcouldbeupto600/32=18xfasterthanPCIe.TheevaluationofthisperformanceadvantageinrealmodelswillbediscussedinSection3.1.3.Becauseofthisadvantage,thePowerEdgeC4140computenodeintheDeepLearningsolutionusesV100-SXM2insteadofV100-PCIeGPUs.

ProcessorrecommendationforHeadNodeandComputeNodes

TheprocessorchosenfortheheadnodeandcomputenodesisIntel®Xeon®Gold6148CPU.ThisisthelatestIntel®Xeon®Scalableprocessorwith20physicalcoreswhichsupport40threads.Previousstudies,asdescribedinSection3.1,haveconcludedthat16threadsaresufficienttofeedtheI/Opipelineforthestate-of-the-artconvolutionalneuralnetwork,sotheGold6148CPUisareasonablechoice.AdditionallythisCPUmodelisrecommendedforthecomputenodesaswell,makingthisaconsistentchoiceacrossthecluster.

MemoryrecommendationforHeadNodeandComputeNodes

Therecommendedmemoryfortheheadnodeis24x16GB2666MT/sDIMMs.Thereforethetotalsizeofmemoryis384GB.Thisischosenbasedonthefollowingfacts:

Capacity:AnidealconfigurationmustsupportsystemmemorycapacitythatislargerthanthetotalsizeofGPUmemory.Eachcomputenodehas4GPUsandeachGPUhas16GBmemory,sothesystemmemorymustbeatleast16GBx4=64GB.TheheadnodememoryalsoaffectsI/Operformance.ForNFSservice,largermemorywillreducediskreadoperationssinceNFSserviceneedstosendoutdatafrommemory.16GBDIMMsdemonstratethebestperformance/dollarvalue.

DIMMconfiguration:Choiceslike24x16GBor12x32GBwillprovidethesamecapacityof384GBsystemmemory,butaccordingtoourstudiesasshowninFigure4,thecombinationof24x16GBDIMMsprovides11%betterperformancethanusing12x32GB.TheresultsshownherewasontheIntelXeonPlatinum8180processor,butthesametrendswillapplyacrossothermodelsintheIntelScalableProcessorFamilyincludingtheGold6148,althoughtheactualpercentagedifferencesacrossconfigurationsmayvary.MoredetailscanbefoundinourSkylakememorystudy.

Serviceability:Theheadnodeandcomputenodesmemoryconfigurationsaredesignedtobesimilartoreducepartscomplexitywhilesatisfyingperformanceandcapacityneeds.Fewerpartsneedtobestockedforreplacement,andinurgentcasesifamemorymoduleintheheadnodeneedstobe

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

replacedimmediately,aDIMMmodulefromacomputenodecanbetemporarilyconsideredtorestoretheheadnodeuntilreplacementmodulesarrive.

Figure4:Relativememorybandwidthfordifferentsystemcapacities

IsilonStorage

DellEMCIsilonisaprovenscale-outnetworkattachedstorage(NAS)solutionthatcanhandletheunstructureddataprevalentinmanydifferentworkflows.TheIsilonstoragearchitectureautomaticallyalignsapplicationneedswithperformance,capacity,andeconomics.Asperformanceandcapacitydemandsincrease,bothcanbescaledsimplyandnon-disruptively,allowingapplicationsanduserstocontinueworking.

DellEMCIsilonOneFSoperatingsystempowersallDellEMCIsilonscale-outNASstoragesolutionsandhasthefollowingfeatures.

Ahighdegreeofscalability,withgrow-as-you-goflexibilityHighefficiencytoreducecosts

Multi-protocolsupportsuchasSMB,NFS,HTTPandHDFStomaximizeoperationalflexibilityEnterprisedataprotectionandresiliency

Robustsecurityoptions

TherecommendedIsilonstorageisIsilonF800all-flashscale-outNASstorage.DellEMCIsilonF800all-flashScale-outNASstorageisuniquelysuitedformodernDeepLearningapplicationsdeliveringtheflexibilitytodealwithanydatatype,scalabilityfordatasetsranginginthePBs,andconcurrencytosupportthemassive

-outarchitectureeliminatestheI/Obottleneckbetween

canscale-outupto68PBwithupto540GB/sofperformanceinasinglecluster.ThisallowsIsilontoaccelerateAIinnovationwithfastermodeltraining,providemoreaccurateinsightswithdeeperdatasets,anddelivera

TheIsilonstoragecanbeusedifthelocalNFSstoragecapacityisinsufficientfortheenvironment.IftheIsilonisusedinconjunctionwiththelocalNFSstorage,userhomedirectoriesandprojectresultscanbestoredon

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

theIsilonwithapplicationsinstalledonthelocalNFS.TheperformancecomparisonbetweenIsilonandotherstoragesolutionsareshowninSection3.1.6.ThespecificationsoftheIsilonF800arelistedinTable4.

Table4:SpecificationofIsilonF800

Storage

Externalstorage

Bandwidth

IOPS

ChassisCapacity(4RU)

ClusterCapacity

Network

Beforedoingdeeplearningmodeltraining,ifauserwantstomoveverylargedataoutsidetheclusterdescribedinSection2toIsilon,theusercanconnecttheserverwhichstoresthedatatotheFDR-40GigEgatewayinFigure2,sothatthedatacanbemovedontoIsilonwithouthavingtorouteitthroughtheheadnode.

TomonitorandanalyzetheperformanceandfilesystemofIsilonstorage,thetoolInsightIQcanbeused.InsightIQallowsausertomonitorandanalyzeIsilonstorageclusteractivityusingstandardreportsintheInsightIQweb-basedapplication.Theusercancustomizethesereportstoprovideinformationaboutstorageclusterhardware,software,andprotocoloperations.InsightIQtransformsdataintovisualinformationthathighlightsperformanceoutliers,andhelpsusersdiagnosebottlenecksandoptimizeworkflows.InSection3.1.5,InsightIQwasusedtocollecttheaveragediskoperationsize,diskreadIOPS,andfilesystemthroughputwhenrunningdeeplearningmodels.FormoredetailsaboutInsightIQ,refertoIsilonInsightIQUserGuide.

Network

Thesolutioncomprisesofthreenetworkfabrics.Theheadnodeandallcomputenodesareconnectedwitha1GigabitEthernetfabric.TheEthernetswitchrecommendedforthisistheDellNetworkingS3048-ONwhichhas48ports.ThisconnectionisprimarilyusedbyBrightClusterManagerfordeployment,maintenanceandmonitoringthesolution.

Thesecondfabricconnectstheheadnodeandallcomputenodesarethrough100Gb/sEDRInfiniBand.TheEDRInfiniBandswitchisMellanoxSB7800whichhas36ports.ThisfabricisusedforIPCbytheapplicationsaswellastoserveNFSfromtheheadnode(IPoIB)andIsilon.GPU-to-GPUcommunicationacrossserverscanuseatechniquecalledGPUDirectRemoteDirectMemoryAccess(RDMA)whichisenabledbyInfiniBand.ThisenablesGPUstocommunicatedirectlywithouttheinvolvementofCPUs.WithoutGPUDirect,whenGPUsacrossserversneedtocommunicate,theGPUinonenodehastocopydatafromitsGPUmemorytosystemmemory,thenthatdataissenttothesystemmemoryofanothernodeoverthenetwork,andfinallythedataiscopiedfromthesystemmemoryofthesecondnodetothereceivingGPUmemory.WithGPUDirecthowever,theGPUononenodecansendthedatadirectlyfromitsGPUmemorytotheGPUmemoryinanothernode,withoutgoingthroughthesystemmemoryinbothnodes.ThereforeGPUDirectdecreasestheGPU-GPUcommunicationlatencysignificantly.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

ThethirdswitchinthesolutioniscalledagatewayswitchinFigure2andconnectstheIsilonF800tothehead

nalinterfacesare40GigabitEthernet.Hence,aswitchwhichcanserveasthegatewaybetweenthe40GbEEthernetandInfiniBandnetworksisneededforconnectivitytotheheadandcomputenodes.TheMellanoxSX6036isusedforthispurpose.ThegatewayisconnectedtotheInfiniBandEDRswitchandtheIsilonasshowninFigure2.

Software

ThesoftwareportionofthesolutionisprovidedbyDellEMCandBrightComputing.Thesoftwareincludesseveralpieces.

ThefirstpieceisBrightClusterManagerwhichisusedtoeasilydeployandmanagetheclusteredinfrastructureandprovidesallclustersoftwareincludingtheoperatingsystem,GPUdriversandlibraries,InfiniBanddriversandlibraries,MPImiddleware,theSlurmschedule,etc.

ThesecondpieceistheBrightmachinelearning(ML)whichincludesanydeeplearninglibrarydependenciestothebaseoperatingsystem,deeplearningframeworksincludingCaffe/Caffe2,Pytorch,Torch7,Theano,Tensorflow,Horovod,Keras,DIGITS,CNTKandMXNet,anddeeplearninglibrariesincludingcuDNN,NCCL,andtheCUDAtoolkit.

ThethirdpieceistheDataScientistPortalwhichwasdevelopedbyDellEMC.Theportalwascreatedtoabstractthecomplexityofthedeeplearningecosystemsbyprovidingasinglepaneofglasswhichprovidesuserswithaninterfacetogetstartedwiththeirmodels.TheportalincludesspawnerforJupyterhubandintegrateswith

Resourcemanagersandschedulers(Slurm)LDAPforusermanagement

DeepLearningframeworkenvironments(TensorFlow,Keras,MXNet,Pytorchetc. moduleenvironment,Python2,Python3andRkernelsupport

Tensorboard

TerminalCLIenvironments.

ItalsoprovidestemplatestogetstartedwithforvariousDLenvironmentsandaddssupportforsingularitycontainers.FormoredetailsabouthowtousetheDataScientistPortal,refertoSection5.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

DeepLearningTrainingandInferencePerformanceandAnalysis

Inthissection,theperformanceofDeepLearningtrainingaswellasinferenceismeasuredusingthreeopensourceDeepLearningframeworks:TensorFlow,MXNetandCaffe2.TheexperimentswereconductedonaninstanceofthesolutionarchitecturedescribedinSection2.TheexperimenttestclusterusedaPowerEdgeR740xdheadnode,andPowerEdgeC4140computenodes,differentstoragesub-systemsincludingIsilonandInfiniBandEDRnetwork.Adetailedtestbeddescriptionisprovidedinthefollowingsection.

DeepLearningTraining

Thewell-knownILSVRC2012datasetwasusedforbenchmarkingperformance.Thisdatasetcontains1,281,167trainingimagesand50,000validationimagesin140GB.Allimagesaregroupedinto1000categoriesorclasses.TheoverallsizeofILSVRC2012leadstonon-trivialtrainingtimesandthusmakesitmoreinterestingforanalysis.AdditionallythisdatasetiscommonlyusedbyDeepLearningresearchersforbenchmarkingandcomparisonstudies.Resnet50isacomputationallyintensivenetworkandwasselectedtostressthesolutiontoitsmaximumcapability.ForthebatchsizeparameterinDeepLearning,themaximumbatchsizethatdoesnotcausememoryerrorswasselected;thistranslatedtoabatchsizeof64perGPUforMXNetandCaffe2,and128perGPUforTensorFlow.Horovod,adistributedTensorFlowframework,wasusedtoscalethetrainingacrossmultiplecomputenodes.Throughputthisdocument,performancewasmeasuredusingametricofimages/secwhichisameasureofthroughputofhowfastthesystemcancompletetrainingthedataset.

Theimages/secresultwasaveragedacrossalliterationstotakeintoaccountthedeviations.Thetotalnumberofiterationsisequaltonum_epochs*num_images/(batch_size*num_gpus),wherenum_epochsmeansthenumberofpassestoallimagesofadataset,num_imagesmeansthetotalnumberofimagesinthedataset,batch_sizemeansthenumberofimagesthatareprocessedinparallelbyoneGPU,andnum_gpusmeansthetotalnumberofGPUsinvolvedinthetraining.

Beforerunninganybenchmark,thecacheontheheadnodeandcomputenode(s)wereclearedwiththe

Thetrainingtestswererunforasingleepoch,oronepassthroughtheentiredataset,sincethethroughputisconsistentthroughepochsforMXNetandTensorFlowtests.Consistentthroughputmeansthattheperformancevariationwasnotsignificantacrossiterations,thetestsmeasuredlessthan2%variationinperformance.

However,twoepochswereusedforCaffe2asitneedstwoepochstostabilizetheperformance.Thisisbecause

(throughputorimages/sec)isnotstable(theperformancevariationbetweeniterationsislarge)whenthedatasetisnotfullyloadedinmemory.

ForMXNetframework,16CPUthreadswereusedfordatasetdecodingandthereasonwasexplainedintheDeepLearningonV100.Caffe2doesnotprovideaparameterforuserstosetthenumberofCPUthreads.

ForTensorFlow,thenumberofCPUthreadsusedfordatasetdecodingiscalculatedbysubtractingfourthreadsperGPUfromthetotalphysicalcorecountofthesystem.ThefourthreadsperGPUareusedforGPUcompute,memorycopies,eventmonitoring,ands

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论