




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA
DeepLearningwithNVIDIAArchitectureGuide
Authors:RenganXu,FrankHan,NishanthDandapanthula
Abstract
TherehasbeenanexplosionofinterestinDeepLearningandtheplethoraofchoicesmakesdesigningasolutioncomplexandtimeconsuming.Dell
sforAIDeepLearningwithNVIDIAisacompletesolution,designedtosupportallphasesofDeepLearning,incorporatesthelatestCPU,GPU,memory,network,storage,andsoftwaretechnologieswithimpressiveperformanceforbothtrainingandinferencephases.ThearchitectureofthisDeepLearningsolutionispresentedinthisdocument.
August2018
DellEMCReferenceArchitecture
Revisions
Date
Description
August2018
Initialrelease
publication,andspecificallydisclaimsimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.
Use,copying,anddistributionofanysoftwaredescribedinthispublicationrequiresanapplicablesoftwarelicense.
©August2018v1.0DellInc.oritssubsidiaries.AllRightsReserved.Dell,EMC,DellEMCandothertrademarksaretrademarksofDellInc.oritssubsidiaries.Othertrademarksmaybetrademarksoftheirrespectiveowners.
Dellbelievestheinformationinthisdocumentisaccurateasofitspublicationdate.Theinformationissubjecttochangewithoutnotice.
DellEMCReferenceArchitecture
TableofContents
Revisions 2
TableofContents 3
Executivesummary 4
SolutionOverview 5
SolutionArchitecture 7
HeadNodeConfiguration 7
SharedStorageviaNFSoverInfiniBand 8
ComputeNodeConfiguration 8
GPU 9
ProcessorrecommendationforHeadNodeandComputeNodes 10
MemoryrecommendationforHeadNodeandComputeNodes 10
IsilonStorage 11
Network 12
Software 13
DeepLearningTrainingandInferencePerformanceandAnalysis 14
DeepLearningTraining 14
FP16vsFP32 15
V100vsP100 16
V100-SXM2vsV100-PCIe 17
ScalingPerformancewithMulti-GPU 18
StoragePerformance 21
DeepLearningInference 28
NVIDIADIGITSToolandtheDeepLearningSolution 30
ContainersforDeepLearning 32
SingularityContainers 32
RunningNVIDIAGPUCloudwiththeReadySolutionsforAI-DeepLearning 34
TheDataScientistPortal 38
CreatingandRunningaNotebook 38
TensorboardIntegration 42
SlurmScheduler 43
ConclusionsandFutureWork 46
DellEMCReadySolutionsforAI-DeepLearningwithNVIDIA anArchitectureGuide|v1.0
Executivesummary
DeepLearningtechniqueshasenabledgreatsuccessinmanyfieldssuchascomputervision,naturallanguageprocessing(NLP),gamingandautonomousdrivingbyenablingamodeltolearnfromexistingdataandthentomakecorrespondingpredictions.Thesuccessisduetoacombinationofimprovedalgorithms,accesstolargedatasetsandincreasedcomputationalpower.Tobeeffectiveatenterprisescale,thecomputationalintensityofDeepLearningneuralnetworktrainingrequireshighlypowerfulandefficientparallelarchitectures.Thechoiceanddesignofthesystemcomponents,carefullyselectedandtunedforDeepLearninguse-cases,canmakethedifferenceinthebusinessoutcomesofapplyingDeepLearningtechniques.Inadditiontoseveraloptionsforprocessors,acceleratorsandstoragetechnologies,therearemultipleDeepLearningsoftwareframeworksandlibrariesthatmustbeconsidered.Thesesoftwarecomponentsareunderactivedevelopment,updatedfrequentlyandcumbersometomanage.ItiscomplicatedtosimplybuildandrunDeepLearningapplicationssuccessfully,leavinglittletimeforfocusontheactualbusinessproblem.
Toresolvethiscomplexitychallenge,DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.ThisdocumentpresentsthearchitectureofthisDeepLearningsolutionincludingdetailsonthedesignchoiceforeachcomponent.Theperformanceaspectsofthiscompletesolutionhavealsobeencharacterizedandarealsodescribedhere.
AUDIENCE
ThisdocumentisintendedfororganizationsinterestedinacceleratingDeepLearningwithadvancedcomputinganddatamanagementsolutions.Solutionarchitects,systemadministratorsandothersinterestedreaderswithinthoseorganizationsconstitutethetargetaudience.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
SolutionOverview
DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.Thiscompletesolutionisprovidedas sforAIDeepLearningwithNVIDIA.Thesolutionincludesfullyintegratedandoptimizedhardware,software,andservicesincludingdeployment,integrationandsupportmakingiteasierfororganizationstostartandgrowtheirDeepLearningpractice.
ThehighleveloverviewofDellEMCReadySolutionsforAI-DeepLearningisshowninFigure1.
Figure1:OverviewofDellEMCReadySolutionsforAI-DeepLearning
DataScientistPortal:Thisisanewportalfordatascientistscreatedforthissolution.Itenablesdatascientists,whoshouldnotneedtobeexpertsinclustertechnologies,touseasimplewebportaltotakeadvantageoftheunderlyingtechnology.Thescientistscanwrite,trainanddoinferencefordifferentDeepLearningmodelswithinJupyterNotebookwhichincludesPython2,Python3,Randotherkernels.
BrightClusterManagerandBrightMachineLearning:BrightClusterManagerisusedforthemonitoring,deployment,management,andmaintenanceofthecluster.TheBrightMachineLearning(ML)includesthedeeplearningframeworks,libraries,andcompilersandsoon.
DeepLearningFrameworksandLibraries:ThiscategoryincludesTensorFlow,MXNet,Caffe2,CUDA,cuDNN,andNCCL.Thelatestversionoftheseframeworksandlibrariesareintegratedintothesolution.
Infrastructure:Theinfrastructurecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Inthisinstanceofthesolution,themasternodeisaDellEMCPowerEdgeR740xd,eachcomputenodeisPowerEdgeC4140withNVIDIATeslaGPUs,thestorageincludesNetworkFileSystem(NFS)andIsilon,andthenetworksincludeEthernetandMellanoxInfiniBand.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Section2describeseachofthesesolutioncomponentsinmoredetail,coveringthecompute,network,storageandsoftwareconfigurations.ExtensiveperformanceanalysisonthissolutionwasconductedintheHPCandAIInnovationLabandthoseresultsarepresentedinSection3.Theseincludestestswithtrainingandinferenceworkloads,conductedondifferenttypesofGPUs,usingdifferentfloatingpointandintegerprecisionarithmetic,andwithdifferentstoragesub-systemsforDeepLearningworkloads.ThatisfollowedbySection4thatdescribescontainerizationtechniquesforDeepLearning.Section5hasdetailsontheDataScientistPortaldevelopedbyDellEMC.ConclusionandfuturedirectioncompletesthedocumentinSection6.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
SolutionArchitecture
Thehardwarecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Themasternodeorheadnoderolescanincludedeployingtheclusterofcomputenodes,managingthecomputenodes,userloginsandaccess,providingacompilationenvironment,andjobsubmissionstocomputenodes.Thecomputenodesaretheworkhorseandexecutethesubmittedjobs.SoftwarefromBrightComputingcalledBrightClusterManagerisusedtodeployandmanagethewholecluster.
Figure2showsthehigh-leveloverviewoftheclusterwhichincludesoneheadnode,ncomputenodes,thelocaldisksontheclusterheadnodeexportedoverNFS,Isilonstorage,andtwonetworks.AllcomputenodesareinterconnectedthroughanInfiniBandswitch.TheheadnodeisalsoconnectedtotheInfiniBandswitchasitusesIPoIBtoexporttheNFSsharetothecomputenodes.Allcomputenodesandtheheadnodearealsoconnectedtoa1GigabitEthernetmanagementswitchwhichisusedbyBrightClusterManagertoadministerthecluster.AnIsilonstoragesolutionisconnectedtotheFDR-40GigEGatewayswitchsothatitcanbeaccessedbytheheadnodeandallcomputenodes.
Figure2:Theoverviewofthecluster
HeadNodeConfiguration
TheDellEMCPowerEdgeR740xdisrecommendedfortheroleoftheheadnode.This
socket,2Urackserverthatcansupportthememorycapacities,I/Oneedsandnetworkoptionsrequiredoftheheadnode.Theheadnodewillperformtheclusteradministration,clustermanagement,NFSserver,userloginnodeandcompilationnoderoles.
ThesuggestedconfigurationofthePowerEdgeR740xdislistedinTable1.Itincludes12x12TBNLSASlocaldisksthatareformattedasanXFSfilesystemandexportedviaNFStothecomputenodesoverIPoIB.RAID50isusedinsteadofRAID6/RAID60totakeintoconsiderationfasterrebuildtimeandcapacityadvantagesprovidedbytheformer.Detailsofeachconfigurationchoicearedescribedinthefollowingsections.FormoreinformationonthisservermodelpleaserefertoPowerEdgeR740/740xdTechnicalGuide.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Table1:PowerEdgeR740xdconfigurations
Component
Details
ServerModel
PowerEdgeR740xd
Processor
2xIntelXeonGold6148CPU@2.40GHz
Memory
24x16GBDDR42666MT/sDIMMs-384GB
Disks
12x12TBNLSASRAID50(Recommended10+drives)
I/O&Ports
Networkdaughtercardwith
2x10GE+2x1GE
NetworkAdapter
1xInfiniBandEDRadapter
OutofBandManagement
iDRAC9EnterprisewithLifecycleController
PowerSupplies
Titanium1100W,Platinum
StorageControllers
PowerEdgeRAIDController(PERC)H730p
SharedStorageviaNFSoverInfiniBand
ThedefaultsharedstoragesystemfortheclusterisprovidedoverNFS.Itisbuiltusing12x12TBNLSASdisksthatarelocaltotheheadnodeconfiguredinRAID50withtwoparitycheckdisks.Thisprovidesusablecapacityof120TB(109TiB).RAID50waschosenbecauseithasbalancedperformanceandshorterrebuildtimecomparedtoRAID6orRAID60(sinceRAID50hasfewerparitydisksthanRAID6orRAID60).This120TBvolumeisformattedasanXFSfilesystemandexportedtothecomputenodesviaNFSoverIPoIB.
Inthedefaultconfiguration,bothhomedirectoriesandsharedapplicationandlibraryinstalllocationsarehostedonthisNFSshare.Inadditiontothis,forsolutionswhichrequirealargercapacitysharedstoragesolution,theIsilonF800isasanalternativeoptionandisdescribedinSection2.5.AcomparisonbetweenvariousstoragesubsystemsisprovidedinSection3.1.5,includingthisNLSASNFS,theIsilon,andsmallertestconfigurationsusingSSDsandNVMedevices.
ComputeNodeConfiguration
DeepLearningmethodswouldnothavegainedsuccesswithoutthecomputationalpowertodrivetheiterativetrainingprocess.Therefore,akeycomponentofDeepLearningsolutionsishighlycapablenodesthatcansupportcomputeintensiveworkloads.Thestate-of-artneuralnetworkmodelsinDeepLearninghavemorethan100layerswhichrequirethecomputationtobeabletoscaleacrossmanycomputenodesinorderforanytimelyresults.TheDellEMCPowerEdgeC4140,anaccelerator-optimized,highdensity1Urackserver,isusedasthecomputenodeunitinthissolution.ThePowerEdgeC4140cansupportfourNVIDIAVoltaSMX2GPUs,boththeV100-SXM2aswellastheV100-PCIemodels.Figure3showstheCPU-GPUandGPU-GPUconnectiontopologyofacomputenode.
ThedetailedconfigurationofeachPowerEdgeC4140computenodeislistedinTable2.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Figure3:ThetopologyofacomputenodeTable2:PowerEdgeC4140Configurations
Component
Details
ServerModel
PowerEdgeC4140
Processor
2xIntelXeonGold6148CPU@2.40GHz
Memory
24x16GBDDR42666MT/sDIMMs-384GB
LocalDisks
120GBSSD,1.6TBNVMe
I/O&Ports
Networkdaughtercardwith
2x10GE+2x1GE
NetworkAdapter
1xInfiniBandEDRadapter
GPU
4xV100-SXM216GB
OutofBandManagement
iDRAC9EnterprisewithLifecycleController
PowerSupplies
2000Whot-plugRedundantPowerSupplyUnit(PSU)
GPU
TheNVIDIATeslaV100isthelatestdatacenterGPUavailabletoaccelerateDeepLearning.Poweredby
engineerstotacklechallengesthatwereoncedifficult.With640TensorCores,TeslaV100isthefirstGPUtobreakthe100teraflops(TFLOPS)barrierofDeepLearningperformance.
Table3:V100-SXM2vsV100-PCIe
Description
V100-PCIe
V100-SXM2
CUDACores
5120
5120
GPUMaxClockRate(MHz)
1380
1530
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
TensorCores
640
640
MemoryBandwidth(GB/s)
900
900
NVLinkBandwidth(GB/s)(uni-direction)
N/A
300
DeepLearning(TensorOPS)
112
120
TDP(Watts)
250
300
TeslaV100productlineincludestwovariations,V100-PCIeandV100-SXM2.ThecomparisonoftwovariantsisshowninTable3.IntheV100-PCIe,allGPUscommunicatewitheachotheroverPCIebuses.WiththeV100-SXM2model,allGPUsareconnectedbyNVIDIANVLink.Inuse-caseswheremultipleGPUsarerequired,theV100-SXM2modelsprovidetheadvantageoffasterGPU-to-GPUcommunicationovertheNVLINKinterconnectwhencomparedtoPCIe.V100-SXM2GPUsprovidesixNVLinksperGPUforbi-directionalcommunication.ThebandwidthofeachNVLinkis25GB/sinuni-directionandallfourGPUswithinanodecancommunicateatthesametime,thereforethetheoreticalpeakbandwidthis6*25*4=600GB/sinbi-direction.However,thetheoreticalpeakbandwidthusingPCIeisonly16*2=32GB/sastheGPUscanonlycommunicateinorder,whichmeansthecommunicationcannotbedoneinparallel.SointheorythedatacommunicationwithNVLinkcouldbeupto600/32=18xfasterthanPCIe.TheevaluationofthisperformanceadvantageinrealmodelswillbediscussedinSection3.1.3.Becauseofthisadvantage,thePowerEdgeC4140computenodeintheDeepLearningsolutionusesV100-SXM2insteadofV100-PCIeGPUs.
ProcessorrecommendationforHeadNodeandComputeNodes
TheprocessorchosenfortheheadnodeandcomputenodesisIntel®Xeon®Gold6148CPU.ThisisthelatestIntel®Xeon®Scalableprocessorwith20physicalcoreswhichsupport40threads.Previousstudies,asdescribedinSection3.1,haveconcludedthat16threadsaresufficienttofeedtheI/Opipelineforthestate-of-the-artconvolutionalneuralnetwork,sotheGold6148CPUisareasonablechoice.AdditionallythisCPUmodelisrecommendedforthecomputenodesaswell,makingthisaconsistentchoiceacrossthecluster.
MemoryrecommendationforHeadNodeandComputeNodes
Therecommendedmemoryfortheheadnodeis24x16GB2666MT/sDIMMs.Thereforethetotalsizeofmemoryis384GB.Thisischosenbasedonthefollowingfacts:
Capacity:AnidealconfigurationmustsupportsystemmemorycapacitythatislargerthanthetotalsizeofGPUmemory.Eachcomputenodehas4GPUsandeachGPUhas16GBmemory,sothesystemmemorymustbeatleast16GBx4=64GB.TheheadnodememoryalsoaffectsI/Operformance.ForNFSservice,largermemorywillreducediskreadoperationssinceNFSserviceneedstosendoutdatafrommemory.16GBDIMMsdemonstratethebestperformance/dollarvalue.
DIMMconfiguration:Choiceslike24x16GBor12x32GBwillprovidethesamecapacityof384GBsystemmemory,butaccordingtoourstudiesasshowninFigure4,thecombinationof24x16GBDIMMsprovides11%betterperformancethanusing12x32GB.TheresultsshownherewasontheIntelXeonPlatinum8180processor,butthesametrendswillapplyacrossothermodelsintheIntelScalableProcessorFamilyincludingtheGold6148,althoughtheactualpercentagedifferencesacrossconfigurationsmayvary.MoredetailscanbefoundinourSkylakememorystudy.
Serviceability:Theheadnodeandcomputenodesmemoryconfigurationsaredesignedtobesimilartoreducepartscomplexitywhilesatisfyingperformanceandcapacityneeds.Fewerpartsneedtobestockedforreplacement,andinurgentcasesifamemorymoduleintheheadnodeneedstobe
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
replacedimmediately,aDIMMmodulefromacomputenodecanbetemporarilyconsideredtorestoretheheadnodeuntilreplacementmodulesarrive.
Figure4:Relativememorybandwidthfordifferentsystemcapacities
IsilonStorage
DellEMCIsilonisaprovenscale-outnetworkattachedstorage(NAS)solutionthatcanhandletheunstructureddataprevalentinmanydifferentworkflows.TheIsilonstoragearchitectureautomaticallyalignsapplicationneedswithperformance,capacity,andeconomics.Asperformanceandcapacitydemandsincrease,bothcanbescaledsimplyandnon-disruptively,allowingapplicationsanduserstocontinueworking.
DellEMCIsilonOneFSoperatingsystempowersallDellEMCIsilonscale-outNASstoragesolutionsandhasthefollowingfeatures.
Ahighdegreeofscalability,withgrow-as-you-goflexibilityHighefficiencytoreducecosts
Multi-protocolsupportsuchasSMB,NFS,HTTPandHDFStomaximizeoperationalflexibilityEnterprisedataprotectionandresiliency
Robustsecurityoptions
TherecommendedIsilonstorageisIsilonF800all-flashscale-outNASstorage.DellEMCIsilonF800all-flashScale-outNASstorageisuniquelysuitedformodernDeepLearningapplicationsdeliveringtheflexibilitytodealwithanydatatype,scalabilityfordatasetsranginginthePBs,andconcurrencytosupportthemassive
-outarchitectureeliminatestheI/Obottleneckbetween
canscale-outupto68PBwithupto540GB/sofperformanceinasinglecluster.ThisallowsIsilontoaccelerateAIinnovationwithfastermodeltraining,providemoreaccurateinsightswithdeeperdatasets,anddelivera
TheIsilonstoragecanbeusedifthelocalNFSstoragecapacityisinsufficientfortheenvironment.IftheIsilonisusedinconjunctionwiththelocalNFSstorage,userhomedirectoriesandprojectresultscanbestoredon
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
theIsilonwithapplicationsinstalledonthelocalNFS.TheperformancecomparisonbetweenIsilonandotherstoragesolutionsareshowninSection3.1.6.ThespecificationsoftheIsilonF800arelistedinTable4.
Table4:SpecificationofIsilonF800
Storage
Externalstorage
Bandwidth
IOPS
ChassisCapacity(4RU)
ClusterCapacity
Network
Beforedoingdeeplearningmodeltraining,ifauserwantstomoveverylargedataoutsidetheclusterdescribedinSection2toIsilon,theusercanconnecttheserverwhichstoresthedatatotheFDR-40GigEgatewayinFigure2,sothatthedatacanbemovedontoIsilonwithouthavingtorouteitthroughtheheadnode.
TomonitorandanalyzetheperformanceandfilesystemofIsilonstorage,thetoolInsightIQcanbeused.InsightIQallowsausertomonitorandanalyzeIsilonstorageclusteractivityusingstandardreportsintheInsightIQweb-basedapplication.Theusercancustomizethesereportstoprovideinformationaboutstorageclusterhardware,software,andprotocoloperations.InsightIQtransformsdataintovisualinformationthathighlightsperformanceoutliers,andhelpsusersdiagnosebottlenecksandoptimizeworkflows.InSection3.1.5,InsightIQwasusedtocollecttheaveragediskoperationsize,diskreadIOPS,andfilesystemthroughputwhenrunningdeeplearningmodels.FormoredetailsaboutInsightIQ,refertoIsilonInsightIQUserGuide.
Network
Thesolutioncomprisesofthreenetworkfabrics.Theheadnodeandallcomputenodesareconnectedwitha1GigabitEthernetfabric.TheEthernetswitchrecommendedforthisistheDellNetworkingS3048-ONwhichhas48ports.ThisconnectionisprimarilyusedbyBrightClusterManagerfordeployment,maintenanceandmonitoringthesolution.
Thesecondfabricconnectstheheadnodeandallcomputenodesarethrough100Gb/sEDRInfiniBand.TheEDRInfiniBandswitchisMellanoxSB7800whichhas36ports.ThisfabricisusedforIPCbytheapplicationsaswellastoserveNFSfromtheheadnode(IPoIB)andIsilon.GPU-to-GPUcommunicationacrossserverscanuseatechniquecalledGPUDirectRemoteDirectMemoryAccess(RDMA)whichisenabledbyInfiniBand.ThisenablesGPUstocommunicatedirectlywithouttheinvolvementofCPUs.WithoutGPUDirect,whenGPUsacrossserversneedtocommunicate,theGPUinonenodehastocopydatafromitsGPUmemorytosystemmemory,thenthatdataissenttothesystemmemoryofanothernodeoverthenetwork,andfinallythedataiscopiedfromthesystemmemoryofthesecondnodetothereceivingGPUmemory.WithGPUDirecthowever,theGPUononenodecansendthedatadirectlyfromitsGPUmemorytotheGPUmemoryinanothernode,withoutgoingthroughthesystemmemoryinbothnodes.ThereforeGPUDirectdecreasestheGPU-GPUcommunicationlatencysignificantly.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
ThethirdswitchinthesolutioniscalledagatewayswitchinFigure2andconnectstheIsilonF800tothehead
nalinterfacesare40GigabitEthernet.Hence,aswitchwhichcanserveasthegatewaybetweenthe40GbEEthernetandInfiniBandnetworksisneededforconnectivitytotheheadandcomputenodes.TheMellanoxSX6036isusedforthispurpose.ThegatewayisconnectedtotheInfiniBandEDRswitchandtheIsilonasshowninFigure2.
Software
ThesoftwareportionofthesolutionisprovidedbyDellEMCandBrightComputing.Thesoftwareincludesseveralpieces.
ThefirstpieceisBrightClusterManagerwhichisusedtoeasilydeployandmanagetheclusteredinfrastructureandprovidesallclustersoftwareincludingtheoperatingsystem,GPUdriversandlibraries,InfiniBanddriversandlibraries,MPImiddleware,theSlurmschedule,etc.
ThesecondpieceistheBrightmachinelearning(ML)whichincludesanydeeplearninglibrarydependenciestothebaseoperatingsystem,deeplearningframeworksincludingCaffe/Caffe2,Pytorch,Torch7,Theano,Tensorflow,Horovod,Keras,DIGITS,CNTKandMXNet,anddeeplearninglibrariesincludingcuDNN,NCCL,andtheCUDAtoolkit.
ThethirdpieceistheDataScientistPortalwhichwasdevelopedbyDellEMC.Theportalwascreatedtoabstractthecomplexityofthedeeplearningecosystemsbyprovidingasinglepaneofglasswhichprovidesuserswithaninterfacetogetstartedwiththeirmodels.TheportalincludesspawnerforJupyterhubandintegrateswith
Resourcemanagersandschedulers(Slurm)LDAPforusermanagement
DeepLearningframeworkenvironments(TensorFlow,Keras,MXNet,Pytorchetc. moduleenvironment,Python2,Python3andRkernelsupport
Tensorboard
TerminalCLIenvironments.
ItalsoprovidestemplatestogetstartedwithforvariousDLenvironmentsandaddssupportforsingularitycontainers.FormoredetailsabouthowtousetheDataScientistPortal,refertoSection5.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
DeepLearningTrainingandInferencePerformanceandAnalysis
Inthissection,theperformanceofDeepLearningtrainingaswellasinferenceismeasuredusingthreeopensourceDeepLearningframeworks:TensorFlow,MXNetandCaffe2.TheexperimentswereconductedonaninstanceofthesolutionarchitecturedescribedinSection2.TheexperimenttestclusterusedaPowerEdgeR740xdheadnode,andPowerEdgeC4140computenodes,differentstoragesub-systemsincludingIsilonandInfiniBandEDRnetwork.Adetailedtestbeddescriptionisprovidedinthefollowingsection.
DeepLearningTraining
Thewell-knownILSVRC2012datasetwasusedforbenchmarkingperformance.Thisdatasetcontains1,281,167trainingimagesand50,000validationimagesin140GB.Allimagesaregroupedinto1000categoriesorclasses.TheoverallsizeofILSVRC2012leadstonon-trivialtrainingtimesandthusmakesitmoreinterestingforanalysis.AdditionallythisdatasetiscommonlyusedbyDeepLearningresearchersforbenchmarkingandcomparisonstudies.Resnet50isacomputationallyintensivenetworkandwasselectedtostressthesolutiontoitsmaximumcapability.ForthebatchsizeparameterinDeepLearning,themaximumbatchsizethatdoesnotcausememoryerrorswasselected;thistranslatedtoabatchsizeof64perGPUforMXNetandCaffe2,and128perGPUforTensorFlow.Horovod,adistributedTensorFlowframework,wasusedtoscalethetrainingacrossmultiplecomputenodes.Throughputthisdocument,performancewasmeasuredusingametricofimages/secwhichisameasureofthroughputofhowfastthesystemcancompletetrainingthedataset.
Theimages/secresultwasaveragedacrossalliterationstotakeintoaccountthedeviations.Thetotalnumberofiterationsisequaltonum_epochs*num_images/(batch_size*num_gpus),wherenum_epochsmeansthenumberofpassestoallimagesofadataset,num_imagesmeansthetotalnumberofimagesinthedataset,batch_sizemeansthenumberofimagesthatareprocessedinparallelbyoneGPU,andnum_gpusmeansthetotalnumberofGPUsinvolvedinthetraining.
Beforerunninganybenchmark,thecacheontheheadnodeandcomputenode(s)wereclearedwiththe
Thetrainingtestswererunforasingleepoch,oronepassthroughtheentiredataset,sincethethroughputisconsistentthroughepochsforMXNetandTensorFlowtests.Consistentthroughputmeansthattheperformancevariationwasnotsignificantacrossiterations,thetestsmeasuredlessthan2%variationinperformance.
However,twoepochswereusedforCaffe2asitneedstwoepochstostabilizetheperformance.Thisisbecause
(throughputorimages/sec)isnotstable(theperformancevariationbetweeniterationsislarge)whenthedatasetisnotfullyloadedinmemory.
ForMXNetframework,16CPUthreadswereusedfordatasetdecodingandthereasonwasexplainedintheDeepLearningonV100.Caffe2doesnotprovideaparameterforuserstosetthenumberofCPUthreads.
ForTensorFlow,thenumberofCPUthreadsusedfordatasetdecodingiscalculatedbysubtractingfourthreadsperGPUfromthetotalphysicalcorecountofthesystem.ThefourthreadsperGPUareusedforGPUcompute,memorycopies,eventmonitoring,ands
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024-2025学年高中语文 第三单元 戏剧 第13课 等待戈多(节选)教学设计 粤教版必修5
- 19夜宿山寺教学设计-2024-2025学年二年级上册语文统编版
- Unit 8 When is your birthday SectionA 1a-1c教学设计+教学设计
- 七下第二单元 吟哦涵泳传承家国情怀(教学设计)-初中语文核心素养学科教学专题培训系列
- 7 我是班级值日生 教学设计-2024-2025学年道德与法治二年级上册统编版
- 九年级语文上册 第三单元 课外古诗词诵读教学设计 新人教版
- 物品分类数学课件
- 22 我为环境添绿色(教学设计)人美版(2012)美术一年级下册
- 脊柱骨科护理三级查房
- Unit 7 Lesson 7 Reading for Writing 教学设计 2024-2025学年仁爱科普版(2024)七年级英语下册
- 基于学科核心素养的高中体育与健康学业质量与学习评价解读-汪晓赞
- 看守所刑事解除委托书
- 统编版历史七年级下册 问答式复习提纲
- 大型集团公司信息安全整体规划方案
- 特别国债资金管理办法
- 福建省建筑与市政地基基础技术标准
- 江苏省徐州市邳州市2023-2024学年八年级下学期期中数学试题
- 2024年福建省人民政府外事办公室翻译室日语翻译招录1人《行政职业能力测验》高频考点、难点(含详细答案)
- DL-T5017-2007水电水利工程压力钢管制造安装及验收规范
- 一年级数学下册100以内加减法口算练习题一
- 消化内镜进修总结汇报
评论
0/150
提交评论