




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA
DeepLearningwithNVIDIAArchitectureGuide
Authors:RenganXu,FrankHan,NishanthDandapanthula
Abstract
TherehasbeenanexplosionofinterestinDeepLearningandtheplethoraofchoicesmakesdesigningasolutioncomplexandtimeconsuming.Dell
sforAIDeepLearningwithNVIDIAisacompletesolution,designedtosupportallphasesofDeepLearning,incorporatesthelatestCPU,GPU,memory,network,storage,andsoftwaretechnologieswithimpressiveperformanceforbothtrainingandinferencephases.ThearchitectureofthisDeepLearningsolutionispresentedinthisdocument.
August2018
DellEMCReferenceArchitecture
Revisions
Date
Description
August2018
Initialrelease
publication,andspecificallydisclaimsimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.
Use,copying,anddistributionofanysoftwaredescribedinthispublicationrequiresanapplicablesoftwarelicense.
©August2018v1.0DellInc.oritssubsidiaries.AllRightsReserved.Dell,EMC,DellEMCandothertrademarksaretrademarksofDellInc.oritssubsidiaries.Othertrademarksmaybetrademarksoftheirrespectiveowners.
Dellbelievestheinformationinthisdocumentisaccurateasofitspublicationdate.Theinformationissubjecttochangewithoutnotice.
DellEMCReferenceArchitecture
TableofContents
Revisions 2
TableofContents 3
Executivesummary 4
SolutionOverview 5
SolutionArchitecture 7
HeadNodeConfiguration 7
SharedStorageviaNFSoverInfiniBand 8
ComputeNodeConfiguration 8
GPU 9
ProcessorrecommendationforHeadNodeandComputeNodes 10
MemoryrecommendationforHeadNodeandComputeNodes 10
IsilonStorage 11
Network 12
Software 13
DeepLearningTrainingandInferencePerformanceandAnalysis 14
DeepLearningTraining 14
FP16vsFP32 15
V100vsP100 16
V100-SXM2vsV100-PCIe 17
ScalingPerformancewithMulti-GPU 18
StoragePerformance 21
DeepLearningInference 28
NVIDIADIGITSToolandtheDeepLearningSolution 30
ContainersforDeepLearning 32
SingularityContainers 32
RunningNVIDIAGPUCloudwiththeReadySolutionsforAI-DeepLearning 34
TheDataScientistPortal 38
CreatingandRunningaNotebook 38
TensorboardIntegration 42
SlurmScheduler 43
ConclusionsandFutureWork 46
DellEMCReadySolutionsforAI-DeepLearningwithNVIDIA anArchitectureGuide|v1.0
Executivesummary
DeepLearningtechniqueshasenabledgreatsuccessinmanyfieldssuchascomputervision,naturallanguageprocessing(NLP),gamingandautonomousdrivingbyenablingamodeltolearnfromexistingdataandthentomakecorrespondingpredictions.Thesuccessisduetoacombinationofimprovedalgorithms,accesstolargedatasetsandincreasedcomputationalpower.Tobeeffectiveatenterprisescale,thecomputationalintensityofDeepLearningneuralnetworktrainingrequireshighlypowerfulandefficientparallelarchitectures.Thechoiceanddesignofthesystemcomponents,carefullyselectedandtunedforDeepLearninguse-cases,canmakethedifferenceinthebusinessoutcomesofapplyingDeepLearningtechniques.Inadditiontoseveraloptionsforprocessors,acceleratorsandstoragetechnologies,therearemultipleDeepLearningsoftwareframeworksandlibrariesthatmustbeconsidered.Thesesoftwarecomponentsareunderactivedevelopment,updatedfrequentlyandcumbersometomanage.ItiscomplicatedtosimplybuildandrunDeepLearningapplicationssuccessfully,leavinglittletimeforfocusontheactualbusinessproblem.
Toresolvethiscomplexitychallenge,DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.ThisdocumentpresentsthearchitectureofthisDeepLearningsolutionincludingdetailsonthedesignchoiceforeachcomponent.Theperformanceaspectsofthiscompletesolutionhavealsobeencharacterizedandarealsodescribedhere.
AUDIENCE
ThisdocumentisintendedfororganizationsinterestedinacceleratingDeepLearningwithadvancedcomputinganddatamanagementsolutions.Solutionarchitects,systemadministratorsandothersinterestedreaderswithinthoseorganizationsconstitutethetargetaudience.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
SolutionOverview
DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.Thiscompletesolutionisprovidedas sforAIDeepLearningwithNVIDIA.Thesolutionincludesfullyintegratedandoptimizedhardware,software,andservicesincludingdeployment,integrationandsupportmakingiteasierfororganizationstostartandgrowtheirDeepLearningpractice.
ThehighleveloverviewofDellEMCReadySolutionsforAI-DeepLearningisshowninFigure1.
Figure1:OverviewofDellEMCReadySolutionsforAI-DeepLearning
DataScientistPortal:Thisisanewportalfordatascientistscreatedforthissolution.Itenablesdatascientists,whoshouldnotneedtobeexpertsinclustertechnologies,touseasimplewebportaltotakeadvantageoftheunderlyingtechnology.Thescientistscanwrite,trainanddoinferencefordifferentDeepLearningmodelswithinJupyterNotebookwhichincludesPython2,Python3,Randotherkernels.
BrightClusterManagerandBrightMachineLearning:BrightClusterManagerisusedforthemonitoring,deployment,management,andmaintenanceofthecluster.TheBrightMachineLearning(ML)includesthedeeplearningframeworks,libraries,andcompilersandsoon.
DeepLearningFrameworksandLibraries:ThiscategoryincludesTensorFlow,MXNet,Caffe2,CUDA,cuDNN,andNCCL.Thelatestversionoftheseframeworksandlibrariesareintegratedintothesolution.
Infrastructure:Theinfrastructurecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Inthisinstanceofthesolution,themasternodeisaDellEMCPowerEdgeR740xd,eachcomputenodeisPowerEdgeC4140withNVIDIATeslaGPUs,thestorageincludesNetworkFileSystem(NFS)andIsilon,andthenetworksincludeEthernetandMellanoxInfiniBand.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Section2describeseachofthesesolutioncomponentsinmoredetail,coveringthecompute,network,storageandsoftwareconfigurations.ExtensiveperformanceanalysisonthissolutionwasconductedintheHPCandAIInnovationLabandthoseresultsarepresentedinSection3.Theseincludestestswithtrainingandinferenceworkloads,conductedondifferenttypesofGPUs,usingdifferentfloatingpointandintegerprecisionarithmetic,andwithdifferentstoragesub-systemsforDeepLearningworkloads.ThatisfollowedbySection4thatdescribescontainerizationtechniquesforDeepLearning.Section5hasdetailsontheDataScientistPortaldevelopedbyDellEMC.ConclusionandfuturedirectioncompletesthedocumentinSection6.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
SolutionArchitecture
Thehardwarecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Themasternodeorheadnoderolescanincludedeployingtheclusterofcomputenodes,managingthecomputenodes,userloginsandaccess,providingacompilationenvironment,andjobsubmissionstocomputenodes.Thecomputenodesaretheworkhorseandexecutethesubmittedjobs.SoftwarefromBrightComputingcalledBrightClusterManagerisusedtodeployandmanagethewholecluster.
Figure2showsthehigh-leveloverviewoftheclusterwhichincludesoneheadnode,ncomputenodes,thelocaldisksontheclusterheadnodeexportedoverNFS,Isilonstorage,andtwonetworks.AllcomputenodesareinterconnectedthroughanInfiniBandswitch.TheheadnodeisalsoconnectedtotheInfiniBandswitchasitusesIPoIBtoexporttheNFSsharetothecomputenodes.Allcomputenodesandtheheadnodearealsoconnectedtoa1GigabitEthernetmanagementswitchwhichisusedbyBrightClusterManagertoadministerthecluster.AnIsilonstoragesolutionisconnectedtotheFDR-40GigEGatewayswitchsothatitcanbeaccessedbytheheadnodeandallcomputenodes.
Figure2:Theoverviewofthecluster
HeadNodeConfiguration
TheDellEMCPowerEdgeR740xdisrecommendedfortheroleoftheheadnode.This
socket,2Urackserverthatcansupportthememorycapacities,I/Oneedsandnetworkoptionsrequiredoftheheadnode.Theheadnodewillperformtheclusteradministration,clustermanagement,NFSserver,userloginnodeandcompilationnoderoles.
ThesuggestedconfigurationofthePowerEdgeR740xdislistedinTable1.Itincludes12x12TBNLSASlocaldisksthatareformattedasanXFSfilesystemandexportedviaNFStothecomputenodesoverIPoIB.RAID50isusedinsteadofRAID6/RAID60totakeintoconsiderationfasterrebuildtimeandcapacityadvantagesprovidedbytheformer.Detailsofeachconfigurationchoicearedescribedinthefollowingsections.FormoreinformationonthisservermodelpleaserefertoPowerEdgeR740/740xdTechnicalGuide.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Table1:PowerEdgeR740xdconfigurations
Component
Details
ServerModel
PowerEdgeR740xd
Processor
2xIntelXeonGold6148CPU@2.40GHz
Memory
24x16GBDDR42666MT/sDIMMs-384GB
Disks
12x12TBNLSASRAID50(Recommended10+drives)
I/O&Ports
Networkdaughtercardwith
2x10GE+2x1GE
NetworkAdapter
1xInfiniBandEDRadapter
OutofBandManagement
iDRAC9EnterprisewithLifecycleController
PowerSupplies
Titanium1100W,Platinum
StorageControllers
PowerEdgeRAIDController(PERC)H730p
SharedStorageviaNFSoverInfiniBand
ThedefaultsharedstoragesystemfortheclusterisprovidedoverNFS.Itisbuiltusing12x12TBNLSASdisksthatarelocaltotheheadnodeconfiguredinRAID50withtwoparitycheckdisks.Thisprovidesusablecapacityof120TB(109TiB).RAID50waschosenbecauseithasbalancedperformanceandshorterrebuildtimecomparedtoRAID6orRAID60(sinceRAID50hasfewerparitydisksthanRAID6orRAID60).This120TBvolumeisformattedasanXFSfilesystemandexportedtothecomputenodesviaNFSoverIPoIB.
Inthedefaultconfiguration,bothhomedirectoriesandsharedapplicationandlibraryinstalllocationsarehostedonthisNFSshare.Inadditiontothis,forsolutionswhichrequirealargercapacitysharedstoragesolution,theIsilonF800isasanalternativeoptionandisdescribedinSection2.5.AcomparisonbetweenvariousstoragesubsystemsisprovidedinSection3.1.5,includingthisNLSASNFS,theIsilon,andsmallertestconfigurationsusingSSDsandNVMedevices.
ComputeNodeConfiguration
DeepLearningmethodswouldnothavegainedsuccesswithoutthecomputationalpowertodrivetheiterativetrainingprocess.Therefore,akeycomponentofDeepLearningsolutionsishighlycapablenodesthatcansupportcomputeintensiveworkloads.Thestate-of-artneuralnetworkmodelsinDeepLearninghavemorethan100layerswhichrequirethecomputationtobeabletoscaleacrossmanycomputenodesinorderforanytimelyresults.TheDellEMCPowerEdgeC4140,anaccelerator-optimized,highdensity1Urackserver,isusedasthecomputenodeunitinthissolution.ThePowerEdgeC4140cansupportfourNVIDIAVoltaSMX2GPUs,boththeV100-SXM2aswellastheV100-PCIemodels.Figure3showstheCPU-GPUandGPU-GPUconnectiontopologyofacomputenode.
ThedetailedconfigurationofeachPowerEdgeC4140computenodeislistedinTable2.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Figure3:ThetopologyofacomputenodeTable2:PowerEdgeC4140Configurations
Component
Details
ServerModel
PowerEdgeC4140
Processor
2xIntelXeonGold6148CPU@2.40GHz
Memory
24x16GBDDR42666MT/sDIMMs-384GB
LocalDisks
120GBSSD,1.6TBNVMe
I/O&Ports
Networkdaughtercardwith
2x10GE+2x1GE
NetworkAdapter
1xInfiniBandEDRadapter
GPU
4xV100-SXM216GB
OutofBandManagement
iDRAC9EnterprisewithLifecycleController
PowerSupplies
2000Whot-plugRedundantPowerSupplyUnit(PSU)
GPU
TheNVIDIATeslaV100isthelatestdatacenterGPUavailabletoaccelerateDeepLearning.Poweredby
engineerstotacklechallengesthatwereoncedifficult.With640TensorCores,TeslaV100isthefirstGPUtobreakthe100teraflops(TFLOPS)barrierofDeepLearningperformance.
Table3:V100-SXM2vsV100-PCIe
Description
V100-PCIe
V100-SXM2
CUDACores
5120
5120
GPUMaxClockRate(MHz)
1380
1530
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
TensorCores
640
640
MemoryBandwidth(GB/s)
900
900
NVLinkBandwidth(GB/s)(uni-direction)
N/A
300
DeepLearning(TensorOPS)
112
120
TDP(Watts)
250
300
TeslaV100productlineincludestwovariations,V100-PCIeandV100-SXM2.ThecomparisonoftwovariantsisshowninTable3.IntheV100-PCIe,allGPUscommunicatewitheachotheroverPCIebuses.WiththeV100-SXM2model,allGPUsareconnectedbyNVIDIANVLink.Inuse-caseswheremultipleGPUsarerequired,theV100-SXM2modelsprovidetheadvantageoffasterGPU-to-GPUcommunicationovertheNVLINKinterconnectwhencomparedtoPCIe.V100-SXM2GPUsprovidesixNVLinksperGPUforbi-directionalcommunication.ThebandwidthofeachNVLinkis25GB/sinuni-directionandallfourGPUswithinanodecancommunicateatthesametime,thereforethetheoreticalpeakbandwidthis6*25*4=600GB/sinbi-direction.However,thetheoreticalpeakbandwidthusingPCIeisonly16*2=32GB/sastheGPUscanonlycommunicateinorder,whichmeansthecommunicationcannotbedoneinparallel.SointheorythedatacommunicationwithNVLinkcouldbeupto600/32=18xfasterthanPCIe.TheevaluationofthisperformanceadvantageinrealmodelswillbediscussedinSection3.1.3.Becauseofthisadvantage,thePowerEdgeC4140computenodeintheDeepLearningsolutionusesV100-SXM2insteadofV100-PCIeGPUs.
ProcessorrecommendationforHeadNodeandComputeNodes
TheprocessorchosenfortheheadnodeandcomputenodesisIntel®Xeon®Gold6148CPU.ThisisthelatestIntel®Xeon®Scalableprocessorwith20physicalcoreswhichsupport40threads.Previousstudies,asdescribedinSection3.1,haveconcludedthat16threadsaresufficienttofeedtheI/Opipelineforthestate-of-the-artconvolutionalneuralnetwork,sotheGold6148CPUisareasonablechoice.AdditionallythisCPUmodelisrecommendedforthecomputenodesaswell,makingthisaconsistentchoiceacrossthecluster.
MemoryrecommendationforHeadNodeandComputeNodes
Therecommendedmemoryfortheheadnodeis24x16GB2666MT/sDIMMs.Thereforethetotalsizeofmemoryis384GB.Thisischosenbasedonthefollowingfacts:
Capacity:AnidealconfigurationmustsupportsystemmemorycapacitythatislargerthanthetotalsizeofGPUmemory.Eachcomputenodehas4GPUsandeachGPUhas16GBmemory,sothesystemmemorymustbeatleast16GBx4=64GB.TheheadnodememoryalsoaffectsI/Operformance.ForNFSservice,largermemorywillreducediskreadoperationssinceNFSserviceneedstosendoutdatafrommemory.16GBDIMMsdemonstratethebestperformance/dollarvalue.
DIMMconfiguration:Choiceslike24x16GBor12x32GBwillprovidethesamecapacityof384GBsystemmemory,butaccordingtoourstudiesasshowninFigure4,thecombinationof24x16GBDIMMsprovides11%betterperformancethanusing12x32GB.TheresultsshownherewasontheIntelXeonPlatinum8180processor,butthesametrendswillapplyacrossothermodelsintheIntelScalableProcessorFamilyincludingtheGold6148,althoughtheactualpercentagedifferencesacrossconfigurationsmayvary.MoredetailscanbefoundinourSkylakememorystudy.
Serviceability:Theheadnodeandcomputenodesmemoryconfigurationsaredesignedtobesimilartoreducepartscomplexitywhilesatisfyingperformanceandcapacityneeds.Fewerpartsneedtobestockedforreplacement,andinurgentcasesifamemorymoduleintheheadnodeneedstobe
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
replacedimmediately,aDIMMmodulefromacomputenodecanbetemporarilyconsideredtorestoretheheadnodeuntilreplacementmodulesarrive.
Figure4:Relativememorybandwidthfordifferentsystemcapacities
IsilonStorage
DellEMCIsilonisaprovenscale-outnetworkattachedstorage(NAS)solutionthatcanhandletheunstructureddataprevalentinmanydifferentworkflows.TheIsilonstoragearchitectureautomaticallyalignsapplicationneedswithperformance,capacity,andeconomics.Asperformanceandcapacitydemandsincrease,bothcanbescaledsimplyandnon-disruptively,allowingapplicationsanduserstocontinueworking.
DellEMCIsilonOneFSoperatingsystempowersallDellEMCIsilonscale-outNASstoragesolutionsandhasthefollowingfeatures.
Ahighdegreeofscalability,withgrow-as-you-goflexibilityHighefficiencytoreducecosts
Multi-protocolsupportsuchasSMB,NFS,HTTPandHDFStomaximizeoperationalflexibilityEnterprisedataprotectionandresiliency
Robustsecurityoptions
TherecommendedIsilonstorageisIsilonF800all-flashscale-outNASstorage.DellEMCIsilonF800all-flashScale-outNASstorageisuniquelysuitedformodernDeepLearningapplicationsdeliveringtheflexibilitytodealwithanydatatype,scalabilityfordatasetsranginginthePBs,andconcurrencytosupportthemassive
-outarchitectureeliminatestheI/Obottleneckbetween
canscale-outupto68PBwithupto540GB/sofperformanceinasinglecluster.ThisallowsIsilontoaccelerateAIinnovationwithfastermodeltraining,providemoreaccurateinsightswithdeeperdatasets,anddelivera
TheIsilonstoragecanbeusedifthelocalNFSstoragecapacityisinsufficientfortheenvironment.IftheIsilonisusedinconjunctionwiththelocalNFSstorage,userhomedirectoriesandprojectresultscanbestoredon
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
theIsilonwithapplicationsinstalledonthelocalNFS.TheperformancecomparisonbetweenIsilonandotherstoragesolutionsareshowninSection3.1.6.ThespecificationsoftheIsilonF800arelistedinTable4.
Table4:SpecificationofIsilonF800
Storage
Externalstorage
Bandwidth
IOPS
ChassisCapacity(4RU)
ClusterCapacity
Network
Beforedoingdeeplearningmodeltraining,ifauserwantstomoveverylargedataoutsidetheclusterdescribedinSection2toIsilon,theusercanconnecttheserverwhichstoresthedatatotheFDR-40GigEgatewayinFigure2,sothatthedatacanbemovedontoIsilonwithouthavingtorouteitthroughtheheadnode.
TomonitorandanalyzetheperformanceandfilesystemofIsilonstorage,thetoolInsightIQcanbeused.InsightIQallowsausertomonitorandanalyzeIsilonstorageclusteractivityusingstandardreportsintheInsightIQweb-basedapplication.Theusercancustomizethesereportstoprovideinformationaboutstorageclusterhardware,software,andprotocoloperations.InsightIQtransformsdataintovisualinformationthathighlightsperformanceoutliers,andhelpsusersdiagnosebottlenecksandoptimizeworkflows.InSection3.1.5,InsightIQwasusedtocollecttheaveragediskoperationsize,diskreadIOPS,andfilesystemthroughputwhenrunningdeeplearningmodels.FormoredetailsaboutInsightIQ,refertoIsilonInsightIQUserGuide.
Network
Thesolutioncomprisesofthreenetworkfabrics.Theheadnodeandallcomputenodesareconnectedwitha1GigabitEthernetfabric.TheEthernetswitchrecommendedforthisistheDellNetworkingS3048-ONwhichhas48ports.ThisconnectionisprimarilyusedbyBrightClusterManagerfordeployment,maintenanceandmonitoringthesolution.
Thesecondfabricconnectstheheadnodeandallcomputenodesarethrough100Gb/sEDRInfiniBand.TheEDRInfiniBandswitchisMellanoxSB7800whichhas36ports.ThisfabricisusedforIPCbytheapplicationsaswellastoserveNFSfromtheheadnode(IPoIB)andIsilon.GPU-to-GPUcommunicationacrossserverscanuseatechniquecalledGPUDirectRemoteDirectMemoryAccess(RDMA)whichisenabledbyInfiniBand.ThisenablesGPUstocommunicatedirectlywithouttheinvolvementofCPUs.WithoutGPUDirect,whenGPUsacrossserversneedtocommunicate,theGPUinonenodehastocopydatafromitsGPUmemorytosystemmemory,thenthatdataissenttothesystemmemoryofanothernodeoverthenetwork,andfinallythedataiscopiedfromthesystemmemoryofthesecondnodetothereceivingGPUmemory.WithGPUDirecthowever,theGPUononenodecansendthedatadirectlyfromitsGPUmemorytotheGPUmemoryinanothernode,withoutgoingthroughthesystemmemoryinbothnodes.ThereforeGPUDirectdecreasestheGPU-GPUcommunicationlatencysignificantly.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
ThethirdswitchinthesolutioniscalledagatewayswitchinFigure2andconnectstheIsilonF800tothehead
nalinterfacesare40GigabitEthernet.Hence,aswitchwhichcanserveasthegatewaybetweenthe40GbEEthernetandInfiniBandnetworksisneededforconnectivitytotheheadandcomputenodes.TheMellanoxSX6036isusedforthispurpose.ThegatewayisconnectedtotheInfiniBandEDRswitchandtheIsilonasshowninFigure2.
Software
ThesoftwareportionofthesolutionisprovidedbyDellEMCandBrightComputing.Thesoftwareincludesseveralpieces.
ThefirstpieceisBrightClusterManagerwhichisusedtoeasilydeployandmanagetheclusteredinfrastructureandprovidesallclustersoftwareincludingtheoperatingsystem,GPUdriversandlibraries,InfiniBanddriversandlibraries,MPImiddleware,theSlurmschedule,etc.
ThesecondpieceistheBrightmachinelearning(ML)whichincludesanydeeplearninglibrarydependenciestothebaseoperatingsystem,deeplearningframeworksincludingCaffe/Caffe2,Pytorch,Torch7,Theano,Tensorflow,Horovod,Keras,DIGITS,CNTKandMXNet,anddeeplearninglibrariesincludingcuDNN,NCCL,andtheCUDAtoolkit.
ThethirdpieceistheDataScientistPortalwhichwasdevelopedbyDellEMC.Theportalwascreatedtoabstractthecomplexityofthedeeplearningecosystemsbyprovidingasinglepaneofglasswhichprovidesuserswithaninterfacetogetstartedwiththeirmodels.TheportalincludesspawnerforJupyterhubandintegrateswith
Resourcemanagersandschedulers(Slurm)LDAPforusermanagement
DeepLearningframeworkenvironments(TensorFlow,Keras,MXNet,Pytorchetc. moduleenvironment,Python2,Python3andRkernelsupport
Tensorboard
TerminalCLIenvironments.
ItalsoprovidestemplatestogetstartedwithforvariousDLenvironmentsandaddssupportforsingularitycontainers.FormoredetailsabouthowtousetheDataScientistPortal,refertoSection5.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
DeepLearningTrainingandInferencePerformanceandAnalysis
Inthissection,theperformanceofDeepLearningtrainingaswellasinferenceismeasuredusingthreeopensourceDeepLearningframeworks:TensorFlow,MXNetandCaffe2.TheexperimentswereconductedonaninstanceofthesolutionarchitecturedescribedinSection2.TheexperimenttestclusterusedaPowerEdgeR740xdheadnode,andPowerEdgeC4140computenodes,differentstoragesub-systemsincludingIsilonandInfiniBandEDRnetwork.Adetailedtestbeddescriptionisprovidedinthefollowingsection.
DeepLearningTraining
Thewell-knownILSVRC2012datasetwasusedforbenchmarkingperformance.Thisdatasetcontains1,281,167trainingimagesand50,000validationimagesin140GB.Allimagesaregroupedinto1000categoriesorclasses.TheoverallsizeofILSVRC2012leadstonon-trivialtrainingtimesandthusmakesitmoreinterestingforanalysis.AdditionallythisdatasetiscommonlyusedbyDeepLearningresearchersforbenchmarkingandcomparisonstudies.Resnet50isacomputationallyintensivenetworkandwasselectedtostressthesolutiontoitsmaximumcapability.ForthebatchsizeparameterinDeepLearning,themaximumbatchsizethatdoesnotcausememoryerrorswasselected;thistranslatedtoabatchsizeof64perGPUforMXNetandCaffe2,and128perGPUforTensorFlow.Horovod,adistributedTensorFlowframework,wasusedtoscalethetrainingacrossmultiplecomputenodes.Throughputthisdocument,performancewasmeasuredusingametricofimages/secwhichisameasureofthroughputofhowfastthesystemcancompletetrainingthedataset.
Theimages/secresultwasaveragedacrossalliterationstotakeintoaccountthedeviations.Thetotalnumberofiterationsisequaltonum_epochs*num_images/(batch_size*num_gpus),wherenum_epochsmeansthenumberofpassestoallimagesofadataset,num_imagesmeansthetotalnumberofimagesinthedataset,batch_sizemeansthenumberofimagesthatareprocessedinparallelbyoneGPU,andnum_gpusmeansthetotalnumberofGPUsinvolvedinthetraining.
Beforerunninganybenchmark,thecacheontheheadnodeandcomputenode(s)wereclearedwiththe
Thetrainingtestswererunforasingleepoch,oronepassthroughtheentiredataset,sincethethroughputisconsistentthroughepochsforMXNetandTensorFlowtests.Consistentthroughputmeansthattheperformancevariationwasnotsignificantacrossiterations,thetestsmeasuredlessthan2%variationinperformance.
However,twoepochswereusedforCaffe2asitneedstwoepochstostabilizetheperformance.Thisisbecause
(throughputorimages/sec)isnotstable(theperformancevariationbetweeniterationsislarge)whenthedatasetisnotfullyloadedinmemory.
ForMXNetframework,16CPUthreadswereusedfordatasetdecodingandthereasonwasexplainedintheDeepLearningonV100.Caffe2doesnotprovideaparameterforuserstosetthenumberofCPUthreads.
ForTensorFlow,thenumberofCPUthreadsusedfordatasetdecodingiscalculatedbysubtractingfourthreadsperGPUfromthetotalphysicalcorecountofthesystem.ThefourthreadsperGPUareusedforGPUcompute,memorycopies,eventmonitoring,ands
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年湖南铁路科技职业技术学院单招职业技能测试题库及答案1套
- 2025年吉林司法警官职业学院单招职业倾向性测试题库及参考答案
- 2025年广州番禺职业技术学院单招职业技能测试题库带答案
- 2025年广西职业技术学院单招职业倾向性测试题库参考答案
- 2025年河北省张家口市单招职业倾向性测试题库带答案
- 2025年湖南机电职业技术学院单招职业适应性测试题库1套
- 2025年海南工商职业学院单招职业适应性测试题库必考题
- 2025年河北建筑安全员-B证考试题库附答案
- 2025年黑龙江省鹤岗市单招职业倾向性测试题库带答案
- 农村赠予合同范本
- 课前三分钟有效利用活动方案
- HIV阳性孕产妇全程管理专家共识2024年版解读
- 人教版九年级数学复习教案全册
- 《工程热力学》(第四版)全册配套完整课件
- 2024时事政治考试题库(100题)
- 零售商超市行业前台工作技巧
- 《纺织服装材料》课件-项目6 纺织材料的水分及检测
- 贵州人民版五年级劳动下册教案
- 中图版高中地理选择性必修1第3章第1节常见天气现象及成因课件
- 九年级物理说教材课标
- 江苏省昆山、太仓、常熟、张家港市2023-2024学年下学期七年级数学期中试题
评论
0/150
提交评论