版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
AISystem
NSCCTraining17December2020
NSCCAISystem
PAGE
10
Expectations
TheDGX-1nodesaremostsuitedtolarge,batchworkloads
e.g.trainingcomplexmodelswithlargedatasets
Weencourageuserstododevelopmentandpreliminarytestingonlocalresources
UsersareencouragedtousetheoptimizedNVIDIAGPUCloudDockerimages
Utilisation
AccessisthroughPBSjobscheduler
Weencourageworkloadswhichcanscaleuptoutiliseall8GPUsonanodeorrunacrossmultiplenodes
Userscanrequestfewerthan8GPUs
MultiplejobswillthenrunonanodewithGPUresourceisolation(usingcgroups)
YouwillonlyseethenumberofGPUsyourequest
SystemOverview
LoginNodes
DGX-1Nodes
InfiniBandNetwork
PBSJob
Scheduler
Storage
NSCCVPN:aspire.nscc.sg
Externaloutgoingaccess
astar.nscc.sg
Externaloutgoingaccess
ntu.nscc.sgNointernetaccess
NSCCNetworks
nscc0[3-4]
OnNUSandNTUloginnodes:
nscc0[1-2]
Forexternaloutgoingaccess:
sshnscc04-ib0
ntu0[1-4]
nus0[1-4]
nus.nscc.sg
Nointernetaccess
dgx410[1-6]
Nodirectincomingaccess
Externaloutgoingaccess
ProjectID
ProjectIDsprovideaccesstocomputationalresourcesandprojectstorage.
AIprojectsareinGPUhours
OnlyAIprojectcodescanwillrunonthedgxqueues
Inthefollowingmaterialwhereyousee
$PROJECTreplacewiththecodeforyourproject,forexamplethestakeholderpilotprojectcodewas41000001
Filesystems
TherearemultiplefilesystemsavailableontheNSCCsystems
/home GPFSfilesystemexportedtotheDGXnodesasanNFSfilesystem
/scratch high-performanceLustrefilesystem
/raid LocalSSDfilesystemoneachontheDGXnodes
I/OintensiveworkloadsshoulduseeithertheLustre/scratchfilesytemorthelocalSSD/raid
filesystem
VisibleonLoginnodes
VisibleonDGXhostO/S
VisibleinDGXincontainers
Description
/home/users/ORG/USER
YES
YES
YES
Homedirectory:$HOME50GBlimit
/home/projects/$PROJECT
YES
YES
YES
ProjectdirectoryLargerstoragelimits
/scratch/users/ORG/USER
YES
YES
YES
HighperformanceLustrefilesystem.Softlinkedto$HOME/scratch
Noquota,willbepurgedwhenfilesystemisfull.
/raid/users/ORG/USER
NO
YES
YES
LocalSSDfilesytemoneachDGXnode.
7TBfilesystemonvisibleonthatspecificnode.Noquota,willbepurgedwhenfilesystemisfull.
Filesystems
The/homefilesystem(homeandprojectdirectories)ismountedandvisibleonallloginandDGXnodesandinsideDockercontainers.Thisfilesystemshouldbeusedforstoringjobscripts,logsandarchivalofinactivedatasets.ActivedatasetswhicharebeingusedincalculationsshouldbeplacedoneithertheLustre/scratchfilesystemorthelocalSSD/raidfilesystems.
IntensiveI/OworkloadsonlargedatasetsshouldusetheLustrefilesytem.TheLustre/scratchdirectoryisnowmounteddirectlyontheDGXnodesandautomaticallymountedinsideDockercontainers(previouslyitwasonlyvisibleonloginnodesandmountedinDockercontainers)
ThelocalSSD/raidfilesystemisfastbutonlyvisibleonaspecificDGXnode.Thiscanbeused
fortemporaryfilesduringarunorforstaticlong-termdatasets.
Datasetswithverylargenumbersofsmallfiles(e.g.100,000fileswhichareapprox.1kBin
size)MUSTusethelocalSSD(/raid)filesystemorLustre(/scratch)filesystem.
Networkfilesystems(/home&/scratch)arenotsuitedtodatasetswhichhaveverylargenumberofsmallfilesbecausemetadataoperationsonnetworkfilesystemsareslow.
PBSQueueConfiguration
User
queues
Execution
dgx-dev
dgx-03g-04h
dgx-03g-24h
dgx-48g-04h
dgx-48g-24h
dgx
queues
…
Per-userrunlimits,per-queuerunlimitsandnodeassignmentusedtocontrolutilisation
Shorterqueues
havehigherpriority
Halfofanodeforsharedinteractivetesting&development
TypicalPBSNodeConfiguration
dgx-48g-*
dgx-03g-*
dgx-dev
dgx4101
dgx4102
dgx4103
dgx4104
dgx4105
()
()
dgx4106(4GPUS)
dgx4106(4GPUS)
Differentqueuescanaccessdifferentsetsofnodes
Shorterqueueshavebeengivenhigherpriority
Queuelimitsonthe48hourqueueareverystrictsowaittimesinthatqueueareextremelylong(throughputismuchbetterinthe4hourand24hourqueues)
Configurationmaybechangedtomatchrequirementsbasedontheloadinthe
queues
InteractiveUse–Access
SharedaccesstohalfofaDGXnode(4GPUs)isavailablefortestingofworkflowsbeforesubmissiontothebatchqueues
Toopenaninteractivesessionusethefollowingqsubcommandfromaloginnode:
user@nscc:~$qsub-I-qdgx-dev-lwalltime=8:00:00–P$PROJECT
#$PROJECT=41000001or22270170
Resourcesaresharedbetweenallusers,checkactivitybeforeuse
Usageofthedgx-devqueueisnotchargedagainstyourprojectquota
InteractiveUse–Docker
TorunaninteractivesessioninaDockercontainerthenaddthe“-t”flagtothe“nscc-dockerrun”command:
user@dgx:~$nscc-dockerrun-tnvcr.io/nvidia/tensorflow:latest
$ls
README.mddocker-examplesnvidia-examples
$tty
/dev/pts/0
The–tflagwillcausejobtofailifusedinabatchscript,onlyuseforinteractiveuse:
user@dgx:~$echotty|nscc-dockerrun-tnvcr.io/nvidia/tensorflow:latest
theinputdeviceisnotaTTY
Batchscheduler
Accessingthebatchschedulergenerallyinvolves3commands:
Submittingajob: qsub
Queryingthestatusofajob: qstat
Killingajob: qdel
qsubjob.pbs #submitaPBSjobscripttoscheduler
qstat #querythestatusofyourjobsqdel11111.wlm01#terminatejobwithid11111.wlm01
See
https://help.nscc.sg/user-guide/
formoreinformationonhowtousethePBSscheduler
Introductoryworkshopsareheldregularly,moreinformationat
https://www.nscc.sg/hpc-calendar/
ExamplePBSJobScript(Headers)
#!/bin/sh
##Lineswhichstartwith#PBSaredirectivesforthescheduler
##Directivesinjobscriptsaresupercededbycommandlineoptionspassedtoqsub
##Thefollowinglinerequeststheresourcesfor1DGXnode#PBS-lselect=1:ncpus=40:ngpus=8
##Runfor1hour,modifyasrequired
#PBS-lwalltime=1:00:00
##SubmittocorrectqueueforDGXaccess
#PBS–qdgx
##SpecifyprojectID
#Replace$PROJECTwithProjectIDsuchas41000001or22270170
#PBS-P$PROJECT
##Jobname#PBS-Nmxnet
##MergestandardoutputanderrorfromPBSscript#PBS-joe
ExamplePBSScript(Commmands)
#Changetodirectorywherejobwassubmitted
cd"$PBS_O_WORKDIR"||exit$?
#SpecifywhichDockerimagetouseforcontainer
image="nvcr.io/nvidia/tensorflow:latest"
#Passthecommandsthatyouwishtoruninsidethecontainertothestandardinputof“nscc-dockerrun”
nscc-dockerrun$image<stdin>stdout.$PBS_JOBID2>stderr.$PBS_JOBID
Hands-on
/home/projects/ai/examples
ExamplePBSjobscriptstodemonstratehowto:
submitajobtorunonaDGX-1node
startacontainer
runastandardMXNettrainingjob
installapythonpackageinsideinacontainer
See
https://help.nscc.sg/user-guide/
formoreinformationonhowto
usetheNSCCsystems
Hands-on
Step1:LogontoNSCCmachine
Step2:Runthefollowingcommandsandconfirmthattheywork:
cp-a/home/projects/ai/examples.#submitfirstbasicexample
cdexamples/1-basic-job&&\qsubsubmit.pbs
#runatrainingjob
cd../../examples/2-mxnet-training&&\qsubtrain.pbs
#installapythonpackageinsidecontainer
cd../../examples/3-pip-install&&\
qsubpip.pbs
Useqstattocheckjobstatusandwhenjobshavefinishedexamineoutputfilestoconfirm
everythingisworkingcorrectly
PartialNodeJobSubmission
Specifyrequiredngpusresourceinjobscript:
#PBS-lselect=1:ngpus=N:ncpus=5N
whereNisthenumberofGPUsrequired
e.g.“-lselect=1:ngpus=4:ncpus=20”
$echonvidia-smi|qsub-lselect=1:ncpus=5:ngpus=1-lwalltime=0:05:00-qfj5-P410000017590401.wlm01
$grepTeslaSTDIN.o7590401
| 0TeslaV100-SXM2...On |00000000:07:00.0Off| 0|
$echonvidia-smi|qsub-lselect=1:ncpus=10:ngpus=2-lwalltime=0:05:00-qfj5-P41000001
7590404.wlm01
$
grep
Tesla
STDIN.o7590404
|
0
Tesla
V100-SXM2...On
|00000000:07:00.0Off|
0|
|
1
Tesla
V100-SXM2...On
|00000000:0A:00.0Off|
0|
$echonvidia-smi|qsub-lselect=1:ncpus=20:ngpus=4-lwalltime=0:05:00-qfj5-P410000017590408.wlm01
$
grep
Tesla
STDIN.o7590408
|
0
Tesla
V100-SXM2...
On
|
00000000:07:00.0
Off
|
0
|
|
1
Tesla
V100-SXM2...
On
|
00000000:0A:00.0
Off
|
0
|
|
2
Tesla
V100-SXM2...
On
|
00000000:0B:00.0
Off
|
0
|
|
3
Tesla
V100-SXM2...
On
|
00000000:85:00.0
Off
|
0
|
NOTETHATTHEINTERACTIVEQUEUE(dgx-dev)WILLSTILLGIVESHAREDACCESSTOASETOFGPUSONTHETEST&DEVNODE
Checkingwhereajobisrunning
4availableoptionstoseewhichhostajobisrunningon:
$qstat-fJOBID
JobId:7008432.wlm01
<snip>
comment=JobrunatWedMay30at13:25on(dgx4106:ncpus=40:ngpus=8)
<snip>
$qstat-wanJOBID
wlm01:
Req'dReq'd Elap
JobID Username Queue Jobname SessID NDSTSK MemoryTimeSTime
-7008432.wlm01 fsg3 fj5 STDIN 67452 1 40 --01:00R00:05:09
dgx4106/0*40
$pbsnodes-Sjdgx410{1..6}
vnode
state
njobs
run
susp
mem
f/t
ncpus
f/t
nmics
f/t
ngpus
f/t
jobs
dgx4101
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4102
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4103
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4104
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4105
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4106
job-busy
1
1
0
504gb/504gb
0/40
0/0
0/8
7008432
$gstat
-dgx
#similarinformationtoabovecommandsbutshowsinformationonjobsfromallusersandiscachedsohas
aquickerresponse(butdatamaybeupto5minutesold)
AttachingsshSessiontoPBSJob
Ifyousshtoanodewhereyouarerunningajobthenthesshsessionwillbeattachedtothecgroupforyourjob.
Ifyouhavemultiplejobsrunningonanodeyoucanselectwhichjobtobeattachedtowiththecommand“pbs-attach”
$pbs-attach-l #listavailablejobs
7590741.wlm017590751.wlm01
$pbs-attach7590751.wlm01
executing:cgclassify-gdevices:/7590751.wlm0143840
Availableworkflows
Dockercontainers(recommended)
OptimizedDLframeworksfromNVIDIAGPUCloud(fullysupported)
Singularitycontainers(besteffortsupport)
https://sylabs.io/docs/
Applicationsinstalledbyuserinhomedirectory(e.g.Anaconda)(besteffortsupport)
DockerImages
The“nscc-dockerimages"commandshowsallimagescurrentlyinrepository
Currentlyinstalledincludes:
nvcr.io/nvidia/{pytorch,tensorflow,mxnet}:*
nvcr.io/nvidia/cuda:*
Olderimageswillberemovediftheyhavenotbeenusedrecently,ifyouneedaspecificversionthenitcanbepulledonrequest
Contact
help@nscc.sg
or
https://servicedesk.nscc.sg
NVIDIAGPUCloud
ToseewhichoptimisedDLframeworksareavailablefromNVIDIAcreateaccounton
/
UsingDockerontheDGX-1
Directaccesstothedockercommandordockergroupisnotpossiblefortechnicalreasons
Utilitiesprovidepre-definedtemplatedDockercommands:
nscc-dockerrunimage
nvidia-docker-u$UID:$GID\
-v/home:/home-v/scratch:/scratch-v/raid:/raid\
--rm-i--shm-size=1g--ulimitmemlock=-1\
--ulimitstack=67108864runimage/bin/sh
nscc-dockerimages
dockerimages
nscc-dockerps
dockerps
Dockerwrapper
$nscc-dockerrun-h
Usage:nscc-dockerrun[--net=host][--ipc=host][--pid=host][-t][-h]IMAGE
--net=host addsdockeroption--net=host
--ipc=host addsdockeroption--ipc=host
--pid=host addsdockeroption--pid=host
-t addsdockeroption-t
-h displaythishelpandexit
--help displaythishelpandexit
--usage displaythishelpandexit
Thefollowingoptionsareaddedtothedockercommandbydefault:
-uUID:GID--group-addGROUP\
–v/home:/home–v/raid:/raid-v/scratch:/scratch\
--rm–i--ulimitmemlock=-1--ulimitstack=67108864
If--ipc=hostisnotspecifiedthenthefollowingoptionisalsoadded:
--shm-size=1g
Singularity
Singularityisanalternativecontainertechnology
Canbeusedasanormaluser
CommonlyusedatHPCsites
Imagesareflatfiles(ordirectories)ratherthanlayers
LatestNGCDockerimagesconvertedtoSingularityimagesand
availablein:
/home/projects/ai/singularity
Examplejobscriptin:
/home/projects/ai/examples/singularity
https://www.sylabs.io/docs/
/docker-compatibility-singularity-hpc/
MultinodeTrainingwithHorovod
HorovodisadistributedtrainingframeworkforTensorFlow,Keras,andPyTorch.
Canbeusedfor:
multi-GPUparallelizationinasinglenode
multi-nodeparallelizationacrossmultiplenodesUsesNCCLandMPI
/uber/horovod
Examplejobscriptformulti-nodeHorovodusing
Singularitytorunacrossmultiplenodes:
/home/projects/ai/examples/horovod
CustomImages(Method1)
NSCC
Admin
UsersendsDockerfile
toNSCCAdmin
NSCCadminperforms"dockerbuild"andsynchronizesimageonallDGXnodes
NSCC
DGX-1
Local
resource
UserlogsintoNSCC
Usercreatesandtests
Dockerfile
Userperforms"nscc-dockerrun"
UsercreatesDockerimagelocallyandsendsDockerfiletoNSCCadmin
CustomImages(Method2)
DockerHub
Userperforms
"dockerpush"
NSCCadminperforms
"dockerpull"onallDGX
Local
resource
UserrequestsNSCCtopullimage
UsercreatesDockerfile
Userperforms"dockerbuild"
NSCC
Userperforms"nscc-dockerrun"
UsercreatesDockerimagelocallyandpushesimagetoDockerHub
Custompythonpackages
#“pipinstall”failsduetopermissionserror#“pipinstall--user”installsinto~/.local
# Thisisnotbestpracticeasitisexternaltocontainer# Itcanalsocauseunexpectedconflicts
#UsePYTHONUSERBASEtoinstallpackagesinsidecontainer
nscc-dockerrunnvcr.io/nvidia/tensorflow:latest<<EOFmkdir/workspace/.local
exportPYTHONUSERBASE=/workspace/.local
pipinstall--userscikit-learn
EOF
#Packagesinstalledwillbewipedoutwhencontainerstops#Forpermanentsolutionbuildacustomimage
Custompythonpackages(virtualenv)
#Installintoavirtualenv(notinstalledindefaultimage)nscc-dockerrunnscc/local/tensorflow:latest<<EOFvirtualenv$HOME/mypthon
.$HOME/mypython/bin/activate
pipinstallscikit-learn
EOF
#virtualenvisinhomedirectorysopersistsaftercontainerstops
#Thereforevirtualenvcanbereused
#Notbestpracticeasitaffectsportabilityandreplicability
nscc-dockerrunnscc/local/tensorflow:latest<<EOF
.$HOME/mypython/bin/activate
pythonscript.pyEOF
sshmiscellany
#ProxyCommandcanmakea2hopsshconnectionappeardirect#Onlocalmachinedo:
cat<<EOF>>.ssh/config
hostdgx410?
ProxyCommandsshaspire.nscc.sgnc%h%p
usermyusernamehostaspire.nscc.sguser
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年手机售后服务协议模板
- 成品油海上运输服务协议2024年
- 2023-2024学年之江教育评价高三下阶段测试(五)数学试题
- 2024年企业劳务服务协议模板
- 2024办公电脑集中采购协议模板
- 2024年反担保协议条款示例
- 2024年家居装饰协议格式
- 2024年批量锚具采购商务协议条款
- 文书模板-旅游服务转让合同
- 2024年电商管理代运营协议模板
- NB_T 10339-2019《水电工程坝址工程地质勘察规程》_(高清最新)
- 繁体校对《太上老君说常清静经》
- 关于统一规范人民防空标识使用管理的通知(1)
- 电缆振荡波局部放电试验报告
- 西门子RWD68说明书
- 针对建筑工程施工数字化管理分析
- 多品种共线生产质量风险评价
- 【MBA教学案例】从“虾国”到“国虾”:国联水产的战略转型
- Unit-1--College-Life
- 医院车辆加油卡管理制度
- 平面四杆机构急回特性说课课件
评论
0/150
提交评论