




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
AISystem
NSCCTraining17December2020
NSCCAISystem
PAGE
10
Expectations
TheDGX-1nodesaremostsuitedtolarge,batchworkloads
e.g.trainingcomplexmodelswithlargedatasets
Weencourageuserstododevelopmentandpreliminarytestingonlocalresources
UsersareencouragedtousetheoptimizedNVIDIAGPUCloudDockerimages
Utilisation
AccessisthroughPBSjobscheduler
Weencourageworkloadswhichcanscaleuptoutiliseall8GPUsonanodeorrunacrossmultiplenodes
Userscanrequestfewerthan8GPUs
MultiplejobswillthenrunonanodewithGPUresourceisolation(usingcgroups)
YouwillonlyseethenumberofGPUsyourequest
SystemOverview
LoginNodes
DGX-1Nodes
InfiniBandNetwork
PBSJob
Scheduler
Storage
NSCCVPN:aspire.nscc.sg
Externaloutgoingaccess
astar.nscc.sg
Externaloutgoingaccess
ntu.nscc.sgNointernetaccess
NSCCNetworks
nscc0[3-4]
OnNUSandNTUloginnodes:
nscc0[1-2]
Forexternaloutgoingaccess:
sshnscc04-ib0
ntu0[1-4]
nus0[1-4]
nus.nscc.sg
Nointernetaccess
dgx410[1-6]
Nodirectincomingaccess
Externaloutgoingaccess
ProjectID
ProjectIDsprovideaccesstocomputationalresourcesandprojectstorage.
AIprojectsareinGPUhours
OnlyAIprojectcodescanwillrunonthedgxqueues
Inthefollowingmaterialwhereyousee
$PROJECTreplacewiththecodeforyourproject,forexamplethestakeholderpilotprojectcodewas41000001
Filesystems
TherearemultiplefilesystemsavailableontheNSCCsystems
/home GPFSfilesystemexportedtotheDGXnodesasanNFSfilesystem
/scratch high-performanceLustrefilesystem
/raid LocalSSDfilesystemoneachontheDGXnodes
I/OintensiveworkloadsshoulduseeithertheLustre/scratchfilesytemorthelocalSSD/raid
filesystem
VisibleonLoginnodes
VisibleonDGXhostO/S
VisibleinDGXincontainers
Description
/home/users/ORG/USER
YES
YES
YES
Homedirectory:$HOME50GBlimit
/home/projects/$PROJECT
YES
YES
YES
ProjectdirectoryLargerstoragelimits
/scratch/users/ORG/USER
YES
YES
YES
HighperformanceLustrefilesystem.Softlinkedto$HOME/scratch
Noquota,willbepurgedwhenfilesystemisfull.
/raid/users/ORG/USER
NO
YES
YES
LocalSSDfilesytemoneachDGXnode.
7TBfilesystemonvisibleonthatspecificnode.Noquota,willbepurgedwhenfilesystemisfull.
Filesystems
The/homefilesystem(homeandprojectdirectories)ismountedandvisibleonallloginandDGXnodesandinsideDockercontainers.Thisfilesystemshouldbeusedforstoringjobscripts,logsandarchivalofinactivedatasets.ActivedatasetswhicharebeingusedincalculationsshouldbeplacedoneithertheLustre/scratchfilesystemorthelocalSSD/raidfilesystems.
IntensiveI/OworkloadsonlargedatasetsshouldusetheLustrefilesytem.TheLustre/scratchdirectoryisnowmounteddirectlyontheDGXnodesandautomaticallymountedinsideDockercontainers(previouslyitwasonlyvisibleonloginnodesandmountedinDockercontainers)
ThelocalSSD/raidfilesystemisfastbutonlyvisibleonaspecificDGXnode.Thiscanbeused
fortemporaryfilesduringarunorforstaticlong-termdatasets.
Datasetswithverylargenumbersofsmallfiles(e.g.100,000fileswhichareapprox.1kBin
size)MUSTusethelocalSSD(/raid)filesystemorLustre(/scratch)filesystem.
Networkfilesystems(/home&/scratch)arenotsuitedtodatasetswhichhaveverylargenumberofsmallfilesbecausemetadataoperationsonnetworkfilesystemsareslow.
PBSQueueConfiguration
User
queues
Execution
dgx-dev
dgx-03g-04h
dgx-03g-24h
dgx-48g-04h
dgx-48g-24h
dgx
queues
…
Per-userrunlimits,per-queuerunlimitsandnodeassignmentusedtocontrolutilisation
Shorterqueues
havehigherpriority
Halfofanodeforsharedinteractivetesting&development
TypicalPBSNodeConfiguration
dgx-48g-*
dgx-03g-*
dgx-dev
dgx4101
dgx4102
dgx4103
dgx4104
dgx4105
()
()
dgx4106(4GPUS)
dgx4106(4GPUS)
Differentqueuescanaccessdifferentsetsofnodes
Shorterqueueshavebeengivenhigherpriority
Queuelimitsonthe48hourqueueareverystrictsowaittimesinthatqueueareextremelylong(throughputismuchbetterinthe4hourand24hourqueues)
Configurationmaybechangedtomatchrequirementsbasedontheloadinthe
queues
InteractiveUse–Access
SharedaccesstohalfofaDGXnode(4GPUs)isavailablefortestingofworkflowsbeforesubmissiontothebatchqueues
Toopenaninteractivesessionusethefollowingqsubcommandfromaloginnode:
user@nscc:~$qsub-I-qdgx-dev-lwalltime=8:00:00–P$PROJECT
#$PROJECT=41000001or22270170
Resourcesaresharedbetweenallusers,checkactivitybeforeuse
Usageofthedgx-devqueueisnotchargedagainstyourprojectquota
InteractiveUse–Docker
TorunaninteractivesessioninaDockercontainerthenaddthe“-t”flagtothe“nscc-dockerrun”command:
user@dgx:~$nscc-dockerrun-tnvcr.io/nvidia/tensorflow:latest
$ls
README.mddocker-examplesnvidia-examples
$tty
/dev/pts/0
The–tflagwillcausejobtofailifusedinabatchscript,onlyuseforinteractiveuse:
user@dgx:~$echotty|nscc-dockerrun-tnvcr.io/nvidia/tensorflow:latest
theinputdeviceisnotaTTY
Batchscheduler
Accessingthebatchschedulergenerallyinvolves3commands:
Submittingajob: qsub
Queryingthestatusofajob: qstat
Killingajob: qdel
qsubjob.pbs #submitaPBSjobscripttoscheduler
qstat #querythestatusofyourjobsqdel11111.wlm01#terminatejobwithid11111.wlm01
See
https://help.nscc.sg/user-guide/
formoreinformationonhowtousethePBSscheduler
Introductoryworkshopsareheldregularly,moreinformationat
https://www.nscc.sg/hpc-calendar/
ExamplePBSJobScript(Headers)
#!/bin/sh
##Lineswhichstartwith#PBSaredirectivesforthescheduler
##Directivesinjobscriptsaresupercededbycommandlineoptionspassedtoqsub
##Thefollowinglinerequeststheresourcesfor1DGXnode#PBS-lselect=1:ncpus=40:ngpus=8
##Runfor1hour,modifyasrequired
#PBS-lwalltime=1:00:00
##SubmittocorrectqueueforDGXaccess
#PBS–qdgx
##SpecifyprojectID
#Replace$PROJECTwithProjectIDsuchas41000001or22270170
#PBS-P$PROJECT
##Jobname#PBS-Nmxnet
##MergestandardoutputanderrorfromPBSscript#PBS-joe
ExamplePBSScript(Commmands)
#Changetodirectorywherejobwassubmitted
cd"$PBS_O_WORKDIR"||exit$?
#SpecifywhichDockerimagetouseforcontainer
image="nvcr.io/nvidia/tensorflow:latest"
#Passthecommandsthatyouwishtoruninsidethecontainertothestandardinputof“nscc-dockerrun”
nscc-dockerrun$image<stdin>stdout.$PBS_JOBID2>stderr.$PBS_JOBID
Hands-on
/home/projects/ai/examples
ExamplePBSjobscriptstodemonstratehowto:
submitajobtorunonaDGX-1node
startacontainer
runastandardMXNettrainingjob
installapythonpackageinsideinacontainer
See
https://help.nscc.sg/user-guide/
formoreinformationonhowto
usetheNSCCsystems
Hands-on
Step1:LogontoNSCCmachine
Step2:Runthefollowingcommandsandconfirmthattheywork:
cp-a/home/projects/ai/examples.#submitfirstbasicexample
cdexamples/1-basic-job&&\qsubsubmit.pbs
#runatrainingjob
cd../../examples/2-mxnet-training&&\qsubtrain.pbs
#installapythonpackageinsidecontainer
cd../../examples/3-pip-install&&\
qsubpip.pbs
Useqstattocheckjobstatusandwhenjobshavefinishedexamineoutputfilestoconfirm
everythingisworkingcorrectly
PartialNodeJobSubmission
Specifyrequiredngpusresourceinjobscript:
#PBS-lselect=1:ngpus=N:ncpus=5N
whereNisthenumberofGPUsrequired
e.g.“-lselect=1:ngpus=4:ncpus=20”
$echonvidia-smi|qsub-lselect=1:ncpus=5:ngpus=1-lwalltime=0:05:00-qfj5-P410000017590401.wlm01
$grepTeslaSTDIN.o7590401
| 0TeslaV100-SXM2...On |00000000:07:00.0Off| 0|
$echonvidia-smi|qsub-lselect=1:ncpus=10:ngpus=2-lwalltime=0:05:00-qfj5-P41000001
7590404.wlm01
$
grep
Tesla
STDIN.o7590404
|
0
Tesla
V100-SXM2...On
|00000000:07:00.0Off|
0|
|
1
Tesla
V100-SXM2...On
|00000000:0A:00.0Off|
0|
$echonvidia-smi|qsub-lselect=1:ncpus=20:ngpus=4-lwalltime=0:05:00-qfj5-P410000017590408.wlm01
$
grep
Tesla
STDIN.o7590408
|
0
Tesla
V100-SXM2...
On
|
00000000:07:00.0
Off
|
0
|
|
1
Tesla
V100-SXM2...
On
|
00000000:0A:00.0
Off
|
0
|
|
2
Tesla
V100-SXM2...
On
|
00000000:0B:00.0
Off
|
0
|
|
3
Tesla
V100-SXM2...
On
|
00000000:85:00.0
Off
|
0
|
NOTETHATTHEINTERACTIVEQUEUE(dgx-dev)WILLSTILLGIVESHAREDACCESSTOASETOFGPUSONTHETEST&DEVNODE
Checkingwhereajobisrunning
4availableoptionstoseewhichhostajobisrunningon:
$qstat-fJOBID
JobId:7008432.wlm01
<snip>
comment=JobrunatWedMay30at13:25on(dgx4106:ncpus=40:ngpus=8)
<snip>
$qstat-wanJOBID
wlm01:
Req'dReq'd Elap
JobID Username Queue Jobname SessID NDSTSK MemoryTimeSTime
-7008432.wlm01 fsg3 fj5 STDIN 67452 1 40 --01:00R00:05:09
dgx4106/0*40
$pbsnodes-Sjdgx410{1..6}
vnode
state
njobs
run
susp
mem
f/t
ncpus
f/t
nmics
f/t
ngpus
f/t
jobs
dgx4101
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4102
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4103
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4104
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4105
free
0
0
0
504gb/504gb
40/40
0/0
8/8
--
dgx4106
job-busy
1
1
0
504gb/504gb
0/40
0/0
0/8
7008432
$gstat
-dgx
#similarinformationtoabovecommandsbutshowsinformationonjobsfromallusersandiscachedsohas
aquickerresponse(butdatamaybeupto5minutesold)
AttachingsshSessiontoPBSJob
Ifyousshtoanodewhereyouarerunningajobthenthesshsessionwillbeattachedtothecgroupforyourjob.
Ifyouhavemultiplejobsrunningonanodeyoucanselectwhichjobtobeattachedtowiththecommand“pbs-attach”
$pbs-attach-l #listavailablejobs
7590741.wlm017590751.wlm01
$pbs-attach7590751.wlm01
executing:cgclassify-gdevices:/7590751.wlm0143840
Availableworkflows
Dockercontainers(recommended)
OptimizedDLframeworksfromNVIDIAGPUCloud(fullysupported)
Singularitycontainers(besteffortsupport)
https://sylabs.io/docs/
Applicationsinstalledbyuserinhomedirectory(e.g.Anaconda)(besteffortsupport)
DockerImages
The“nscc-dockerimages"commandshowsallimagescurrentlyinrepository
Currentlyinstalledincludes:
nvcr.io/nvidia/{pytorch,tensorflow,mxnet}:*
nvcr.io/nvidia/cuda:*
Olderimageswillberemovediftheyhavenotbeenusedrecently,ifyouneedaspecificversionthenitcanbepulledonrequest
Contact
help@nscc.sg
or
https://servicedesk.nscc.sg
NVIDIAGPUCloud
ToseewhichoptimisedDLframeworksareavailablefromNVIDIAcreateaccounton
/
UsingDockerontheDGX-1
Directaccesstothedockercommandordockergroupisnotpossiblefortechnicalreasons
Utilitiesprovidepre-definedtemplatedDockercommands:
nscc-dockerrunimage
nvidia-docker-u$UID:$GID\
-v/home:/home-v/scratch:/scratch-v/raid:/raid\
--rm-i--shm-size=1g--ulimitmemlock=-1\
--ulimitstack=67108864runimage/bin/sh
nscc-dockerimages
dockerimages
nscc-dockerps
dockerps
Dockerwrapper
$nscc-dockerrun-h
Usage:nscc-dockerrun[--net=host][--ipc=host][--pid=host][-t][-h]IMAGE
--net=host addsdockeroption--net=host
--ipc=host addsdockeroption--ipc=host
--pid=host addsdockeroption--pid=host
-t addsdockeroption-t
-h displaythishelpandexit
--help displaythishelpandexit
--usage displaythishelpandexit
Thefollowingoptionsareaddedtothedockercommandbydefault:
-uUID:GID--group-addGROUP\
–v/home:/home–v/raid:/raid-v/scratch:/scratch\
--rm–i--ulimitmemlock=-1--ulimitstack=67108864
If--ipc=hostisnotspecifiedthenthefollowingoptionisalsoadded:
--shm-size=1g
Singularity
Singularityisanalternativecontainertechnology
Canbeusedasanormaluser
CommonlyusedatHPCsites
Imagesareflatfiles(ordirectories)ratherthanlayers
LatestNGCDockerimagesconvertedtoSingularityimagesand
availablein:
/home/projects/ai/singularity
Examplejobscriptin:
/home/projects/ai/examples/singularity
https://www.sylabs.io/docs/
/docker-compatibility-singularity-hpc/
MultinodeTrainingwithHorovod
HorovodisadistributedtrainingframeworkforTensorFlow,Keras,andPyTorch.
Canbeusedfor:
multi-GPUparallelizationinasinglenode
multi-nodeparallelizationacrossmultiplenodesUsesNCCLandMPI
/uber/horovod
Examplejobscriptformulti-nodeHorovodusing
Singularitytorunacrossmultiplenodes:
/home/projects/ai/examples/horovod
CustomImages(Method1)
NSCC
Admin
UsersendsDockerfile
toNSCCAdmin
NSCCadminperforms"dockerbuild"andsynchronizesimageonallDGXnodes
NSCC
DGX-1
Local
resource
UserlogsintoNSCC
Usercreatesandtests
Dockerfile
Userperforms"nscc-dockerrun"
UsercreatesDockerimagelocallyandsendsDockerfiletoNSCCadmin
CustomImages(Method2)
DockerHub
Userperforms
"dockerpush"
NSCCadminperforms
"dockerpull"onallDGX
Local
resource
UserrequestsNSCCtopullimage
UsercreatesDockerfile
Userperforms"dockerbuild"
NSCC
Userperforms"nscc-dockerrun"
UsercreatesDockerimagelocallyandpushesimagetoDockerHub
Custompythonpackages
#“pipinstall”failsduetopermissionserror#“pipinstall--user”installsinto~/.local
# Thisisnotbestpracticeasitisexternaltocontainer# Itcanalsocauseunexpectedconflicts
#UsePYTHONUSERBASEtoinstallpackagesinsidecontainer
nscc-dockerrunnvcr.io/nvidia/tensorflow:latest<<EOFmkdir/workspace/.local
exportPYTHONUSERBASE=/workspace/.local
pipinstall--userscikit-learn
EOF
#Packagesinstalledwillbewipedoutwhencontainerstops#Forpermanentsolutionbuildacustomimage
Custompythonpackages(virtualenv)
#Installintoavirtualenv(notinstalledindefaultimage)nscc-dockerrunnscc/local/tensorflow:latest<<EOFvirtualenv$HOME/mypthon
.$HOME/mypython/bin/activate
pipinstallscikit-learn
EOF
#virtualenvisinhomedirectorysopersistsaftercontainerstops
#Thereforevirtualenvcanbereused
#Notbestpracticeasitaffectsportabilityandreplicability
nscc-dockerrunnscc/local/tensorflow:latest<<EOF
.$HOME/mypython/bin/activate
pythonscript.pyEOF
sshmiscellany
#ProxyCommandcanmakea2hopsshconnectionappeardirect#Onlocalmachinedo:
cat<<EOF>>.ssh/config
hostdgx410?
ProxyCommandsshaspire.nscc.sgnc%h%p
usermyusernamehostaspire.nscc.sguser
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 裂变反应堆燃料管理-深度研究
- 葡萄品牌价值评估体系构建-深度研究
- 能源效率提升的国际合作-深度研究
- 跨国公司分红策略比较-深度研究
- 典当借款合同范本
- 产品生产加工合同范本
- 临时用工劳务合同范本
- 供暖合同范本业主
- 代销网店合同范本
- 加工合同范本简介
- 《消费者心理与行为分析》第五版 课件全套 肖涧松 单元1-10 消费者心理与行为概述 - 消费者购买决策与购后行为
- 《会展概述》课件
- 体检报告电子版
- 2024年中考语文真题分类汇编(全国版)专题12议论文阅读(第01期)含答案及解析
- 七年级下册心理健康教育教学设计
- 食堂清洗及消毒制度
- 服装质量管理制度
- 自然辩证法概论:第四章-马克思主义科学技术社会论
- 会议会务服务投标方案投标文件(技术方案)
- 老年人健康保健-老年人的长期照护(老年护理课件)
- 建筑工程质量管理培训
评论
0/150
提交评论