版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DistributedMachine
Learningwith
Python
Acceleratingmodeltrainingandservingwithdistributedsystems
GuanhuaWang
BIRMINGHAM—MUMBAI
DistributedMachineLearningwithPython
Copyright©2022PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishingoritsdealersanddistributors,willbeheldliableforanydamagescausedorallegedtohavebeencauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
PublishingProductManager:AliAbidi
SeniorEditors:RoshanKumar,NathanyaDiaz
ContentDevelopmentEditors:TazeenShaikh,ShreyaMoharir
TechnicalEditor:DevanshiAyare
CopyEditor:SafisEditing
ProjectCoordinator:AparnaRavikumarNair
Proofreader:SafisEditing
Indexer:PratikShirodkar
ProductionDesigner:AlishonMendonca
MarketingCoordinators:AbeerRiyazDawe,ShifaAnsari
Firstpublished:May2022
Productionreference:1040422
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
Birmingham
B32PB,UK.
ISBN978-1-80181-569-7
Tomyparents,YingHanandXinWang
Tomygirlfriend,JingYuan
–GuanhuaWang
Contributors
Abouttheauthor
GuanhuaWangisafinal-yearcomputersciencePh.D.studentintheRISELabatUCBerkeley,advisedbyProfessorIonStoica.Hisresearchliesprimarilyinthemachinelearningsystemsarea,includingfastcollectivecommunication,efficientin-parallelmodeltraining,andreal-timemodelserving.Hisresearchhasgainedlotsofattentionfrombothacademiaandindustry.Hewasinvitedtogivetalkstotop-tieruniversities(MIT,Stanford,CMU,Princeton)andbigtechcompanies(Facebook/Meta,Microsoft).Hereceivedhismaster'sdegreefromHKUSTandabachelor'sdegreefromSoutheastUniversityinChina.Hehasalsodonesomecoolresearchonwirelessnetworks.Helikesplayingsoccerandhasrunmultiplehalf-marathonsintheBayAreaofCalifornia.
Aboutthereviewers
JamshaidSohailispassionateaboutdatascience,machinelearning,computervision,andnaturallanguageprocessingandhasmorethan2yearsofexperienceintheindustry.HepreviouslyworkedataSiliconValley-basedstart-up,FunnelBeam,thefoundersofwhicharefromStanfordUniversity,asadatascientist.Currently,heisworkingasadatascientistatSystemsLimited.Hehascompletedover66onlinecoursesfromdifferentplatforms.HeauthoredthebookDataWranglingwithPython3.XforPacktPublishingandhasreviewedmultiplebooksandcourses.HeisalsodevelopingacomprehensivecourseondatascienceatEducativeandisintheprocessofwritingbooksformultiplepublishers.
HiteshHindujaisanardentAIenthusiastworkingasaseniormanagerinAIatOlaElectric,whereheleadsateamof20+peopleintheareasofML,statistics,CV,NLP,andreinforcementlearning.Hehasfiled14+patentsinIndiaandtheUSandhasnumerousresearchpublicationstohisname.HiteshhasbeeninvolvedinresearchrolesatIndia'stopbusinessschools:theIndianSchoolofBusiness,Hyderabad,andtheIndianInstituteofManagement,Ahmedabad.Heisalsoactivelyinvolvedintrainingandmentoringandhasbeeninvitedtobeaguestspeakerbyvariouscorporationsandassociationsacrosstheglobe.
TableofContents
Preface
Section1–DataParallelism
1
SplittingInputData
Single-nodetrainingistooslow4
Themismatchbetweendataloading
bandwidthandmodeltrainingbandwidth5
Single-nodetrainingtimeonpopular
datasets6
Acceleratingthetrainingprocesswith
dataparallelism8
Dataparallelism–the
high-levelbits9
Stochasticgradientdescent13
Modelsynchronization14
Hyperparametertuning15
Globalbatchsize16
Learningrateadjustment16
Modelsynchronizationschemes17
Summary18
2
ParameterServerandAll-Reduce
Technicalrequirements20
Parameterserverarchitecture21
Communicationbottleneckinthe
parameterserverarchitecture22
Shardingthemodelamongparameter
servers24
Implementingtheparameter
server26
Definingmodellayers26
Definingtheparameterserver27
Definingtheworker28
Passingdatabetweentheparameter
serverandworker30
Issueswiththeparameter
server32
Theparameterserverarchitecture
introducesahighcodingcomplexity
forpractitioners33
viiiTableofContents
Broadcast40
Gather41
All-Gather42
Summary
43
All-Reducearchitecture34
Reduce34
All-Reduce36
RingAll-Reduce37
Collectivecommunication40
3
BuildingaDataParallelTrainingandServingPipeline
Technicalrequirements
46
Single-machinemulti-GPU52
Thedataparalleltraining
Multi-machinemulti-GPU56
pipelineinanutshell
Inputpre-processing
Inputdatapartition
Dataloading
Training
46
48
49
50
50
Checkpointingandfault
tolerance64
Modelcheckpointing64
Loadmodelcheckpoints65
Modelsynchronization
51
Modelevaluationand
Modelupdate
52
hyperparametertuning67
Single-machinemulti-GPUsand
multi-machinemulti-GPUs
4
BottlenecksandSolutions
52
Modelservingindataparallelism71
Summary73
Communicationbottlenecksin
dataparalleltraining76
Analyzingthecommunicationworkloads76
Parameterserverarchitecture77
TheAll-Reducearchitecture80
Theinefficiencyofstate-of-the-art
communicationschemes83
Leveragingidlelinksandhost
resources85
TreeAll-Reduce85
HybriddatatransferoverPCIeand
NVLink91
On-devicememorybottlenecks93
Recomputationandquantization94
Recomputation95
Quantization98
Summary99
TableofContentsix
Section2–ModelParallelism
5
SplittingtheModel
Technicalrequirements104
Single-nodetrainingerror–out
ofmemory105
Fine-tuningBERTonasingleGPU105
Tryingtopackagiantmodelinsideone
state-of-the-artGPU107
ELMo,BERT,andGPT110
Basicconcepts110
RNN114
ELMo117
6
PipelineInputandLayerSplit
BERT
GPT
Pre-trainingandfine-tuningState-of-the-arthardware
P100,V100,andDGX-1
NVLink
A100andDGX-2
NVSwitch
Summary
119
121
122
123
123
124
125
125
125
Vanillamodelparallelismis
inefficient128
Forwardpropagation130
Backwardpropagation131
GPUidletimebetweenforwardand
backwardpropagation132
Pipelineinput137
Prosandconsofpipeline
parallelism141
Advantagesofpipelineparallelism141
Disadvantagesofpipelineparallelism142
Layersplit142
Notesonintra-layermodel
parallelism145
Summary145
xTableofContents
7
ImplementingModelParallelTrainingandServingWorkflows
Technicalrequirements148
Wrappingupthewholemodel
parallelismpipeline149
Amodelparalleltrainingoverview149
Implementingamodelparalleltraining
pipeline150
Specifyingcommunicationprotocol
amongGPUs153
Modelparallelserving158
Fine-tuningtransformers162
Hyperparametertuningin
modelparallelism163
BalancingtheworkloadamongGPUs163
Enabling/disablingpipelineparallelism164
NLPmodelserving164
Summary165
8
AchievingHigherThroughputandLowerLatency
Technicalrequirements169
Freezinglayers169
Freezinglayersduringforward
propagation171
Reducingcomputationcostduring
forwardpropagation173
Freezinglayersduringbackward
propagation174
Exploringmemoryand
storageresources177
Understandingmodel
decompositionanddistillation180
Modeldecomposition180
Modeldistillation183
Reducingbitsinhardware184
Summary184
Section3–AdvancedParallelismParadigms
9
A
HybridofDataandModelParallelism
Technicalrequirements189
CasestudyofMegatron-LM189
Layersplitformodelparallelism189
Row-wisetrial-and-errorapproach192
Column-wisetrial-and-errorapproach196
Cross-machinefordataparallelism200
Implementationof
Megatron-LM201
Casestudyof
Mesh-TensorFlow203
TableofContentsxi
Implementationof
ProsandconsofMegatron-LM
Mesh-TensorFlow204
andMesh-TensorFlow204
Summary205
10
FederatedLearningandEdgeDevices
Technicalrequirements209
Sharingknowledgewithout
sharingdata209
Recappingthetraditionaldataparallel
modeltrainingparadigm210
Noinputsharingamongworkers211
Communicatinggradientsfor
collaborativelearning212
Casestudy:TensorFlow
Federated217
Runningedgedeviceswith
TinyML219
Casestudy:TensorFlowLite219
Summary220
11
ElasticModelTrainingandServing
Technicalrequirements223
Introducingadaptive
modeltraining223
Traditionaldataparalleltraining224
Adaptivemodeltrainingindata
parallelism226
Adaptivemodeltraining(AllReduce-
based)226
Adaptivemodeltraining(parameter
server-based)229
Traditionalmodel-parallelmodel
trainingparadigm231
Adaptivemodeltraininginmodel
parallelism232
Implementingadaptivemodel
traininginthecloud235
Elasticityinmodelinference236
Serverless238
Summary238
xiiTableofContents
12
AdvancedTechniquesforFurtherSpeed-Ups
Technicalrequirements
241
Jobmigrationandmultiplexing
249
Debuggingandperformance
Jobmigration
250
analytics
241
Jobmultiplexing
251
Generalconceptsinthe
profilingresultsCommunicationresultsanalysisComputationresultsanalysis
243
245
246
Modeltrainingina
heterogeneousenvironmentSummary
251
252
Index
OtherBooksYouMayEnjoy
Preface
Reducingtimecostsinmachinelearningleadstoashorterwaitingtimeformodeltrainingandafastermodelupdatingcycle.Distributedmachinelearningenablesmachinelearningpractitionerstoshortenmodeltrainingandinferencetimebyordersofmagnitude.Withthehelpofthispracticalguide,you'llbeabletoputyourPythondevelopmentknowledgetoworktogetupandrunningwiththeimplementationofdistributedmachinelearning,includingmulti-nodemachinelearningsystems,innotime.
You'llbeginbyexploringhowdistributedsystemsworkinthemachinelearningareaandhowdistributedmachinelearningisappliedtostate-of-the-artdeeplearningmodels.Asyouadvance,you'llseehowtousedistributedsystemstoenhancemachinelearningmodeltrainingandservingspeed.You'llalsogettogripswithapplyingdataparallelandmodelparallelapproachesbeforeoptimizingthein-parallelmodeltrainingandservingpipelineinlocalclustersorcloudenvironments.
Bytheendofthisbook,you'llhavegainedtheknowledgeandskillsneededtobuildanddeployanefficientdataprocessingpipelineformachinelearningmodeltrainingandinferenceinadistributedmanner.
Whothisbookisfor
Thisbookisfordatascientists,machinelearningengineers,andmachinelearningpractitionersinbothacademiaandindustry.AfundamentalunderstandingofmachinelearningconceptsandworkingknowledgeofPythonprogrammingisassumed.Priorexperienceimplementingmachinelearning/deeplearningmodelswithTensorFloworPyTorchwillbebeneficial.You'llfindthisbookusefulifyouareinterestedinusingdistributedsystemstoboostmachinelearningmodeltrainingandservingspeed.
xivPreface
Whatthisbookcovers
Chapter1,SplittingInputData,showshowtodistributemachinelearningtrainingorservingworkloadontheinputdatadimension,whichiscalleddataparallelism.Chapter2,ParameterServerandAll-Reduce,describestwowidely-adoptedmodelsynchronizationschemesinthedataparalleltrainingprocess.
Chapter3,BuildingaDataParallelTrainingandServingPipeline,illustrateshowtoimplementdataparalleltrainingandtheservingworkflow.
Chapter4,BottlenecksandSolutions,describeshowtoimprovedataparallelismperformancewithsomeadvancedtechniques,suchasmoreefficientcommunicationprotocols,reducingthememoryfootprint.
Chapter5,SplittingtheModel,introducesthevanillamodelparallelapproachingeneral.Chapter6,PipelineInputandLayerSplit,showshowtoimprovesystemefficiencywithpipelineparallelism.
Chapter7,ImplementingModelParallelTrainingandServingWorkflows,discusseshowtoimplementmodelparalleltrainingandservingindetail.
Chapter8,AchievingHigherThroughputandLowerLatency,coversadvancedschemestoreducecomputationandmemoryconsumptioninmodelparallelism.
Chapter9,AHybridofDataandModelParallelism,combinesdataandmodelparallelismtogetherasanadvancedin-parallelmodeltraining/servingscheme.
Chapter10,FederatedLearningandEdgeDevices,talksaboutfederatedlearningandhowedgedevicesareinvolvedinthisprocess.
Chapter11,ElasticModelTrainingandServing,describesamoreefficientschemethatcanchangethenumberofacceleratorsusedonthefly.
Chapter12,AdvancedTechniquesforFurtherSpeed-Ups,summarizesseveralusefultools,suchasaperformancedebuggingtool,jobmultiplexing,andheterogeneousmodeltraining.
Prefacexv
Togetthemostoutofthisbook
YouwillneedtoinstallPyTorch/TensorFlowsuccessfullyonyoursystem.Fordistributedworkloads,wesuggestyouatleasthavefourGPUsinhand.
WeassumeyouhaveLinux/Ubuntuasyouroperatingsystem.WeassumeyouuseNVIDIAGPUsandhaveinstalledtheproperNVIDIAdriveraswell.Wealsoassumeyouhavebasicknowledgeaboutmachinelearningingeneralandarefamiliarwithpopulardeeplearningmodels.
Ifyouareusingthedigitalversionofthisbook,weadviseyoutotypethecodeyourselforaccessthecodefromthebook'sGitHubrepository(alinkisavailableinthenextsection).Doingsowillhelpyouavoidanypotentialerrorsrelatedtothecopyingandpastingofcode.
Downloadtheexamplecodefiles
YoucandownloadtheexamplecodefilesforthisbookfromGitHubat
https://
/PacktPublishing/Distributed-Machine-Learning-with-
Python
.Ifthere'sanupdatetothecode,itwillbeupdatedintheGitHubrepository.
Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableat
/PacktPublishing/
.Checkthemout!
Downloadthecolorimages
WealsoprovideaPDFfilethathascolorimagesofthescreenshotsanddiagramsusedinthisbook.Youcandownloadithere:
/
downloads/9781801815697_ColorImages.pdf
xviPreface
Conventionsused
Thereareanumberoftextconventionsusedthroughoutthisbook.
Codeintext:Indicatescodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandles.Hereisanexample:"ReplaceYOUR_API_KEY_HEREwiththesubscriptionkeyofyourCognitiveServicesresource.Leavethequotationmarks!"
Ablockofcodeissetasfollows:
#ConnecttoAPIthroughsubscriptionkeyandendpoint
subscription_key="<your-subscription-key>"
endpoint="https://<your-cognitive-service>.cognitiveservices.
/"
#Authenticate
credential=AzureKeyCredential(subscription_key)
cog_client=TextAnalyticsClient(endpoint=endpoint,
credential=credential)
Bold:Indicatesanewterm,animportantword,orwordsthatyouseeonscreen.Forinstance,wordsinmenusordialogboxesappearinbold.Hereisanexample:"Select
Review+Create."
TipsorImportantNotes
Appearlikethis.
Getintouch
Feedbackfromourreadersisalwayswelcome.
Generalfeedback:Ifyouhavequestionsaboutanyaspectofthisbook,emailusatcustomercare@andmentionthebooktitleinthesubjectofyour
message.
Errata:Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyouhavefoundamistakeinthisbook,wewouldbegratefulifyouwouldreportthistous.Pleasevisit
/support/errata
andfillintheform.
Prefacexvii
Piracy:Ifyo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 二零二五年度电子合同法律效力认定及证据保全操作规程3篇
- 二零二五年度汽车销售与售后服务咨询合同2篇
- 二零二五年钢筋制作与安装劳动合同规范3篇
- 二零二五版企业品牌形象策划执行合同3篇
- 二零二五年度工伤事故赔偿协议及后续心理咨询服务合同6篇
- 二零二五年度电梯产品研发与创新基金投资合同3篇
- 二零二五年度蜜蜂养殖环境监测与改善合同2篇
- 小麦种子繁育生产合同(2篇)
- 二零二五年电子商务SET协议安全技术实施合同3篇
- 二零二五年智能工厂生产过程监控合同样本3篇
- 2024年采购代发货合作协议范本
- 2024年业绩换取股权的协议书模板
- 颞下颌关节疾病(口腔颌面外科学课件)
- 工业自动化设备维护保养指南
- 2024人教新版七年级上册英语单词英译汉默写表
- 《向心力》参考课件4
- 2024至2030年中国膨润土行业投资战略分析及发展前景研究报告
- 2024年深圳中考数学真题及答案
- 土方转运合同协议书
- Module 3 Unit 1 Point to the door(教学设计)-2024-2025学年外研版(三起)英语三年级上册
- 智能交通信号灯安装合同样本
评论
0/150
提交评论