G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行业资料_第1页
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行业资料_第2页
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行业资料_第3页
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行业资料_第4页
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行业资料_第5页
已阅读5页,还剩368页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DistributedMachine

Learningwith

Python

Acceleratingmodeltrainingandservingwithdistributedsystems

GuanhuaWang

BIRMINGHAM—MUMBAI

DistributedMachineLearningwithPython

Copyright©2022PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishingoritsdealersanddistributors,willbeheldliableforanydamagescausedorallegedtohavebeencauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

PublishingProductManager:AliAbidi

SeniorEditors:RoshanKumar,NathanyaDiaz

ContentDevelopmentEditors:TazeenShaikh,ShreyaMoharir

TechnicalEditor:DevanshiAyare

CopyEditor:SafisEditing

ProjectCoordinator:AparnaRavikumarNair

Proofreader:SafisEditing

Indexer:PratikShirodkar

ProductionDesigner:AlishonMendonca

MarketingCoordinators:AbeerRiyazDawe,ShifaAnsari

Firstpublished:May2022

Productionreference:1040422

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

Birmingham

B32PB,UK.

ISBN978-1-80181-569-7

Tomyparents,YingHanandXinWang

Tomygirlfriend,JingYuan

–GuanhuaWang

Contributors

Abouttheauthor

GuanhuaWangisafinal-yearcomputersciencePh.D.studentintheRISELabatUCBerkeley,advisedbyProfessorIonStoica.Hisresearchliesprimarilyinthemachinelearningsystemsarea,includingfastcollectivecommunication,efficientin-parallelmodeltraining,andreal-timemodelserving.Hisresearchhasgainedlotsofattentionfrombothacademiaandindustry.Hewasinvitedtogivetalkstotop-tieruniversities(MIT,Stanford,CMU,Princeton)andbigtechcompanies(Facebook/Meta,Microsoft).Hereceivedhismaster'sdegreefromHKUSTandabachelor'sdegreefromSoutheastUniversityinChina.Hehasalsodonesomecoolresearchonwirelessnetworks.Helikesplayingsoccerandhasrunmultiplehalf-marathonsintheBayAreaofCalifornia.

Aboutthereviewers

JamshaidSohailispassionateaboutdatascience,machinelearning,computervision,andnaturallanguageprocessingandhasmorethan2yearsofexperienceintheindustry.HepreviouslyworkedataSiliconValley-basedstart-up,FunnelBeam,thefoundersofwhicharefromStanfordUniversity,asadatascientist.Currently,heisworkingasadatascientistatSystemsLimited.Hehascompletedover66onlinecoursesfromdifferentplatforms.HeauthoredthebookDataWranglingwithPython3.XforPacktPublishingandhasreviewedmultiplebooksandcourses.HeisalsodevelopingacomprehensivecourseondatascienceatEducativeandisintheprocessofwritingbooksformultiplepublishers.

HiteshHindujaisanardentAIenthusiastworkingasaseniormanagerinAIatOlaElectric,whereheleadsateamof20+peopleintheareasofML,statistics,CV,NLP,andreinforcementlearning.Hehasfiled14+patentsinIndiaandtheUSandhasnumerousresearchpublicationstohisname.HiteshhasbeeninvolvedinresearchrolesatIndia'stopbusinessschools:theIndianSchoolofBusiness,Hyderabad,andtheIndianInstituteofManagement,Ahmedabad.Heisalsoactivelyinvolvedintrainingandmentoringandhasbeeninvitedtobeaguestspeakerbyvariouscorporationsandassociationsacrosstheglobe.

TableofContents

Preface

Section1–DataParallelism

1

SplittingInputData

Single-nodetrainingistooslow4

Themismatchbetweendataloading

bandwidthandmodeltrainingbandwidth5

Single-nodetrainingtimeonpopular

datasets6

Acceleratingthetrainingprocesswith

dataparallelism8

Dataparallelism–the

high-levelbits9

Stochasticgradientdescent13

Modelsynchronization14

Hyperparametertuning15

Globalbatchsize16

Learningrateadjustment16

Modelsynchronizationschemes17

Summary18

2

ParameterServerandAll-Reduce

Technicalrequirements20

Parameterserverarchitecture21

Communicationbottleneckinthe

parameterserverarchitecture22

Shardingthemodelamongparameter

servers24

Implementingtheparameter

server26

Definingmodellayers26

Definingtheparameterserver27

Definingtheworker28

Passingdatabetweentheparameter

serverandworker30

Issueswiththeparameter

server32

Theparameterserverarchitecture

introducesahighcodingcomplexity

forpractitioners33

viiiTableofContents

Broadcast40

Gather41

All-Gather42

Summary

43

All-Reducearchitecture34

Reduce34

All-Reduce36

RingAll-Reduce37

Collectivecommunication40

3

BuildingaDataParallelTrainingandServingPipeline

Technicalrequirements

46

Single-machinemulti-GPU52

Thedataparalleltraining

Multi-machinemulti-GPU56

pipelineinanutshell

Inputpre-processing

Inputdatapartition

Dataloading

Training

46

48

49

50

50

Checkpointingandfault

tolerance64

Modelcheckpointing64

Loadmodelcheckpoints65

Modelsynchronization

51

Modelevaluationand

Modelupdate

52

hyperparametertuning67

Single-machinemulti-GPUsand

multi-machinemulti-GPUs

4

BottlenecksandSolutions

52

Modelservingindataparallelism71

Summary73

Communicationbottlenecksin

dataparalleltraining76

Analyzingthecommunicationworkloads76

Parameterserverarchitecture77

TheAll-Reducearchitecture80

Theinefficiencyofstate-of-the-art

communicationschemes83

Leveragingidlelinksandhost

resources85

TreeAll-Reduce85

HybriddatatransferoverPCIeand

NVLink91

On-devicememorybottlenecks93

Recomputationandquantization94

Recomputation95

Quantization98

Summary99

TableofContentsix

Section2–ModelParallelism

5

SplittingtheModel

Technicalrequirements104

Single-nodetrainingerror–out

ofmemory105

Fine-tuningBERTonasingleGPU105

Tryingtopackagiantmodelinsideone

state-of-the-artGPU107

ELMo,BERT,andGPT110

Basicconcepts110

RNN114

ELMo117

6

PipelineInputandLayerSplit

BERT

GPT

Pre-trainingandfine-tuningState-of-the-arthardware

P100,V100,andDGX-1

NVLink

A100andDGX-2

NVSwitch

Summary

119

121

122

123

123

124

125

125

125

Vanillamodelparallelismis

inefficient128

Forwardpropagation130

Backwardpropagation131

GPUidletimebetweenforwardand

backwardpropagation132

Pipelineinput137

Prosandconsofpipeline

parallelism141

Advantagesofpipelineparallelism141

Disadvantagesofpipelineparallelism142

Layersplit142

Notesonintra-layermodel

parallelism145

Summary145

xTableofContents

7

ImplementingModelParallelTrainingandServingWorkflows

Technicalrequirements148

Wrappingupthewholemodel

parallelismpipeline149

Amodelparalleltrainingoverview149

Implementingamodelparalleltraining

pipeline150

Specifyingcommunicationprotocol

amongGPUs153

Modelparallelserving158

Fine-tuningtransformers162

Hyperparametertuningin

modelparallelism163

BalancingtheworkloadamongGPUs163

Enabling/disablingpipelineparallelism164

NLPmodelserving164

Summary165

8

AchievingHigherThroughputandLowerLatency

Technicalrequirements169

Freezinglayers169

Freezinglayersduringforward

propagation171

Reducingcomputationcostduring

forwardpropagation173

Freezinglayersduringbackward

propagation174

Exploringmemoryand

storageresources177

Understandingmodel

decompositionanddistillation180

Modeldecomposition180

Modeldistillation183

Reducingbitsinhardware184

Summary184

Section3–AdvancedParallelismParadigms

9

A

HybridofDataandModelParallelism

Technicalrequirements189

CasestudyofMegatron-LM189

Layersplitformodelparallelism189

Row-wisetrial-and-errorapproach192

Column-wisetrial-and-errorapproach196

Cross-machinefordataparallelism200

Implementationof

Megatron-LM201

Casestudyof

Mesh-TensorFlow203

TableofContentsxi

Implementationof

ProsandconsofMegatron-LM

Mesh-TensorFlow204

andMesh-TensorFlow204

Summary205

10

FederatedLearningandEdgeDevices

Technicalrequirements209

Sharingknowledgewithout

sharingdata209

Recappingthetraditionaldataparallel

modeltrainingparadigm210

Noinputsharingamongworkers211

Communicatinggradientsfor

collaborativelearning212

Casestudy:TensorFlow

Federated217

Runningedgedeviceswith

TinyML219

Casestudy:TensorFlowLite219

Summary220

11

ElasticModelTrainingandServing

Technicalrequirements223

Introducingadaptive

modeltraining223

Traditionaldataparalleltraining224

Adaptivemodeltrainingindata

parallelism226

Adaptivemodeltraining(AllReduce-

based)226

Adaptivemodeltraining(parameter

server-based)229

Traditionalmodel-parallelmodel

trainingparadigm231

Adaptivemodeltraininginmodel

parallelism232

Implementingadaptivemodel

traininginthecloud235

Elasticityinmodelinference236

Serverless238

Summary238

xiiTableofContents

12

AdvancedTechniquesforFurtherSpeed-Ups

Technicalrequirements

241

Jobmigrationandmultiplexing

249

Debuggingandperformance

Jobmigration

250

analytics

241

Jobmultiplexing

251

Generalconceptsinthe

profilingresultsCommunicationresultsanalysisComputationresultsanalysis

243

245

246

Modeltrainingina

heterogeneousenvironmentSummary

251

252

Index

OtherBooksYouMayEnjoy

Preface

Reducingtimecostsinmachinelearningleadstoashorterwaitingtimeformodeltrainingandafastermodelupdatingcycle.Distributedmachinelearningenablesmachinelearningpractitionerstoshortenmodeltrainingandinferencetimebyordersofmagnitude.Withthehelpofthispracticalguide,you'llbeabletoputyourPythondevelopmentknowledgetoworktogetupandrunningwiththeimplementationofdistributedmachinelearning,includingmulti-nodemachinelearningsystems,innotime.

You'llbeginbyexploringhowdistributedsystemsworkinthemachinelearningareaandhowdistributedmachinelearningisappliedtostate-of-the-artdeeplearningmodels.Asyouadvance,you'llseehowtousedistributedsystemstoenhancemachinelearningmodeltrainingandservingspeed.You'llalsogettogripswithapplyingdataparallelandmodelparallelapproachesbeforeoptimizingthein-parallelmodeltrainingandservingpipelineinlocalclustersorcloudenvironments.

Bytheendofthisbook,you'llhavegainedtheknowledgeandskillsneededtobuildanddeployanefficientdataprocessingpipelineformachinelearningmodeltrainingandinferenceinadistributedmanner.

Whothisbookisfor

Thisbookisfordatascientists,machinelearningengineers,andmachinelearningpractitionersinbothacademiaandindustry.AfundamentalunderstandingofmachinelearningconceptsandworkingknowledgeofPythonprogrammingisassumed.Priorexperienceimplementingmachinelearning/deeplearningmodelswithTensorFloworPyTorchwillbebeneficial.You'llfindthisbookusefulifyouareinterestedinusingdistributedsystemstoboostmachinelearningmodeltrainingandservingspeed.

xivPreface

Whatthisbookcovers

Chapter1,SplittingInputData,showshowtodistributemachinelearningtrainingorservingworkloadontheinputdatadimension,whichiscalleddataparallelism.Chapter2,ParameterServerandAll-Reduce,describestwowidely-adoptedmodelsynchronizationschemesinthedataparalleltrainingprocess.

Chapter3,BuildingaDataParallelTrainingandServingPipeline,illustrateshowtoimplementdataparalleltrainingandtheservingworkflow.

Chapter4,BottlenecksandSolutions,describeshowtoimprovedataparallelismperformancewithsomeadvancedtechniques,suchasmoreefficientcommunicationprotocols,reducingthememoryfootprint.

Chapter5,SplittingtheModel,introducesthevanillamodelparallelapproachingeneral.Chapter6,PipelineInputandLayerSplit,showshowtoimprovesystemefficiencywithpipelineparallelism.

Chapter7,ImplementingModelParallelTrainingandServingWorkflows,discusseshowtoimplementmodelparalleltrainingandservingindetail.

Chapter8,AchievingHigherThroughputandLowerLatency,coversadvancedschemestoreducecomputationandmemoryconsumptioninmodelparallelism.

Chapter9,AHybridofDataandModelParallelism,combinesdataandmodelparallelismtogetherasanadvancedin-parallelmodeltraining/servingscheme.

Chapter10,FederatedLearningandEdgeDevices,talksaboutfederatedlearningandhowedgedevicesareinvolvedinthisprocess.

Chapter11,ElasticModelTrainingandServing,describesamoreefficientschemethatcanchangethenumberofacceleratorsusedonthefly.

Chapter12,AdvancedTechniquesforFurtherSpeed-Ups,summarizesseveralusefultools,suchasaperformancedebuggingtool,jobmultiplexing,andheterogeneousmodeltraining.

Prefacexv

Togetthemostoutofthisbook

YouwillneedtoinstallPyTorch/TensorFlowsuccessfullyonyoursystem.Fordistributedworkloads,wesuggestyouatleasthavefourGPUsinhand.

WeassumeyouhaveLinux/Ubuntuasyouroperatingsystem.WeassumeyouuseNVIDIAGPUsandhaveinstalledtheproperNVIDIAdriveraswell.Wealsoassumeyouhavebasicknowledgeaboutmachinelearningingeneralandarefamiliarwithpopulardeeplearningmodels.

Ifyouareusingthedigitalversionofthisbook,weadviseyoutotypethecodeyourselforaccessthecodefromthebook'sGitHubrepository(alinkisavailableinthenextsection).Doingsowillhelpyouavoidanypotentialerrorsrelatedtothecopyingandpastingofcode.

Downloadtheexamplecodefiles

YoucandownloadtheexamplecodefilesforthisbookfromGitHubat

https://

/PacktPublishing/Distributed-Machine-Learning-with-

Python

.Ifthere'sanupdatetothecode,itwillbeupdatedintheGitHubrepository.

Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableat

/PacktPublishing/

.Checkthemout!

Downloadthecolorimages

WealsoprovideaPDFfilethathascolorimagesofthescreenshotsanddiagramsusedinthisbook.Youcandownloadithere:

/

downloads/9781801815697_ColorImages.pdf

xviPreface

Conventionsused

Thereareanumberoftextconventionsusedthroughoutthisbook.

Codeintext:Indicatescodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandles.Hereisanexample:"ReplaceYOUR_API_KEY_HEREwiththesubscriptionkeyofyourCognitiveServicesresource.Leavethequotationmarks!"

Ablockofcodeissetasfollows:

#ConnecttoAPIthroughsubscriptionkeyandendpoint

subscription_key="<your-subscription-key>"

endpoint="https://<your-cognitive-service>.cognitiveservices.

/"

#Authenticate

credential=AzureKeyCredential(subscription_key)

cog_client=TextAnalyticsClient(endpoint=endpoint,

credential=credential)

Bold:Indicatesanewterm,animportantword,orwordsthatyouseeonscreen.Forinstance,wordsinmenusordialogboxesappearinbold.Hereisanexample:"Select

Review+Create."

TipsorImportantNotes

Appearlikethis.

Getintouch

Feedbackfromourreadersisalwayswelcome.

Generalfeedback:Ifyouhavequestionsaboutanyaspectofthisbook,emailusatcustomercare@andmentionthebooktitleinthesubjectofyour

message.

Errata:Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyouhavefoundamistakeinthisbook,wewouldbegratefulifyouwouldreportthistous.Pleasevisit

/support/errata

andfillintheform.

Prefacexvii

Piracy:Ifyo

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论