LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing

上传人：策*** IP属地：山西上传时间：2024-11-19 格式：DOCX 页数：28 大小：312.89KB 积分：19.9 举报 版权申诉

LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第2页

LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第3页

LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第4页

LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第5页

已阅读5页，还剩23页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

LSH-MoE:Communication-efficientMoETrainingviaLocality-SensitiveHashing

XiaonanNie1QibinLiu1FangchengFu1ShenhanZhu1XupengMiao2

XiaoyangLi3YangZhang3ShoudaLiu3BinCui1

arXiv:2411.08446v1[cs.DC]13Nov2024

1PekingUniversity2PurdueUniversity3ByteDance

1{xiaonan.nie,2101212782,ccchengff,shenhan.zhu,bin.cui}@.cn

2xupeng@3{lixiaoyang.x,zhangyang.elfin,liushouda}@

Abstract

Largertransformermodelsalwaysperformbetteronvarioustasksbutrequiremorecoststoscaleupthemodelsize.Toefficientlyenlargemodels,themixture-of-experts(MoE)architectureiswidelyadopted,whichconsistsofagatenetworkandaseriesofexpertsandkeepthetrainingcostconstantbyroutingtheinputdatatoafixednumberofexpertsinsteadofall.Inexistinglarge-scaleMoEtrainingsystems,expertswouldbedistributedamongdifferentGPUsforparallelization,andthusinputdatarequiresadditionalall-to-allcommunicationstoaccessthetargetexpertsandconductcorrespondingcomputations.However,uponevaluatingthetrainingprocessofthreemainstreamMoEmodelsoncommonlyusedGPUclusters,wefoundthattheall-to-allcommunicationratioaveragedaround45%,whichsignificantlyhinderstheefficiencyandscalabilityoftrainingMoEmodels.Inthispaper,weproposeLSH-MoE,acommunication-efficientMoEtrainingframe-workusinglocality-sensitivehashing(LSH).WefirstpresenttheproblemsofscalingMoEtraininginexistingsystemsandhighlightthepotentialofexploitingtokensimilaritytofacilitatedatacompression.Then,weintroduceanefficientLSH-basedcompressiontechnique,whichutilizesthecross-polytopehashingforrapidclusteringandimplementsaresidual-basederrorcompensationschemetoalleviatetheadverseimpactofcompression.Toverifytheeffectivenessofourmethods,weconductexperimentsonbothlanguagemodels(e.g.,RoBERTa,GPT,andT5)andvisionmodels(e.g.,Swin)forpre-trainingandfine-tuningtasks.Theresultsdemonstratethatourmethodsubstantiallyoutperformsitscounterparts

acrossdifferenttasksby1.28×-2.2×ofspeedup.

1Introduction

Inrecentyears,large-scalepre-trainedmodelshavesignificantlyadvancedtheperformanceof

deeplearningacrossvariouscomplextasks,includingcomputervision[8,

20],naturallanguage

processing[3,

28],andmulti-modallearning[19]

.Commonlyreferredtoasfoundationmodels,

thesepre-trainedmodelsareprimarilybuiltonTransformerarchitectures[34]andundergoextensive

pre-trainingonlargedatasets,utilizingsubstantialGPUresources.OpenAIhasvalidatedthescaling

lawforlargelanguagemodels[15]andsuggeststhatincreasingthemodel’sparametersize,thevolume

oftrainingdata,andthedurationoftrainingcansignificantlyenhancethemodel’sperformance.However,thisapproachresultsinaconsiderableriseintrainingcosts,makingthedevelopmentoffoundationmodelsextremelyexpensive.

XiaonanNie,QibinLiu,FangchengFu,ShenhanZhu,andBinCuiarewiththeSchoolofComputerScienceandKeyLabofHighConfidenceSoftwareTechnologies(MOE),PekingUniversity.BinCuiisalsowiththeInstituteofComputationalSocialScience,PekingUniversity(Qingdao).

38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).

Toreducethehighcomputationalcosts,thesparsemixture-of-experts(MoE)architectureisoftenadopted,whichcomprisesasparsegatenetworkandaseriesofexpertnetworks.Thisarchitectureroutesinputdatatoonlyasubsetofexperts,resultinginsparseactivationoftheexpertsandtherebyreducingthemodel’scomputationalFLOPs(floatpointoperations)aswellastrainingcosts.

ProminentmodelssuchasGoogle’sSwitch-Transformer[9],ST-MoE[41],Meta’sHashLayer[31]

andMistral-AI’smixtralmodels[14]havesuccessfullyimplementedthisdesign,demonstrating

improvementsinbothperformanceandefficiencywithMoEmodels.

Meanwhile,effectivelyscalingthetrainingofMoEmodelsacrosshundredsoreventhousandsofGPUsremainsasignificantchallenge.ResearchersfromGooglehaveproposedtheexpertparallelism

approach[17],whichreplicatesthegatingnetworkoneachGPUsanddistributesdifferentexperts

acrossmultipleGPUsforparallelprocessing.Specifically,eachinputtokenisinitiallyprocessedbythegatingnetworktoselecttheappropriateexpert,afterwhichitisroutedtothedesignatedexpertsviapeer-to-peer(P2P)networkcommunication.Oncethedesignatedexpertscompletetheircomputation,thetokenisreturnedtotheoriginalGPUforfurtherprocessingthroughanadditionalP2Pcommunication.SinceeachGPUtypicallyneedstoexchangedatawithmanyotherGPUs,theseP2Ptransmissionsresultsinanall-to-allcommunicationpattern.Moreover,becausethecomputationoftheexpertnetworkreliesontheoutcomesofthesecommunications,thecommunicationscannotbeeffectivelyoverlappedwithongoingcomputations.ThisdependencycreatesasignificantperformancebottleneckinmodeltrainingacrossmostcommonlyusedGPUclusters.Weconductedexperimentsonthreewidely-usedMoEmodels,includingRoBERTa-MoE,GPT-MoEandSwin-MoE,onfourA100servers,eachwithacross-machinebandwidthof200Gb/s.Theresults,asshowninFigure

revealthatthetimecostofall-to-allcommunicationconstitutesanaverageof45%andcanreachupto67%ofthetotalmodeltrainingtime.

ExistingmethodstoimprovedistributedMoEtrainingonbandwidth-limitedclusterstacklecommuni-cationchallengesinvariousways.

TA-MoE[4]reducescross-machinecommunicationbyadjusting

thegatingnetworktofavorexpertsonthesameserver,whilePre-gatedMoE[13]reducesdependency

betweencommunicationandcomputationthroughapre-gatingmechanismthatplanstokenroutinginadvance.However,bothapproachesrequiremodificationstothegatingmechanismandmodelstructure,limitingtheiruniversalapplicability.

DeepSpeed-MoE[29]introducesPR-MoE,which

selectsoneexpertplusasharedexpert,halvingtheall-to-allcommunicationload.

SCoMoE[40]

organizesall-to-allcommunicationbystructuringdatatransfersalongdifferentdimensionsandcontrollingdatavolumesacrossnetworklevels,andalsoclusterstokenstoimproverouting.However,noneoftheseworksconsiderreducingtheAll-to-AllcommunicationvolumeinMoEtrainingbycompressingtheforwardactivations.Therefore,theycanbeintergratedwithourmethodforfurtherimprovement.

Inthispaper,wepresentLSH-MoE,acommunication-efficientMoEtrainingframeworkthatleverageslocality-sensitivehashingtogroupsimilartokens.Ourkeycontributionsareasfollows:

•WebeginbyidentifyingkeychallengesinscalingMoEtraininginexistingsystems,notingthatall-to-allcommunicationconstitutesanaverageof45%ofthetotaltrainingtime.Addi-tionally,weinvestigatethepotentialofusingtokensimilaritytofacilitatedatacompressiontoreducecommunicationcosts.

•WeproposeanefficientLSH-basedcompressiontechniquethatemployscross-polytopehashingforrapidclustering.Thisapproachtransmitsonlytheclusteringcentroids,sig-nificantlyreducingcommunicationcosts.Tofurtherenhanceaccuracy,weimplementaresidual-basederrorcompensationschemetomitigatethenegativeeffectsofcompression.

•Throughextensiveexperimentswithlanguagemodels(RoBERTa-MoE,GPT-MoE,andT5-MoE)andvisionmodels(Swin-MoE),acrossbothpre-trainingandfine-tuningtasks,wedemonstratethatourmethodmaintainsmodelqualitywhileachievingaspeedupof1.28×-2.2×inend-to-endtrainingtime.

2Background

2.1Mixtures-of-ExpertArchitecture

ToenhancethetrainingefficiencyofTransformermodels,Williametal.

(2022)[9]introduced

aninnovativeparadigm,thesparsemixture-of-eexperts(MoE)architecture,illustratedinFigure

fx纟ΣEix

i∈GX

1st

Expert

E1(x)E2(x)E3(x)En(x)

nth

Expert

…

GatingNetwork

Figure1:Mixture-of-ExpertsonasingleGPU.

AlltoAll

Node0

GatingNetwork

GPU0

1st

Expert

GatingNetwork

GPU1

2nd

Expert

Node1

GatingNetwork

GPU2

3rd

Expert

GatingNetwork

GPU3

4th

Expert

Intra-nodeComm.Inter-nodeComm.

Figure2:TrainingMixture-of-ExpertsonmultipleGPUsasexpertparallelism.

2nd

Expert

3rd

Expert

G:RM→1,NK

Thisarchitectureeffectivelybalancesparametercapacityandtrainingcosts,andcomprisestwokeycomponents:anexpertnetwork(E)andasparsegatenetwork(G).ItisevidentthatMoEmodels,withanequalnumberofactiveparametersperinput,cansignificantlysurpasstheperformanceofdensemodels.Thisbreakthroughhasalsocatalyzedfurtherresearchandtheirapplicationacross

variousindustries,ashighlightedbynumeroussubsequentstudies[5,

14,

22,

23,

25,

30,

39]

TheexpertnetworkEiscomposedofmultiplespecializedandseparatenetworks,commonlyreferredtoasexperts,denotedas{EiNrepresentsthenumberofexperts.Additionally,Ei(x)denotestheoutputproducedwhentheinputxisprocessedbythei-thexpert.Eachexpertistrainedtoexcelinaspecificsub-task,suchasinmulti-tasklearning,ortohandlespecificsegmentsofdata,asseeninlanguagemodelingandmulti-modallearning,therebyincreasingtheoverallmodelcapacity.Infoundationalmodels,theMoElayeroftenservesasasubstituteforthetraditionalfeed-forwardnetwork(FFN)layer.WithineachMoElayer,eachFFNfunctionworksasanindividualexpert,significantlyenhancingthemodel’scapabilitytoprocessdiverseandcomplexdatainputs.

ThegatingnetworkGplaysacrucialroleinthesparseMoEarchitecture.Forexample,inaK-waygatedMoEsystem,thegatingnetworkoutputsasetofintegersasEquation

todeterminewhichexpertsshouldbeactivated.Thisdecisionisbasedonthecharacteristicsoftheinputitself,allowingforadynamicandefficientallocationofcomputationalresources.Byonlyprocessingeachinputtokenwithaselectedsubsetoftheexpertnetwork,theMoEmodelachievescomputationsparsity,effectivelydecouplingparametercapacityfromtrainingcosts.

G:RM→[1,N]K(1)

Throughtheintegrationofmultiplespecializedexperts,asdescribedbyEquation

2,thesparseMoE

modeliscapableofdeliveringmoreaccurateandefficientpredictionsasf(x).Thisisachievedbyleveragingthespecializedknowledgeembeddedwithineachexpert,combinedwiththestrategicinputallocationmanagedbythegatingnetwork.

纟i(2)

WhileMoE’sprimaryadvantageisdecouplingparametercapacityfromnetworkcost,akeychallengeliesinlearningthegatingparameterseffectively,astheoutput’ssparsitymakesitnon-differentiable.Consequently,muchoftheresearchintheMoEfieldhascenteredondevelopingmethodsforlearninggatingfunctions.

Thesemethodsfallintothreemaincategories,asoutlinedin[6]:routingvia

learnableweighting[9,

24,

30],deterministichashrouting[31],andreinforcementlearning-based

routing[2,

32,

33]

.TheseapproachesprimarilydifferinthedesignofthegatingnetworkGratherthantheexpertnetworkE,andthereforeallencountersimilarscalingchallenges.

2.2ChallengesofScalingMoEModelTraining

WhileMoEmodelswereinitiallydevelopedtofacilitateefficientscalingduringtraining,deployingtheselarge-scalemodelsinpracticalGPU-intensiveenvironmentsposessignificantchallengesin

100%

80%

60%

40%

20%0%

RoBERTa-MoE

GPT-MoE

Swin-MoE-L

(16GPUs)

回All-to-All团Others.

(a)16GPUs

100%

80%

60%

40%

20%0%

RoBERTa-MoE

(32GPUs)

GPT-MoE(32GPUs)

Swin-MoE-L(32GPUs)

All-to-AllOthers.

(b)32GPUs(double#GPUs)

100%

80%

60%

40%

20%0%

RoBERTa-MoEGPT-MoESwin-MoE-L

-Wide(16GPUs)-Wide(16GPUs)-Wide(16GPUs)

All-to-AllOthers.

Figure3:Proportionofall-to-allcommunicationtimerelativetototaltrainingdurationacrossdifferentconfigurations:scalingthenumberoftrainingservers(Figure

3(b))andscalingtheparameter

sizeofmodels(Figure

3(c))

distributedcomputing.Specifically,theMoElayerharborsaconsiderablyhighernumberofparam-etersandrequiresadditionalmemory,yetitmaintainsalmostthesamecomputationaldemandsasthedenselayer.Thisleadstoauniquecomputedensity—definedastheratioofthelayer’sFLOPs(FloatingPointOperations)toitsnumberofparameters.Therefore,traditionalparallelismmethodssuchastensorparallelismandpipelineparallelismareinsufficientforachievingeffectiveparallelisminthescenariosofMoEtraining.

Toimprovetheefficiencyandscalabilityoftraininglarge-scaleMoEmodels,expertparallelism

[17]

hasbeenintroducedasaspecializedmodelparallelismstrategy.ThisapproachdistributesexpertswithinanMoElayeracrossmultipleGPUs,whileleveragingdataparallelismforreplicatingnon-MoElayers,thusefficientlymanagingthetrainingworkloadofMoEmodels.TheworkflowofdistributedtrainingforanMoElayerisdepictedinFigure

Oncethetargetexpertforeachtokenisdetermined,anall-to-allcommunicationprocessistriggeredtodistributetokenstotheircorrespondingtargetexpertsforcomputations,denotedasEi(x).Subsequently,anotherroundofall-to-allcommunicationisexecutedtogathertheoutputsfromallexperts,whichproducestheMoElayer’soutput(representedasf(x),Equation

.Subsequentoperationsinvolveexecutingthedata-parallelnon-MoElayers.

WefirstprofiledthetrainingprocessofthreepopularMoEmodelsemployingexpertparallelism(detailedinTable

1)onaclustercomprisedoffourA100machines,eachequippedwithaninter

-connectRDMAbandwidthof200Gb/s.Theproportionofall-to-allcommunicationtimerelativetothetotaltrainingdurationisillustratedinFigure

3(a).

Wethendoublethenumberofmachines,andthenumberofexpertstoincreasethemodelscale.TheresultsareshowninFigure

3(b)

and

3(c),respectively.

Ourfindingsrevealthatall-to-allcommunicationaccountedforasubstantialportionofthetotaltime:approximately30%inGPT-MoE(15B),40%inRoBERTa-MoE,and70%inSwin-MoE-L.Andthisoverheadremainsnearlyconstantinlargermodelsandatlargermachinescales.Theseresultshighlightasignificantbottleneckthathampersthescalabilityofthetrainingprocess.Consequently,thedurationofall-to-allcommunicationsubstantiallyconstrainstrainingwithexpertparallelism,leadingtoreducedoverallthroughputandlimitingthepotentialtoscaleupthenumberofexpertseffectively.

2.3Locality-SensitiveHashingAlgorithms

Locality-SensitiveHashing(LSH)isaprobabilisticmethodprimarilyusedtoapproximatenearestneighborsearchinhigh-dimensionalspaces,whichreducesthedimensionalityofdatabymappingsimilardatatothesame“buckets”withhighprobabilityusinghashfunctions.Thisapproachoffersasubstantialreductionincomputationalcomplexity,particularlybeneficialforlarge-scaledataapplications.ThekeyoperationsinLSHincluding:

MappingDataintoBuckets:ThecoreofLSHisafamilyofhashfunctionsthatmaximizetheprobabilityofnearbypointsintheoriginalspacestayingcloseinthehashedspace,whiledistantpointsarelikelytoendupindifferentbuckets.Eachhashfunctionhischaracterizedbytheproperty:P[h(x)=h(y)]=1−d(x,y)/D,whered(x,y)isthedistancebetweenpointsxandy,andDdenotesthediameterofthespace.Tomapsimilardataintothesamebucket,multiplehashfunctionsfromthisfamilyareselectedbasedonthespecificattributesofthedata(e.g.,Euclideandistance,cosinesimilarity)andthedesiredgranularityofthebuckets.Datapointsarethenhashedbythese

residuals

device0

厂

expert0

tokenscentroidsE(centroids)E(tokens)

expert1

All-To-All

LSH-BasedClustering

device1

Residual-basedErrorCompensation

expert2

tokenscentroidsresiduals

−=

E(centroids)residualsE(tokens)

calculatethecentroids

average(1,,5)=

device2

buckets

centroids

hashfunction

tokenshash

Figure5:SchematicofMoEtrainingwithLocality-SensitiveHashing(LSH-MoE).

functions,andeachpointisassignedtobucketsaccordingtoitshashvalues,effectivelycategorizingsimilaritemstogetherforclustering.

CalculatingClusterCentroids:Bygroupingdatapointsintobucketsasdeterminedbytheirhash

values,datapointsareeffectivelyclustered.Eachbucketrepresentsaclusterofdatapointsandthe

centroidofeachclusteristhencalculatedasthemeanofallpointswithinthatcluster,formulatedas

Cj=ε1xi,whereCjisthecentroidofthej-thbucket,njisthenumberofpointsinthej-th

bucket,andxiarethedatapointsinthebucket.

3Methodology

3.1TheMotivationofTokenSimilarity

Toexplorethepotentialoptimizationforall-to-allcommunicationsinMoEtraining,weconductedanin-depthanalysisofthedatainvolvedintheseall-to-allcommunications,identifyingahighdegreeofsimilarity,termedtokensimilarity.Specifically,weappliedPrincipalComponentAnalysis(PCA)toreducethedimensionalityoftheinputtokensofall-to-allcommunicationsandobservedadistinctclusteringphenomenon,asillustratedintheFigure

Ouranalysissuggeststhattheobservedsimilarityamongtokensmaystemfromtwoprimaryfactors:

•DataRelatedInfluences:Thesimilarityispartiallyduetothenatureofreal-worlddata,whichoftenadherestoZipf’s

Law[18]

.Thisresultsinaskeweddistribution,withcertaindataelementsappearmorefrequentlythanothers.

•ModelStructureRelatedInfluences:ThedesignofTrans-

formerarchitecture[34],especiallyitsattentionmecha

-nisms,significantlyimpactstokensimilarity.Inmodels

likeBERT[7],attentionlayersaredesignedtocaptureand

integratecontextinformationacrosstokens,thushomoge-nizingtokenrepresentationsandemphasizingtheirsharedsemanticrelationshipsatthesentencelevel.

Figure4:PrincipalCom-ponentAnalysis(PCA)Visu-alizationofinputtokensin-volvedinall-to-allcommuni-cation.

3.2LSH-MoE

MotivatedbytheTokenSimilarityobservedinSection

3.1,weintroduce

LSH-MoE,anovelMoE

trainingframeworkthatintegrateslocality-sensitivehashing(LSH)forrapidclusteringofinputtokens.Ourmethodtransmitsonlytheclusteringcentroids,significantlyreducingcommunicationvolumes.Tocounteractthenegativeeffectsofcompression,wealsoimplementaresidual-basederrorcompensationscheme.

AsdepictedinFigure

LSH-MoEinitiallyemploys(1)anLSH-basedclusteringmethodtocompresstokensintocentriodsforsubsequentprocessing,effectivelyreducingcommunicationoverhead.Itthensequentiallyexecutes(2)all-to-allcommunication,expertcomputation,andanother(3)all-to-allcommunicationtoproducetheprocessedoutputsE(centriods).Finally,itintroduces(4)aresidual-basederrorcompensationmethodtoapproximatetheexpert-processedresultsE(tokens),byintegratingE(centriods)withresiduals.Meanwhile,wealsooutlinetheworkflowofourLSH-MoEframeworkintheAlgorithm

ofAppendix

A.1.

ThekeycomponentsofourLSH-MoEframeworkincludesanefficientLSH-basedclusteringalgorithmforrapidprocessingandanresidual-basederrorcompensationschemetominimizequalitydegradation.

EfficientLSH-basedClusteringAlgorithm.Sincethedatatobecompressed(theinputdataforall-to-allcommunication)isgenerateddynamicallyandinrealtime,pre-compressingitoroverlappingcompressiontimewithotherprocessingtasksisnotfeasible.Consequently,selectinganefficientonlinecompressionalgorithmiscrucial.Traditionalclusteringalgorithms,suchasK-Means,oftenencountercomputationalchallengesandefficiencylimitations.Locality-sensitivehashing(LSH)addresstheseissuesbyhashingsimilardatapointsintothesamebuckets,enablingfastersimilaritydetectioninhigh-dimensionalspaces.

NumerousLSHalgorithmshavebeendeveloped,eachemployingauniquehashingapproachformappingdataontobuckets.Weconductedexperimentstoevaluateseveralpopularhashingalgorithms,includingcross-polytopehashingandsphericalhashing.BasedonourevaluationsinSection

4.5,we

selectedcross-polytopehashingastheoptimalalgorithmforourapplication.Cross-polytopehashingstandsoutforitsmethodofmappinginputvectorstothenearestvertexonacross-polytope.Thisprocessisfacilitatedbyapplyingrandomlyrotatedcross-polytopes,whicheffectivelysegmentthesurfaceoftheunitsphere.Thealgorithmcanbemathematicallyrepresentedasfollows:

LSH(x)=argmaxi∈{±1,±2,...,±d}|Rx|i(3)

whereRisarandomrotationmatrix,disthedimensionalityofthespace,and|Rx|idenotestheabsolutevalueofthei-thcomponentoftherotatedvectorRx.

ThisformulaencapsulateshowtheinputvectorxistransformedbytherotationmatrixRandthenmappedtothenearestvertexofthecross-polytopebyselectingthedimensionithatmaximizestheabsolutevalueofthecomponentsofRx.Thismethodeffectivelysegmentsthehigh-dimensionalspaceandenhancestheclusteringefficiencybyrapidlyidentifyingsimilardatapoints.

Residual-basedErrorCompensationScheme.InourLSH-MoEframework,wecompresstheintermediateactivationvalueswithinthemodelnetwork.Unlikegradientcompression,thisprocessdoesnottolerateerrorswell.Therefore,itisessentialtominimizecompression-inducederrorstoensureminimalimpactonmodelperformance.Toaddressthis,weimplementanovelresidual-basedgradientcompensationstrategy,outlinedasfollows:

1.Wefirstcapturetheresidualforeachdatapointrelativetoitsclustercentroid,definedbytheequation:

2.Aftertheexpertnetworkcomputesoutputsfortheclustercenters,thefinalstepistorestoretheprocessedresultforeachtokenbyaddingbackthepreviouslyrecordedresidual:

Yij←{E(clusterj)+∆Clusterjk|k=1,2,...,Nj}.(5)

Thiserrorcompensationschemeeffectivelymitigatespotentialaccuracylosscausedbydatacompres-sioninall-to-allcommunication,ensuringthefidelityandrobustnessoftheLSH-MoEframework.TheexperimentalresultsinSection

showthatimplementingthiscompensationmechanismenables

Table1:Modelsforevaluation,where“-”indicatesthatthevaluesaredifferentacrosslayers.

Model

#Layer

dmodel

dffn

#Experts

#Params.(MoE)

#Params.(Total)

RoBERTa-MoE

768

3072

302M

394M

T5-MoE

1024

16384

8594M

9288M

GPT-MoE(15B)

768

3072

512

14507M

14629M

GPT-MoE(52B)

1024

4096

512

51539M

51740M

Swin-MoE-L

946M

themodeltrainedwithLSH-MoEtoachieveanaccuracycomparabletothatofamodeltrainedwithoutcompression.Thisoutcomehighlightstheeffectivenessofourproposederrorcompensationstrategyinpreservingmodelperformancedespitethechallengesposedbydatacompressioninall-to-allcommunication.

3.3ScalabilityAnalysisofLSH-MoE

Toeffectivelydemonstratethescalabilityofourapproach,particularlyintermsofitsapplicabilitytobothlargermodelsandlargercomputationalclusters,weconductedatheoreticalanalysis.ThisanalysisprimarilyfocusesonthecomputationoverheadandthecommunicationcostsassociatedwithMixtureofExperts(MoE),specificallyconsideringall-to-allcommunicationoverhead.Wederivedtheratioofcommunicationtimetocomputationtime,highlightinghowthisratioevolvesasboththescaleoftheserversandthemodelsizeincrease.Thisrelationshipiscrucialforunderstandingscalabilityandcanbeformallyexpressedasfollows:

wherekrepresentsthenumberofexpertsactivatedpertoken,FLOPsandBinterdenotetheGPU’scomputationabilityandthenetworkperformance,wisthenumberofGPUservers,andhisthehiddensizeofmodel.Notably,thefirstterm,.Additionally,scalingMoEmodelstypicallyemphasizesincreasingthenumberoflayersandexperts,

whilethegrowthinhiddensize(h)tendstobegradual,asseeninmodelslikeSwitch-Transformer[9]

.Consequently,whenboththemodelscaleandthenumberoftrainingserversgrow,theproportionofall-to-allcommunicationtimeremainsnearlyunchanged.ThisinsightunderpinsthescalabilityoftheLSH-MoEmethod,demonstratingitsrobustnessinlarger-scalesettingsandsupportingitspotentialinfuturelarge-scaleapplications.Foradetailedderivation,pleaserefertoAppendix

A.2.

4Experiment

4.1Implementation

OurLSH-MoEcomprisesadatacompression/restorationcomponentandacommunicationcompo-nent.WeutilizePyTorch1.11fordevelopingtheLSHclusteringandNCCLforimplementingthecommunication.Additionally,ourmethodisframework-independentandcanbeeasilyappliedto

otherMoEtrainingframeworkssuchasHetu-MoE[21,

26],DeepSpeed-MoE[29],andTutel[12]

4.2BenchmarksandDatasets

Ourevaluationsareconductedbyscalingpre-trainedmodelsequippedwithMoEarchitectureacrossvariousapplicationdomains.ThisincludesmodelslikeRoBERTa-MoE,T5-MoEandGPT-MoEinnaturallanguageprocessing(NLP),aswellasSwin-MoEincomputervision(CV).Amongthesemodels,RoBERTa-MoEandT5-MoEareevaluatedonpre-trainingtask,whileGPT-MoEandSwin-MoEundergofine-tuningevaluationbasedontheirofficialopen-sourcedmodelcheckpoints

Wealsoevaluatedthezero-shotaccuracyofthepre-trainedT5-MoE.ModelconfigurationsaredetailedinTable

1/facebookr

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing

文档简介

温馨提示

最新文档

评论

LSH-MoE：通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing

文档简介

温馨提示

最新文档

评论

相关文档