LSH-MoE:通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第1页
LSH-MoE:通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第2页
LSH-MoE:通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第3页
LSH-MoE:通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第4页
LSH-MoE:通过局部敏感哈希实现通信高效的专家混合模型训练 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第5页
已阅读5页,还剩23页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

LSH-MoE:Communication-efficientMoETrainingviaLocality-SensitiveHashing

XiaonanNie1QibinLiu1FangchengFu1ShenhanZhu1XupengMiao2

XiaoyangLi3YangZhang3ShoudaLiu3BinCui1

arXiv:2411.08446v1[cs.DC]13Nov2024

1PekingUniversity2PurdueUniversity3ByteDance

1{xiaonan.nie,2101212782,ccchengff,shenhan.zhu,bin.cui}@.cn

2xupeng@3{lixiaoyang.x,zhangyang.elfin,liushouda}@

Abstract

Largertransformermodelsalwaysperformbetteronvarioustasksbutrequiremorecoststoscaleupthemodelsize.Toefficientlyenlargemodels,themixture-of-experts(MoE)architectureiswidelyadopted,whichconsistsofagatenetworkandaseriesofexpertsandkeepthetrainingcostconstantbyroutingtheinputdatatoafixednumberofexpertsinsteadofall.Inexistinglarge-scaleMoEtrainingsystems,expertswouldbedistributedamongdifferentGPUsforparallelization,andthusinputdatarequiresadditionalall-to-allcommunicationstoaccessthetargetexpertsandconductcorrespondingcomputations.However,uponevaluatingthetrainingprocessofthreemainstreamMoEmodelsoncommonlyusedGPUclusters,wefoundthattheall-to-allcommunicationratioaveragedaround45%,whichsignificantlyhinderstheefficiencyandscalabilityoftrainingMoEmodels.Inthispaper,weproposeLSH-MoE,acommunication-efficientMoEtrainingframe-workusinglocality-sensitivehashing(LSH).WefirstpresenttheproblemsofscalingMoEtraininginexistingsystemsandhighlightthepotentialofexploitingtokensimilaritytofacilitatedatacompression.Then,weintroduceanefficientLSH-basedcompressiontechnique,whichutilizesthecross-polytopehashingforrapidclusteringandimplementsaresidual-basederrorcompensationschemetoalleviatetheadverseimpactofcompression.Toverifytheeffectivenessofourmethods,weconductexperimentsonbothlanguagemodels(e.g.,RoBERTa,GPT,andT5)andvisionmodels(e.g.,Swin)forpre-trainingandfine-tuningtasks.Theresultsdemonstratethatourmethodsubstantiallyoutperformsitscounterparts

acrossdifferenttasksby1.28×-2.2×ofspeedup.

1Introduction

Inrecentyears,large-scalepre-trainedmodelshavesignificantlyadvancedtheperformanceof

deeplearningacrossvariouscomplextasks,includingcomputervision[8,

20],naturallanguage

processing[3,

7,

28],andmulti-modallearning[19]

.Commonlyreferredtoasfoundationmodels,

thesepre-trainedmodelsareprimarilybuiltonTransformerarchitectures[34]andundergoextensive

pre-trainingonlargedatasets,utilizingsubstantialGPUresources.OpenAIhasvalidatedthescaling

lawforlargelanguagemodels[15]andsuggeststhatincreasingthemodel’sparametersize,thevolume

oftrainingdata,andthedurationoftrainingcansignificantlyenhancethemodel’sperformance.However,thisapproachresultsinaconsiderableriseintrainingcosts,makingthedevelopmentoffoundationmodelsextremelyexpensive.

XiaonanNie,QibinLiu,FangchengFu,ShenhanZhu,andBinCuiarewiththeSchoolofComputerScienceandKeyLabofHighConfidenceSoftwareTechnologies(MOE),PekingUniversity.BinCuiisalsowiththeInstituteofComputationalSocialScience,PekingUniversity(Qingdao).

38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).

2

Toreducethehighcomputationalcosts,thesparsemixture-of-experts(MoE)architectureisoftenadopted,whichcomprisesasparsegatenetworkandaseriesofexpertnetworks.Thisarchitectureroutesinputdatatoonlyasubsetofexperts,resultinginsparseactivationoftheexpertsandtherebyreducingthemodel’scomputationalFLOPs(floatpointoperations)aswellastrainingcosts.

ProminentmodelssuchasGoogle’sSwitch-Transformer[9],ST-MoE[41],Meta’sHashLayer[31]

andMistral-AI’smixtralmodels[14]havesuccessfullyimplementedthisdesign,demonstrating

improvementsinbothperformanceandefficiencywithMoEmodels.

Meanwhile,effectivelyscalingthetrainingofMoEmodelsacrosshundredsoreventhousandsofGPUsremainsasignificantchallenge.ResearchersfromGooglehaveproposedtheexpertparallelism

approach[17],whichreplicatesthegatingnetworkoneachGPUsanddistributesdifferentexperts

acrossmultipleGPUsforparallelprocessing.Specifically,eachinputtokenisinitiallyprocessedbythegatingnetworktoselecttheappropriateexpert,afterwhichitisroutedtothedesignatedexpertsviapeer-to-peer(P2P)networkcommunication.Oncethedesignatedexpertscompletetheircomputation,thetokenisreturnedtotheoriginalGPUforfurtherprocessingthroughanadditionalP2Pcommunication.SinceeachGPUtypicallyneedstoexchangedatawithmanyotherGPUs,theseP2Ptransmissionsresultsinanall-to-allcommunicationpattern.Moreover,becausethecomputationoftheexpertnetworkreliesontheoutcomesofthesecommunications,thecommunicationscannotbeeffectivelyoverlappedwithongoingcomputations.ThisdependencycreatesasignificantperformancebottleneckinmodeltrainingacrossmostcommonlyusedGPUclusters.Weconductedexperimentsonthreewidely-usedMoEmodels,includingRoBERTa-MoE,GPT-MoEandSwin-MoE,onfourA100servers,eachwithacross-machinebandwidthof200Gb/s.Theresults,asshowninFigure

3,

revealthatthetimecostofall-to-allcommunicationconstitutesanaverageof45%andcanreachupto67%ofthetotalmodeltrainingtime.

ExistingmethodstoimprovedistributedMoEtrainingonbandwidth-limitedclusterstacklecommuni-cationchallengesinvariousways.

TA-MoE[4]reducescross-machinecommunicationbyadjusting

thegatingnetworktofavorexpertsonthesameserver,whilePre-gatedMoE[13]reducesdependency

betweencommunicationandcomputationthroughapre-gatingmechanismthatplanstokenroutinginadvance.However,bothapproachesrequiremodificationstothegatingmechanismandmodelstructure,limitingtheiruniversalapplicability.

DeepSpeed-MoE[29]introducesPR-MoE,which

selectsoneexpertplusasharedexpert,halvingtheall-to-allcommunicationload.

SCoMoE[40]

organizesall-to-allcommunicationbystructuringdatatransfersalongdifferentdimensionsandcontrollingdatavolumesacrossnetworklevels,andalsoclusterstokenstoimproverouting.However,noneoftheseworksconsiderreducingtheAll-to-AllcommunicationvolumeinMoEtrainingbycompressingtheforwardactivations.Therefore,theycanbeintergratedwithourmethodforfurtherimprovement.

Inthispaper,wepresentLSH-MoE,acommunication-efficientMoEtrainingframeworkthatleverageslocality-sensitivehashingtogroupsimilartokens.Ourkeycontributionsareasfollows:

•WebeginbyidentifyingkeychallengesinscalingMoEtraininginexistingsystems,notingthatall-to-allcommunicationconstitutesanaverageof45%ofthetotaltrainingtime.Addi-tionally,weinvestigatethepotentialofusingtokensimilaritytofacilitatedatacompressiontoreducecommunicationcosts.

•WeproposeanefficientLSH-basedcompressiontechniquethatemployscross-polytopehashingforrapidclustering.Thisapproachtransmitsonlytheclusteringcentroids,sig-nificantlyreducingcommunicationcosts.Tofurtherenhanceaccuracy,weimplementaresidual-basederrorcompensationschemetomitigatethenegativeeffectsofcompression.

•Throughextensiveexperimentswithlanguagemodels(RoBERTa-MoE,GPT-MoE,andT5-MoE)andvisionmodels(Swin-MoE),acrossbothpre-trainingandfine-tuningtasks,wedemonstratethatourmethodmaintainsmodelqualitywhileachievingaspeedupof1.28×-2.2×inend-to-endtrainingtime.

2Background

2.1Mixtures-of-ExpertArchitecture

ToenhancethetrainingefficiencyofTransformermodels,Williametal.

(2022)[9]introduced

aninnovativeparadigm,thesparsemixture-of-eexperts(MoE)architecture,illustratedinFigure

1.

3

fx纟ΣEix

i∈GX

1st

Expert

E1(x)E2(x)E3(x)En(x)

nth

Expert

GatingNetwork

t!

Figure1:Mixture-of-ExpertsonasingleGPU.

AlltoAll

AlltoAll

x0

x1

Node0

GatingNetwork

GPU0

1st

Expert

GatingNetwork

GPU1

2nd

Expert

x2

x3

Node1

GatingNetwork

GPU2

3rd

Expert

GatingNetwork

GPU3

4th

Expert

Intra-nodeComm.Inter-nodeComm.

Figure2:TrainingMixture-of-ExpertsonmultipleGPUsasexpertparallelism.

2nd

Expert

3rd

Expert

G:RM→1,NK

Thisarchitectureeffectivelybalancesparametercapacityandtrainingcosts,andcomprisestwokeycomponents:anexpertnetwork(E)andasparsegatenetwork(G).ItisevidentthatMoEmodels,withanequalnumberofactiveparametersperinput,cansignificantlysurpasstheperformanceofdensemodels.Thisbreakthroughhasalsocatalyzedfurtherresearchandtheirapplicationacross

variousindustries,ashighlightedbynumeroussubsequentstudies[5,

14,

22,

23,

25,

30,

39]

.

TheexpertnetworkEiscomposedofmultiplespecializedandseparatenetworks,commonlyreferredtoasexperts,denotedas{EiNrepresentsthenumberofexperts.Additionally,Ei(x)denotestheoutputproducedwhentheinputxisprocessedbythei-thexpert.Eachexpertistrainedtoexcelinaspecificsub-task,suchasinmulti-tasklearning,ortohandlespecificsegmentsofdata,asseeninlanguagemodelingandmulti-modallearning,therebyincreasingtheoverallmodelcapacity.Infoundationalmodels,theMoElayeroftenservesasasubstituteforthetraditionalfeed-forwardnetwork(FFN)layer.WithineachMoElayer,eachFFNfunctionworksasanindividualexpert,significantlyenhancingthemodel’scapabilitytoprocessdiverseandcomplexdatainputs.

ThegatingnetworkGplaysacrucialroleinthesparseMoEarchitecture.Forexample,inaK-waygatedMoEsystem,thegatingnetworkoutputsasetofintegersasEquation

1

todeterminewhichexpertsshouldbeactivated.Thisdecisionisbasedonthecharacteristicsoftheinputitself,allowingforadynamicandefficientallocationofcomputationalresources.Byonlyprocessingeachinputtokenwithaselectedsubsetoftheexpertnetwork,theMoEmodelachievescomputationsparsity,effectivelydecouplingparametercapacityfromtrainingcosts.

G:RM→[1,N]K(1)

Throughtheintegrationofmultiplespecializedexperts,asdescribedbyEquation

2,thesparseMoE

modeliscapableofdeliveringmoreaccurateandefficientpredictionsasf(x).Thisisachievedbyleveragingthespecializedknowledgeembeddedwithineachexpert,combinedwiththestrategicinputallocationmanagedbythegatingnetwork.

纟i(2)

WhileMoE’sprimaryadvantageisdecouplingparametercapacityfromnetworkcost,akeychallengeliesinlearningthegatingparameterseffectively,astheoutput’ssparsitymakesitnon-differentiable.Consequently,muchoftheresearchintheMoEfieldhascenteredondevelopingmethodsforlearninggatingfunctions.

Thesemethodsfallintothreemaincategories,asoutlinedin[6]:routingvia

learnableweighting[9,

24,

30],deterministichashrouting[31],andreinforcementlearning-based

routing[2,

32,

33]

.TheseapproachesprimarilydifferinthedesignofthegatingnetworkGratherthantheexpertnetworkE,andthereforeallencountersimilarscalingchallenges.

2.2ChallengesofScalingMoEModelTraining

WhileMoEmodelswereinitiallydevelopedtofacilitateefficientscalingduringtraining,deployingtheselarge-scalemodelsinpracticalGPU-intensiveenvironmentsposessignificantchallengesin

4

100%

80%

60%

40%

20%0%

RoBERTa-MoE

GPT-MoE

Swin-MoE-L

(16GPUs)

(16GPUs)

(16GPUs)

回All-to-All团Others.

(a)16GPUs

100%

80%

60%

40%

20%0%

RoBERTa-MoE

(32GPUs)

GPT-MoE(32GPUs)

Swin-MoE-L(32GPUs)

All-to-AllOthers.

(b)32GPUs(double#GPUs)

100%

80%

60%

40%

20%0%

RoBERTa-MoEGPT-MoESwin-MoE-L

-Wide(16GPUs)-Wide(16GPUs)-Wide(16GPUs)

All-to-AllOthers.

(c)16GPUs(double#experts)

Figure3:Proportionofall-to-allcommunicationtimerelativetototaltrainingdurationacrossdifferentconfigurations:scalingthenumberoftrainingservers(Figure

3(b))andscalingtheparameter

sizeofmodels(Figure

3(c))

.

distributedcomputing.Specifically,theMoElayerharborsaconsiderablyhighernumberofparam-etersandrequiresadditionalmemory,yetitmaintainsalmostthesamecomputationaldemandsasthedenselayer.Thisleadstoauniquecomputedensity—definedastheratioofthelayer’sFLOPs(FloatingPointOperations)toitsnumberofparameters.Therefore,traditionalparallelismmethodssuchastensorparallelismandpipelineparallelismareinsufficientforachievingeffectiveparallelisminthescenariosofMoEtraining.

Toimprovetheefficiencyandscalabilityoftraininglarge-scaleMoEmodels,expertparallelism

[17]

hasbeenintroducedasaspecializedmodelparallelismstrategy.ThisapproachdistributesexpertswithinanMoElayeracrossmultipleGPUs,whileleveragingdataparallelismforreplicatingnon-MoElayers,thusefficientlymanagingthetrainingworkloadofMoEmodels.TheworkflowofdistributedtrainingforanMoElayerisdepictedinFigure

2.

Oncethetargetexpertforeachtokenisdetermined,anall-to-allcommunicationprocessistriggeredtodistributetokenstotheircorrespondingtargetexpertsforcomputations,denotedasEi(x).Subsequently,anotherroundofall-to-allcommunicationisexecutedtogathertheoutputsfromallexperts,whichproducestheMoElayer’soutput(representedasf(x),Equation

2)

.Subsequentoperationsinvolveexecutingthedata-parallelnon-MoElayers.

WefirstprofiledthetrainingprocessofthreepopularMoEmodelsemployingexpertparallelism(detailedinTable

1)onaclustercomprisedoffourA100machines,eachequippedwithaninter

-connectRDMAbandwidthof200Gb/s.Theproportionofall-to-allcommunicationtimerelativetothetotaltrainingdurationisillustratedinFigure

3(a).

Wethendoublethenumberofmachines,andthenumberofexpertstoincreasethemodelscale.TheresultsareshowninFigure

3(b)

and

3(c),respectively.

Ourfindingsrevealthatall-to-allcommunicationaccountedforasubstantialportionofthetotaltime:approximately30%inGPT-MoE(15B),40%inRoBERTa-MoE,and70%inSwin-MoE-L.Andthisoverheadremainsnearlyconstantinlargermodelsandatlargermachinescales.Theseresultshighlightasignificantbottleneckthathampersthescalabilityofthetrainingprocess.Consequently,thedurationofall-to-allcommunicationsubstantiallyconstrainstrainingwithexpertparallelism,leadingtoreducedoverallthroughputandlimitingthepotentialtoscaleupthenumberofexpertseffectively.

2.3Locality-SensitiveHashingAlgorithms

Locality-SensitiveHashing(LSH)isaprobabilisticmethodprimarilyusedtoapproximatenearestneighborsearchinhigh-dimensionalspaces,whichreducesthedimensionalityofdatabymappingsimilardatatothesame“buckets”withhighprobabilityusinghashfunctions.Thisapproachoffersasubstantialreductionincomputationalcomplexity,particularlybeneficialforlarge-scaledataapplications.ThekeyoperationsinLSHincluding:

MappingDataintoBuckets:ThecoreofLSHisafamilyofhashfunctionsthatmaximizetheprobabilityofnearbypointsintheoriginalspacestayingcloseinthehashedspace,whiledistantpointsarelikelytoendupindifferentbuckets.Eachhashfunctionhischaracterizedbytheproperty:P[h(x)=h(y)]=1−d(x,y)/D,whered(x,y)isthedistancebetweenpointsxandy,andDdenotesthediameterofthespace.Tomapsimilardataintothesamebucket,multiplehashfunctionsfromthisfamilyareselectedbasedonthespecificattributesofthedata(e.g.,Euclideandistance,cosinesimilarity)andthedesiredgranularityofthebuckets.Datapointsarethenhashedbythese

5

residuals

device0

expert0

tokenscentroidsE(centroids)E(tokens)

3

4

1

2

expert1

All-To-All

All-To-All

LSH-BasedClustering

device1

Residual-basedErrorCompensation

expert2

tokenscentroidsresiduals

−=

E(centroids)residualsE(tokens)

+=

calculatethecentroids

average(1,,5)=

a

2

device2

buckets

centroids

a

b

hashfunction

tokenshash

1

2

3

4

5

6

Figure5:SchematicofMoEtrainingwithLocality-SensitiveHashing(LSH-MoE).

functions,andeachpointisassignedtobucketsaccordingtoitshashvalues,effectivelycategorizingsimilaritemstogetherforclustering.

CalculatingClusterCentroids:Bygroupingdatapointsintobucketsasdeterminedbytheirhash

values,datapointsareeffectivelyclustered.Eachbucketrepresentsaclusterofdatapointsandthe

centroidofeachclusteristhencalculatedasthemeanofallpointswithinthatcluster,formulatedas

Cj=ε1xi,whereCjisthecentroidofthej-thbucket,njisthenumberofpointsinthej-th

bucket,andxiarethedatapointsinthebucket.

3Methodology

3.1TheMotivationofTokenSimilarity

Toexplorethepotentialoptimizationforall-to-allcommunicationsinMoEtraining,weconductedanin-depthanalysisofthedatainvolvedintheseall-to-allcommunications,identifyingahighdegreeofsimilarity,termedtokensimilarity.Specifically,weappliedPrincipalComponentAnalysis(PCA)toreducethedimensionalityoftheinputtokensofall-to-allcommunicationsandobservedadistinctclusteringphenomenon,asillustratedintheFigure

4.

Ouranalysissuggeststhattheobservedsimilarityamongtokensmaystemfromtwoprimaryfactors:

•DataRelatedInfluences:Thesimilarityispartiallyduetothenatureofreal-worlddata,whichoftenadherestoZipf’s

Law[18]

.Thisresultsinaskeweddistribution,withcertaindataelementsappearmorefrequentlythanothers.

•ModelStructureRelatedInfluences:ThedesignofTrans-

formerarchitecture[34],especiallyitsattentionmecha

-nisms,significantlyimpactstokensimilarity.Inmodels

likeBERT[7],attentionlayersaredesignedtocaptureand

integratecontextinformationacrosstokens,thushomoge-nizingtokenrepresentationsandemphasizingtheirsharedsemanticrelationshipsatthesentencelevel.

Figure4:PrincipalCom-ponentAnalysis(PCA)Visu-alizationofinputtokensin-volvedinall-to-allcommuni-cation.

6

3.2LSH-MoE

MotivatedbytheTokenSimilarityobservedinSection

3.1,weintroduce

LSH-MoE,anovelMoE

trainingframeworkthatintegrateslocality-sensitivehashing(LSH)forrapidclusteringofinputtokens.Ourmethodtransmitsonlytheclusteringcentroids,significantlyreducingcommunicationvolumes.Tocounteractthenegativeeffectsofcompression,wealsoimplementaresidual-basederrorcompensationscheme.

AsdepictedinFigure

5,

LSH-MoEinitiallyemploys(1)anLSH-basedclusteringmethodtocompresstokensintocentriodsforsubsequentprocessing,effectivelyreducingcommunicationoverhead.Itthensequentiallyexecutes(2)all-to-allcommunication,expertcomputation,andanother(3)all-to-allcommunicationtoproducetheprocessedoutputsE(centriods).Finally,itintroduces(4)aresidual-basederrorcompensationmethodtoapproximatetheexpert-processedresultsE(tokens),byintegratingE(centriods)withresiduals.Meanwhile,wealsooutlinetheworkflowofourLSH-MoEframeworkintheAlgorithm

1

ofAppendix

A.1.

ThekeycomponentsofourLSH-MoEframeworkincludesanefficientLSH-basedclusteringalgorithmforrapidprocessingandanresidual-basederrorcompensationschemetominimizequalitydegradation.

EfficientLSH-basedClusteringAlgorithm.Sincethedatatobecompressed(theinputdataforall-to-allcommunication)isgenerateddynamicallyandinrealtime,pre-compressingitoroverlappingcompressiontimewithotherprocessingtasksisnotfeasible.Consequently,selectinganefficientonlinecompressionalgorithmiscrucial.Traditionalclusteringalgorithms,suchasK-Means,oftenencountercomputationalchallengesandefficiencylimitations.Locality-sensitivehashing(LSH)addresstheseissuesbyhashingsimilardatapointsintothesamebuckets,enablingfastersimilaritydetectioninhigh-dimensionalspaces.

NumerousLSHalgorithmshavebeendeveloped,eachemployingauniquehashingapproachformappingdataontobuckets.Weconductedexperimentstoevaluateseveralpopularhashingalgorithms,includingcross-polytopehashingandsphericalhashing.BasedonourevaluationsinSection

4.5,we

selectedcross-polytopehashingastheoptimalalgorithmforourapplication.Cross-polytopehashingstandsoutforitsmethodofmappinginputvectorstothenearestvertexonacross-polytope.Thisprocessisfacilitatedbyapplyingrandomlyrotatedcross-polytopes,whicheffectivelysegmentthesurfaceoftheunitsphere.Thealgorithmcanbemathematicallyrepresentedasfollows:

LSH(x)=argmaxi∈{±1,±2,...,±d}|Rx|i(3)

whereRisarandomrotationmatrix,disthedimensionalityofthespace,and|Rx|idenotestheabsolutevalueofthei-thcomponentoftherotatedvectorRx.

ThisformulaencapsulateshowtheinputvectorxistransformedbytherotationmatrixRandthenmappedtothenearestvertexofthecross-polytopebyselectingthedimensionithatmaximizestheabsolutevalueofthecomponentsofRx.Thismethodeffectivelysegmentsthehigh-dimensionalspaceandenhancestheclusteringefficiencybyrapidlyidentifyingsimilardatapoints.

Residual-basedErrorCompensationScheme.InourLSH-MoEframework,wecompresstheintermediateactivationvalueswithinthemodelnetwork.Unlikegradientcompression,thisprocessdoesnottolerateerrorswell.Therefore,itisessentialtominimizecompression-inducederrorstoensureminimalimpactonmodelperformance.Toaddressthis,weimplementanovelresidual-basedgradientcompensationstrategy,outlinedasfollows:

1.Wefirstcapturetheresidualforeachdatapointrelativetoitsclustercentroid,definedbytheequation:

2.Aftertheexpertnetworkcomputesoutputsfortheclustercenters,thefinalstepistorestoretheprocessedresultforeachtokenbyaddingbackthepreviouslyrecordedresidual:

Yij←{E(clusterj)+∆Clusterjk|k=1,2,...,Nj}.(5)

Thiserrorcompensationschemeeffectivelymitigatespotentialaccuracylosscausedbydatacompres-sioninall-to-allcommunication,ensuringthefidelityandrobustnessoftheLSH-MoEframework.TheexperimentalresultsinSection

4

showthatimplementingthiscompensationmechanismenables

7

Table1:Modelsforevaluation,where“-”indicatesthatthevaluesaredifferentacrosslayers.

Model

#Layer

dmodel

dffn

#Experts

#Params.(MoE)

#Params.(Total)

RoBERTa-MoE

12

768

3072

16

302M

394M

T5-MoE

16

1024

16384

16

8594M

9288M

GPT-MoE(15B)

12

768

3072

512

14507M

14629M

GPT-MoE(52B)

24

1024

4096

512

51539M

51740M

Swin-MoE-L

24

-

-

32

-

946M

themodeltrainedwithLSH-MoEtoachieveanaccuracycomparabletothatofamodeltrainedwithoutcompression.Thisoutcomehighlightstheeffectivenessofourproposederrorcompensationstrategyinpreservingmodelperformancedespitethechallengesposedbydatacompressioninall-to-allcommunication.

3.3ScalabilityAnalysisofLSH-MoE

Toeffectivelydemonstratethescalabilityofourapproach,particularlyintermsofitsapplicabilitytobothlargermodelsandlargercomputationalclusters,weconductedatheoreticalanalysis.ThisanalysisprimarilyfocusesonthecomputationoverheadandthecommunicationcostsassociatedwithMixtureofExperts(MoE),specificallyconsideringall-to-allcommunicationoverhead.Wederivedtheratioofcommunicationtimetocomputationtime,highlightinghowthisratioevolvesasboththescaleoftheserversandthemodelsizeincrease.Thisrelationshipiscrucialforunderstandingscalabilityandcanbeformallyexpressedasfollows:

wherekrepresentsthenumberofexpertsactivatedpertoken,FLOPsandBinterdenotetheGPU’scomputationabilityandthenetworkperformance,wisthenumberofGPUservers,andhisthehiddensizeofmodel.Notably,thefirstterm,.Additionally,scalingMoEmodelstypicallyemphasizesincreasingthenumberoflayersandexperts,

whilethegrowthinhiddensize(h)tendstobegradual,asseeninmodelslikeSwitch-Transformer[9]

.Consequently,whenboththemodelscaleandthenumberoftrainingserversgrow,theproportionofall-to-allcommunicationtimeremainsnearlyunchanged.ThisinsightunderpinsthescalabilityoftheLSH-MoEmethod,demonstratingitsrobustnessinlarger-scalesettingsandsupportingitspotentialinfuturelarge-scaleapplications.Foradetailedderivation,pleaserefertoAppendix

A.2.

4Experiment

4.1Implementation

OurLSH-MoEcomprisesadatacompression/restorationcomponentandacommunicationcompo-nent.WeutilizePyTorch1.11fordevelopingtheLSHclusteringandNCCLforimplementingthecommunication.Additionally,ourmethodisframework-independentandcanbeeasilyappliedto

otherMoEtrainingframeworkssuchasHetu-MoE[21,

26],DeepSpeed-MoE[29],andTutel[12]

.

4.2BenchmarksandDatasets

Ourevaluationsareconductedbyscalingpre-trainedmodelsequippedwithMoEarchitectureacrossvariousapplicationdomains.ThisincludesmodelslikeRoBERTa-MoE,T5-MoEandGPT-MoEinnaturallanguageprocessing(NLP),aswellasSwin-MoEincomputervision(CV).Amongthesemodels,RoBERTa-MoEandT5-MoEareevaluatedonpre-trainingtask,whileGPT-MoEandSwin-MoEundergofine-tuningevaluationbasedontheirofficialopen-sourcedmodelcheckpoints

1

2.

Wealsoevaluatedthezero-shotaccuracyofthepre-trainedT5-MoE.ModelconfigurationsaredetailedinTable

1.

1/facebookr

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论