版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
LSH-MoE:Communication-efficientMoETrainingviaLocality-SensitiveHashing
XiaonanNie1QibinLiu1FangchengFu1ShenhanZhu1XupengMiao2
XiaoyangLi3YangZhang3ShoudaLiu3BinCui1
arXiv:2411.08446v1[cs.DC]13Nov2024
1PekingUniversity2PurdueUniversity3ByteDance
1{xiaonan.nie,2101212782,ccchengff,shenhan.zhu,bin.cui}@.cn
2xupeng@3{lixiaoyang.x,zhangyang.elfin,liushouda}@
Abstract
Largertransformermodelsalwaysperformbetteronvarioustasksbutrequiremorecoststoscaleupthemodelsize.Toefficientlyenlargemodels,themixture-of-experts(MoE)architectureiswidelyadopted,whichconsistsofagatenetworkandaseriesofexpertsandkeepthetrainingcostconstantbyroutingtheinputdatatoafixednumberofexpertsinsteadofall.Inexistinglarge-scaleMoEtrainingsystems,expertswouldbedistributedamongdifferentGPUsforparallelization,andthusinputdatarequiresadditionalall-to-allcommunicationstoaccessthetargetexpertsandconductcorrespondingcomputations.However,uponevaluatingthetrainingprocessofthreemainstreamMoEmodelsoncommonlyusedGPUclusters,wefoundthattheall-to-allcommunicationratioaveragedaround45%,whichsignificantlyhinderstheefficiencyandscalabilityoftrainingMoEmodels.Inthispaper,weproposeLSH-MoE,acommunication-efficientMoEtrainingframe-workusinglocality-sensitivehashing(LSH).WefirstpresenttheproblemsofscalingMoEtraininginexistingsystemsandhighlightthepotentialofexploitingtokensimilaritytofacilitatedatacompression.Then,weintroduceanefficientLSH-basedcompressiontechnique,whichutilizesthecross-polytopehashingforrapidclusteringandimplementsaresidual-basederrorcompensationschemetoalleviatetheadverseimpactofcompression.Toverifytheeffectivenessofourmethods,weconductexperimentsonbothlanguagemodels(e.g.,RoBERTa,GPT,andT5)andvisionmodels(e.g.,Swin)forpre-trainingandfine-tuningtasks.Theresultsdemonstratethatourmethodsubstantiallyoutperformsitscounterparts
acrossdifferenttasksby1.28×-2.2×ofspeedup.
1Introduction
Inrecentyears,large-scalepre-trainedmodelshavesignificantlyadvancedtheperformanceof
deeplearningacrossvariouscomplextasks,includingcomputervision[8,
20],naturallanguage
processing[3,
7,
28],andmulti-modallearning[19]
.Commonlyreferredtoasfoundationmodels,
thesepre-trainedmodelsareprimarilybuiltonTransformerarchitectures[34]andundergoextensive
pre-trainingonlargedatasets,utilizingsubstantialGPUresources.OpenAIhasvalidatedthescaling
lawforlargelanguagemodels[15]andsuggeststhatincreasingthemodel’sparametersize,thevolume
oftrainingdata,andthedurationoftrainingcansignificantlyenhancethemodel’sperformance.However,thisapproachresultsinaconsiderableriseintrainingcosts,makingthedevelopmentoffoundationmodelsextremelyexpensive.
XiaonanNie,QibinLiu,FangchengFu,ShenhanZhu,andBinCuiarewiththeSchoolofComputerScienceandKeyLabofHighConfidenceSoftwareTechnologies(MOE),PekingUniversity.BinCuiisalsowiththeInstituteofComputationalSocialScience,PekingUniversity(Qingdao).
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
2
Toreducethehighcomputationalcosts,thesparsemixture-of-experts(MoE)architectureisoftenadopted,whichcomprisesasparsegatenetworkandaseriesofexpertnetworks.Thisarchitectureroutesinputdatatoonlyasubsetofexperts,resultinginsparseactivationoftheexpertsandtherebyreducingthemodel’scomputationalFLOPs(floatpointoperations)aswellastrainingcosts.
ProminentmodelssuchasGoogle’sSwitch-Transformer[9],ST-MoE[41],Meta’sHashLayer[31]
andMistral-AI’smixtralmodels[14]havesuccessfullyimplementedthisdesign,demonstrating
improvementsinbothperformanceandefficiencywithMoEmodels.
Meanwhile,effectivelyscalingthetrainingofMoEmodelsacrosshundredsoreventhousandsofGPUsremainsasignificantchallenge.ResearchersfromGooglehaveproposedtheexpertparallelism
approach[17],whichreplicatesthegatingnetworkoneachGPUsanddistributesdifferentexperts
acrossmultipleGPUsforparallelprocessing.Specifically,eachinputtokenisinitiallyprocessedbythegatingnetworktoselecttheappropriateexpert,afterwhichitisroutedtothedesignatedexpertsviapeer-to-peer(P2P)networkcommunication.Oncethedesignatedexpertscompletetheircomputation,thetokenisreturnedtotheoriginalGPUforfurtherprocessingthroughanadditionalP2Pcommunication.SinceeachGPUtypicallyneedstoexchangedatawithmanyotherGPUs,theseP2Ptransmissionsresultsinanall-to-allcommunicationpattern.Moreover,becausethecomputationoftheexpertnetworkreliesontheoutcomesofthesecommunications,thecommunicationscannotbeeffectivelyoverlappedwithongoingcomputations.ThisdependencycreatesasignificantperformancebottleneckinmodeltrainingacrossmostcommonlyusedGPUclusters.Weconductedexperimentsonthreewidely-usedMoEmodels,includingRoBERTa-MoE,GPT-MoEandSwin-MoE,onfourA100servers,eachwithacross-machinebandwidthof200Gb/s.Theresults,asshowninFigure
3,
revealthatthetimecostofall-to-allcommunicationconstitutesanaverageof45%andcanreachupto67%ofthetotalmodeltrainingtime.
ExistingmethodstoimprovedistributedMoEtrainingonbandwidth-limitedclusterstacklecommuni-cationchallengesinvariousways.
TA-MoE[4]reducescross-machinecommunicationbyadjusting
thegatingnetworktofavorexpertsonthesameserver,whilePre-gatedMoE[13]reducesdependency
betweencommunicationandcomputationthroughapre-gatingmechanismthatplanstokenroutinginadvance.However,bothapproachesrequiremodificationstothegatingmechanismandmodelstructure,limitingtheiruniversalapplicability.
DeepSpeed-MoE[29]introducesPR-MoE,which
selectsoneexpertplusasharedexpert,halvingtheall-to-allcommunicationload.
SCoMoE[40]
organizesall-to-allcommunicationbystructuringdatatransfersalongdifferentdimensionsandcontrollingdatavolumesacrossnetworklevels,andalsoclusterstokenstoimproverouting.However,noneoftheseworksconsiderreducingtheAll-to-AllcommunicationvolumeinMoEtrainingbycompressingtheforwardactivations.Therefore,theycanbeintergratedwithourmethodforfurtherimprovement.
Inthispaper,wepresentLSH-MoE,acommunication-efficientMoEtrainingframeworkthatleverageslocality-sensitivehashingtogroupsimilartokens.Ourkeycontributionsareasfollows:
•WebeginbyidentifyingkeychallengesinscalingMoEtraininginexistingsystems,notingthatall-to-allcommunicationconstitutesanaverageof45%ofthetotaltrainingtime.Addi-tionally,weinvestigatethepotentialofusingtokensimilaritytofacilitatedatacompressiontoreducecommunicationcosts.
•WeproposeanefficientLSH-basedcompressiontechniquethatemployscross-polytopehashingforrapidclustering.Thisapproachtransmitsonlytheclusteringcentroids,sig-nificantlyreducingcommunicationcosts.Tofurtherenhanceaccuracy,weimplementaresidual-basederrorcompensationschemetomitigatethenegativeeffectsofcompression.
•Throughextensiveexperimentswithlanguagemodels(RoBERTa-MoE,GPT-MoE,andT5-MoE)andvisionmodels(Swin-MoE),acrossbothpre-trainingandfine-tuningtasks,wedemonstratethatourmethodmaintainsmodelqualitywhileachievingaspeedupof1.28×-2.2×inend-to-endtrainingtime.
2Background
2.1Mixtures-of-ExpertArchitecture
ToenhancethetrainingefficiencyofTransformermodels,Williametal.
(2022)[9]introduced
aninnovativeparadigm,thesparsemixture-of-eexperts(MoE)architecture,illustratedinFigure
1.
3
fx纟ΣEix
i∈GX
1st
Expert
E1(x)E2(x)E3(x)En(x)
nth
Expert
…
GatingNetwork
t!
Figure1:Mixture-of-ExpertsonasingleGPU.
AlltoAll
AlltoAll
x0
x1
Node0
GatingNetwork
GPU0
1st
Expert
GatingNetwork
GPU1
2nd
Expert
x2
x3
Node1
GatingNetwork
GPU2
3rd
Expert
GatingNetwork
GPU3
4th
Expert
Intra-nodeComm.Inter-nodeComm.
Figure2:TrainingMixture-of-ExpertsonmultipleGPUsasexpertparallelism.
2nd
Expert
3rd
Expert
G:RM→1,NK
Thisarchitectureeffectivelybalancesparametercapacityandtrainingcosts,andcomprisestwokeycomponents:anexpertnetwork(E)andasparsegatenetwork(G).ItisevidentthatMoEmodels,withanequalnumberofactiveparametersperinput,cansignificantlysurpasstheperformanceofdensemodels.Thisbreakthroughhasalsocatalyzedfurtherresearchandtheirapplicationacross
variousindustries,ashighlightedbynumeroussubsequentstudies[5,
14,
22,
23,
25,
30,
39]
.
TheexpertnetworkEiscomposedofmultiplespecializedandseparatenetworks,commonlyreferredtoasexperts,denotedas{EiNrepresentsthenumberofexperts.Additionally,Ei(x)denotestheoutputproducedwhentheinputxisprocessedbythei-thexpert.Eachexpertistrainedtoexcelinaspecificsub-task,suchasinmulti-tasklearning,ortohandlespecificsegmentsofdata,asseeninlanguagemodelingandmulti-modallearning,therebyincreasingtheoverallmodelcapacity.Infoundationalmodels,theMoElayeroftenservesasasubstituteforthetraditionalfeed-forwardnetwork(FFN)layer.WithineachMoElayer,eachFFNfunctionworksasanindividualexpert,significantlyenhancingthemodel’scapabilitytoprocessdiverseandcomplexdatainputs.
ThegatingnetworkGplaysacrucialroleinthesparseMoEarchitecture.Forexample,inaK-waygatedMoEsystem,thegatingnetworkoutputsasetofintegersasEquation
1
todeterminewhichexpertsshouldbeactivated.Thisdecisionisbasedonthecharacteristicsoftheinputitself,allowingforadynamicandefficientallocationofcomputationalresources.Byonlyprocessingeachinputtokenwithaselectedsubsetoftheexpertnetwork,theMoEmodelachievescomputationsparsity,effectivelydecouplingparametercapacityfromtrainingcosts.
G:RM→[1,N]K(1)
Throughtheintegrationofmultiplespecializedexperts,asdescribedbyEquation
2,thesparseMoE
modeliscapableofdeliveringmoreaccurateandefficientpredictionsasf(x).Thisisachievedbyleveragingthespecializedknowledgeembeddedwithineachexpert,combinedwiththestrategicinputallocationmanagedbythegatingnetwork.
纟i(2)
WhileMoE’sprimaryadvantageisdecouplingparametercapacityfromnetworkcost,akeychallengeliesinlearningthegatingparameterseffectively,astheoutput’ssparsitymakesitnon-differentiable.Consequently,muchoftheresearchintheMoEfieldhascenteredondevelopingmethodsforlearninggatingfunctions.
Thesemethodsfallintothreemaincategories,asoutlinedin[6]:routingvia
learnableweighting[9,
24,
30],deterministichashrouting[31],andreinforcementlearning-based
routing[2,
32,
33]
.TheseapproachesprimarilydifferinthedesignofthegatingnetworkGratherthantheexpertnetworkE,andthereforeallencountersimilarscalingchallenges.
2.2ChallengesofScalingMoEModelTraining
WhileMoEmodelswereinitiallydevelopedtofacilitateefficientscalingduringtraining,deployingtheselarge-scalemodelsinpracticalGPU-intensiveenvironmentsposessignificantchallengesin
4
100%
80%
60%
40%
20%0%
RoBERTa-MoE
GPT-MoE
Swin-MoE-L
(16GPUs)
(16GPUs)
(16GPUs)
回All-to-All团Others.
(a)16GPUs
100%
80%
60%
40%
20%0%
RoBERTa-MoE
(32GPUs)
GPT-MoE(32GPUs)
Swin-MoE-L(32GPUs)
All-to-AllOthers.
(b)32GPUs(double#GPUs)
100%
80%
60%
40%
20%0%
RoBERTa-MoEGPT-MoESwin-MoE-L
-Wide(16GPUs)-Wide(16GPUs)-Wide(16GPUs)
All-to-AllOthers.
(c)16GPUs(double#experts)
Figure3:Proportionofall-to-allcommunicationtimerelativetototaltrainingdurationacrossdifferentconfigurations:scalingthenumberoftrainingservers(Figure
3(b))andscalingtheparameter
sizeofmodels(Figure
3(c))
.
distributedcomputing.Specifically,theMoElayerharborsaconsiderablyhighernumberofparam-etersandrequiresadditionalmemory,yetitmaintainsalmostthesamecomputationaldemandsasthedenselayer.Thisleadstoauniquecomputedensity—definedastheratioofthelayer’sFLOPs(FloatingPointOperations)toitsnumberofparameters.Therefore,traditionalparallelismmethodssuchastensorparallelismandpipelineparallelismareinsufficientforachievingeffectiveparallelisminthescenariosofMoEtraining.
Toimprovetheefficiencyandscalabilityoftraininglarge-scaleMoEmodels,expertparallelism
[17]
hasbeenintroducedasaspecializedmodelparallelismstrategy.ThisapproachdistributesexpertswithinanMoElayeracrossmultipleGPUs,whileleveragingdataparallelismforreplicatingnon-MoElayers,thusefficientlymanagingthetrainingworkloadofMoEmodels.TheworkflowofdistributedtrainingforanMoElayerisdepictedinFigure
2.
Oncethetargetexpertforeachtokenisdetermined,anall-to-allcommunicationprocessistriggeredtodistributetokenstotheircorrespondingtargetexpertsforcomputations,denotedasEi(x).Subsequently,anotherroundofall-to-allcommunicationisexecutedtogathertheoutputsfromallexperts,whichproducestheMoElayer’soutput(representedasf(x),Equation
2)
.Subsequentoperationsinvolveexecutingthedata-parallelnon-MoElayers.
WefirstprofiledthetrainingprocessofthreepopularMoEmodelsemployingexpertparallelism(detailedinTable
1)onaclustercomprisedoffourA100machines,eachequippedwithaninter
-connectRDMAbandwidthof200Gb/s.Theproportionofall-to-allcommunicationtimerelativetothetotaltrainingdurationisillustratedinFigure
3(a).
Wethendoublethenumberofmachines,andthenumberofexpertstoincreasethemodelscale.TheresultsareshowninFigure
3(b)
and
3(c),respectively.
Ourfindingsrevealthatall-to-allcommunicationaccountedforasubstantialportionofthetotaltime:approximately30%inGPT-MoE(15B),40%inRoBERTa-MoE,and70%inSwin-MoE-L.Andthisoverheadremainsnearlyconstantinlargermodelsandatlargermachinescales.Theseresultshighlightasignificantbottleneckthathampersthescalabilityofthetrainingprocess.Consequently,thedurationofall-to-allcommunicationsubstantiallyconstrainstrainingwithexpertparallelism,leadingtoreducedoverallthroughputandlimitingthepotentialtoscaleupthenumberofexpertseffectively.
2.3Locality-SensitiveHashingAlgorithms
Locality-SensitiveHashing(LSH)isaprobabilisticmethodprimarilyusedtoapproximatenearestneighborsearchinhigh-dimensionalspaces,whichreducesthedimensionalityofdatabymappingsimilardatatothesame“buckets”withhighprobabilityusinghashfunctions.Thisapproachoffersasubstantialreductionincomputationalcomplexity,particularlybeneficialforlarge-scaledataapplications.ThekeyoperationsinLSHincluding:
MappingDataintoBuckets:ThecoreofLSHisafamilyofhashfunctionsthatmaximizetheprobabilityofnearbypointsintheoriginalspacestayingcloseinthehashedspace,whiledistantpointsarelikelytoendupindifferentbuckets.Eachhashfunctionhischaracterizedbytheproperty:P[h(x)=h(y)]=1−d(x,y)/D,whered(x,y)isthedistancebetweenpointsxandy,andDdenotesthediameterofthespace.Tomapsimilardataintothesamebucket,multiplehashfunctionsfromthisfamilyareselectedbasedonthespecificattributesofthedata(e.g.,Euclideandistance,cosinesimilarity)andthedesiredgranularityofthebuckets.Datapointsarethenhashedbythese
5
residuals
device0
厂
expert0
tokenscentroidsE(centroids)E(tokens)
3
4
1
2
expert1
All-To-All
All-To-All
LSH-BasedClustering
device1
Residual-basedErrorCompensation
expert2
tokenscentroidsresiduals
−=
E(centroids)residualsE(tokens)
+=
calculatethecentroids
average(1,,5)=
a
2
device2
buckets
centroids
a
b
hashfunction
tokenshash
1
2
3
4
5
6
Figure5:SchematicofMoEtrainingwithLocality-SensitiveHashing(LSH-MoE).
functions,andeachpointisassignedtobucketsaccordingtoitshashvalues,effectivelycategorizingsimilaritemstogetherforclustering.
CalculatingClusterCentroids:Bygroupingdatapointsintobucketsasdeterminedbytheirhash
values,datapointsareeffectivelyclustered.Eachbucketrepresentsaclusterofdatapointsandthe
centroidofeachclusteristhencalculatedasthemeanofallpointswithinthatcluster,formulatedas
Cj=ε1xi,whereCjisthecentroidofthej-thbucket,njisthenumberofpointsinthej-th
bucket,andxiarethedatapointsinthebucket.
3Methodology
3.1TheMotivationofTokenSimilarity
Toexplorethepotentialoptimizationforall-to-allcommunicationsinMoEtraining,weconductedanin-depthanalysisofthedatainvolvedintheseall-to-allcommunications,identifyingahighdegreeofsimilarity,termedtokensimilarity.Specifically,weappliedPrincipalComponentAnalysis(PCA)toreducethedimensionalityoftheinputtokensofall-to-allcommunicationsandobservedadistinctclusteringphenomenon,asillustratedintheFigure
4.
Ouranalysissuggeststhattheobservedsimilarityamongtokensmaystemfromtwoprimaryfactors:
•DataRelatedInfluences:Thesimilarityispartiallyduetothenatureofreal-worlddata,whichoftenadherestoZipf’s
Law[18]
.Thisresultsinaskeweddistribution,withcertaindataelementsappearmorefrequentlythanothers.
•ModelStructureRelatedInfluences:ThedesignofTrans-
formerarchitecture[34],especiallyitsattentionmecha
-nisms,significantlyimpactstokensimilarity.Inmodels
likeBERT[7],attentionlayersaredesignedtocaptureand
integratecontextinformationacrosstokens,thushomoge-nizingtokenrepresentationsandemphasizingtheirsharedsemanticrelationshipsatthesentencelevel.
Figure4:PrincipalCom-ponentAnalysis(PCA)Visu-alizationofinputtokensin-volvedinall-to-allcommuni-cation.
6
3.2LSH-MoE
MotivatedbytheTokenSimilarityobservedinSection
3.1,weintroduce
LSH-MoE,anovelMoE
trainingframeworkthatintegrateslocality-sensitivehashing(LSH)forrapidclusteringofinputtokens.Ourmethodtransmitsonlytheclusteringcentroids,significantlyreducingcommunicationvolumes.Tocounteractthenegativeeffectsofcompression,wealsoimplementaresidual-basederrorcompensationscheme.
AsdepictedinFigure
5,
LSH-MoEinitiallyemploys(1)anLSH-basedclusteringmethodtocompresstokensintocentriodsforsubsequentprocessing,effectivelyreducingcommunicationoverhead.Itthensequentiallyexecutes(2)all-to-allcommunication,expertcomputation,andanother(3)all-to-allcommunicationtoproducetheprocessedoutputsE(centriods).Finally,itintroduces(4)aresidual-basederrorcompensationmethodtoapproximatetheexpert-processedresultsE(tokens),byintegratingE(centriods)withresiduals.Meanwhile,wealsooutlinetheworkflowofourLSH-MoEframeworkintheAlgorithm
1
ofAppendix
A.1.
ThekeycomponentsofourLSH-MoEframeworkincludesanefficientLSH-basedclusteringalgorithmforrapidprocessingandanresidual-basederrorcompensationschemetominimizequalitydegradation.
EfficientLSH-basedClusteringAlgorithm.Sincethedatatobecompressed(theinputdataforall-to-allcommunication)isgenerateddynamicallyandinrealtime,pre-compressingitoroverlappingcompressiontimewithotherprocessingtasksisnotfeasible.Consequently,selectinganefficientonlinecompressionalgorithmiscrucial.Traditionalclusteringalgorithms,suchasK-Means,oftenencountercomputationalchallengesandefficiencylimitations.Locality-sensitivehashing(LSH)addresstheseissuesbyhashingsimilardatapointsintothesamebuckets,enablingfastersimilaritydetectioninhigh-dimensionalspaces.
NumerousLSHalgorithmshavebeendeveloped,eachemployingauniquehashingapproachformappingdataontobuckets.Weconductedexperimentstoevaluateseveralpopularhashingalgorithms,includingcross-polytopehashingandsphericalhashing.BasedonourevaluationsinSection
4.5,we
selectedcross-polytopehashingastheoptimalalgorithmforourapplication.Cross-polytopehashingstandsoutforitsmethodofmappinginputvectorstothenearestvertexonacross-polytope.Thisprocessisfacilitatedbyapplyingrandomlyrotatedcross-polytopes,whicheffectivelysegmentthesurfaceoftheunitsphere.Thealgorithmcanbemathematicallyrepresentedasfollows:
LSH(x)=argmaxi∈{±1,±2,...,±d}|Rx|i(3)
whereRisarandomrotationmatrix,disthedimensionalityofthespace,and|Rx|idenotestheabsolutevalueofthei-thcomponentoftherotatedvectorRx.
ThisformulaencapsulateshowtheinputvectorxistransformedbytherotationmatrixRandthenmappedtothenearestvertexofthecross-polytopebyselectingthedimensionithatmaximizestheabsolutevalueofthecomponentsofRx.Thismethodeffectivelysegmentsthehigh-dimensionalspaceandenhancestheclusteringefficiencybyrapidlyidentifyingsimilardatapoints.
Residual-basedErrorCompensationScheme.InourLSH-MoEframework,wecompresstheintermediateactivationvalueswithinthemodelnetwork.Unlikegradientcompression,thisprocessdoesnottolerateerrorswell.Therefore,itisessentialtominimizecompression-inducederrorstoensureminimalimpactonmodelperformance.Toaddressthis,weimplementanovelresidual-basedgradientcompensationstrategy,outlinedasfollows:
1.Wefirstcapturetheresidualforeachdatapointrelativetoitsclustercentroid,definedbytheequation:
2.Aftertheexpertnetworkcomputesoutputsfortheclustercenters,thefinalstepistorestoretheprocessedresultforeachtokenbyaddingbackthepreviouslyrecordedresidual:
Yij←{E(clusterj)+∆Clusterjk|k=1,2,...,Nj}.(5)
Thiserrorcompensationschemeeffectivelymitigatespotentialaccuracylosscausedbydatacompres-sioninall-to-allcommunication,ensuringthefidelityandrobustnessoftheLSH-MoEframework.TheexperimentalresultsinSection
4
showthatimplementingthiscompensationmechanismenables
7
Table1:Modelsforevaluation,where“-”indicatesthatthevaluesaredifferentacrosslayers.
Model
#Layer
dmodel
dffn
#Experts
#Params.(MoE)
#Params.(Total)
RoBERTa-MoE
12
768
3072
16
302M
394M
T5-MoE
16
1024
16384
16
8594M
9288M
GPT-MoE(15B)
12
768
3072
512
14507M
14629M
GPT-MoE(52B)
24
1024
4096
512
51539M
51740M
Swin-MoE-L
24
-
-
32
-
946M
themodeltrainedwithLSH-MoEtoachieveanaccuracycomparabletothatofamodeltrainedwithoutcompression.Thisoutcomehighlightstheeffectivenessofourproposederrorcompensationstrategyinpreservingmodelperformancedespitethechallengesposedbydatacompressioninall-to-allcommunication.
3.3ScalabilityAnalysisofLSH-MoE
Toeffectivelydemonstratethescalabilityofourapproach,particularlyintermsofitsapplicabilitytobothlargermodelsandlargercomputationalclusters,weconductedatheoreticalanalysis.ThisanalysisprimarilyfocusesonthecomputationoverheadandthecommunicationcostsassociatedwithMixtureofExperts(MoE),specificallyconsideringall-to-allcommunicationoverhead.Wederivedtheratioofcommunicationtimetocomputationtime,highlightinghowthisratioevolvesasboththescaleoftheserversandthemodelsizeincrease.Thisrelationshipiscrucialforunderstandingscalabilityandcanbeformallyexpressedasfollows:
wherekrepresentsthenumberofexpertsactivatedpertoken,FLOPsandBinterdenotetheGPU’scomputationabilityandthenetworkperformance,wisthenumberofGPUservers,andhisthehiddensizeofmodel.Notably,thefirstterm,.Additionally,scalingMoEmodelstypicallyemphasizesincreasingthenumberoflayersandexperts,
whilethegrowthinhiddensize(h)tendstobegradual,asseeninmodelslikeSwitch-Transformer[9]
.Consequently,whenboththemodelscaleandthenumberoftrainingserversgrow,theproportionofall-to-allcommunicationtimeremainsnearlyunchanged.ThisinsightunderpinsthescalabilityoftheLSH-MoEmethod,demonstratingitsrobustnessinlarger-scalesettingsandsupportingitspotentialinfuturelarge-scaleapplications.Foradetailedderivation,pleaserefertoAppendix
A.2.
4Experiment
4.1Implementation
OurLSH-MoEcomprisesadatacompression/restorationcomponentandacommunicationcompo-nent.WeutilizePyTorch1.11fordevelopingtheLSHclusteringandNCCLforimplementingthecommunication.Additionally,ourmethodisframework-independentandcanbeeasilyappliedto
otherMoEtrainingframeworkssuchasHetu-MoE[21,
26],DeepSpeed-MoE[29],andTutel[12]
.
4.2BenchmarksandDatasets
Ourevaluationsareconductedbyscalingpre-trainedmodelsequippedwithMoEarchitectureacrossvariousapplicationdomains.ThisincludesmodelslikeRoBERTa-MoE,T5-MoEandGPT-MoEinnaturallanguageprocessing(NLP),aswellasSwin-MoEincomputervision(CV).Amongthesemodels,RoBERTa-MoEandT5-MoEareevaluatedonpre-trainingtask,whileGPT-MoEandSwin-MoEundergofine-tuningevaluationbasedontheirofficialopen-sourcedmodelcheckpoints
1
2.
Wealsoevaluatedthezero-shotaccuracyofthepre-trainedT5-MoE.ModelconfigurationsaredetailedinTable
1.
1/facebookr
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 利用主题班会增强班级凝聚力计划
- 酒店员工薪酬管理总结
- 纺织行业生产作业安全总结
- 七年级生物下册 1.1人类的起源和发展 人教新课标版课件
- 2025年全球及中国智慧康养平台行业头部企业市场占有率及排名调研报告
- 2025-2030全球鱼塘净水器行业调研及趋势分析报告
- 2025-2030全球插画设计行业调研及趋势分析报告
- 2025-2030全球绳状海藻酸盐敷料行业调研及趋势分析报告
- 2025年全球及中国后装载机卡车行业头部企业市场占有率及排名调研报告
- 2025年全球及中国翻新SSD和HDD行业头部企业市场占有率及排名调研报告
- 交警安全进校园课件
- (2024年高考真题)2024年普通高等学校招生全国统一考试数学试卷-新课标Ⅰ卷(含部分解析)
- 润滑油过滤培训
- 内蒙自治区乌兰察布市集宁二中2025届高考语文全真模拟密押卷含解析
- 浙江省绍兴市2023-2024学年高一上学期期末考试物理试题(含答案)
- 《住院患者身体约束的护理》团体标准解读课件
- 中国急性缺血性卒中诊治指南(2023版)
- 学前教育普及普惠质量评估幼儿园准备工作详解
- 第十五章《探究电路》复习课课件沪科版九年级物理
- 2024年中考物理科技创新题型(教师版)
- 唐山市重点中学2024-2025学年全国高考大联考信息卷:数学试题试卷(3)含解析
评论
0/150
提交评论