从头训练大型语言模型的最佳实践

上传人：1*** IP属地：山西上传时间：2024-12-14 格式：DOCX 页数：46 大小：1.27MB 积分：15 举报 版权申诉

已阅读5页，还剩41页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

CurrentBestPractices

forTrainingLLMsfromScratch

Authors:RebeccaLi,AndreaParker,JustinTenuto

·······weights&Biases

TableofContents

Introduction03

Buildvs.BuyPre-trainedLLMModels03

TheScalingLaws05

Hardware06

Memoryvs.ComputeEfficiency06

TechniquesforParallelization06

DatasetCollection08

DatasetPre-processing08

DatasetHandling08

Tokenization09

Pre-trainingSteps13

ModelEvaluation15

BiasandToxicity16

InstructionTuning17

ReinforcementLearningthroughHumanFeedback(RLHF)19

Conclusion20

References20

Appendix21

LLMOverview21

TransformerModelArchitecture21

TheOriginalLLMScalingLaws23

www.wandb.ai•contact@wandb.ai2

·······weights&Biases

Introduction

Althoughwe’reonlyafewyearsremovedfromthetransformer

breakthrough,LLMshavealreadygrownmassivelyin

performance,cost,andpromise.AtW&B,we’vebeenfortunatetoseemoreteamstrytobuildLLMsthananyoneelse.Butmanyofthecriticaldetailsandkeydecisionpointsareoftenpasseddownbywordofmouth.

Thegoalofthiswhitepaperistodistillthebestpracticesfor

trainingyourownLLMforscratch.We’llcovereverythingfromscalingandhardwaretodatasetselectionandmodeltraining,lettingyouknowwhichtradeoffstoconsiderandflaggingsomepotentialpitfallsalongtheway.Thisismeanttobeafairly

exhaustivelookatthekeystepsandconsiderationsyou’llmakewhentraininganLLMfromscratch.

Thefirstquestionyoushouldaskyourselfiswhethertrainingonefromscratchisrightforyourorganization.Assuch,we’llstartthere:

BUILDVS.BUYPRE-TRAINEDLLMMODELS

BeforestartingLLMpre-training,thefirstquestionyouneedtoaskiswhetheryoushouldpre-trainanLLMbyyourselforuseanexistingone.Therearethreebasicapproaches:

•Option1:UsetheAPIofacommercialLLM,e.g.GPT-3(OpenAI,2020),CohereAPIs,AI21J-1

•Option2:Useanexistingopen-sourcedLLM,e.g.GPT-J(EleutherAI,2021),GPT-NeoX(EleutherAI,2022),Galactica(MetaAI),UL2(Google,2022),OPT(MetaAI,2022),BLOOM(BigScience,2022),Megatron-LM(NVIDIA,2021),CodeGen(Salesforce,2022)

•Option3:Pre-trainanLLMbyyourselforwithconsultants:

YoucaneithermanageyourowntrainingorhireLLM

consultants&platforms.Forexample,MosaicMLprovidestrainingservicesfocusingonLLMs.

Thatsaid,therearealotofdetailstoconsiderwhenmakingyourchoice.Herearethepros,cons,andapplicablescenariosforeachoption:

Option3

Pre-trainanLLMbyyourselforwithconsultants

Option2

Useanexistingopen-sourcedLLM

Option1

UsetheAPIofacommercialLLM

Pros

•RequirestheleastLLMtrainingtechnicalskills.

•Minimumupfronttraining/

explorationcost,givenmaincostincursatinferencetime.

•Theleastdata-demandingoption.

Onlyafewexamples(ornoexamples)areneededformodelstoperform

inference.

•Canleveragethebest-performingLLMsinthemarketandbuilda

superiorexperience.

•Reducetime-to-marketofyour

appsandde-riskyourprojectwithaworkingLLMmodel.

•AgoodwaytoleveragewhatLLMs

havelearnedfromavastamountofinternetdataandbuildontopofit

withoutpayingfortheIPatinference.

•Comparedtooptionone,youarelessdependentonthefuturedirectionofLLMserviceprovidersandthushavemorecontrolregardingroadmap&backwardscompatibility.

•Comparedtooptionthree,youhaveamuchfastertime-to-valuegivenyouarenotbuildingLLMsfromscratch,alsoleadingtolessdata,training

time,trainingbudgetneeded.

•Comparedtooptionsoneandtwo,youhavethemostcontrolofyour

LLM’sperformanceandfuture

direction,givingyoulotsofflexibilitytoinnovateontechniquesand/or

customizetoyourdownstreamtasks.

•Gainfullcontroloftrainingdatasetsusedforthepre-training,which

directlyimpactsmodelquality,bias,andtoxicityissues.Incomparison,thoseissuesarelesscontrollableinoptiononeortwo.

•TrainingyourownLLMalsogives

youadeepmoat:superiorLLM

performanceeitheracrosshorizontalusecasesortailoredtoyourvertical,allowingyoutobuildasustaining

advantageespeciallyifyoucreateapositivedata/feedbackloopwithLLMdeployments.

www.wandb.ai•contact@wandb.ai3

·······weights&Biases

Option1

Option2

Option3

UsetheAPIofacommercialLLM

Useanexistingopen-sourcedLLM

Pre-trainanLLMbyyourselforwithconsultants

Cons

•CommercialLLMservicescanget

expensivewithahighvolumeoffine-tuningorinferencetasks.Itcomes

downtoLLMtotal-cost-of-ownership(TCO)amortizedtoeachinference.

•Manyindustries/usecasesforbidtheuseofcommercialLLMservicesassensitivedata/PIIdatacannotbeseenbytheserviceforcompliance(healthcareusecases,forexample).

•Ifbuildingexternalapps,you’llneedtofindothermoatsandde-riskyourbusinessifyou’rehighlyreliantonexternalLLMservicetechnology.

•Lessflexibledownstream:doesn’t

supportedgeinference,limited

abilitytocustomizethemodel(fine-tuninggetsexpensive),limitedabilityforongoingmodelimprovements.

•Notasdemandingasbuilding

yourown,butstillrequireslotsofdomainexpertskillstotrain,fine-tune,andhostanopen-sourcedLLM.LLMreproducibilityisstillasignificantissuesotheamountoftimeandworkneededcannotbeunderestimated.

•Slowertime-to-marketandlessagileifyouarebuildingdownstreamapps,duetoamoreverticaltechstack.

•Open-sourcedmodelstypically

lagperformancecomparedto

commercialmodelsbymonths/years.Ifyourcompetitorleveragescommercialmodels,theyhaveanadvantageonLLMtechandyou’llneedtofindothercompetitive

advantages.

•Veryexpensiveendeavorwith

highrisks.Needcross-domain

knowledgespanningfromNLP/ML,subjectmatterexpertise,softwareandhardwareexpertise.Ifnotdonewell,youcouldendupinasituationwhereyou’vespentthousands

orevenmillionsofdollarswith

asuboptimalmodel.Mistakes,

especiallylateintotrainingstages,arehardtofix/unwind.

•Lessefficientthanoptiontwo.

OptiontwoleveragesexistingLLMs,learningfromanentireinternet’s

worthofdataandcanprovidea

solidstartingpoint.Withoption3,youstartfromscratchandneedlotsofhigh-quality/diversedatasets

foryourmodelstogaingeneralizedcapabilities.

Whentoconsidereachoption

•BestifyoueitherhavelesstechnicalteamsbutwanttoleverageLLM

techniquestobuilddownstream

apps,oryouwanttoleveragethebest-in-classLLMsforperformancereasons(outsourcingtheLLMtech).

•Betweenoptionstwoandthree,

ifyouaren’ttryingtochangethe

modelarchitecture,itisalmost

alwaysbettertoeitherdirectlytakeanexistingpre-trainedLLMand

fine-tuneitortaketheweightsofan

•Bestifyouneedtochangemodelarchitectureortrainingdatasetfromexistingpre-trainedLLMs.Forexample,ifyouwanttouseadifferenttokenizer,changethevocabularysize,orchangethe

•Goodifyouhaveverylimitedtrainingdatasetsandwanttoleveragean

LLM’scapabilitytodozero/few-shotlearning.

existingpre-trainedLLMasastartingpointandcontinuepre-training.Thereasonisbecauseagoodpre-trainedLLMlikeGPT-NeoXhasalreadyseenavastamountofdataandthushas

numberofhiddendimensions,attentionheads,orlayers.

•Typically,inthiscasetheLLMisa

corepartofyourbusinessstrategy&

•Goodforprototypingappsand

exploringwhatispossiblewithLLMs.

learnedgeneralcapabilitiesfromthedata.Youcanleveragethatlearningespeciallyifyourtrainingdatasetisnothugeordiverse.

•Anothertypicalscenarioisthatyouoperateinaregulatoryenvironmentorhaveuser/sensitivedatathat

cannotbefedtocommercial

LLMservices.Oryouneededge

deploymentofthemodelforlatencyorlocationalreasons.

technologicalmoat.Youaretakingonsomeoralotofinnovations

inLLMtraining,andhavealargeinvestmentappetitetotrainandmaintainexpensivemodelsonanongoingbasis.

•Typically,youhaveorwillhavelotsofproprietarydataassociatedwithyourLLMtocreateacontinuous

modelimprovementloopfor

sustainablecompetitiveadvantage.

Itisalsoworthmentioningthatifyouonlyhaveaverytargetedsetofusecasesanddon’tneedthegeneral-purposecapabilitiesor

generativecapabilitiesfromLLMs,youmightwanttoconsidertrainingorfine-tuningamuchsmallertransformerorothermuchsimplerdeeplearningmodels.Thatcouldresultinmuchlesscomplexity,lesstrainingtime,andlessongoingcosts.

www.wandb.ai•contact@wandb.ai4

·······weights&Biases

THESCALINGLAWS

Beforeyoudiveintotraining,it’simportanttocoverhowLLMsscale.Understandingscalingletsyoueffectivelybalancethesizeandcomplexityofyourmodelandthesizeofthedatayou’llusetotrainit.

Somerelevanthistoryhere:OpenAIoriginallyintroduced“theLLMscalinglaws”in2020.Theysuggestedthatincreasingmodelsizewasmoreimportantthanscalingdatasize.Thisheldfor

abouttwoyearsbeforeDeepMindsuggestedalmostthepolaropposite:thatpreviousmodelsweresignificantlyundertrainedandthatincreasingyourfoundationaltrainingdatasetsactuallyleadstobetterperformance.

Thatchangedin2022.Specifically,DeepMindputforward

analternativeapproachintheir

TrainingCompute-Optimal

LargeLanguageModels

paper.TheyfoundthatcurrentLLMsareactuallysignificantlyundertrained.Putsimply:theselargemodelsweren’ttrainedonnearlyenoughdata.

DeepmindshowcasedthiswithamodelcalledChinchilla,whichisafourththesizeoftheGophermodelabovebuttrainedon

4.6xmoredata.Atthatreducedsizebutwithfarmoretrainingdata,ChinchillaoutperformedGopherandotherLLMs.

DeepMindclaimsthatthemodelsizeandthenumberof

trainingtokens*shouldinsteadincreaseatroughlythesameratetoachieveoptimalperformance.Ifyougeta10xincreaseincompute,youshouldmakeyourmodel3.1xtimesbiggerandthedatayoutrainover3.1xbigger;ifyougeta100xincreaseincompute,youshouldmakeyourmodel10xbiggerandyourdata10xbigger.

*Note:TokenizationinNLPisanessentialstepofseparatingapiece

oftextintosmallerunitscalledtokens.Tokenscanbeeitherwords,

characters,orsubwords.Thenumberoftrainingtokensisthesizeof

trainingdataintokenformaftertokenization.Wewilldiveintodetailedtokenizationmethodsalittlelater.

DeepMindprovidesthefollowingchartshowinghowmuch

trainingdataandcomputeyou’dneedtooptimallytrainmodelsofvarioussizes.

EstimatedoptimaltrainingFLOPsandtrainingtokensforvariousmodelsizes,

TrainingCompute-OptimalLargeLanguageModels

Thatsaid,mostexistingLLMsarestillundertrained:

Data/compute-optimal(Chinchilla)heatmap,

Chinchilla

data-optimalscalinglaws:InplainEnglish

Insummary,thecurrentbestpracticesinchoosingthesizeofyourLLMmodelsarelargelybasedontworules:

•DecideonyourdatasetandfindtheChinchilla-optimal

modelsizebasedondatasize(orclosetoChinchilla-optimalwithintheboundaryofyourdatacollectionlimitation)

•Determinethedataandmodelsizecombinationthat’sbestforyourmodel,basedonyourtrainingcomputebudgetandinferencelatencyrequirements

Totheleftoftheminimaoneachcurve,modelsaretoosmall--alargermodeltrainedonlessdatawouldbeanimprovement.Totherightoftheminimaoneachcurve,modelsaretoolarge--asmallermodeltrainedonmoredatawouldbeanimprovement.Thebestmodelsareattheminima.

www.wandb.ai•contact@wandb.ai5

·······weights&Biases

HARDWARE

Itshouldcomeasnosurprisethatpre-trainingLLMsisa

hardware-intensiveeffort.Thefollowingexamplesofcurrentmodelsareagoodguidehere:

•PaLM(540B,Google):6144TPUv4chipsusedintotal,madeoftwoTPUv4Podsconnectedoverdatacenternetwork(DCN)usingacombinationofmodelanddataparallelism

•OPT(175B,MetaAI):99280GBA100GPUs,utilizingfullyshareddataparallelismwithMegatron-LMtensorparallelism

•GPT-NeoX(20B,EleutherAI):9640GBA100GPUsintotal

•Megatron-TuringNLG(530B,NVIDIA&MSFT):560DGXA100nodes,eachclusternodehas8NVIDIA80-GB

A100GPUs

TrainingLLMsischallengingfromaninfrastructureperspectivefortwobigreasons.Forstarters,itissimplynolongerpossibletofitallthemodelparametersinthememoryofeventhelargestGPU(e.g.NVIDIA80GB-A100),soyou’llneedsomeparallel

architecturehere.Theotherchallengeisthatalargenumberofcomputeoperationscanresultinunrealisticallylongtrainingtimesifyouaren’tconcurrentlyoptimizingyouralgorithms,

software,andhardwarestack(e.g.trainingGPT-3with175Bparameterswouldrequireabout288yearswithasingleV100NVIDIAGPU).

Memoryvs.ComputeEfficiency

TechniquesforParallelization

Parallelizationreferstosplittinguptasksanddistributing

themacrossmultipleprocessorsordevices,suchasGPUs,sothattheycanbecompletedsimultaneously.Thisallowsformoreefficientuseofcomputeresourcesandfastercompletiontimescomparedtorunningonasingleprocessorordevice.

ParallelizedtrainingacrossmultipleGPUsisaneffectivewaytoreducetheoveralltimeneededforthetrainingprocess.

Thereareseveraldifferentstrategiesthatcanbeusedto

parallelizetraining,includinggradientaccumulation,micro-

batching,dataparallelization,tensorparallelizationandpipelineparallelization,andmore.TypicalLLMpre-trainingemploysa

combinationofthesemethods.Let’sdefineeach:

DataParallelism

Dataparallelismisthebestandmostcommonapproachfor

dealingwithlargedatasetsthatcannotfitintoasinglemachineinadeeplearningworkflow.

Morespecifically,dataparallelismdividesthetrainingdataintomultipleshards(partitions)anddistributesthemtovarious

nodes.Eachnodefirstworkswithitslocaldatatotrainitssub-model,andthencommunicateswiththeothernodestocombinetheirresultsatcertainintervalsinordertoobtaintheglobal

model.Theparameterupdatesfordataparallelismcanbeeitherasynchronousorsynchronous.

Theadvantageofthismethodisthatitincreasescompute

efficiencyandthatitisrelativelyeasytoimplement.ThebiggestdownsideisthatduringthebackwardpassyouhavetopassthewholegradienttoallotherGPUs.Italsoreplicatesthemodelandoptimizeracrossallworkerswhichisrathermemoryinefficient.

ToachievethefullpotentialofthousandsofdistributedGPUs,itiscrucialtodesignparallelismintoyourarchitectureto

balancememoryandcomputeefficiency.

Memoryefficiency

TrainingaLLMrequiresterabytesofaggregatememoryfor

modelweights,gradients,andoptimizerstates-farbeyondwhatisavailableonasingleGPU.Onetypicalmitigationstrategyis

gradientaccumulation,inwhichthefulltrainingbatchissplitintomicro-batchesthatareprocessedinsequencewiththeirresultinggradientsaccumulatedbeforeupdatingthemodel

weights.Thatmeansyourtrainingbatchsizecanscalewithoutincreasingthepeakresidentactivationmemory.

Computeefficiency

WhilelargeGPUclusterscanhavethousandsofhigh-throughputGPUs,achievinghighcomputeefficiencyatthisscaleis

challenging.Alargebatchsizecanbeaneffectivewaytoincreasecomputeefficiency,becauseitincreasesthearithmeticintensityofaGPUkernelandhelpsamortizethetimespentstalledon

communicationandsynchronization.However,usingtoolargeofabatchsizecanhavenegativeeffectsonthemodelquality.

Whileparallelizationisparamount,therearemanydifferent

waystodoit.We’llgetintothemostcommoninournextsection.

www.wandb.ai•contact@wandb.ai6

·······weights&Biases

TensorParallelism

Tensorparallelismdivideslargematrixmultiplicationsintosmallersubmatrixcalculationswhicharethenexecuted

simultaneouslyusingmultipleGPUs.

Thisallowsforfastertrainingtimesduetoitsasynchronousnatureandtheabilitytoreducecommunicationoverheadbetweennodes.Thebenefitofthismethodisthatitis

memory-efficient.Thedownside,however,isthatit

introducesadditionalcommunicationofactivationsineachforward&backwardpropagation,andthereforerequireshighcommunicationbandwidthtobeefficient.

Pipelineparallelismandmodelparallelism

Pipelineparallelismimprovesboththememoryandcomputeefficiencyofdeeplearningtrainingbypartitioningthelayersofamodelintostagesthatcanbeprocessedinparallel.

Thishelpswithoverallthroughputspeedssignificantlywhile

addingthesmallestcommunicationoverhead.Youcanthinkofpipelineparallelismas“inter-layerparallelism”(wheretensor

parallelismcanbethoughtofas“intra-layerparallelism”).

Similartopipelineparallelism,modelparallelismiswhenyou

splitthemodelamongGPUsandusethesamedataforeach

model;soeachGPUworksonapartofthemodelratherthanapartofthedata.Thedownsideofpipelineandmodelparallelismisthatitcannotscaleinfinitelygiventhatthedegreeofpipelineparallelismisboundedbythedepthofthemodel.

Asmentionedatthestartofthissection,it’snotuncommonforteamstoleverageacombinationofparallelismtechniquesduringtraining.Forexample,PaLM(GoogleBrain,2022)andOPT(MetaAI,2022)bothusedacombinationoftensormodelparallelismanddataparallelism.

NVIDIAapproachedthingsalittledifferentlyinthe

Efficient

Large-ScaleLanguageModelTrainingonGPUClustersUsing

Megatron-LM

paper.TheyproposedaPTD-Ptechniquethat

combinespipeline,tensor,anddataparallelismtoachieve

state-of-the-artcomputationalperformance(52%ofpeakdevicethroughput)on1000sofGPUs.

Specifically,PTD-Pleveragesacombinationofpipeline

parallelismacrossmulti-GPUservers,tensorparallelismwithinamulti-GPUserver,anddataparallelismtopracticallytrain

modelswithatrillionparameters.Themethodalsoemploys

gracefulscalinginanoptimizedclusterenvironmentwithhigh-bandwidthlinksbetweenGPUsonthesameserverandacrossservers.

UsingthesetechniquestotrainLLMsrequiresnotonlythe

highest-performingGPUstobeefficient,butalsoneedshigh-

bandwidthnetworkingforoptimalcommunication––InfiniBandisoftenusedtomovedatabetweennodes.

Butthisofcoursecomeswithacost.Leveragingthousands

ofhigh-performingGPUsandhigh-bandwidthnetworksto

trainLLMsisinfrastructure-intensive.Forexample,aback-of-the-envelopecalculationestimatedthatthecostofthePaLMmodel(540B,Google)mightbeashighas$23MM

(seedetailed

analysis

Toimplementdistributeddeeplearningtrainingsystems,

softwaretoolkitssuchasDistributedTensorFlow,Torch

Distributed,Horovod,andlibrariessuchasDeepSeedand

Megatronareoftenneeded.Thereisimplementationcomplexityheresoitrequiressystemexpertiseifyou’regoingtobe

successful.

Inaddition,thefollowingtechniquesandstrategiesarecommonlyemployedtoachieveparallelism:

Gradientaccumulation

Gradientaccumulationinvolvesaddingupgradientsfrom

multiplebatchesbeforeperformingoneweightupdatesteponallaccumulatedgradientsatonce.

ThisapproachreducescommunicationoverheadbetweenGPUsbyallowingthemtoworkindependentlyontheirownlocalbatchofdatauntiltheyhavesynchronizedwitheach

otheragain,afteraccumulatingenoughgradientsforasingleoptimizationstep.

Asynchronousstochasticgradientdescentoptimization

AsynchronousstochasticgradientdescentoptimizationmethodscanalsobeemployedwhenperformingmodeloptimizationovermultipleGPUs.

www.wandb.ai•contact@wandb.ai7

·······weights&Biases

Thismethodusessmallsubsets(microbatches)ofdatafrom

eachnodeinsteadofloadingalldataatonce,whichhelpsreducememoryrequirementswhilestillallowingforfastconvergenceratesduetoitsasynchronousnature.Itworkslikethis:

•First,wefetchthemostup-to-dateparametersofthe

modelneededtoprocessthecurrentmini-batchfromtheparameterservers.

•Wethencomputegradientsofthelosswithrespecttotheseparameter

•Finally,thesegradientsaresentbacktotheparameterservers,whichthenupdatesthemodelaccordingly.

Micro-batching

Micro-batchingcombinessmallmini-batchesintolargeronessothatmorebatchescanbeprocessedinlesstimeandwithfewersynchronizationpointsbetweendevicesduringbackpropagationoperations.Ithasbecomeincreasinglypopularfortraining

verylargemodelsacrossmanyGPUsduetoitsabilitytoreducememoryconsumptionandimprovescalability.Overall,micro-

batchingisaneffectivewaytoleveragedistributeddeeplearningtechniqueswhendealingwithverylargedatasetsormodelsthatrequiresignificantamountsofprocessingpower.

Nowthatwe’vegonethroughscaling,hardware,andsome

techniquesforparallelizingyourtrainingruns,let’slookatwhatyourLLMwillactuallylearnfrom:data.

DATASETCOLLECTION

Baddataleadstobadmodels.Butcarefulprocessingofhigh-quality,high-volume,diversedatasetsdirectly

contributestomodelperformanceindownstreamtasksaswellasmodelconvergence.

DatasetdiversityisespeciallyimportantforLLMs.That’sbecausediversityimprovesthecross-domainknowledgeofthemodel,

aswellasitsdownstreamgeneralizationcapability.TrainingondiverseexampleseffectivelybroadenstheabilityofyourLLMtoperformwellonmyriadnuancedtasks.

Atypicaltrainingdatasetiscomprisedoftextualdatafrom

diversesources,suchascrawledpublicdata,onlinepublicationorbookrepositories,codedatafromGitHub,Wikipedia,news,socialmediaconversations,etc.

Forexample,considerThePile

.ThePileisapopulartextcorpuscreatedbyEleutherAIforlarge-scalelanguagemodeling.It

containsdatafrom22datasources,coarselybrokendownintofivebroadcategories:

•AcademicWriting:PubMedAbstractsandPubMedCentral,arXiv,FreeLaw,USPTOBackgrounds,PhilPapers,NIH

Exporter

•OnlineorScrapedResources:CommonCrawl,OpenWebText2,StackExchange,Wikipedia

•Prose:BookCorpus2,Bibliotik,ProjectGutenberg

•Dialog:YouTubesubtitles,UbuntuIRC,OpenSubtitles,HackerNews,Europarl

•Miscellaneous:GitHub,theDeepMindMathematicsdataset,Enronemails

NotethatThePiledatasetisoneoftheveryfewlarge-scale

textdatasetsthatisfreeforthepublic.FormostoftheexistingmodelslikeGPT-3,PaLM,andGalactica,theirtrainingand

evaluationdatasetsarenotpubliclyavailable.Giventhelargescaleeffortittakestocompileandpre-processthesedatasetsforLLMtraining,mostcompanieshavekeptthemin-house

tomaintaincompetitiveadvantage.ThatmakesdatasetslikeThePileandafewdatasetsfromAllenAIextremelyvaluableforpubliclarge-scaleNLPresearchpurposes.

Anotherthingworthmentioningisthat,duringdataset

collection,generaldatacanbecollectedbynon-expertsbut

dataforspecificdomainsnormallyne

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

从头训练大型语言模型的最佳实践

文档简介

温馨提示

最新文档

评论

从头训练大型语言模型的最佳实践

文档简介

温馨提示

最新文档

评论

相关文档