![从头训练大型语言模型的最佳实践_第1页](http://file4.renrendoc.com/view9/M03/12/0F/wKhkGWdZfC6AR1wBAAEi1WQoT8Q568.jpg)
![从头训练大型语言模型的最佳实践_第2页](http://file4.renrendoc.com/view9/M03/12/0F/wKhkGWdZfC6AR1wBAAEi1WQoT8Q5682.jpg)
![从头训练大型语言模型的最佳实践_第3页](http://file4.renrendoc.com/view9/M03/12/0F/wKhkGWdZfC6AR1wBAAEi1WQoT8Q5683.jpg)
![从头训练大型语言模型的最佳实践_第4页](http://file4.renrendoc.com/view9/M03/12/0F/wKhkGWdZfC6AR1wBAAEi1WQoT8Q5684.jpg)
![从头训练大型语言模型的最佳实践_第5页](http://file4.renrendoc.com/view9/M03/12/0F/wKhkGWdZfC6AR1wBAAEi1WQoT8Q5685.jpg)
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CurrentBestPractices
forTrainingLLMsfromScratch
Authors:RebeccaLi,AndreaParker,JustinTenuto
·······weights&Biases
TableofContents
Introduction03
Buildvs.BuyPre-trainedLLMModels03
TheScalingLaws05
Hardware06
Memoryvs.ComputeEfficiency06
TechniquesforParallelization06
DatasetCollection08
DatasetPre-processing08
DatasetHandling08
Tokenization09
Pre-trainingSteps13
ModelEvaluation15
BiasandToxicity16
InstructionTuning17
ReinforcementLearningthroughHumanFeedback(RLHF)19
Conclusion20
References20
Appendix21
LLMOverview21
TransformerModelArchitecture21
TheOriginalLLMScalingLaws23
www.wandb.ai•contact@wandb.ai2
·······weights&Biases
Introduction
Althoughwe’reonlyafewyearsremovedfromthetransformer
breakthrough,LLMshavealreadygrownmassivelyin
performance,cost,andpromise.AtW&B,we’vebeenfortunatetoseemoreteamstrytobuildLLMsthananyoneelse.Butmanyofthecriticaldetailsandkeydecisionpointsareoftenpasseddownbywordofmouth.
Thegoalofthiswhitepaperistodistillthebestpracticesfor
trainingyourownLLMforscratch.We’llcovereverythingfromscalingandhardwaretodatasetselectionandmodeltraining,lettingyouknowwhichtradeoffstoconsiderandflaggingsomepotentialpitfallsalongtheway.Thisismeanttobeafairly
exhaustivelookatthekeystepsandconsiderationsyou’llmakewhentraininganLLMfromscratch.
Thefirstquestionyoushouldaskyourselfiswhethertrainingonefromscratchisrightforyourorganization.Assuch,we’llstartthere:
BUILDVS.BUYPRE-TRAINEDLLMMODELS
BeforestartingLLMpre-training,thefirstquestionyouneedtoaskiswhetheryoushouldpre-trainanLLMbyyourselforuseanexistingone.Therearethreebasicapproaches:
•Option1:UsetheAPIofacommercialLLM,e.g.GPT-3(OpenAI,2020),CohereAPIs,AI21J-1
•Option2:Useanexistingopen-sourcedLLM,e.g.GPT-J(EleutherAI,2021),GPT-NeoX(EleutherAI,2022),Galactica(MetaAI),UL2(Google,2022),OPT(MetaAI,2022),BLOOM(BigScience,2022),Megatron-LM(NVIDIA,2021),CodeGen(Salesforce,2022)
•Option3:Pre-trainanLLMbyyourselforwithconsultants:
YoucaneithermanageyourowntrainingorhireLLM
consultants&platforms.Forexample,MosaicMLprovidestrainingservicesfocusingonLLMs.
Thatsaid,therearealotofdetailstoconsiderwhenmakingyourchoice.Herearethepros,cons,andapplicablescenariosforeachoption:
Option3
Pre-trainanLLMbyyourselforwithconsultants
Option2
Useanexistingopen-sourcedLLM
Option1
UsetheAPIofacommercialLLM
Pros
•RequirestheleastLLMtrainingtechnicalskills.
•Minimumupfronttraining/
explorationcost,givenmaincostincursatinferencetime.
•Theleastdata-demandingoption.
Onlyafewexamples(ornoexamples)areneededformodelstoperform
inference.
•Canleveragethebest-performingLLMsinthemarketandbuilda
superiorexperience.
•Reducetime-to-marketofyour
appsandde-riskyourprojectwithaworkingLLMmodel.
•AgoodwaytoleveragewhatLLMs
havelearnedfromavastamountofinternetdataandbuildontopofit
withoutpayingfortheIPatinference.
•Comparedtooptionone,youarelessdependentonthefuturedirectionofLLMserviceprovidersandthushavemorecontrolregardingroadmap&backwardscompatibility.
•Comparedtooptionthree,youhaveamuchfastertime-to-valuegivenyouarenotbuildingLLMsfromscratch,alsoleadingtolessdata,training
time,trainingbudgetneeded.
•Comparedtooptionsoneandtwo,youhavethemostcontrolofyour
LLM’sperformanceandfuture
direction,givingyoulotsofflexibilitytoinnovateontechniquesand/or
customizetoyourdownstreamtasks.
•Gainfullcontroloftrainingdatasetsusedforthepre-training,which
directlyimpactsmodelquality,bias,andtoxicityissues.Incomparison,thoseissuesarelesscontrollableinoptiononeortwo.
•TrainingyourownLLMalsogives
youadeepmoat:superiorLLM
performanceeitheracrosshorizontalusecasesortailoredtoyourvertical,allowingyoutobuildasustaining
advantageespeciallyifyoucreateapositivedata/feedbackloopwithLLMdeployments.
www.wandb.ai•contact@wandb.ai3
·······weights&Biases
Option1
Option2
Option3
UsetheAPIofacommercialLLM
Useanexistingopen-sourcedLLM
Pre-trainanLLMbyyourselforwithconsultants
Cons
•CommercialLLMservicescanget
expensivewithahighvolumeoffine-tuningorinferencetasks.Itcomes
downtoLLMtotal-cost-of-ownership(TCO)amortizedtoeachinference.
•Manyindustries/usecasesforbidtheuseofcommercialLLMservicesassensitivedata/PIIdatacannotbeseenbytheserviceforcompliance(healthcareusecases,forexample).
•Ifbuildingexternalapps,you’llneedtofindothermoatsandde-riskyourbusinessifyou’rehighlyreliantonexternalLLMservicetechnology.
•Lessflexibledownstream:doesn’t
supportedgeinference,limited
abilitytocustomizethemodel(fine-tuninggetsexpensive),limitedabilityforongoingmodelimprovements.
•Notasdemandingasbuilding
yourown,butstillrequireslotsofdomainexpertskillstotrain,fine-tune,andhostanopen-sourcedLLM.LLMreproducibilityisstillasignificantissuesotheamountoftimeandworkneededcannotbeunderestimated.
•Slowertime-to-marketandlessagileifyouarebuildingdownstreamapps,duetoamoreverticaltechstack.
•Open-sourcedmodelstypically
lagperformancecomparedto
commercialmodelsbymonths/years.Ifyourcompetitorleveragescommercialmodels,theyhaveanadvantageonLLMtechandyou’llneedtofindothercompetitive
advantages.
•Veryexpensiveendeavorwith
highrisks.Needcross-domain
knowledgespanningfromNLP/ML,subjectmatterexpertise,softwareandhardwareexpertise.Ifnotdonewell,youcouldendupinasituationwhereyou’vespentthousands
orevenmillionsofdollarswith
asuboptimalmodel.Mistakes,
especiallylateintotrainingstages,arehardtofix/unwind.
•Lessefficientthanoptiontwo.
OptiontwoleveragesexistingLLMs,learningfromanentireinternet’s
worthofdataandcanprovidea
solidstartingpoint.Withoption3,youstartfromscratchandneedlotsofhigh-quality/diversedatasets
foryourmodelstogaingeneralizedcapabilities.
Whentoconsidereachoption
•BestifyoueitherhavelesstechnicalteamsbutwanttoleverageLLM
techniquestobuilddownstream
apps,oryouwanttoleveragethebest-in-classLLMsforperformancereasons(outsourcingtheLLMtech).
•Betweenoptionstwoandthree,
ifyouaren’ttryingtochangethe
modelarchitecture,itisalmost
alwaysbettertoeitherdirectlytakeanexistingpre-trainedLLMand
fine-tuneitortaketheweightsofan
•Bestifyouneedtochangemodelarchitectureortrainingdatasetfromexistingpre-trainedLLMs.Forexample,ifyouwanttouseadifferenttokenizer,changethevocabularysize,orchangethe
•Goodifyouhaveverylimitedtrainingdatasetsandwanttoleveragean
LLM’scapabilitytodozero/few-shotlearning.
existingpre-trainedLLMasastartingpointandcontinuepre-training.Thereasonisbecauseagoodpre-trainedLLMlikeGPT-NeoXhasalreadyseenavastamountofdataandthushas
numberofhiddendimensions,attentionheads,orlayers.
•Typically,inthiscasetheLLMisa
corepartofyourbusinessstrategy&
•Goodforprototypingappsand
exploringwhatispossiblewithLLMs.
learnedgeneralcapabilitiesfromthedata.Youcanleveragethatlearningespeciallyifyourtrainingdatasetisnothugeordiverse.
•Anothertypicalscenarioisthatyouoperateinaregulatoryenvironmentorhaveuser/sensitivedatathat
cannotbefedtocommercial
LLMservices.Oryouneededge
deploymentofthemodelforlatencyorlocationalreasons.
technologicalmoat.Youaretakingonsomeoralotofinnovations
inLLMtraining,andhavealargeinvestmentappetitetotrainandmaintainexpensivemodelsonanongoingbasis.
•Typically,youhaveorwillhavelotsofproprietarydataassociatedwithyourLLMtocreateacontinuous
modelimprovementloopfor
sustainablecompetitiveadvantage.
Itisalsoworthmentioningthatifyouonlyhaveaverytargetedsetofusecasesanddon’tneedthegeneral-purposecapabilitiesor
generativecapabilitiesfromLLMs,youmightwanttoconsidertrainingorfine-tuningamuchsmallertransformerorothermuchsimplerdeeplearningmodels.Thatcouldresultinmuchlesscomplexity,lesstrainingtime,andlessongoingcosts.
www.wandb.ai•contact@wandb.ai4
·······weights&Biases
THESCALINGLAWS
Beforeyoudiveintotraining,it’simportanttocoverhowLLMsscale.Understandingscalingletsyoueffectivelybalancethesizeandcomplexityofyourmodelandthesizeofthedatayou’llusetotrainit.
Somerelevanthistoryhere:OpenAIoriginallyintroduced“theLLMscalinglaws”in2020.Theysuggestedthatincreasingmodelsizewasmoreimportantthanscalingdatasize.Thisheldfor
abouttwoyearsbeforeDeepMindsuggestedalmostthepolaropposite:thatpreviousmodelsweresignificantlyundertrainedandthatincreasingyourfoundationaltrainingdatasetsactuallyleadstobetterperformance.
Thatchangedin2022.Specifically,DeepMindputforward
analternativeapproachintheir
TrainingCompute-Optimal
LargeLanguageModels
paper.TheyfoundthatcurrentLLMsareactuallysignificantlyundertrained.Putsimply:theselargemodelsweren’ttrainedonnearlyenoughdata.
DeepmindshowcasedthiswithamodelcalledChinchilla,whichisafourththesizeoftheGophermodelabovebuttrainedon
4.6xmoredata.Atthatreducedsizebutwithfarmoretrainingdata,ChinchillaoutperformedGopherandotherLLMs.
DeepMindclaimsthatthemodelsizeandthenumberof
trainingtokens*shouldinsteadincreaseatroughlythesameratetoachieveoptimalperformance.Ifyougeta10xincreaseincompute,youshouldmakeyourmodel3.1xtimesbiggerandthedatayoutrainover3.1xbigger;ifyougeta100xincreaseincompute,youshouldmakeyourmodel10xbiggerandyourdata10xbigger.
*Note:TokenizationinNLPisanessentialstepofseparatingapiece
oftextintosmallerunitscalledtokens.Tokenscanbeeitherwords,
characters,orsubwords.Thenumberoftrainingtokensisthesizeof
trainingdataintokenformaftertokenization.Wewilldiveintodetailedtokenizationmethodsalittlelater.
DeepMindprovidesthefollowingchartshowinghowmuch
trainingdataandcomputeyou’dneedtooptimallytrainmodelsofvarioussizes.
EstimatedoptimaltrainingFLOPsandtrainingtokensforvariousmodelsizes,
TrainingCompute-OptimalLargeLanguageModels
Thatsaid,mostexistingLLMsarestillundertrained:
Data/compute-optimal(Chinchilla)heatmap,
Chinchilla
data-optimalscalinglaws:InplainEnglish
Insummary,thecurrentbestpracticesinchoosingthesizeofyourLLMmodelsarelargelybasedontworules:
•DecideonyourdatasetandfindtheChinchilla-optimal
modelsizebasedondatasize(orclosetoChinchilla-optimalwithintheboundaryofyourdatacollectionlimitation)
•Determinethedataandmodelsizecombinationthat’sbestforyourmodel,basedonyourtrainingcomputebudgetandinferencelatencyrequirements
Totheleftoftheminimaoneachcurve,modelsaretoosmall--alargermodeltrainedonlessdatawouldbeanimprovement.Totherightoftheminimaoneachcurve,modelsaretoolarge--asmallermodeltrainedonmoredatawouldbeanimprovement.Thebestmodelsareattheminima.
www.wandb.ai•contact@wandb.ai5
·······weights&Biases
HARDWARE
Itshouldcomeasnosurprisethatpre-trainingLLMsisa
hardware-intensiveeffort.Thefollowingexamplesofcurrentmodelsareagoodguidehere:
•PaLM(540B,Google):6144TPUv4chipsusedintotal,madeoftwoTPUv4Podsconnectedoverdatacenternetwork(DCN)usingacombinationofmodelanddataparallelism
•OPT(175B,MetaAI):99280GBA100GPUs,utilizingfullyshareddataparallelismwithMegatron-LMtensorparallelism
•GPT-NeoX(20B,EleutherAI):9640GBA100GPUsintotal
•Megatron-TuringNLG(530B,NVIDIA&MSFT):560DGXA100nodes,eachclusternodehas8NVIDIA80-GB
A100GPUs
TrainingLLMsischallengingfromaninfrastructureperspectivefortwobigreasons.Forstarters,itissimplynolongerpossibletofitallthemodelparametersinthememoryofeventhelargestGPU(e.g.NVIDIA80GB-A100),soyou’llneedsomeparallel
architecturehere.Theotherchallengeisthatalargenumberofcomputeoperationscanresultinunrealisticallylongtrainingtimesifyouaren’tconcurrentlyoptimizingyouralgorithms,
software,andhardwarestack(e.g.trainingGPT-3with175Bparameterswouldrequireabout288yearswithasingleV100NVIDIAGPU).
Memoryvs.ComputeEfficiency
TechniquesforParallelization
Parallelizationreferstosplittinguptasksanddistributing
themacrossmultipleprocessorsordevices,suchasGPUs,sothattheycanbecompletedsimultaneously.Thisallowsformoreefficientuseofcomputeresourcesandfastercompletiontimescomparedtorunningonasingleprocessorordevice.
ParallelizedtrainingacrossmultipleGPUsisaneffectivewaytoreducetheoveralltimeneededforthetrainingprocess.
Thereareseveraldifferentstrategiesthatcanbeusedto
parallelizetraining,includinggradientaccumulation,micro-
batching,dataparallelization,tensorparallelizationandpipelineparallelization,andmore.TypicalLLMpre-trainingemploysa
combinationofthesemethods.Let’sdefineeach:
DataParallelism
Dataparallelismisthebestandmostcommonapproachfor
dealingwithlargedatasetsthatcannotfitintoasinglemachineinadeeplearningworkflow.
Morespecifically,dataparallelismdividesthetrainingdataintomultipleshards(partitions)anddistributesthemtovarious
nodes.Eachnodefirstworkswithitslocaldatatotrainitssub-model,andthencommunicateswiththeothernodestocombinetheirresultsatcertainintervalsinordertoobtaintheglobal
model.Theparameterupdatesfordataparallelismcanbeeitherasynchronousorsynchronous.
Theadvantageofthismethodisthatitincreasescompute
efficiencyandthatitisrelativelyeasytoimplement.ThebiggestdownsideisthatduringthebackwardpassyouhavetopassthewholegradienttoallotherGPUs.Italsoreplicatesthemodelandoptimizeracrossallworkerswhichisrathermemoryinefficient.
ToachievethefullpotentialofthousandsofdistributedGPUs,itiscrucialtodesignparallelismintoyourarchitectureto
balancememoryandcomputeefficiency.
Memoryefficiency
TrainingaLLMrequiresterabytesofaggregatememoryfor
modelweights,gradients,andoptimizerstates-farbeyondwhatisavailableonasingleGPU.Onetypicalmitigationstrategyis
gradientaccumulation,inwhichthefulltrainingbatchissplitintomicro-batchesthatareprocessedinsequencewiththeirresultinggradientsaccumulatedbeforeupdatingthemodel
weights.Thatmeansyourtrainingbatchsizecanscalewithoutincreasingthepeakresidentactivationmemory.
Computeefficiency
WhilelargeGPUclusterscanhavethousandsofhigh-throughputGPUs,achievinghighcomputeefficiencyatthisscaleis
challenging.Alargebatchsizecanbeaneffectivewaytoincreasecomputeefficiency,becauseitincreasesthearithmeticintensityofaGPUkernelandhelpsamortizethetimespentstalledon
communicationandsynchronization.However,usingtoolargeofabatchsizecanhavenegativeeffectsonthemodelquality.
Whileparallelizationisparamount,therearemanydifferent
waystodoit.We’llgetintothemostcommoninournextsection.
www.wandb.ai•contact@wandb.ai6
·······weights&Biases
TensorParallelism
Tensorparallelismdivideslargematrixmultiplicationsintosmallersubmatrixcalculationswhicharethenexecuted
simultaneouslyusingmultipleGPUs.
Thisallowsforfastertrainingtimesduetoitsasynchronousnatureandtheabilitytoreducecommunicationoverheadbetweennodes.Thebenefitofthismethodisthatitis
memory-efficient.Thedownside,however,isthatit
introducesadditionalcommunicationofactivationsineachforward&backwardpropagation,andthereforerequireshighcommunicationbandwidthtobeefficient.
Pipelineparallelismandmodelparallelism
Pipelineparallelismimprovesboththememoryandcomputeefficiencyofdeeplearningtrainingbypartitioningthelayersofamodelintostagesthatcanbeprocessedinparallel.
Thishelpswithoverallthroughputspeedssignificantlywhile
addingthesmallestcommunicationoverhead.Youcanthinkofpipelineparallelismas“inter-layerparallelism”(wheretensor
parallelismcanbethoughtofas“intra-layerparallelism”).
Similartopipelineparallelism,modelparallelismiswhenyou
splitthemodelamongGPUsandusethesamedataforeach
model;soeachGPUworksonapartofthemodelratherthanapartofthedata.Thedownsideofpipelineandmodelparallelismisthatitcannotscaleinfinitelygiventhatthedegreeofpipelineparallelismisboundedbythedepthofthemodel.
Asmentionedatthestartofthissection,it’snotuncommonforteamstoleverageacombinationofparallelismtechniquesduringtraining.Forexample,PaLM(GoogleBrain,2022)andOPT(MetaAI,2022)bothusedacombinationoftensormodelparallelismanddataparallelism.
NVIDIAapproachedthingsalittledifferentlyinthe
Efficient
Large-ScaleLanguageModelTrainingonGPUClustersUsing
Megatron-LM
paper.TheyproposedaPTD-Ptechniquethat
combinespipeline,tensor,anddataparallelismtoachieve
state-of-the-artcomputationalperformance(52%ofpeakdevicethroughput)on1000sofGPUs.
Specifically,PTD-Pleveragesacombinationofpipeline
parallelismacrossmulti-GPUservers,tensorparallelismwithinamulti-GPUserver,anddataparallelismtopracticallytrain
modelswithatrillionparameters.Themethodalsoemploys
gracefulscalinginanoptimizedclusterenvironmentwithhigh-bandwidthlinksbetweenGPUsonthesameserverandacrossservers.
UsingthesetechniquestotrainLLMsrequiresnotonlythe
highest-performingGPUstobeefficient,butalsoneedshigh-
bandwidthnetworkingforoptimalcommunication––InfiniBandisoftenusedtomovedatabetweennodes.
Butthisofcoursecomeswithacost.Leveragingthousands
ofhigh-performingGPUsandhigh-bandwidthnetworksto
trainLLMsisinfrastructure-intensive.Forexample,aback-of-the-envelopecalculationestimatedthatthecostofthePaLMmodel(540B,Google)mightbeashighas$23MM
(seedetailed
analysis
).
Toimplementdistributeddeeplearningtrainingsystems,
softwaretoolkitssuchasDistributedTensorFlow,Torch
Distributed,Horovod,andlibrariessuchasDeepSeedand
Megatronareoftenneeded.Thereisimplementationcomplexityheresoitrequiressystemexpertiseifyou’regoingtobe
successful.
Inaddition,thefollowingtechniquesandstrategiesarecommonlyemployedtoachieveparallelism:
Gradientaccumulation
Gradientaccumulationinvolvesaddingupgradientsfrom
multiplebatchesbeforeperformingoneweightupdatesteponallaccumulatedgradientsatonce.
ThisapproachreducescommunicationoverheadbetweenGPUsbyallowingthemtoworkindependentlyontheirownlocalbatchofdatauntiltheyhavesynchronizedwitheach
otheragain,afteraccumulatingenoughgradientsforasingleoptimizationstep.
Asynchronousstochasticgradientdescentoptimization
AsynchronousstochasticgradientdescentoptimizationmethodscanalsobeemployedwhenperformingmodeloptimizationovermultipleGPUs.
www.wandb.ai•contact@wandb.ai7
·······weights&Biases
Thismethodusessmallsubsets(microbatches)ofdatafrom
eachnodeinsteadofloadingalldataatonce,whichhelpsreducememoryrequirementswhilestillallowingforfastconvergenceratesduetoitsasynchronousnature.Itworkslikethis:
•First,wefetchthemostup-to-dateparametersofthe
modelneededtoprocessthecurrentmini-batchfromtheparameterservers.
•Wethencomputegradientsofthelosswithrespecttotheseparameter
•Finally,thesegradientsaresentbacktotheparameterservers,whichthenupdatesthemodelaccordingly.
Micro-batching
Micro-batchingcombinessmallmini-batchesintolargeronessothatmorebatchescanbeprocessedinlesstimeandwithfewersynchronizationpointsbetweendevicesduringbackpropagationoperations.Ithasbecomeincreasinglypopularfortraining
verylargemodelsacrossmanyGPUsduetoitsabilitytoreducememoryconsumptionandimprovescalability.Overall,micro-
batchingisaneffectivewaytoleveragedistributeddeeplearningtechniqueswhendealingwithverylargedatasetsormodelsthatrequiresignificantamountsofprocessingpower.
Nowthatwe’vegonethroughscaling,hardware,andsome
techniquesforparallelizingyourtrainingruns,let’slookatwhatyourLLMwillactuallylearnfrom:data.
DATASETCOLLECTION
Baddataleadstobadmodels.Butcarefulprocessingofhigh-quality,high-volume,diversedatasetsdirectly
contributestomodelperformanceindownstreamtasksaswellasmodelconvergence.
DatasetdiversityisespeciallyimportantforLLMs.That’sbecausediversityimprovesthecross-domainknowledgeofthemodel,
aswellasitsdownstreamgeneralizationcapability.TrainingondiverseexampleseffectivelybroadenstheabilityofyourLLMtoperformwellonmyriadnuancedtasks.
Atypicaltrainingdatasetiscomprisedoftextualdatafrom
diversesources,suchascrawledpublicdata,onlinepublicationorbookrepositories,codedatafromGitHub,Wikipedia,news,socialmediaconversations,etc.
Forexample,considerThePile
.ThePileisapopulartextcorpuscreatedbyEleutherAIforlarge-scalelanguagemodeling.It
containsdatafrom22datasources,coarselybrokendownintofivebroadcategories:
•AcademicWriting:PubMedAbstractsandPubMedCentral,arXiv,FreeLaw,USPTOBackgrounds,PhilPapers,NIH
Exporter
•OnlineorScrapedResources:CommonCrawl,OpenWebText2,StackExchange,Wikipedia
•Prose:BookCorpus2,Bibliotik,ProjectGutenberg
•Dialog:YouTubesubtitles,UbuntuIRC,OpenSubtitles,HackerNews,Europarl
•Miscellaneous:GitHub,theDeepMindMathematicsdataset,Enronemails
NotethatThePiledatasetisoneoftheveryfewlarge-scale
textdatasetsthatisfreeforthepublic.FormostoftheexistingmodelslikeGPT-3,PaLM,andGalactica,theirtrainingand
evaluationdatasetsarenotpubliclyavailable.Giventhelargescaleeffortittakestocompileandpre-processthesedatasetsforLLMtraining,mostcompanieshavekeptthemin-house
tomaintaincompetitiveadvantage.ThatmakesdatasetslikeThePileandafewdatasetsfromAllenAIextremelyvaluableforpubliclarge-scaleNLPresearchpurposes.
Anotherthingworthmentioningisthat,duringdataset
collection,generaldatacanbecollectedbynon-expertsbut
dataforspecificdomainsnormallyne
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 数字化转型趋势及实施方案
- 锅炉工聘用合同
- 三农行业现代农业园区规划与设计指导书
- 三农村农业综合开发方案
- 2025年东营货运上岗证模拟考试
- 2025年东莞货运资格证安检考试题
- 2025年安顺货运从业资格证模拟考试保过版
- 2025年辽阳货运从业资格模拟考试
- 2025年荆州货运车从业考试题
- 2024年高考化学一轮复习2.2离子反应离子方程式练习含解析
- 《网络设备安装与调试(华为eNSP模拟器)》项目1认识eNSP模拟器及VRP基础操作
- 民事诉讼法学 马工程 课件 第21章 涉外民事诉讼程序的特别规定
- 钢结构考试试题(含答案)
- 彭大军桥牌约定卡
- 新能源整车装配工艺培训的资料课件
- 房车露营地的研究课件
- 园艺疗法共课件
- DB33T 628.1-2021 交通建设工程工程量清单计价规范 第1部分:公路工程
- 医院-9S管理共88张课件
- 设立登记通知书
- 2022医学课件前列腺炎指南模板
评论
0/150
提交评论