约翰霍普金斯大学 Bloomberg:用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第1页
约翰霍普金斯大学 Bloomberg:用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第2页
约翰霍普金斯大学 Bloomberg:用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第3页
约翰霍普金斯大学 Bloomberg:用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第4页
约翰霍普金斯大学 Bloomberg:用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第5页
已阅读5页,还剩123页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1BloombergGPT:ALargeLanguageModelforFinanceShijieWu1,*,Ozanrsoy1,*,StevenLu1,*,VadimDabravolski1,MarkDredze1,2,SebastianGehrmann1,PrabhanjanKambadur1,DavidRosenberg1,GideonMann11Bloomberg,NewYork,NYUSA2ComputerScience,JohnsHopkinsUniversity,Baltimore,MDUSAAbstractTheuseofNLPintherealmoffinancialtechnologyisbroadandcomplex,withapplicationsrangingfromsentimentanalysisandnamedentityrecognitiontoquestionanswering.LargeLanguageModels(LLMs)havebeenshowntobeeffectiveonavarietyoftasks;however,noLLMspecializedforthefinancialdomainhasbeenreportedinliterature.Inthiswork,wepresentBloombergGPT,a50billionparameterlanguagemodelthatistrainedonawiderangeoffinancialdata.Weconstructa363billiontokendatasetbasedonBloomberg’sextensivedatasourcesperhapsthelargestdomainspecicdatasetyetaugmentedwith45billiontokensfromgeneralpurposedatasets.WevalidateBloombergGPTonstandardLLMbenchmarks,openfinancialbenchmarks,andasuiteofinternalbenchmarksthatmostaccuratelyreflectourintendedusage.OurmixeddatasettrainingleadstoamodelthatoutperformsexistingmodelsonfinancialtasksbysignificantmarginswithoutsacrificingperformanceongeneralLLMbenchmarks.Additionally,weexplainourmodelingchoices,trainingprocess,andevaluationmethodology.Asanextstep,weplantoreleasetraininglogs(Chronicles)detailingourexperienceintrainingBloombergGPT.Contents1Introduction31.1BloombergGPT 31.2BroaderContributions 42Dataset52.1FinancialDatasets(363Btokens–54.2%oftraining) 72.1.1Web(298Btokens–42.01%oftraining) 72.1.2News(38Btokens–5.31%oftraining) 72.1.3Filings(14Btokens–2.04%oftraining) 72.1.4Press(9Btokens–1.21%oftraining) 82.1.5Bloomberg(5Btokens–0.70%oftraining) 82.2PublicDatasets(345Btokens–48.73%oftraining) 92.2.1ThePile(184Btokens–25.9%oftraining) 92.2.2C4(138Btokens–19.48%oftraining) 92.2.3Wikipedia(24Btokens–3.35%oftraining) 92.3Tokenization 9*.Co-firstauthors.23Model113.1Architecture 113.2ModelScaling 123.3TrainingConfiguration 133.4Large-scaleOptimization 144TrainingRun155Evaluation165.1Few-shotMethodology 185.2HeldoutLoss 185.3FinancialTasks 195.3.1ExternalFinancialTasks 205.3.2InternalTask:SentimentAnalysis 225.3.3ExploratoryTask:NER 235.4BIG-benchHard 26KnowledgeAssessments 265.6ReadingComprehension 285.7LinguisticTasks 295.8Summary 306QualitativeSamples317RelatedWork328Ethics,Limitations,andImplications378.1EthicalUse 37 9Conclusion38AArchitecture60A.0Notation 60A.1FullArchitecture 60A.2SelfAttentionwithALiBi(SA) 61A.3LayerNorm(LN) 62A.4FeedForwardNetwork(FFN) 62A.5ListofAllTrainableParameters 63BDetailsonexternalfinancialtasks6431.IntroductionThereleaseofGPT-3in2020(Brownetal.,2020)demonstratedthepowerfulbenefitsoftrainingverylargeauto-regressivelanguagemodels(LLMs).GPT-3had175billionparameters,ahundredfoldincreaseoverthepreviousGPT-2model,anddidremarkablywellacrossawiderangeofnowpopularLLMtasks,includingreadingcomprehension,ringandcodegenerationThisperformancehasbeenreplicatedacrossseveralothermodelsChowdheryetalScaoetalZhangetal022a).evidencesuggeststhatlargemodelsexhibitemergentbehaviorsgrowthallowsthemtoacquireabilitiesnotpresentinsmallermodels(Weietal.,2022a).Anotableexampleofemergentbehavioristheabilitytoperformtasksviafew-shotprompting,whereamodelcanlearnataskfromjustafewexamples.Thisabilityimproveswell-aboverandomasweincreasethesizeoflanguagemodels.Broadlyspeaking,few-shotpromptingdramaticallyexpandstherangeoftaskssupportedbymodelsandlowersthebarriertoentryforusersseekingautomationfornewlanguagetasks.AfterGPT-3,modelsgrewinsizeto280billion(Gopher,Raeetal.,2021),540bil-lion(PaLM,Chowdheryetal.,2022),and1trillionparameters(Megatron,Korthikantietal.,2022).Workalsoexploredotherimportantaspectsofachievingahigh-performingLLM,suchasdifferenttrainingobjectives(Tayetal.,2022b),multilingualmodels(Scaoetal.,2022),moreefficientandsmallermodels(Blacketal.,2022),andfindingdataandparameter-efficienttrainingsizes(Hoffmannetal.,2022).TheseeffortshavealmostexclusivelyfocusedongeneralLLMs,trainedondatasetsthatcoverabroadrangeoftopicsanddomains.Whilethesehaveincludedsomedatasetsforspecializeddomains(e.g.,code(Chenetal.,2021a)orbiomedicalarticlesGaoetal.(2021))thefocushasbeenonbuildingLLMswithbroadcapabilities.Recenteffortstrainingmodelsusingonlydomain-specificdatahaveyieldedmodelsthat,whilemuchsmaller,beatgeneralpurposeLLMsontaskswithinthosedomains,suchasscienceTayloretal.(2022)andmedicineBoltonetal.(2023);Luoetal.(2022);Lehmanetal.(2023).Thesefindingsmotivatefurtherdevelopmentofmodelsfocusedonspecificdomains.FinancialTechnology(FinTech)isalargeandgrowingareawithNLPtechnologieshavinganincreasinglyimportantroleXingetal.(2018);Fisheretal.(2016);Dredzeetal.(2016).FinancialNLPtasksShahetal.(2022)includesentimentanalysisAraci(2019),namedentityrecognitionSalinasAlvaradoetal.(2015),newsclassificationSinhaandKhandait(2020),andquestionansweringChenetal.(2021b,2022).WhiletherangeoftasksissimilartothosefoundingeneralNLPbenchmarks,thecomplexityandterminologyofthefinancialdomainwarrantadomain-specificsystem.ForallofthereasonsgenerativeLLMsareattractiveingeneral–few-shotlearning,textgeneration,conversationalsystems,etc.–itwouldbevaluabletohaveaLLMfocusedonthefinancialdomain.WhiletherearemaskedlanguagemodelstunedforthefinancialdomainAraci(2019),noLLMhasbeentunedfororevaluatedontasksforthisdomain.1.1BloombergGPTWetrainBloombergGPT,a50billionparameterlanguagemodelthatsupportsawiderangeoftaskswithinthefinancialindustry.Ratherthanbuildingageneral-purposeLLM,orasmallLLMexclusivelyondomain-specificdata,wetakeamixedapproach.General4andobviatetheneedforspecializationduringtrainingtime.However,resultsfromexistingdomain-specificmodelsshowthatgeneralmodelscannotreplacethem.AtBloomberg,wesupportaverylargeanddiversesetoftasks,wellservedbyageneralmodel,butthevastmajorityofourapplicationsarewithinthefinancialdomain,betterservedbyaspecificmodel.Forthatreason,wesetouttobuildamodelthatachievesbest-in-classresultsonfinancialbenchmarks,whilealsomaintainingcompetitiveperformanceongeneral-purposeLLMbenchmarks.Weachievethisgoalbyconstructingthelargestdomain-specificdatasetyet,drawingonexistingdatacreation,collection,andcurationresourcesatBloomberg.AsBloombergisprimarilyafinancialdatacompany,ourdataanalystshavecollectedandcuratedfinanciallanguagedocumentsoverthespanoffortyyears.Wehaveextensivearchivesoffinancialdatathatcoverarangeoftopics,withcarefultrackingofdatasourcesandusagerights.Weaddthisdatatopublicdatasetstocreatealargetrainingcorpuswithover700billiontokens.Usingaportionofthistrainingcorpus,wetrainaBLOOM-style,50billionparametermodeldesignedbasedonguidelinesfromHoffmannetal.(2022)andLeScaoetal.(2022).WevalidatethemodelonstandardLLMbenchmarks,openfinancialbenchmarks,andasuiteofBloomberg-internalbenchmarksthatmostaccuratelyreflectourintendedusecases.Ourresultsdemonstratethatourmixedtrainingapproachleadstoamodelthatvastlyoutperformsexistingmodelsonin-domainfinancialtaskswhilebeingonparorbetterongeneralNLPbenchmarks.1.2BroaderContributionsBeyondtheconstructionofaLLMforfinancialdata,ourgoalistocontributetothebroaderresearchcommunity.Specifically,ourexperiencedocumentedinthispaperprovidesevidencethatfurtherdevelopsthecommunity’sunderstandingofseveralopenquestionsintheliterature.Domain-specificLLMs.Thefewexistingdomain-specificLLMsaretrainedexclusivelyondomain-specificdatasources(Luoetal.,2022;Boltonetal.,2023;Tayloretal.,2022),oradaptaverylargegeneralpurposemodeltodomain-specifictasks(Singhaletal.,2022;Lewkowyczetal.,2022).Ouralternativeapproach–traininganLLMonbothdomain-wellondomain-specifictasks,butalsomaintainsstrongperformanceongeneral-purposebenchmarks.Trainingdata.Nearlyalllanguagemodelsrelyinlargepartonweb-scrapeddata,suchasC4(Raffeletal.,2020)andThePile(Gaoetal.,2021)(whichincludesOpenWebText2).ThisdatamaybecleanedorsubsettedinvariouswaysbeforeuseTouvronetal.(2023);Raeetal.(2020);Scaoetal.(2022);Jerniteetal.(2022),butissuesofdataduplicationCarlinietal.(2020)andtoxiclanguageremainWelbletal.(2021).OurtrainingdataisunusualforLLMtraininginthatitincludesasignificantamountofcuratedandprepareddatafromreliablesources.Evaluation.LLMevaluationremainsachallengingandevolvingproblemGehrmannetal.(2022);Goyaletal.(2022),withnewbenchmarkstryingtostandardizeevaluationacross5models(Liangetal.,2022;Srivastavaetal.,2022).However,fordomain-specifictasks,thereremainsamismatchbetweenevaluationandactualusecases.Evaluationsarebuiltonavailabledatasetsandnotnecessarilyonhowthemodelwillbeusedinpractice.WeprovideresultsonbothpublicfinancialNLPbenchmarks(Shahetal.,2022;Chenetal.,2021b)aswellasaselectionofinternalBloombergtasks,whicharebetteralignedwithourintendedusecasesanddirectlyevaluateourmodel’sabilitytoperformtasksofinterest.ModelSize.EarlyLLMsmadeasingletrainingpassoveracorpusof200-400billionto-kens(Brownetal.,2020)andHoffmannetal.(2022)positedthatmodelswereundertrained,insteadfocusingontrainingsmallermodelswithmoredata,astrategymostrecentlyem-ployedbyTouvronetal.(2023).WeselectamodelsizemotivatedbyHoffmannetal.(2022)andtraina50billionparametermodelon569billiontokensfromourcorpusofover700billiontokenstoproduceamodelthatiscompetitivewithlargermodels.Tokenizer.Afterassemblingtrainingdata,thecriticalstepoftokenizationtransformsthetextintoaformatsuitableforthelanguagemodel.TheimportanceofthisstepisoftenoverlookedMielkeetal.(2021),andmanyolderLLMsusethesametokenizerandvocabulary,meaningthatwehavelittleevidencetosupportothertokenizers.WetakeadifferentapproachanduseaUnigrammodelinsteadofgreedymerge-basedsub-wordtokenizerssinceitsavesprobabilitiesallowingforsmartertokenizationatinferencetime(Kudo,2018).ModelBuildingChallenges.GPT-3andsubsequentmodelsweretheworkoflargeteamsandrequiredanenormousamountofcomputation.Initialworktoreproducetheseresults,suchasOPTZhangetal.(2022a),didnotmatchtheperformanceoftheoriginalmodel.Withthereleaseofeachsubsequentmodel,thecommunity’sunderstanding,ex-perience,andsoftwaretoolsincrease.IndevelopingBloombergGPT,webenefitedfromexistingcodedevelopedaspartoftheBLOOMeffortScaoetal.(2022),showingthatamoderatelysizedteamcanproduceacompetitivemodelondomain-specificdata.Wede-scribeourexperiencestrainingBloombergGPTindetailtosupportfuturetrainingeffortsandaddresseachoftheabovetopics.2.DatasetTotrainBloombergGPT,weconstruct“FinPile”,acomprehensivedatasetconsistingofnancialdocuments,andsocialmediadrawnfromtheBloombergarchives.Thesedocumentshavebeenacquiredthroughourbusinessprocessoverthepasttwodecades.WeaugmentFinPilewithpublicdatawidelyusedtotrainLLMs.Theresultisatrainingcorpusthatisroughlyhalfdomain-specifictextandhalfgeneral-purposetext.Forabreakdownofthefulltrainingset,seeTable1.Toimprovedataquality,wede-duplicateeachdataset(ThePile,C4,Wikipedia,FinPile)accordingtoLeeetal.(2022a);asaside-effect,thestatisticsreportedinTable1mightbedifferentfromthosereportedinotherpapers.6DatasetDocsC/DCharsC/TToksT%FinPile1,017WebFilingsBloombergPUBLIC416,818Pile-CCGitHubPubMedCentralArXivOpenWebText2DMMathematicsWikipedia(en)USPTOBackgroundsPubMedAbstractsOpenSubtitlesGutenberg(PG-19)3UbuntuIRC1EuroParl7YouTubeSubtitlesBookCorpus228PhilPapers36NIHExPorter3EnronEmails51Wikipedia(7/1/22)TOTAL1,531Table1:BreakdownofthefulltrainingsetusedtotrainBloombergGPT.Thestatisticsprovidedaretheaveragenumberofcharactersperdocument(“C/D”),theaveragenumberofcharacterspertoken(“C/T”),andthepercentageoftheoveralltokens(“T%”).Unitsforeachcolumnaredenotedintheheader.72.1FinancialDatasets(363Btokens–54.2%oftraining)TheBloombergTerminalhasprovidedaccesstoacomprehensivesetofdiversestructuredandunstructuredfinancialdataandanalyticsforthepastfourdecades.Inservingthismission,Bloomberganalystshavecuratedasetoffinancialdocumentsthatwereeithercreatedinternallyoracquiredfromexternalsources.WeutilizethisextensivecollectionofcuratedandmaintaineddocumentstocreateFinPile,whichconsistsofcompanyfilings,financialnews,andotherdatarelevanttothefinancialmarkets.SomedocumentsincludedintheFinPile,suchascompanyfilings,areavailabletothegeneralpublic,althoughcollectingthesedocumentsandpre-processingthemforLLMtrainingisanon-trivialtask.Otherdocuments,suchas(asubsetof)Bloombergnews,mustbepurchased.Therestofthedocumentsareprivateandavailable,amongothersources,throughtheBloombergTerminal.Finally,wecleanthisdatatostripoffmarkup,specialformatting,andtemplates.NotethateachdocumentinFinPileistime-stamped,withdatesrangingfrom2007-03-01to2022-07-31;thequalityandquantityofdocumentsincreaseoverthistimerange.Whilewedonotutilizedateinformationinthiswork,weplantouseitinthefuture,suchasforevaluationofwhatthemodellearnsaboutdifferenttimeperiods.WhilewecannotreleaseFinPile,ourexperiencetrainingonalarge,carefullycurated,andcleandomain-specificdatasetmayprovidehelpfulinsightstothecommunityontheadvantagesandchallengesofbuildingafinancialLLMinparticular,andadomain-specificmodelingeneral.WeprovideabreakdownandanalysisofFinPileinTable2andabriefdescriptionofthetypesofdataincludedbelow.2.1.1Web(298Btokens–42.01%oftraining)Bloombergcollectswebcontentbyidentifyingsitesthatcontainfinanciallyrelevantinfor-mation.WhilethiscategorymakesupthemajorityofFinPile,itsclassificationsarerough,withcontentclassifiedmainlybythelocationofthewebdomain.Withintheselocation-specificsources,e.g.“US”(15.95%oftotal),“Asia-Pac”(4.72%oftotal),and“UK”(1.98%oftotal),documenttypesarehighlyvariedaswouldbeexpectedfromawebcrawl.WhilewebsourcesarecommoninexistingpublicLLMtrainingdatasets,Bloomberg’swebcrawlisfocusedonhigh-qualitywebsitesthathavefinanciallyrelevantinformation,asopposedtoageneral-purposecrawloftheweb.2.1.2News(38Btokens–5.31%oftraining)TheNewscategoryincludesallnewssourcesexcludingnewsarticleswrittenbyBloombergjournalists.Overall,therearehundredsofEnglishnewssourcesinFinPileincluding“BloombergTranscripts”(0.41%oftotal),whicharetranscriptsofBloombergTVnews.Generallythecontentinthisdatasetcomesfromreputablesourcesofnewsthatarerelevanttothefinancialcommunitysoastomaintainfactualityandreducebias.2.1.3Filings(14Btokens–2.04%oftraining)CompanyFilingsarefinancialstatementspreparedby(public)companiesandmadeavail-abletothegeneralpublic.Insomecountries,liketheUS,publiccompaniesaremandatedDateBloombergFilingsNewsPressWebTotal84,43111,69511,88316,907,57631,21436,21537,647Table2:Thenumberoftokens(inmillions)containedwithindocumentsinFinPile,or-ganizedbyyear(rows)andtype(column).Unitsaremillionsoftokens.toprepareandsubmittheirfinancialstatementsonaregularcadence;e.g.,10-Kannualreportsand10-Qquarterlyreports.Inourdataset,amajorityofthefilingscomefromEDGAR,whichistheSEC’sonlinedatabase(1.90%oftotal).FilingsaretypicallylongPDFdocumentswithtablesandchartsthataredenseinfinancialinformation,whichareprocessedandnormalizedinBloomberg.FilingsaresubstantiallydifferentfromthetypesofdocumentstypicallyusedtotrainLLMs,butcontaincriticallyimportantinformationforfinancialdecision-making.2.1.4Press(9Btokens–1.21%oftraining)ressreleasestypicallyissuedbycompaniesthatarenanciallyrelevant.Takentogetherwithfilings,pressreleasesrepresentmostofthepubliccommuni-cationsofacompany.However,unlikefilings,pressreleasesaresimilartonewsstoriesintermsofcontentandstyle.2.1.5Bloomberg(5Btokens–0.70%oftraining)ThiscategorycomprisesBloombergauthorednewsandotherdocumentssuchasopinionsandanalyses.Thelargestsourcesare“BloombergNews”(0.44%oftotal)and“BloombergFirstWord”(0.13%oftotal),theBloomberg-authoredwireofreal-timenews.WhileBloombergNewscoversawiderangeoftopics,ittypicallyfocusesoncontentrelevanttothefinancialcommunity.Thisdatasetcontainsdocumentsofvaryinglengths.92.2PublicDatasets(345Btokens–48.73%oftraining)Weusethreewidelyknownandavailablepublicdatasetsinourtrainingcorpus.2.2.1ThePile(184Btokens–25.9%oftraining)ThePile(Gaoetal.,2021)isthedatasetusedinGPT-Neo(Blacketal.,2021),GPT-J(WangandKomatsuzaki,2021),andGPT-NeoX(20B)(Blacketal.,2022).WeincludeThePileinourtrainingdataforthefollowingreasons.First,ithasbeenusedtosuccessfullytrainanLLM.Second,ithasundergonesignificantdatacleaningandpre-processing.Third,itincludesmultipledomainsandwebelievesuchdiversedatawillaidgeneralizationtonewdomainsandmayevensupporttrainingonfinancialdata.Forexample,domainssuchasFreeLawandGitHubareusefultoteamsatBloombergthatworkonlegaldocumentsandsoftwaredevelopment,respectively.CreatorsofThePilehavedeliberatelychosentoincludeduplicatecontent,withtheduplicationfactorbeingproportionaltotheperceivedqualityofthecontent.However,aswededuplicateeachofourdatasets,thesizeofThePileissignificantlyreduced.Additionally,notethatourtokenizer(●2.3)istrainedonThePile.2.2.2C4(138Btokens–19.48%oftraining)TheColossalCleanCrawledCorpus(C4)isacommondatasetusedtotrainLLMs,andwasintroducedtosupporttrainingT5(Raffeletal.,2020).AlthoughitoverlapswithPile-CC,C4iscleanedandprocesseddifferently;hence,wefeelthatincludingC4inadditiontoThePilecanaddvaluemorethanduplicateddocumentswould.WefindthatC4containshigh-qualitynaturallanguagedocumentsduetothelayersofcleaning,thoughothershavenotedthatthedistributionacrosswebdomainsisunusual,withahighfractionofdatastemmingfrompatentsDodgeetal.(2021).2.2.3Wikipedia(24Btokens–3.35%oftraining)BothThePileandC4includeout-of-datecopiesofWikipedia,soitcouldbebeneficialforthefactualityofthemodeltohaveup-to-dateWikipediapagesincluded.Therefore,weincludeadumpofEnglishWikipediafromJuly1,2022.Thisdatasetistokenizedquiteinefficiently(3.06characterspertoken),indicatinganabove-averageamountofmarkup,whichsuggeststhatfurthercleaningmightbenefitfuturemodeltraining.2.3TokenizationWechoosetheUnigramtokenizer(Kudo,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论