约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance

上传人：策*** IP属地：山西上传时间：2023-04-04 格式：DOCX 页数：128 大小：644.91KB 积分：19.9 举报 版权申诉

约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第2页

约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第3页

约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第4页

约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance_第5页

已阅读5页，还剩123页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1BloombergGPT:ALargeLanguageModelforFinanceShijieWu1,*,Ozanrsoy1,*,StevenLu1,*,VadimDabravolski1,MarkDredze1,2,SebastianGehrmann1,PrabhanjanKambadur1,DavidRosenberg1,GideonMann11Bloomberg,NewYork,NYUSA2ComputerScience,JohnsHopkinsUniversity,Baltimore,MDUSAAbstractTheuseofNLPintherealmofﬁnancialtechnologyisbroadandcomplex,withapplicationsrangingfromsentimentanalysisandnamedentityrecognitiontoquestionanswering.LargeLanguageModels(LLMs)havebeenshowntobeeﬀectiveonavarietyoftasks;however,noLLMspecializedfortheﬁnancialdomainhasbeenreportedinliterature.Inthiswork,wepresentBloombergGPT,a50billionparameterlanguagemodelthatistrainedonawiderangeofﬁnancialdata.Weconstructa363billiontokendatasetbasedonBloomberg’sextensivedatasourcesperhapsthelargestdomainspecicdatasetyetaugmentedwith45billiontokensfromgeneralpurposedatasets.WevalidateBloombergGPTonstandardLLMbenchmarks,openﬁnancialbenchmarks,andasuiteofinternalbenchmarksthatmostaccuratelyreﬂectourintendedusage.OurmixeddatasettrainingleadstoamodelthatoutperformsexistingmodelsonﬁnancialtasksbysigniﬁcantmarginswithoutsacriﬁcingperformanceongeneralLLMbenchmarks.Additionally,weexplainourmodelingchoices,trainingprocess,andevaluationmethodology.Asanextstep,weplantoreleasetraininglogs(Chronicles)detailingourexperienceintrainingBloombergGPT.Contents1Introduction31.1BloombergGPT 31.2BroaderContributions 42Dataset52.1FinancialDatasets(363Btokens–54.2%oftraining) 72.1.1Web(298Btokens–42.01%oftraining) 72.1.2News(38Btokens–5.31%oftraining) 72.1.3Filings(14Btokens–2.04%oftraining) 72.1.4Press(9Btokens–1.21%oftraining) 82.1.5Bloomberg(5Btokens–0.70%oftraining) 82.2PublicDatasets(345Btokens–48.73%oftraining) 92.2.1ThePile(184Btokens–25.9%oftraining) 92.2.2C4(138Btokens–19.48%oftraining) 92.2.3Wikipedia(24Btokens–3.35%oftraining) 92.3Tokenization 9*.Co-ﬁrstauthors.23Model113.1Architecture 113.2ModelScaling 123.3TrainingConﬁguration 133.4Large-scaleOptimization 144TrainingRun155Evaluation165.1Few-shotMethodology 185.2HeldoutLoss 185.3FinancialTasks 195.3.1ExternalFinancialTasks 205.3.2InternalTask:SentimentAnalysis 225.3.3ExploratoryTask:NER 235.4BIG-benchHard 26KnowledgeAssessments 265.6ReadingComprehension 285.7LinguisticTasks 295.8Summary 306QualitativeSamples317RelatedWork328Ethics,Limitations,andImplications378.1EthicalUse 37 9Conclusion38AArchitecture60A.0Notation 60A.1FullArchitecture 60A.2SelfAttentionwithALiBi(SA) 61A.3LayerNorm(LN) 62A.4FeedForwardNetwork(FFN) 62A.5ListofAllTrainableParameters 63BDetailsonexternalﬁnancialtasks6431.IntroductionThereleaseofGPT-3in2020(Brownetal.,2020)demonstratedthepowerfulbeneﬁtsoftrainingverylargeauto-regressivelanguagemodels(LLMs).GPT-3had175billionparameters,ahundredfoldincreaseoverthepreviousGPT-2model,anddidremarkablywellacrossawiderangeofnowpopularLLMtasks,includingreadingcomprehension,ringandcodegenerationThisperformancehasbeenreplicatedacrossseveralothermodelsChowdheryetalScaoetalZhangetal022a).evidencesuggeststhatlargemodelsexhibitemergentbehaviorsgrowthallowsthemtoacquireabilitiesnotpresentinsmallermodels(Weietal.,2022a).Anotableexampleofemergentbehavioristheabilitytoperformtasksviafew-shotprompting,whereamodelcanlearnataskfromjustafewexamples.Thisabilityimproveswell-aboverandomasweincreasethesizeoflanguagemodels.Broadlyspeaking,few-shotpromptingdramaticallyexpandstherangeoftaskssupportedbymodelsandlowersthebarriertoentryforusersseekingautomationfornewlanguagetasks.AfterGPT-3,modelsgrewinsizeto280billion(Gopher,Raeetal.,2021),540bil-lion(PaLM,Chowdheryetal.,2022),and1trillionparameters(Megatron,Korthikantietal.,2022).Workalsoexploredotherimportantaspectsofachievingahigh-performingLLM,suchasdiﬀerenttrainingobjectives(Tayetal.,2022b),multilingualmodels(Scaoetal.,2022),moreeﬃcientandsmallermodels(Blacketal.,2022),andﬁndingdataandparameter-eﬃcienttrainingsizes(Hoﬀmannetal.,2022).TheseeﬀortshavealmostexclusivelyfocusedongeneralLLMs,trainedondatasetsthatcoverabroadrangeoftopicsanddomains.Whilethesehaveincludedsomedatasetsforspecializeddomains(e.g.,code(Chenetal.,2021a)orbiomedicalarticlesGaoetal.(2021))thefocushasbeenonbuildingLLMswithbroadcapabilities.Recenteﬀortstrainingmodelsusingonlydomain-speciﬁcdatahaveyieldedmodelsthat,whilemuchsmaller,beatgeneralpurposeLLMsontaskswithinthosedomains,suchasscienceTayloretal.(2022)andmedicineBoltonetal.(2023);Luoetal.(2022);Lehmanetal.(2023).Theseﬁndingsmotivatefurtherdevelopmentofmodelsfocusedonspeciﬁcdomains.FinancialTechnology(FinTech)isalargeandgrowingareawithNLPtechnologieshavinganincreasinglyimportantroleXingetal.(2018);Fisheretal.(2016);Dredzeetal.(2016).FinancialNLPtasksShahetal.(2022)includesentimentanalysisAraci(2019),namedentityrecognitionSalinasAlvaradoetal.(2015),newsclassiﬁcationSinhaandKhandait(2020),andquestionansweringChenetal.(2021b,2022).WhiletherangeoftasksissimilartothosefoundingeneralNLPbenchmarks,thecomplexityandterminologyoftheﬁnancialdomainwarrantadomain-speciﬁcsystem.ForallofthereasonsgenerativeLLMsareattractiveingeneral–few-shotlearning,textgeneration,conversationalsystems,etc.–itwouldbevaluabletohaveaLLMfocusedontheﬁnancialdomain.WhiletherearemaskedlanguagemodelstunedfortheﬁnancialdomainAraci(2019),noLLMhasbeentunedfororevaluatedontasksforthisdomain.1.1BloombergGPTWetrainBloombergGPT,a50billionparameterlanguagemodelthatsupportsawiderangeoftaskswithintheﬁnancialindustry.Ratherthanbuildingageneral-purposeLLM,orasmallLLMexclusivelyondomain-speciﬁcdata,wetakeamixedapproach.General4andobviatetheneedforspecializationduringtrainingtime.However,resultsfromexistingdomain-speciﬁcmodelsshowthatgeneralmodelscannotreplacethem.AtBloomberg,wesupportaverylargeanddiversesetoftasks,wellservedbyageneralmodel,butthevastmajorityofourapplicationsarewithintheﬁnancialdomain,betterservedbyaspeciﬁcmodel.Forthatreason,wesetouttobuildamodelthatachievesbest-in-classresultsonﬁnancialbenchmarks,whilealsomaintainingcompetitiveperformanceongeneral-purposeLLMbenchmarks.Weachievethisgoalbyconstructingthelargestdomain-speciﬁcdatasetyet,drawingonexistingdatacreation,collection,andcurationresourcesatBloomberg.AsBloombergisprimarilyaﬁnancialdatacompany,ourdataanalystshavecollectedandcuratedﬁnanciallanguagedocumentsoverthespanoffortyyears.Wehaveextensivearchivesofﬁnancialdatathatcoverarangeoftopics,withcarefultrackingofdatasourcesandusagerights.Weaddthisdatatopublicdatasetstocreatealargetrainingcorpuswithover700billiontokens.Usingaportionofthistrainingcorpus,wetrainaBLOOM-style,50billionparametermodeldesignedbasedonguidelinesfromHoﬀmannetal.(2022)andLeScaoetal.(2022).WevalidatethemodelonstandardLLMbenchmarks,openﬁnancialbenchmarks,andasuiteofBloomberg-internalbenchmarksthatmostaccuratelyreﬂectourintendedusecases.Ourresultsdemonstratethatourmixedtrainingapproachleadstoamodelthatvastlyoutperformsexistingmodelsonin-domainﬁnancialtaskswhilebeingonparorbetterongeneralNLPbenchmarks.1.2BroaderContributionsBeyondtheconstructionofaLLMforﬁnancialdata,ourgoalistocontributetothebroaderresearchcommunity.Speciﬁcally,ourexperiencedocumentedinthispaperprovidesevidencethatfurtherdevelopsthecommunity’sunderstandingofseveralopenquestionsintheliterature.Domain-speciﬁcLLMs.Thefewexistingdomain-speciﬁcLLMsaretrainedexclusivelyondomain-speciﬁcdatasources(Luoetal.,2022;Boltonetal.,2023;Tayloretal.,2022),oradaptaverylargegeneralpurposemodeltodomain-speciﬁctasks(Singhaletal.,2022;Lewkowyczetal.,2022).Ouralternativeapproach–traininganLLMonbothdomain-wellondomain-speciﬁctasks,butalsomaintainsstrongperformanceongeneral-purposebenchmarks.Trainingdata.Nearlyalllanguagemodelsrelyinlargepartonweb-scrapeddata,suchasC4(Raﬀeletal.,2020)andThePile(Gaoetal.,2021)(whichincludesOpenWebText2).ThisdatamaybecleanedorsubsettedinvariouswaysbeforeuseTouvronetal.(2023);Raeetal.(2020);Scaoetal.(2022);Jerniteetal.(2022),butissuesofdataduplicationCarlinietal.(2020)andtoxiclanguageremainWelbletal.(2021).OurtrainingdataisunusualforLLMtraininginthatitincludesasigniﬁcantamountofcuratedandprepareddatafromreliablesources.Evaluation.LLMevaluationremainsachallengingandevolvingproblemGehrmannetal.(2022);Goyaletal.(2022),withnewbenchmarkstryingtostandardizeevaluationacross5models(Liangetal.,2022;Srivastavaetal.,2022).However,fordomain-speciﬁctasks,thereremainsamismatchbetweenevaluationandactualusecases.Evaluationsarebuiltonavailabledatasetsandnotnecessarilyonhowthemodelwillbeusedinpractice.WeprovideresultsonbothpublicﬁnancialNLPbenchmarks(Shahetal.,2022;Chenetal.,2021b)aswellasaselectionofinternalBloombergtasks,whicharebetteralignedwithourintendedusecasesanddirectlyevaluateourmodel’sabilitytoperformtasksofinterest.ModelSize.EarlyLLMsmadeasingletrainingpassoveracorpusof200-400billionto-kens(Brownetal.,2020)andHoﬀmannetal.(2022)positedthatmodelswereundertrained,insteadfocusingontrainingsmallermodelswithmoredata,astrategymostrecentlyem-ployedbyTouvronetal.(2023).WeselectamodelsizemotivatedbyHoﬀmannetal.(2022)andtraina50billionparametermodelon569billiontokensfromourcorpusofover700billiontokenstoproduceamodelthatiscompetitivewithlargermodels.Tokenizer.Afterassemblingtrainingdata,thecriticalstepoftokenizationtransformsthetextintoaformatsuitableforthelanguagemodel.TheimportanceofthisstepisoftenoverlookedMielkeetal.(2021),andmanyolderLLMsusethesametokenizerandvocabulary,meaningthatwehavelittleevidencetosupportothertokenizers.WetakeadiﬀerentapproachanduseaUnigrammodelinsteadofgreedymerge-basedsub-wordtokenizerssinceitsavesprobabilitiesallowingforsmartertokenizationatinferencetime(Kudo,2018).ModelBuildingChallenges.GPT-3andsubsequentmodelsweretheworkoflargeteamsandrequiredanenormousamountofcomputation.Initialworktoreproducetheseresults,suchasOPTZhangetal.(2022a),didnotmatchtheperformanceoftheoriginalmodel.Withthereleaseofeachsubsequentmodel,thecommunity’sunderstanding,ex-perience,andsoftwaretoolsincrease.IndevelopingBloombergGPT,webeneﬁtedfromexistingcodedevelopedaspartoftheBLOOMeﬀortScaoetal.(2022),showingthatamoderatelysizedteamcanproduceacompetitivemodelondomain-speciﬁcdata.Wede-scribeourexperiencestrainingBloombergGPTindetailtosupportfuturetrainingeﬀortsandaddresseachoftheabovetopics.2.DatasetTotrainBloombergGPT,weconstruct“FinPile”,acomprehensivedatasetconsistingofnancialdocuments,andsocialmediadrawnfromtheBloombergarchives.Thesedocumentshavebeenacquiredthroughourbusinessprocessoverthepasttwodecades.WeaugmentFinPilewithpublicdatawidelyusedtotrainLLMs.Theresultisatrainingcorpusthatisroughlyhalfdomain-speciﬁctextandhalfgeneral-purposetext.Forabreakdownofthefulltrainingset,seeTable1.Toimprovedataquality,wede-duplicateeachdataset(ThePile,C4,Wikipedia,FinPile)accordingtoLeeetal.(2022a);asaside-eﬀect,thestatisticsreportedinTable1mightbediﬀerentfromthosereportedinotherpapers.6DatasetDocsC/DCharsC/TToksT%FinPile1,017WebFilingsBloombergPUBLIC416,818Pile-CCGitHubPubMedCentralArXivOpenWebText2DMMathematicsWikipedia(en)USPTOBackgroundsPubMedAbstractsOpenSubtitlesGutenberg(PG-19)3UbuntuIRC1EuroParl7YouTubeSubtitlesBookCorpus228PhilPapers36NIHExPorter3EnronEmails51Wikipedia(7/1/22)TOTAL1,531Table1:BreakdownofthefulltrainingsetusedtotrainBloombergGPT.Thestatisticsprovidedaretheaveragenumberofcharactersperdocument(“C/D”),theaveragenumberofcharacterspertoken(“C/T”),andthepercentageoftheoveralltokens(“T%”).Unitsforeachcolumnaredenotedintheheader.72.1FinancialDatasets(363Btokens–54.2%oftraining)TheBloombergTerminalhasprovidedaccesstoacomprehensivesetofdiversestructuredandunstructuredﬁnancialdataandanalyticsforthepastfourdecades.Inservingthismission,Bloomberganalystshavecuratedasetofﬁnancialdocumentsthatwereeithercreatedinternallyoracquiredfromexternalsources.WeutilizethisextensivecollectionofcuratedandmaintaineddocumentstocreateFinPile,whichconsistsofcompanyﬁlings,ﬁnancialnews,andotherdatarelevanttotheﬁnancialmarkets.SomedocumentsincludedintheFinPile,suchascompanyﬁlings,areavailabletothegeneralpublic,althoughcollectingthesedocumentsandpre-processingthemforLLMtrainingisanon-trivialtask.Otherdocuments,suchas(asubsetof)Bloombergnews,mustbepurchased.Therestofthedocumentsareprivateandavailable,amongothersources,throughtheBloombergTerminal.Finally,wecleanthisdatatostripoﬀmarkup,specialformatting,andtemplates.NotethateachdocumentinFinPileistime-stamped,withdatesrangingfrom2007-03-01to2022-07-31;thequalityandquantityofdocumentsincreaseoverthistimerange.Whilewedonotutilizedateinformationinthiswork,weplantouseitinthefuture,suchasforevaluationofwhatthemodellearnsaboutdiﬀerenttimeperiods.WhilewecannotreleaseFinPile,ourexperiencetrainingonalarge,carefullycurated,andcleandomain-speciﬁcdatasetmayprovidehelpfulinsightstothecommunityontheadvantagesandchallengesofbuildingaﬁnancialLLMinparticular,andadomain-speciﬁcmodelingeneral.WeprovideabreakdownandanalysisofFinPileinTable2andabriefdescriptionofthetypesofdataincludedbelow.2.1.1Web(298Btokens–42.01%oftraining)Bloombergcollectswebcontentbyidentifyingsitesthatcontainﬁnanciallyrelevantinfor-mation.WhilethiscategorymakesupthemajorityofFinPile,itsclassiﬁcationsarerough,withcontentclassiﬁedmainlybythelocationofthewebdomain.Withintheselocation-speciﬁcsources,e.g.“US”(15.95%oftotal),“Asia-Pac”(4.72%oftotal),and“UK”(1.98%oftotal),documenttypesarehighlyvariedaswouldbeexpectedfromawebcrawl.WhilewebsourcesarecommoninexistingpublicLLMtrainingdatasets,Bloomberg’swebcrawlisfocusedonhigh-qualitywebsitesthathaveﬁnanciallyrelevantinformation,asopposedtoageneral-purposecrawloftheweb.2.1.2News(38Btokens–5.31%oftraining)TheNewscategoryincludesallnewssourcesexcludingnewsarticleswrittenbyBloombergjournalists.Overall,therearehundredsofEnglishnewssourcesinFinPileincluding“BloombergTranscripts”(0.41%oftotal),whicharetranscriptsofBloombergTVnews.Generallythecontentinthisdatasetcomesfromreputablesourcesofnewsthatarerelevanttotheﬁnancialcommunitysoastomaintainfactualityandreducebias.2.1.3Filings(14Btokens–2.04%oftraining)CompanyFilingsareﬁnancialstatementspreparedby(public)companiesandmadeavail-abletothegeneralpublic.Insomecountries,liketheUS,publiccompaniesaremandatedDateBloombergFilingsNewsPressWebTotal84,43111,69511,88316,907,57631,21436,21537,647Table2:Thenumberoftokens(inmillions)containedwithindocumentsinFinPile,or-ganizedbyyear(rows)andtype(column).Unitsaremillionsoftokens.toprepareandsubmittheirﬁnancialstatementsonaregularcadence;e.g.,10-Kannualreportsand10-Qquarterlyreports.Inourdataset,amajorityoftheﬁlingscomefromEDGAR,whichistheSEC’sonlinedatabase(1.90%oftotal).FilingsaretypicallylongPDFdocumentswithtablesandchartsthataredenseinﬁnancialinformation,whichareprocessedandnormalizedinBloomberg.FilingsaresubstantiallydiﬀerentfromthetypesofdocumentstypicallyusedtotrainLLMs,butcontaincriticallyimportantinformationforﬁnancialdecision-making.2.1.4Press(9Btokens–1.21%oftraining)ressreleasestypicallyissuedbycompaniesthatarenanciallyrelevant.Takentogetherwithﬁlings,pressreleasesrepresentmostofthepubliccommuni-cationsofacompany.However,unlikeﬁlings,pressreleasesaresimilartonewsstoriesintermsofcontentandstyle.2.1.5Bloomberg(5Btokens–0.70%oftraining)ThiscategorycomprisesBloombergauthorednewsandotherdocumentssuchasopinionsandanalyses.Thelargestsourcesare“BloombergNews”(0.44%oftotal)and“BloombergFirstWord”(0.13%oftotal),theBloomberg-authoredwireofreal-timenews.WhileBloombergNewscoversawiderangeoftopics,ittypicallyfocusesoncontentrelevanttotheﬁnancialcommunity.Thisdatasetcontainsdocumentsofvaryinglengths.92.2PublicDatasets(345Btokens–48.73%oftraining)Weusethreewidelyknownandavailablepublicdatasetsinourtrainingcorpus.2.2.1ThePile(184Btokens–25.9%oftraining)ThePile(Gaoetal.,2021)isthedatasetusedinGPT-Neo(Blacketal.,2021),GPT-J(WangandKomatsuzaki,2021),andGPT-NeoX(20B)(Blacketal.,2022).WeincludeThePileinourtrainingdataforthefollowingreasons.First,ithasbeenusedtosuccessfullytrainanLLM.Second,ithasundergonesigniﬁcantdatacleaningandpre-processing.Third,itincludesmultipledomainsandwebelievesuchdiversedatawillaidgeneralizationtonewdomainsandmayevensupporttrainingonﬁnancialdata.Forexample,domainssuchasFreeLawandGitHubareusefultoteamsatBloombergthatworkonlegaldocumentsandsoftwaredevelopment,respectively.CreatorsofThePilehavedeliberatelychosentoincludeduplicatecontent,withtheduplicationfactorbeingproportionaltotheperceivedqualityofthecontent.However,aswededuplicateeachofourdatasets,thesizeofThePileissigniﬁcantlyreduced.Additionally,notethatourtokenizer(●2.3)istrainedonThePile.2.2.2C4(138Btokens–19.48%oftraining)TheColossalCleanCrawledCorpus(C4)isacommondatasetusedtotrainLLMs,andwasintroducedtosupporttrainingT5(Raﬀeletal.,2020).AlthoughitoverlapswithPile-CC,C4iscleanedandprocesseddiﬀerently;hence,wefeelthatincludingC4inadditiontoThePilecanaddvaluemorethanduplicateddocumentswould.WeﬁndthatC4containshigh-qualitynaturallanguagedocumentsduetothelayersofcleaning,thoughothershavenotedthatthedistributionacrosswebdomainsisunusual,withahighfractionofdatastemmingfrompatentsDodgeetal.(2021).2.2.3Wikipedia(24Btokens–3.35%oftraining)BothThePileandC4includeout-of-datecopiesofWikipedia,soitcouldbebeneﬁcialforthefactualityofthemodeltohaveup-to-dateWikipediapagesincluded.Therefore,weincludeadumpofEnglishWikipediafromJuly1,2022.Thisdatasetistokenizedquiteineﬃciently(3.06characterspertoken),indicatinganabove-averageamountofmarkup,whichsuggeststhatfurthercleaningmightbeneﬁtfuturemodeltraining.2.3TokenizationWechoosetheUnigramtokenizer(Kudo,

人人文库> 全部分类> 行业资料 > 管理策划

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance

文档简介

温馨提示

最新文档

评论

约翰霍普金斯大学 Bloomberg：用于金融的大语言模型 -BloombergGPT - A Large Language Model for Finance

文档简介

温馨提示

最新文档

评论

相关文档