Gemini  1.5:解锁数百万代币的多模式理解(英文版)_第1页
Gemini  1.5:解锁数百万代币的多模式理解(英文版)_第2页
Gemini  1.5:解锁数百万代币的多模式理解(英文版)_第3页
Gemini  1.5:解锁数百万代币的多模式理解(英文版)_第4页
Gemini  1.5:解锁数百万代币的多模式理解(英文版)_第5页
已阅读5页,还剩103页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

GocgleDeepMind

®2024Google.Allrightsreserved

Gemini1.5:Unlockingmultimodal

understandingacrossmillionsoftokensofcontext

GeminiTeam,Google

1

Inthisreport,wepresentthelatestmodeloftheGeminifamily,Gemini1.5Pro,ahighlycompute-efficientmultimodalmixture-of-expertsmodelcapableofrecallingandreasoningoverfine-grainedinformationfrommillionsoftokensofcontext,includingmultiplelongdocumentsandhoursofvideoandaudio.Gemini1.5Proachievesnear-perfectrecallonlong-contextretrievaltasksacrossmodalities,improvesthestate-of-the-artinlong-documentQA,long-videoQAandlong-contextASR,andmatchesorsurpassesGemini1.0Ultra’sstate-of-the-artperformanceacrossabroadsetofbenchmarks.StudyingthelimitsofGemini1.5Pro’slong-contextability,wefindcontinuedimprovementinnext-tokenpredictionandnear-perfectretrieval(>99%)uptoatleast10Mtokens,agenerationalleapoverexistingmodelssuchasClaude2.1(200k)andGPT-4Turbo(128k).Finally,wehighlightsurprisingnewcapabilitiesoflargelanguagemodelsatthefrontier;whengivenagrammarmanualforKalamang,alanguagewithfewerthan200speakersworldwide,themodellearnstotranslateEnglishtoKalamangatasimilarleveltoapersonlearningfromthesamecontent.

1.Introduction

WepresentourlatestmultimodalmodelfromtheGeminiline:Gemini1.5Pro.ThisisourfirstreleasefromGemini1.5,anewfamilyofhighly-capablemultimodalmodelswhichincorporatesanovelmixture-of-expertsarchitectureaswellasmajoradvancesintrainingandservinginfrastructurethatallowittopushtheboundaryofefficiency,reasoning,andlong-contextperformance.Gemini1.5Proisbuilttohandleextremelylongcontexts;ithastheabilitytorecallandreasonoverfine-grainedinformationfromuptoatleast10Mtokens.Thisscaleisunprecedentedamongcontemporarylargelanguagemodels(LLMs),andenablestheprocessingoflong-formmixed-modalityinputsincludingentirecollectionsofdocuments,multiplehoursofvideo,andalmostadaylongofaudio.Gemini1.5ProsurpassesGemini1.0Proandperformsatasimilarlevelto1.0Ultraonawidearrayofbenchmarkswhilerequiringsignificantlylesscomputetotrain.

Theabilitytomodeldataofincreasinglylongercontextshastrackedthedevelopmentofmoregeneralandcapablelanguagemodels,fromthenowtoy2-gramlanguagemodelproposedby

Shannon

(1948),tothemodernn-grammodelsofthe1990s&2000s(

Brantsetal.,

2007;

ChenandGoodman,

1999;

Jelinek,

1998;

KneserandNey,

1995)typicallyconstrainedto5tokensofcontext,torecurrent

neuralnetworkslanguagemodelsfromthe2010swhichcouldeffectivelyconditiononhundredsof

tokens(Jozefowiczetal.,

2016;

Mikolovetal.,

2010),tothemodernTransformer(Vaswanietal.,

2017

)whichcanconditiononhundredsofthousandsoftokens(

Anthropic,

2023)

.Gemini1.5Procontinuesthistrendbyextendinglanguagemodelcontextlengthsbyoveranorderofmagnitude.Scalingtomillionsoftokens,wefindacontinuedimprovementinpredictiveperformance(Section

),nearperfectrecall(>99%)onsyntheticretrievaltasks(Figure

1

andSection

),andahostof

surprisingnewcapabilitieslikein-contextlearningfromentirelongdocuments(Section

).

1Pleasesendcorrespondencetogemini-1_5-report@.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

2

video

videoHaystack

6121824303642485460

Minutes

3672108144180

Minutes

Audio

AudioHaystack

252

204

492540

588

444

300348

Minutes

84010801320

Minutes

4357

Text

TextHaystack

32k

256k

512k

128k

Tokens

2M

5M

Tokens

Figure1|Gemini1.5Proachievesnear-perfect“needle”recall(>99.7%)upto1Mtokensof“haystack”inallmodalities,i.e.,text,videoandaudio.Itevenmaintainsthisrecallperformancewhenextendingto10Mtokensinthetextmodality(approximately7Mwords);2Mtokensintheaudiomodality(upto22hours);2.8Mtokensinthevideomodality(upto3hours).Thex-axisrepresentsthecontextwindow,andthey-axisthedepthpercentageoftheneedleplacedforagivencontextlength.Theresultsarecolor-codedtoindicate:greenforsuccessfulretrievalsandredforunsuccessfulones.

Tomeasuretheeffectivenessofourmodel’slong-contextcapabilities,weconductexperimentsonbothsyntheticandreal-worldtasks.Insynthetic“needle-in-a-haystack”tasksinspiredby

Kamradt

(2023)thatprobehowreliablythemodelcanrecallinformationamidstdistractorcontext,we

findthatGemini1.5Proachievesnear-perfect(>99%)“needle”recalluptomultiplemillionsoftokensof“haystack”inallmodalities,i.e.,text,videoandaudio,andevenmaintainingthisrecallperformancewhenextendingto10Mtokensinthetextmodality.Inmorerealisticmultimodallong-contextbenchmarkswhichrequireretrievalandreasoningovermultiplepartsofthecontext(suchasansweringquestionsfromlongdocumentsorlongvideos),wealsoseeGemini1.5Prooutperformingallcompetingmodelsacrossallmodalitiesevenwhenthesemodelsareaugmentedwithexternalretrievalmethods.Finally,wequalitativelyshowcasethein-contextlearningabilitiesofGemini1.5Proenabledbyverylongcontext:forexample,learningtotranslateanewlanguagefromasinglesetoflinguisticdocumentation.Withonlyinstructionalmaterials(500pagesoflinguisticdocumentation,adictionary,and≈400parallelsentences)allprovidedincontext,Gemini1.5ProiscapableoflearningtotranslatefromEnglishtoKalamang,alanguagespokenbyfewerthan200speakersinwesternNewGuineaintheeastofIndonesianPapua

2,

andthereforealmostnoonlinepresence.Moreover,wefindthatthequalityofitstranslationsiscomparabletothatofapersonwhohaslearnedfromthesamematerials.

2KalamangLanguage:

/lang/1891

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

3

Gemini1.5Pro

Relativeto1.0Pro

Relativeto1.0Ultra

Long-ContextText,Video&Audio

from32kupto10Mtokens

from32kupto10Mtokens

CoreCapabilities

Text

Win-rate:87.1%(27/31benchmarks)

Win-rate:54.8%(17/31benchmarks)

Win-rate:100%(13/13benchmarks)

Win-rate:77%(10/13benchmarks)

Vision

Win-rate:77%(10/13benchmarks)

Win-rate:46%(6/13benchmarks)

Audio

Win-rate:60%(3/5benchmarks)

Win-rate:20%(1/5benchmarks)

Table1|Gemini1.5ProcomparedtoGemini1.0family.Gemini1.5Promaintainshighlevelsofperformanceevenasitscontextwindowincreases.DetailedresultsarepresentedinTable

7.

Importantly,thisleapinlong-contextperformancedoesnotcomeattheexpenseofthecoremulti-modalcapabilitiesofthemodel

.3

Overall,wefindthatGemini1.5ProgreatlysurpassesGemini1.0Pro,performingbetteronthevastmajorityofbenchmarks(i.e.,27/31),increasingthemargininparticularforMath,ScienceandReasoning(+28.9%),Multilinguality(+22.3%),VideoUnderstanding(+11.2%)andCode(+8.9%)(seeTable

7

forbreakdowns).However,amorestrikingcomparisonistheonewithGemini1.0Ultra,astate-of-the-artmodelacrossmanycapabilities.DespiteGemini1.5Prousingsignificantlylesstrainingcomputeandbeingmoreefficienttoserve,wefindGemini1.5Protoperformbetteronmorethanhalfofthebenchmarks(16/31),inparticularontextbenchmarks(10/13)andmanyofthevisionbenchmarks(6/13).

Inthefollowingsections,weprovideanoverviewofthemodelarchitectureandpresenttheresultsoflarge-scalequantitativeevaluationscomparingGemini1.5ProtootherLLMs.Wepresentdetailedevaluationsforthemodel’slongcontextcapabilitiesfollowedbyevaluationsofitscorecapabilities,similartotheGemini1.0technicalreport(

Gemini-Teametal.,

2023

),coveringwell-studiedbenchmarksacrosstext,code,image,videoandaudio.Finally,wediscussourapproachtoresponsibledeployment,includingourprocessforimpactassessmentdevelopingmodelpolicies,evaluations,andmitigationsofharmbeforedeploymentdecisions

.4

2.ModelArchitecture

Gemini1.5Proisasparsemixture-of-expert(MoE)Transformer-basedmodelthatbuildsonGemini1.

0’s(Gemini-Teametal.,

2023)researchadvancesandmultimodalcapabilities

.Gemini1.5ProalsobuildsonamuchlongerhistoryofMoEresearchatGoogle(

Clarketal.,

2022;

Duetal.,

2022;

Fedus

etal.,

2021;

Lepikhinetal.,

2020;

Riquelmeetal.,

2021;

Shazeeretal.,

2017;

Zophetal.,

2022)and

languagemodelresearchinthebroaderliterature(Aniletal.,

2023;

Anthropic,

2023;

Brownetal.,

2020;

Chowdheryetal.,

2023;

Hoffmannetal.,

2022;

Jiangetal.,

2024;

Kimetal.,

2021;

OpenAI,

2023;

Raeetal.,

2021;

Raffeletal.,

2020;

Rolleretal.,

2021;

Thoppilanetal.,

2022;

Touvronetal.,

2023a

,b;

Vaswanietal.,

2017

).MoEmodelsusealearnedroutingfunctiontodirectinputstoasubsetofthemodel’sparametersforprocessing.Thisformofconditionalcomputation(

Bengioetal.,

2013;

3Wedefinethecorecapabilitiesasthosecapabilitiesofthemodelthatareprimarilynonlong-context(e.g.,math,science,reasoning,multilinguality,codeetc.)similartocapabilitiescoveredintheGemini1.0technicalreport

(Gemini-Teametal.,

2023

).

4Seethemodelcard

(Mitchelletal.,

2019a

)inAppendixSection

8.1.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

4

DavisandArel,

2014

;

Jacobsetal.,

1991

)allowsmodelstogrowtheirtotalparametercountwhilekeepingthenumberofparametersthatareactivatedforanygiveninputconstant.

Ahostofimprovementsmadeacrossnearlytheentiremodelstack(architecture,data,optimizationandsystems)allowsGemini1.5ProtoachievecomparablequalitytoGemini1.0Ultra(seeSection

5

),whileusingsignificantlylesstrainingcomputeandbeingsignificantlymoreefficienttoserve.Gemini1.5Proalsoincorporatesaseriesofsignificantarchitecturechangesthatenablelong-contextunderstandingofinputsupto10milliontokenswithoutdegradingperformance.Translatedintorealworlddata,thiscontextlengthenablesGemini1.5Promodelstocomfortablyprocessalmostadayofaudiorecordings(i.e.,22hours),morethantentimestheentiretyofthe1440pagebook(or

587,287words)"WarandPeace",theentireFlax(Heeketal.,

2023)codebase(41,070linesofcode),

orthreehoursofvideoat1frame-per-second.Further,sincethemodelisnativelymultimodalandsupportsinterleavingofdatafromdifferentmodalities,itcansupportamixofaudio,visual,text,andcodeinputsinthesameinputsequence.InSection

4.1

,wehighlightsomeofthenovelcapabilitiesenabledbytheseadvances,includingevaluationsthatyieldedpositiveresultsoncontextlengthsupto10million.Wenotethatunderstandingthelimitsofthesecapabilitiesandstudyingtheirexcitingcapabilitiesandapplicationsremainsanareaofcontinuedresearchexploration.

3.TrainingInfrastructureandDataset

LikeGemini1.0Ultraand1.0Pro,Gemini1.5Proistrainedonmultiple4096-chippodsofGoogle’sTPUv4accelerators,distributedacrossmultipledatacenters,andonavarietyofmultimodalandmultilingualdata.Ourpre-trainingdatasetincludesdatasourcedacrossmanydifferentdomains,includingwebdocumentsandcode,andincorporatesimage,audio,andvideocontent.Fortheinstruction-tuningphasewefinetunedGemini1.5Proonacollectionofmultimodaldata(containingpairedinstructionsandappropriateresponses),withfurthertuningbasedonhumanpreferencedata.WereferreaderstotheGemini1.0TechnicalReport(

Gemini-Teametal.,

2023

)forfurtherinformation.

4.Long-contextEvaluation

Existingevaluationsareincreasinglystrainedbythenewandrapidlyadvancingcapabilitiesoflargemultimodalmodels.Theytypicallyfocusonindividualmodalitiesand/orarerestrictedtotaskswithshortercontextlengths.Hence,thereisagrowingneedforbenchmarkswhichexemplifythenuancedrequirementsofrealworldlongmixed-modalityusecases.Amongthese,wehighlightthequantitativeassessmentofreasoningcapabilitiesacrosslongmixed-modalitysequencesasakeychallenge.

Withthechallengesofevaluatingincreasinglycapablemodelsinmind,ourevaluationofGemini1.5Profirstfocusesonunderstandingandevaluatingitsnovelcapabilities.Subsequently,weexplorecorebenchmarks,coveringcapabilitiesstudiedintheGemini1.0TechnicalReport(

Gemini-Team

etal.,

2023

).Specifically,weevaluateGemini1.5Prointhreemaincategories:

5

1.Qualitativelong-contextmultimodalevaluations:manuallyprobeandstress-testthemodel’slong-contextabilities,especiallyfornovelcapabilitieswherenoquantitativebenchmarksexist.

2.Quantitativelong-contextmultimodalevaluations:measurethemodel’slong-contextabilitiesonbothsyntheticandreal-worldtaskswithwell-definedmetrics.

3.Quantitativecoreevaluations:identifyprogressandregressionincorecapabilities(e.g.,coding,

math,science,multilingualityandinstructionfollowing).

5WenotethatalltheevaluationsarefromthesamecheckpointoftheGemini1.5Promodelthatisinstructiontunedpostpre-training,unlessotherwisestated.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

5

4.1.QualitativeExamplesofMultimodalLong-ContextCapabilities

Theabilitytoprocessmultiplemillionsoftokensunlockspracticalapplicationsthatwerenotpossiblebefore.InthissectionwedemonstratesomesurprisinginteractionsweobservedwithGemini1.5Proacrosscode,textandvideo.

AsshownintheFigure

2,

Gemini1.5ProisabletoingestentirelargecodebasessuchasJAX(746,152tokens),andanswerveryspecificqueriesaboutthem.inFigure

3

weshowGemini1.5Pro’sabilitytolearnanewlanguagebasedonlyonreferencematerialsgiveninitsinput(seeSection

forquantitativemetricsforthisusecase).Additionally,wetestGemini1.5Pro’sabilitytoansweranimagequerygiventheentiretextofLesMisérablesandobservethatbeingnativelymultimodalallowsittolocateafamousscenefromahand-drawnsketch,asshowninFigure

4.

Lastly,weaskGemini1.5Proquestionsaboutanentiremovieof45minutesinFigure

5

whichthemodelanswersseamlesslywhileretrievingmomentsandtimestampsdowntoasecond.

6

Figure2|Giventheentire746,152tokenJAXcodebaseincontext,Gemini1.5Procanidentifythespecificlocationofacoreautomaticdifferentiationmethod.

Figure3|Givenareferencegrammarbookandabilingualwordlist(dictionary),Gemini1.5ProisabletotranslatefromEnglishtoKalamangwithsimilarqualitytoahumanwholearnedfromthe

samematerials.

6ForadditionalshortvideosofdemonstrationsofthelongcontextabilitiesofGemini1.5Proacrossvideo,text,andcodesee

https://deepmind.google/technologies/gemini/.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

6

Figure4|WiththeentiretextofLesMisérablesintheprompt(1382pages,732ktokens),Gemini

1.5Proisabletoidentifyandlocateafamousscenefromahand-drawnsketch.

Figure5|Whenpromptedwitha45minuteBusterKeatonmovie“SherlockJr."(1924)(2,674framesat1FPS,684ktokens),Gemini1.5Proretrievesandextractstextualinformationfromaspecificframeinandprovidesthecorrespondingtimestamp.Atbottomright,themodelidentifiesasceneinthemoviefromahand-drawnsketch.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

7

4.2.Long-contextEvaluations

Forthepastfewyears,LLMresearchhasprioritizedexpandingthecontextwindowfromwhichmodelscanincorporateinformation(

Anthropic,

2023;

OpenAI,

2023)

.Thisemphasisstemsfromtherecognitionthatawidercontextwindowallowsmodelstoincorporatealargeramountofnew,task-specificinformationnotfoundinthetrainingdataatinferencetime,leadingtoimprovedperformanceinvariousnaturallanguageormultimodaltasks.Recentapproachestoimprovingthelong-contextcapabilitiesofmodelsfallintoafewcategories,includingnovelarchitecturalapproaches(

Ainslieetal.,

2023

;

GuandDao,

2023

;

Guoetal.,

2021

;

Orvietoetal.,

2023

;

Zaheeretal.,

2020

),post-training

modifications(Bertschetal.,

2023;

Chenetal.;

Pressetal.,

2021;

Xiongetal.,

2023),retrieval

-

augmentedmodels(Guuetal.,

2020;

Izacardetal.,

2022;

Jiangetal.,

2022;

Karpukhinetal.,

2020;

Santhanametal.,

2021

),memory-augmentedmodels(

Bulatovetal.,

2022,

2023

;

Martinsetal.,

2022;

Muetal.,

2023;

Wuetal.,

2022a,b;

Zhongetal.,

2022),andtechniquesforbuildingmore

coherentlong-contextdatasets(

Shietal.,

2023c;

Staniszewskietal.,

2023

).Thisactivityhasresultedinmeasurableimprovementsonlong-contextcapabilitiesofLLMsoverthepastseveralmonths,withtherecentconcurrentworkofLiuetal.(2024)exploringcontextwindowof7Bmodelsupto1Mmultimodaltokens.Notably,amongthestate-of-the-artLLMs,Anthropichassuccessfullyextendedthecontextoftheirtext-onlyClaude2modelto100ktokens,whileOpenAIhasrecentlyreleasedGPT-4Turboreaching128ktokens.Finally,thelatestadditiontotheserieswasClaude2.1withacontextwindowof200ktokens.

Gemini1.5Prosignificantlyextendsthiscontextlengthfrontiertomultiplemillionsoftokenswithalmostnodegradationinperformance,makingitpossibletoprocesssignificantlylargerinputs.ComparedtoClaude2.1witha200ktokencontextwindow,Gemini1.5Proachievesa100%recallat200ktokens,surpassingClaude2.1’s98%.This100%recallismaintainedupto530ktokens,andrecallis99.7%at1Mtokens.Whenincreasingfrom1Mtokensto10Mtokens,themodelretains99.2%recall.Moreover,Gemini1.5Pro’snativemultimodalcapabilitiesenablesthemodeltoingestmultiplehoursofaudioandvideorecordingsalongsideorinterleavedwithtext.SuchrecallcapabilitiesaresummarizedinFigure

1.

Belowwereportresultsonlong-contextevaluationsacrossallthreemodalities,i.e.,text,visionandaudio.

Theevaluationmethodologywefollowedtomeasurethelong-contextcapabilityofGemini1.5Proconsistsofbothdiagnostic-focusedprobingofthelongcontextcapabilities(e.g.,perplexityoverlongsequences,needle-in-a-haystackretrievalstudies)andrealisticevaluationsspecificallydesignedformultimodallong-contexttasks(e.g.,long-documentQA,long-contextautomaticspeechrecognition,learningtotranslateanewlanguagefromonlyonebook,andlong-contextvideoQA).Toprovideareferencepoint,throughoutthissectionwecompareGemini1.5Prowiththeleadingmodelavailableexternallyforeachtask.WiththeevaluationharnesswedevelopedforGemini1.5Proweareabletoquantifythequalityoflong-contextunderstandingcapabilitiesreliablyallthewayupto10Mtokens.

4.2.1.DiagnosticLong-ContextEvaluations

PerplexityoverLongSequences

Westartbyreportingresultsonthetextmodality.Toevaluatetheabilityofthemodelstomakeuseofverylongcontextstoimprovenext-tokenprediction,whichistheobjectivefunctionusedtotrainlanguagemodels,werecordthenegativelog-likelihood(NLL)oftokensatdifferentpositionsintheinputsequencesfromheld-outtext(i.e.,notusedintraining).Here,alowervalueimpliesanimprovedprediction.Typically,weexpecttokensatthebeginningofasequencetohavehighNLL,asthereislittletonocontextthatthemodelcanusetopredictthem,andtokenslaterinthesequencetohavelowerNLLasmoreinformationbecomesavailabletothemodel.Theshapeoftheresulting

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

8

NegativeLog-Likelihood

(lowerisbetter)

CumulativeAverageNLLforCode

Gemini1.0ProGemini1.5Pro

Powerlawfit(r2=0.998)

1285122K8K32K128K512K2M10M

Sequenceposition

CumulativeAverageNLLforLongDocuments

Gemini1.0ProGemini1.5Pro

Powerlawfit(r2=0.998)

2561K4K16K64K256K1M

Sequenceposition

NegativeLog-Likelihood

(lowerisbetter)

Figure6|Cumulativeaveragenegativelog-likelihood(NLL)asafunctionoftokenpositioninlongdocumentsandcodedata.Alowervaluedemonstratesbetterprediction.Gemini1.5Proshowsimprovedpredictionsupto1Mtokensforlong-documentsand10Mtokensforcode,whereasGemini1.0Proimprovesuptoonly32Ktokens.TheNLLfollowsapower-lawtrendupuntil1Mtokens(documents)and2Mtokens(code)withadeviatingtrendat10Mtokens.

curveindicatestheabilitiesofmodelstoreasonoverlong-context.Adownwardtrendsignifiesmodelsmakinguseoflong-contexttoreducemodels’uncertainty.Ontheotherhand,anupwardtrendsignifiesthatmodelsareunabletoeffectivelyuseinformationfromthepreviouscontextandmaybedeterioratinginpredictionquality,highlightingthelimitationsintheirlong-contextunderstandingcapability.

Weperformthisanalysisontwodatasources:(a)adatasetoflongdocumentswithupto1milliontokens,and(b)adatasetofcoderepositoriesconstructedbyfirstrandomlyshufflingallthefilesandthenconcatenatingthem.Thecodedatasetcontainssequenceslongerthan1milliontokenswithsomenaturalformofsemanticassociation(e.g.,awholerepository),allowingforfurtherevaluationofsequencesofupto10Mtokens.Figure

6

showsthecumulativeNLLuptoaspecifictokenindex

.7

WealsofitapowerlawoftheformL(x)=αxβ+ytothesedatapoints(dashedline).

WefindinFigure

6

thatNLLdecreasesmonotonicallywithsequencelengthandthuspredictionaccuracyimprovesuptothetestedsequencelengths(1Mforlongdocuments,and10Mforcode),indicatingthatourmodelscanmakeuseofthewholeinputevenatverylong-contextlengths.Thissuggeststhatthemodelisabletoimproveitspredictionsbyfindingusefulpatternsintokens,eveniftheyoccurredmillionsoftokensinthepast,asinthecaseofcode.

Finally,weseethisimprovedpredictionfollowsaregularpower-lawstructure.Whileitiswellknownthatlanguagemodelsfollowapower-lawintermsoftrainingcomputetomodelperformance(NLL)(

Kaplanetal.,

2020)uptoaverylargescale,wedemonstratethatapowerlawcanhold

betweenlog-lossandcontextlengthuptoextremelylongcontextlengths.Weseethepower-lawfitisquiteaccurateupto1Mtokensforlong-documentsandabout2Mtokensforcode.Frominspectinglongercodetokenpredictionscloserto10M,weseeaphenomenaoftheincreasedcontextoccasionallyprovidingoutsizedbenefit(e.g.duetorepetitionofcodeblocks)whichmayexplainthepower-law

deviation.Howeverthisdeservesfurtherstudy,andmaybedependentontheexactdatasetused.

7WenotethatweareunabletoobtainlogitsforothercommerciallyavailableLLMsforcomparison.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

9

0

Depth(%)

14

29

43

57

71

86

100

Gemini1.5Pro:From1kto1Mtokens

32k128k256k512k1M

Tokens

Upto10Mtokens

2M5M10M

Tokens

GPT-4Turbo:From1kto128ktokens

Depth(%)

0142943577186

100

32k128k256k512k1M

Tokens

Figure7|TextHaystack.ThisfigurecomparesGemini1.5ProwithGPT-4Turboforthetextneedle-in-a-haystacktask.Greencellsindicatethemodelsuccessfullyretrievedthesecretnumber,graycellsindicateAPIerrors,andredcellsindicatethatthemodelresponsedidnotcontainthesecretnumber.ThetoprowshowsresultsforGemini1.5Pro,from1kto1Mtokens(topleft),andfrom1Mto10Mtokens(topright).ThebottomrowshowsresultsonGPT-4Turbouptothemaximumsupportedcontextlengthof128ktokens.Theresultsarecolor-codedtoindicate:greenforsuccessfulretrievalsandredforunsuccessfulones.

Text-Haystack

Next,wemovetotestinglong-contextrecallusingtherecentlyintroducedneedle-in-a-haystack

evaluation(Kamradt,

2023),whichtestsamodel’sabilitytoretrieveatext(i.e.,“needle”)insertedat

variouspositionsintoasequence(i.e.,“haystack”).Followingpriorwork(

Dhinakaran,

2024

),weuseasetofconcatenatedandrepeatedessayswrittenbyPaulGraham

8

tofillthedesiredcontextlength.Weinsertaneedleatlinearlyspacedintervalsfromthebeginningtotheendofthecontext,wheretheneededisi.e.,“Thespecialmagic{city}numberis:{number}”wherethecityandnumberarevariedforeachquery,andpromptthemodelwith“Hereisthemagicnumber:”.Wereportwhetherthemagicnumberrecallwascorrectatvariouscontextlengths(xaxis–thehaystack)asafunctionofitspositionintheinputsequenceexpressedintermsofdepthpercentage(yaxis),e.g.,depthat100%wouldindicateaneedleinsertedattheveryendoftheinputwhereas0%attheverybeginning.

AscanbeseeninFigure

7,

Gemini1.5Proachieves100%recallupto530ktokensand>99.7%recallupto1Mtokens.Thistask,whilesimple,providesacleardemonstrationthatGemini1.5Proisabletoreliablyretrieveinformationfromlongdocumentsupto1Mtokens.Forreference,wereportresultsforGPT-4Turbouptothe128Ksequencele

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论