版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
GocgleDeepMind
®2024Google.Allrightsreserved
Gemini1.5:Unlockingmultimodal
understandingacrossmillionsoftokensofcontext
GeminiTeam,Google
1
Inthisreport,wepresentthelatestmodeloftheGeminifamily,Gemini1.5Pro,ahighlycompute-efficientmultimodalmixture-of-expertsmodelcapableofrecallingandreasoningoverfine-grainedinformationfrommillionsoftokensofcontext,includingmultiplelongdocumentsandhoursofvideoandaudio.Gemini1.5Proachievesnear-perfectrecallonlong-contextretrievaltasksacrossmodalities,improvesthestate-of-the-artinlong-documentQA,long-videoQAandlong-contextASR,andmatchesorsurpassesGemini1.0Ultra’sstate-of-the-artperformanceacrossabroadsetofbenchmarks.StudyingthelimitsofGemini1.5Pro’slong-contextability,wefindcontinuedimprovementinnext-tokenpredictionandnear-perfectretrieval(>99%)uptoatleast10Mtokens,agenerationalleapoverexistingmodelssuchasClaude2.1(200k)andGPT-4Turbo(128k).Finally,wehighlightsurprisingnewcapabilitiesoflargelanguagemodelsatthefrontier;whengivenagrammarmanualforKalamang,alanguagewithfewerthan200speakersworldwide,themodellearnstotranslateEnglishtoKalamangatasimilarleveltoapersonlearningfromthesamecontent.
1.Introduction
WepresentourlatestmultimodalmodelfromtheGeminiline:Gemini1.5Pro.ThisisourfirstreleasefromGemini1.5,anewfamilyofhighly-capablemultimodalmodelswhichincorporatesanovelmixture-of-expertsarchitectureaswellasmajoradvancesintrainingandservinginfrastructurethatallowittopushtheboundaryofefficiency,reasoning,andlong-contextperformance.Gemini1.5Proisbuilttohandleextremelylongcontexts;ithastheabilitytorecallandreasonoverfine-grainedinformationfromuptoatleast10Mtokens.Thisscaleisunprecedentedamongcontemporarylargelanguagemodels(LLMs),andenablestheprocessingoflong-formmixed-modalityinputsincludingentirecollectionsofdocuments,multiplehoursofvideo,andalmostadaylongofaudio.Gemini1.5ProsurpassesGemini1.0Proandperformsatasimilarlevelto1.0Ultraonawidearrayofbenchmarkswhilerequiringsignificantlylesscomputetotrain.
Theabilitytomodeldataofincreasinglylongercontextshastrackedthedevelopmentofmoregeneralandcapablelanguagemodels,fromthenowtoy2-gramlanguagemodelproposedby
Shannon
(1948),tothemodernn-grammodelsofthe1990s&2000s(
Brantsetal.,
2007;
ChenandGoodman,
1999;
Jelinek,
1998;
KneserandNey,
1995)typicallyconstrainedto5tokensofcontext,torecurrent
neuralnetworkslanguagemodelsfromthe2010swhichcouldeffectivelyconditiononhundredsof
tokens(Jozefowiczetal.,
2016;
Mikolovetal.,
2010),tothemodernTransformer(Vaswanietal.,
2017
)whichcanconditiononhundredsofthousandsoftokens(
Anthropic,
2023)
.Gemini1.5Procontinuesthistrendbyextendinglanguagemodelcontextlengthsbyoveranorderofmagnitude.Scalingtomillionsoftokens,wefindacontinuedimprovementinpredictiveperformance(Section
),nearperfectrecall(>99%)onsyntheticretrievaltasks(Figure
1
andSection
),andahostof
surprisingnewcapabilitieslikein-contextlearningfromentirelongdocuments(Section
).
1Pleasesendcorrespondencetogemini-1_5-report@.
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
2
video
videoHaystack
6121824303642485460
Minutes
3672108144180
Minutes
Audio
AudioHaystack
252
204
492540
588
444
300348
Minutes
84010801320
Minutes
4357
Text
TextHaystack
32k
256k
512k
128k
Tokens
2M
5M
Tokens
Figure1|Gemini1.5Proachievesnear-perfect“needle”recall(>99.7%)upto1Mtokensof“haystack”inallmodalities,i.e.,text,videoandaudio.Itevenmaintainsthisrecallperformancewhenextendingto10Mtokensinthetextmodality(approximately7Mwords);2Mtokensintheaudiomodality(upto22hours);2.8Mtokensinthevideomodality(upto3hours).Thex-axisrepresentsthecontextwindow,andthey-axisthedepthpercentageoftheneedleplacedforagivencontextlength.Theresultsarecolor-codedtoindicate:greenforsuccessfulretrievalsandredforunsuccessfulones.
Tomeasuretheeffectivenessofourmodel’slong-contextcapabilities,weconductexperimentsonbothsyntheticandreal-worldtasks.Insynthetic“needle-in-a-haystack”tasksinspiredby
Kamradt
(2023)thatprobehowreliablythemodelcanrecallinformationamidstdistractorcontext,we
findthatGemini1.5Proachievesnear-perfect(>99%)“needle”recalluptomultiplemillionsoftokensof“haystack”inallmodalities,i.e.,text,videoandaudio,andevenmaintainingthisrecallperformancewhenextendingto10Mtokensinthetextmodality.Inmorerealisticmultimodallong-contextbenchmarkswhichrequireretrievalandreasoningovermultiplepartsofthecontext(suchasansweringquestionsfromlongdocumentsorlongvideos),wealsoseeGemini1.5Prooutperformingallcompetingmodelsacrossallmodalitiesevenwhenthesemodelsareaugmentedwithexternalretrievalmethods.Finally,wequalitativelyshowcasethein-contextlearningabilitiesofGemini1.5Proenabledbyverylongcontext:forexample,learningtotranslateanewlanguagefromasinglesetoflinguisticdocumentation.Withonlyinstructionalmaterials(500pagesoflinguisticdocumentation,adictionary,and≈400parallelsentences)allprovidedincontext,Gemini1.5ProiscapableoflearningtotranslatefromEnglishtoKalamang,alanguagespokenbyfewerthan200speakersinwesternNewGuineaintheeastofIndonesianPapua
2,
andthereforealmostnoonlinepresence.Moreover,wefindthatthequalityofitstranslationsiscomparabletothatofapersonwhohaslearnedfromthesamematerials.
2KalamangLanguage:
/lang/1891
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
3
Gemini1.5Pro
Relativeto1.0Pro
Relativeto1.0Ultra
Long-ContextText,Video&Audio
from32kupto10Mtokens
from32kupto10Mtokens
CoreCapabilities
Text
Win-rate:87.1%(27/31benchmarks)
Win-rate:54.8%(17/31benchmarks)
Win-rate:100%(13/13benchmarks)
Win-rate:77%(10/13benchmarks)
Vision
Win-rate:77%(10/13benchmarks)
Win-rate:46%(6/13benchmarks)
Audio
Win-rate:60%(3/5benchmarks)
Win-rate:20%(1/5benchmarks)
Table1|Gemini1.5ProcomparedtoGemini1.0family.Gemini1.5Promaintainshighlevelsofperformanceevenasitscontextwindowincreases.DetailedresultsarepresentedinTable
7.
Importantly,thisleapinlong-contextperformancedoesnotcomeattheexpenseofthecoremulti-modalcapabilitiesofthemodel
.3
Overall,wefindthatGemini1.5ProgreatlysurpassesGemini1.0Pro,performingbetteronthevastmajorityofbenchmarks(i.e.,27/31),increasingthemargininparticularforMath,ScienceandReasoning(+28.9%),Multilinguality(+22.3%),VideoUnderstanding(+11.2%)andCode(+8.9%)(seeTable
7
forbreakdowns).However,amorestrikingcomparisonistheonewithGemini1.0Ultra,astate-of-the-artmodelacrossmanycapabilities.DespiteGemini1.5Prousingsignificantlylesstrainingcomputeandbeingmoreefficienttoserve,wefindGemini1.5Protoperformbetteronmorethanhalfofthebenchmarks(16/31),inparticularontextbenchmarks(10/13)andmanyofthevisionbenchmarks(6/13).
Inthefollowingsections,weprovideanoverviewofthemodelarchitectureandpresenttheresultsoflarge-scalequantitativeevaluationscomparingGemini1.5ProtootherLLMs.Wepresentdetailedevaluationsforthemodel’slongcontextcapabilitiesfollowedbyevaluationsofitscorecapabilities,similartotheGemini1.0technicalreport(
Gemini-Teametal.,
2023
),coveringwell-studiedbenchmarksacrosstext,code,image,videoandaudio.Finally,wediscussourapproachtoresponsibledeployment,includingourprocessforimpactassessmentdevelopingmodelpolicies,evaluations,andmitigationsofharmbeforedeploymentdecisions
.4
2.ModelArchitecture
Gemini1.5Proisasparsemixture-of-expert(MoE)Transformer-basedmodelthatbuildsonGemini1.
0’s(Gemini-Teametal.,
2023)researchadvancesandmultimodalcapabilities
.Gemini1.5ProalsobuildsonamuchlongerhistoryofMoEresearchatGoogle(
Clarketal.,
2022;
Duetal.,
2022;
Fedus
etal.,
2021;
Lepikhinetal.,
2020;
Riquelmeetal.,
2021;
Shazeeretal.,
2017;
Zophetal.,
2022)and
languagemodelresearchinthebroaderliterature(Aniletal.,
2023;
Anthropic,
2023;
Brownetal.,
2020;
Chowdheryetal.,
2023;
Hoffmannetal.,
2022;
Jiangetal.,
2024;
Kimetal.,
2021;
OpenAI,
2023;
Raeetal.,
2021;
Raffeletal.,
2020;
Rolleretal.,
2021;
Thoppilanetal.,
2022;
Touvronetal.,
2023a
,b;
Vaswanietal.,
2017
).MoEmodelsusealearnedroutingfunctiontodirectinputstoasubsetofthemodel’sparametersforprocessing.Thisformofconditionalcomputation(
Bengioetal.,
2013;
3Wedefinethecorecapabilitiesasthosecapabilitiesofthemodelthatareprimarilynonlong-context(e.g.,math,science,reasoning,multilinguality,codeetc.)similartocapabilitiescoveredintheGemini1.0technicalreport
(Gemini-Teametal.,
2023
).
4Seethemodelcard
(Mitchelletal.,
2019a
)inAppendixSection
8.1.
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
4
DavisandArel,
2014
;
Jacobsetal.,
1991
)allowsmodelstogrowtheirtotalparametercountwhilekeepingthenumberofparametersthatareactivatedforanygiveninputconstant.
Ahostofimprovementsmadeacrossnearlytheentiremodelstack(architecture,data,optimizationandsystems)allowsGemini1.5ProtoachievecomparablequalitytoGemini1.0Ultra(seeSection
5
),whileusingsignificantlylesstrainingcomputeandbeingsignificantlymoreefficienttoserve.Gemini1.5Proalsoincorporatesaseriesofsignificantarchitecturechangesthatenablelong-contextunderstandingofinputsupto10milliontokenswithoutdegradingperformance.Translatedintorealworlddata,thiscontextlengthenablesGemini1.5Promodelstocomfortablyprocessalmostadayofaudiorecordings(i.e.,22hours),morethantentimestheentiretyofthe1440pagebook(or
587,287words)"WarandPeace",theentireFlax(Heeketal.,
2023)codebase(41,070linesofcode),
orthreehoursofvideoat1frame-per-second.Further,sincethemodelisnativelymultimodalandsupportsinterleavingofdatafromdifferentmodalities,itcansupportamixofaudio,visual,text,andcodeinputsinthesameinputsequence.InSection
4.1
,wehighlightsomeofthenovelcapabilitiesenabledbytheseadvances,includingevaluationsthatyieldedpositiveresultsoncontextlengthsupto10million.Wenotethatunderstandingthelimitsofthesecapabilitiesandstudyingtheirexcitingcapabilitiesandapplicationsremainsanareaofcontinuedresearchexploration.
3.TrainingInfrastructureandDataset
LikeGemini1.0Ultraand1.0Pro,Gemini1.5Proistrainedonmultiple4096-chippodsofGoogle’sTPUv4accelerators,distributedacrossmultipledatacenters,andonavarietyofmultimodalandmultilingualdata.Ourpre-trainingdatasetincludesdatasourcedacrossmanydifferentdomains,includingwebdocumentsandcode,andincorporatesimage,audio,andvideocontent.Fortheinstruction-tuningphasewefinetunedGemini1.5Proonacollectionofmultimodaldata(containingpairedinstructionsandappropriateresponses),withfurthertuningbasedonhumanpreferencedata.WereferreaderstotheGemini1.0TechnicalReport(
Gemini-Teametal.,
2023
)forfurtherinformation.
4.Long-contextEvaluation
Existingevaluationsareincreasinglystrainedbythenewandrapidlyadvancingcapabilitiesoflargemultimodalmodels.Theytypicallyfocusonindividualmodalitiesand/orarerestrictedtotaskswithshortercontextlengths.Hence,thereisagrowingneedforbenchmarkswhichexemplifythenuancedrequirementsofrealworldlongmixed-modalityusecases.Amongthese,wehighlightthequantitativeassessmentofreasoningcapabilitiesacrosslongmixed-modalitysequencesasakeychallenge.
Withthechallengesofevaluatingincreasinglycapablemodelsinmind,ourevaluationofGemini1.5Profirstfocusesonunderstandingandevaluatingitsnovelcapabilities.Subsequently,weexplorecorebenchmarks,coveringcapabilitiesstudiedintheGemini1.0TechnicalReport(
Gemini-Team
etal.,
2023
).Specifically,weevaluateGemini1.5Prointhreemaincategories:
5
1.Qualitativelong-contextmultimodalevaluations:manuallyprobeandstress-testthemodel’slong-contextabilities,especiallyfornovelcapabilitieswherenoquantitativebenchmarksexist.
2.Quantitativelong-contextmultimodalevaluations:measurethemodel’slong-contextabilitiesonbothsyntheticandreal-worldtaskswithwell-definedmetrics.
3.Quantitativecoreevaluations:identifyprogressandregressionincorecapabilities(e.g.,coding,
math,science,multilingualityandinstructionfollowing).
5WenotethatalltheevaluationsarefromthesamecheckpointoftheGemini1.5Promodelthatisinstructiontunedpostpre-training,unlessotherwisestated.
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
5
4.1.QualitativeExamplesofMultimodalLong-ContextCapabilities
Theabilitytoprocessmultiplemillionsoftokensunlockspracticalapplicationsthatwerenotpossiblebefore.InthissectionwedemonstratesomesurprisinginteractionsweobservedwithGemini1.5Proacrosscode,textandvideo.
AsshownintheFigure
2,
Gemini1.5ProisabletoingestentirelargecodebasessuchasJAX(746,152tokens),andanswerveryspecificqueriesaboutthem.inFigure
3
weshowGemini1.5Pro’sabilitytolearnanewlanguagebasedonlyonreferencematerialsgiveninitsinput(seeSection
forquantitativemetricsforthisusecase).Additionally,wetestGemini1.5Pro’sabilitytoansweranimagequerygiventheentiretextofLesMisérablesandobservethatbeingnativelymultimodalallowsittolocateafamousscenefromahand-drawnsketch,asshowninFigure
4.
Lastly,weaskGemini1.5Proquestionsaboutanentiremovieof45minutesinFigure
5
whichthemodelanswersseamlesslywhileretrievingmomentsandtimestampsdowntoasecond.
6
Figure2|Giventheentire746,152tokenJAXcodebaseincontext,Gemini1.5Procanidentifythespecificlocationofacoreautomaticdifferentiationmethod.
Figure3|Givenareferencegrammarbookandabilingualwordlist(dictionary),Gemini1.5ProisabletotranslatefromEnglishtoKalamangwithsimilarqualitytoahumanwholearnedfromthe
samematerials.
6ForadditionalshortvideosofdemonstrationsofthelongcontextabilitiesofGemini1.5Proacrossvideo,text,andcodesee
https://deepmind.google/technologies/gemini/.
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
6
Figure4|WiththeentiretextofLesMisérablesintheprompt(1382pages,732ktokens),Gemini
1.5Proisabletoidentifyandlocateafamousscenefromahand-drawnsketch.
Figure5|Whenpromptedwitha45minuteBusterKeatonmovie“SherlockJr."(1924)(2,674framesat1FPS,684ktokens),Gemini1.5Proretrievesandextractstextualinformationfromaspecificframeinandprovidesthecorrespondingtimestamp.Atbottomright,themodelidentifiesasceneinthemoviefromahand-drawnsketch.
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
7
4.2.Long-contextEvaluations
Forthepastfewyears,LLMresearchhasprioritizedexpandingthecontextwindowfromwhichmodelscanincorporateinformation(
Anthropic,
2023;
OpenAI,
2023)
.Thisemphasisstemsfromtherecognitionthatawidercontextwindowallowsmodelstoincorporatealargeramountofnew,task-specificinformationnotfoundinthetrainingdataatinferencetime,leadingtoimprovedperformanceinvariousnaturallanguageormultimodaltasks.Recentapproachestoimprovingthelong-contextcapabilitiesofmodelsfallintoafewcategories,includingnovelarchitecturalapproaches(
Ainslieetal.,
2023
;
GuandDao,
2023
;
Guoetal.,
2021
;
Orvietoetal.,
2023
;
Zaheeretal.,
2020
),post-training
modifications(Bertschetal.,
2023;
Chenetal.;
Pressetal.,
2021;
Xiongetal.,
2023),retrieval
-
augmentedmodels(Guuetal.,
2020;
Izacardetal.,
2022;
Jiangetal.,
2022;
Karpukhinetal.,
2020;
Santhanametal.,
2021
),memory-augmentedmodels(
Bulatovetal.,
2022,
2023
;
Martinsetal.,
2022;
Muetal.,
2023;
Wuetal.,
2022a,b;
Zhongetal.,
2022),andtechniquesforbuildingmore
coherentlong-contextdatasets(
Shietal.,
2023c;
Staniszewskietal.,
2023
).Thisactivityhasresultedinmeasurableimprovementsonlong-contextcapabilitiesofLLMsoverthepastseveralmonths,withtherecentconcurrentworkofLiuetal.(2024)exploringcontextwindowof7Bmodelsupto1Mmultimodaltokens.Notably,amongthestate-of-the-artLLMs,Anthropichassuccessfullyextendedthecontextoftheirtext-onlyClaude2modelto100ktokens,whileOpenAIhasrecentlyreleasedGPT-4Turboreaching128ktokens.Finally,thelatestadditiontotheserieswasClaude2.1withacontextwindowof200ktokens.
Gemini1.5Prosignificantlyextendsthiscontextlengthfrontiertomultiplemillionsoftokenswithalmostnodegradationinperformance,makingitpossibletoprocesssignificantlylargerinputs.ComparedtoClaude2.1witha200ktokencontextwindow,Gemini1.5Proachievesa100%recallat200ktokens,surpassingClaude2.1’s98%.This100%recallismaintainedupto530ktokens,andrecallis99.7%at1Mtokens.Whenincreasingfrom1Mtokensto10Mtokens,themodelretains99.2%recall.Moreover,Gemini1.5Pro’snativemultimodalcapabilitiesenablesthemodeltoingestmultiplehoursofaudioandvideorecordingsalongsideorinterleavedwithtext.SuchrecallcapabilitiesaresummarizedinFigure
1.
Belowwereportresultsonlong-contextevaluationsacrossallthreemodalities,i.e.,text,visionandaudio.
Theevaluationmethodologywefollowedtomeasurethelong-contextcapabilityofGemini1.5Proconsistsofbothdiagnostic-focusedprobingofthelongcontextcapabilities(e.g.,perplexityoverlongsequences,needle-in-a-haystackretrievalstudies)andrealisticevaluationsspecificallydesignedformultimodallong-contexttasks(e.g.,long-documentQA,long-contextautomaticspeechrecognition,learningtotranslateanewlanguagefromonlyonebook,andlong-contextvideoQA).Toprovideareferencepoint,throughoutthissectionwecompareGemini1.5Prowiththeleadingmodelavailableexternallyforeachtask.WiththeevaluationharnesswedevelopedforGemini1.5Proweareabletoquantifythequalityoflong-contextunderstandingcapabilitiesreliablyallthewayupto10Mtokens.
4.2.1.DiagnosticLong-ContextEvaluations
PerplexityoverLongSequences
Westartbyreportingresultsonthetextmodality.Toevaluatetheabilityofthemodelstomakeuseofverylongcontextstoimprovenext-tokenprediction,whichistheobjectivefunctionusedtotrainlanguagemodels,werecordthenegativelog-likelihood(NLL)oftokensatdifferentpositionsintheinputsequencesfromheld-outtext(i.e.,notusedintraining).Here,alowervalueimpliesanimprovedprediction.Typically,weexpecttokensatthebeginningofasequencetohavehighNLL,asthereislittletonocontextthatthemodelcanusetopredictthem,andtokenslaterinthesequencetohavelowerNLLasmoreinformationbecomesavailabletothemodel.Theshapeoftheresulting
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
8
NegativeLog-Likelihood
(lowerisbetter)
CumulativeAverageNLLforCode
Gemini1.0ProGemini1.5Pro
Powerlawfit(r2=0.998)
1285122K8K32K128K512K2M10M
Sequenceposition
CumulativeAverageNLLforLongDocuments
Gemini1.0ProGemini1.5Pro
Powerlawfit(r2=0.998)
2561K4K16K64K256K1M
Sequenceposition
NegativeLog-Likelihood
(lowerisbetter)
Figure6|Cumulativeaveragenegativelog-likelihood(NLL)asafunctionoftokenpositioninlongdocumentsandcodedata.Alowervaluedemonstratesbetterprediction.Gemini1.5Proshowsimprovedpredictionsupto1Mtokensforlong-documentsand10Mtokensforcode,whereasGemini1.0Proimprovesuptoonly32Ktokens.TheNLLfollowsapower-lawtrendupuntil1Mtokens(documents)and2Mtokens(code)withadeviatingtrendat10Mtokens.
curveindicatestheabilitiesofmodelstoreasonoverlong-context.Adownwardtrendsignifiesmodelsmakinguseoflong-contexttoreducemodels’uncertainty.Ontheotherhand,anupwardtrendsignifiesthatmodelsareunabletoeffectivelyuseinformationfromthepreviouscontextandmaybedeterioratinginpredictionquality,highlightingthelimitationsintheirlong-contextunderstandingcapability.
Weperformthisanalysisontwodatasources:(a)adatasetoflongdocumentswithupto1milliontokens,and(b)adatasetofcoderepositoriesconstructedbyfirstrandomlyshufflingallthefilesandthenconcatenatingthem.Thecodedatasetcontainssequenceslongerthan1milliontokenswithsomenaturalformofsemanticassociation(e.g.,awholerepository),allowingforfurtherevaluationofsequencesofupto10Mtokens.Figure
6
showsthecumulativeNLLuptoaspecifictokenindex
.7
WealsofitapowerlawoftheformL(x)=αxβ+ytothesedatapoints(dashedline).
WefindinFigure
6
thatNLLdecreasesmonotonicallywithsequencelengthandthuspredictionaccuracyimprovesuptothetestedsequencelengths(1Mforlongdocuments,and10Mforcode),indicatingthatourmodelscanmakeuseofthewholeinputevenatverylong-contextlengths.Thissuggeststhatthemodelisabletoimproveitspredictionsbyfindingusefulpatternsintokens,eveniftheyoccurredmillionsoftokensinthepast,asinthecaseofcode.
Finally,weseethisimprovedpredictionfollowsaregularpower-lawstructure.Whileitiswellknownthatlanguagemodelsfollowapower-lawintermsoftrainingcomputetomodelperformance(NLL)(
Kaplanetal.,
2020)uptoaverylargescale,wedemonstratethatapowerlawcanhold
betweenlog-lossandcontextlengthuptoextremelylongcontextlengths.Weseethepower-lawfitisquiteaccurateupto1Mtokensforlong-documentsandabout2Mtokensforcode.Frominspectinglongercodetokenpredictionscloserto10M,weseeaphenomenaoftheincreasedcontextoccasionallyprovidingoutsizedbenefit(e.g.duetorepetitionofcodeblocks)whichmayexplainthepower-law
deviation.Howeverthisdeservesfurtherstudy,andmaybedependentontheexactdatasetused.
7WenotethatweareunabletoobtainlogitsforothercommerciallyavailableLLMsforcomparison.
Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext
9
0
Depth(%)
14
29
43
57
71
86
100
Gemini1.5Pro:From1kto1Mtokens
32k128k256k512k1M
Tokens
Upto10Mtokens
2M5M10M
Tokens
GPT-4Turbo:From1kto128ktokens
Depth(%)
0142943577186
100
32k128k256k512k1M
Tokens
Figure7|TextHaystack.ThisfigurecomparesGemini1.5ProwithGPT-4Turboforthetextneedle-in-a-haystacktask.Greencellsindicatethemodelsuccessfullyretrievedthesecretnumber,graycellsindicateAPIerrors,andredcellsindicatethatthemodelresponsedidnotcontainthesecretnumber.ThetoprowshowsresultsforGemini1.5Pro,from1kto1Mtokens(topleft),andfrom1Mto10Mtokens(topright).ThebottomrowshowsresultsonGPT-4Turbouptothemaximumsupportedcontextlengthof128ktokens.Theresultsarecolor-codedtoindicate:greenforsuccessfulretrievalsandredforunsuccessfulones.
Text-Haystack
Next,wemovetotestinglong-contextrecallusingtherecentlyintroducedneedle-in-a-haystack
evaluation(Kamradt,
2023),whichtestsamodel’sabilitytoretrieveatext(i.e.,“needle”)insertedat
variouspositionsintoasequence(i.e.,“haystack”).Followingpriorwork(
Dhinakaran,
2024
),weuseasetofconcatenatedandrepeatedessayswrittenbyPaulGraham
8
tofillthedesiredcontextlength.Weinsertaneedleatlinearlyspacedintervalsfromthebeginningtotheendofthecontext,wheretheneededisi.e.,“Thespecialmagic{city}numberis:{number}”wherethecityandnumberarevariedforeachquery,andpromptthemodelwith“Hereisthemagicnumber:”.Wereportwhetherthemagicnumberrecallwascorrectatvariouscontextlengths(xaxis–thehaystack)asafunctionofitspositionintheinputsequenceexpressedintermsofdepthpercentage(yaxis),e.g.,depthat100%wouldindicateaneedleinsertedattheveryendoftheinputwhereas0%attheverybeginning.
AscanbeseeninFigure
7,
Gemini1.5Proachieves100%recallupto530ktokensand>99.7%recallupto1Mtokens.Thistask,whilesimple,providesacleardemonstrationthatGemini1.5Proisabletoreliablyretrieveinformationfromlongdocumentsupto1Mtokens.Forreference,wereportresultsforGPT-4Turbouptothe128Ksequencele
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年武汉货运从业资格证年考试题及答案解析
- 《票据概论》课件
- 2024年度税务部门专用劳务派遣服务合同范本2篇
- 2024年物业管理公司装修施工合作合同版
- 《促销策略导入》课件
- 2024年标准采购与安装服务具体协议模板版
- 2024年独家培训合作协议
- 仿写《赵州桥》课件
- 2025商铺租赁合同(标准版)
- 2024年度医院物资采购合作协议版B版
- 心脏介入手术谈话技巧
- 海南省三亚市吉阳区2022-2023学年六年级上学期期末数学试卷
- 办公楼消防改造工程环境保护措施
- 2023-2024学年高一下学期家长会 课件
- 溯源与解读:学科实践即学习方式变革的新方向
- 班克街教育方案
- 护理教育改革与创新研究
- 知识点总结(知识清单)-2023-2024学年人教PEP版英语六年级上册
- 《囚歌》教学课件
- 2024年日历(打印版每月一张)
- 民法典银行培训课件
评论
0/150
提交评论