微软 -广义人工智能的火花GPT-4的早期实验 Sparks of Artificial General Intelligence:Early experiments with GPT-4_第1页
微软 -广义人工智能的火花GPT-4的早期实验 Sparks of Artificial General Intelligence:Early experiments with GPT-4_第2页
微软 -广义人工智能的火花GPT-4的早期实验 Sparks of Artificial General Intelligence:Early experiments with GPT-4_第3页
微软 -广义人工智能的火花GPT-4的早期实验 Sparks of Artificial General Intelligence:Early experiments with GPT-4_第4页
微软 -广义人工智能的火花GPT-4的早期实验 Sparks of Artificial General Intelligence:Early experiments with GPT-4_第5页
已阅读5页,还剩254页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1SparksofArtificialGeneralIntelligence:EarlyexperimentswithGPT-4EricHorvitzEceKamarPeterLeeYinTatLeeYuanzhiLiScottLundbergHarshaNoriHamidPalangiMarcoTulioRibeiroYiZhangAbstractArtificialintelligence(AI)researchershavebeendevelopingandrefininglargelanguagemodels(LLMs)thatexhibitremarkablecapabilitiesacrossavarietyofdomainsandtasks,challengingourunderstandingoflearningandcognition.ThelatestmodeldevelopedbyOpenAI,GPT-4[Ope23],wastrainedusinganunprecedentedscaleofcomputeanddata.Inthispaper,wereportonourinvestigationofanearlyversionofGPT-4,whenitwasstillinactivedevelopmentbyOpenAI.Wecontendthat(thisearlyversionof)GPT-4ispartofanewcohortofLLMs(alongwithChatGPTandGoogle’sPaLMforexample)thatexhibitmoregeneralintelligencethanpreviousAImodels.Wediscusstherisingcapabilitiesandimplicationsofthesemodels.Wedemonstratethat,beyonditsmasteryoflanguage,GPT-4cansolvenovelanddifficulttasksthatspanmathematics,coding,vision,medicine,law,psychologyandmore,withoutneedinganyspecialprompting.Moreover,inallofthesetasks,GPT-4’sperformanceisstrikinglyclosetohuman-levelperformance,andoftenvastlysurpassespriormodelssuchasChatGPT.GiventhebreadthanddepthofGPT-4’scapabilities,webelievethatitcouldreasonablybeviewedasanearly(yetstillincomplete)versionofanartificialgeneralintelligence(AGI)system.InourexplorationofGPT-4,weputspecialemphasisondiscoveringitslimitations,andwediscussthechallengesaheadforadvancingtowardsdeeperandmorecomprehensiveversionsofAGI,includingthepossibleneedforpursuinganewparadigmthatmovesbeyondnext-wordprediction.Weconcludewithreflectionsonsocietalinfluencesoftherecenttechnologicalleapandfutureresearchdirections.Contents1Introduction41.1OurapproachtostudyingGPT-4’sintelligence 71.2Organizationofourdemonstration 82Multimodalandinterdisciplinarycomposition132.1Integrativeability 132.2Vision 162.2.1Imagegenerationbeyondmemorization 162.2.2Imagegenerationfollowingdetailedinstructions(`alaDall-E) 172.2.3Possibleapplicationinsketchgeneration 182.3Music 193Coding213.1Frominstructionstocode 213.1.1Codingchallenges 213.1.2Realworldscenarios 223.2Understandingexistingcode 2624Mathematicalabilities304.1AmathematicalconversationwithGPT-4 314.1.1Afirstgeneralizationoftheoriginalquestion 314.1.2Asecondvariantoftheoriginalquestion 324.1.3Analysisofthelimitationshighlightedbyconversation 344.2Performanceonmathematicalproblemdatasets 354.3Mathematicalmodelinginvariousdomains 374.4Higherlevelmathematics 395Interactionwiththeworld435.1Tooluse 435.1.1Usingmultipletoolstosolvemorecomplextasks 445.1.2Discussion 495.2EmbodiedInteraction 495.2.1Warmup:navigatingamap 495.2.2Text-basedgames 495.2.3Realworldproblems 525.2.4Discussion 536Interactionwithhumans546.1UnderstandingHumans:TheoryofMind 546.1.1Testingspecificaspectsoftheoryofmind 546.1.2Testingtheoryofmindinrealisticscenarios 546.1.3Discussion 606.2TalkingtoHumans:Explainability 607DiscriminativeCapabilities697.1PIIDetection 697.2MisconceptionsandFact-Checking 707.2.1WhyAreCurrentMetricsInsufficient? 717.2.2GPT-4asaJudge 738LimitationsofautoregressivearchitecturehighlightedbyGPT-4768.1Warm-upwithtwobasicexamples 768.2Lackofplanninginarithmetic/reasoningproblems 778.3Lackofplanningintextgeneration 789Societalinfluences829.1Challengesoferroneousgenerations 829.2Misinformationandmanipulation 839.3Bias 869.4Humanexpertise,jobs,andeconomics 899.5Constellationofinfluencesandconsiderations 9010DirectionsandConclusions9210.1Definitionsofintelligence,AI,andAGI 9210.2Onthepathtomoregeneralartificialintelligence 9310.3Whatisactuallyhappening? 94AGPT-4hascommonsensegrounding101BAppendixformultimodalandinterdisciplinarycomposition105B.1Furtherdetailsonintegrativeabilityresults 105B.2Furtherdetailsonvisionresults 108B.3Graphicnoveldesignexample 1103CAppendixfortheCodingsection111C.1MeasuringhumanperformanceonLeetCode 111C.2ExampleofGPT-4visualizingIMDbdata 112C.3Moreexamplesonvisualization 115C.4Examplefor2DHTMLgamedevelopment 116C.5Exampleforgraphicaluserinterfaceprogramming 116C.6Exampleforreverseengineering 119C.7TestingGPT-4’sabilitytoexecute(pseudo)code 121DAdditionalexamplesformathematicalreasoning122D.1Limitations 122D.2Furtherexamples 126D.3GeneratingmathproblemswithGPT-4 138D.4Mitigatingcalculationerrorsviaexternalcodeexecution 139EAdditionalInterpretabilityExamples141E.1ExplanationAgentMismatches 141FAdditionalexamplesforinteractionwiththeworld144F.1Interactwithtools 144F.2Examplesforinteractionwithenvironments 1494Somethingunknownisdoingwedon’tknowwhat.–SirArthurEddington1IntroductionIntelligenceisamultifacetedandelusiveconceptthathaslongchallengedpsychologists,philosophers,andcomputerscientists.Anattempttocaptureitsessencewasmadein1994byagroupof52psychologistswhosignedontoabroaddefinitionpublishedinaneditorialaboutthescienceofintelligence[Got97].Theconsensusgroupdefinedintelligenceasaverygeneralmentalcapabilitythat,amongotherthings,involvestheveproblemsthinkabstractlycomprehendcomplexideaslearnquicklyandlearnfromexperience.Thisdefinitionimpliesthatintelligenceisnotlimitedtoaspecificdomainortask,butratherencompassesabroadrangeofcognitiveskillsandabilities.Buildinganartificialsystemthatexhibitsthekindofgeneralintelligencecapturedbythe1994consensusdefinitionisalong-standingandambitiousgoalofAIresearch.Inearlywritings,thefoundersofthemoderndisciplineofartificialintelligence(AI)researchcalledoutsetsofaspirationalgoalsforunderstandingintelligence[MMRS06].Overdecades,AIresearchershavepursuedprinciplesofintelligence,includinggeneralizablemechanismsforreasoning(e.g.,[NSS59],[LBFL93])andconstructionofknowledgebasescontaininglargecorporaofcommonsenseknowledge[Len95].However,manyofthemorerecentsuccessesinAIresearchcanbedescribedasbeingnarrowlyfocusedonwell-definedtasksandchallenges,suchasplayingchessorGo,whichweremasteredbyAIsystemsin1996and2016,respectively.Inthelate-1990sandintothe2000s,therewereincreasingcallsfordevelopingmoregeneralAIsystems(e.g.,[SBD+96])andscholarshipinthefieldhassoughttoidentifyprinciplesthatmightunderlymoregenerallyintelligentsystems(e.g.,[Leg08,GHT15]).Thephrase,“artificialgeneralintelligence”(AGI),waspopularizedintheearly-2000s(see[Goe14])toemphasizetheaspirationofmovingfromthe“narrowAI”,asdemonstratedinthefocused,real-worldapplicationsbeingdeveloped,tobroadernotionsofintelli-gence,harkeningbacktothelong-termaspirationsanddreamsofearlierAIresearch.WeuseAGItorefertosystemsthatdemonstratebroadcapabilitiesofintelligenceascapturedinthe1994definitionabove,withtheadditionalrequirement,perhapsimplicitintheworkoftheconsensusgroup,thatthesecapabilitiesareatorabovehuman-level.WenotehoweverthatthereisnosingledefinitionofAGIthatisbroadlyaccepted,andwediscussotherdefinitionsintheconclusionsection.ThemostremarkablebreakthroughinAIresearchofthelastfewyearshasbeentheadvancementofnaturallanguageprocessingachievedbylargelanguagemodels(LLMs).TheseneuralnetworkmodelsarebasedontheTransformerarchitecture[VSP+17]andtrainedonmassivecorporaofweb-textdata,usingatitscoreaself-supervisedobjectiveofpredictingthenextwordinapartialsentence.Inthispaper,wereportonevidencethatanewLLMdevelopedbyOpenAI,whichisanearlyandnon-multimodalversionofGPT-4[Ope23],exhibitsmanytraitsofintelligence,accordingtothe1994definition.Despitebeingpurelyalanguagemodel,thisearlyversionofGPT-4demonstratesremarkablecapabilitiesonavarietyofdomainsandtasks,includingabstraction,comprehension,vision,coding,mathematics,medicine,law,understandingofhumanmotivesandemotions,andmore.WeinteractedwithGPT-4duringitsearlydevelopmentbyOpenAIusingpurelynaturallanguagequeries(prompts)1.InFigure1.1,wedisplaysomepreliminaryexamplesofoutputsfromGPT-4,askingittowriteaproofofinfinitudeofprimesintheformofapoem,todrawaunicorninTiKZ(alanguageforcreatinggraphicsinLATEX),tocreateacomplexanimationinPython,andtosolveahigh-schoollevelmathematicalproblem.Iteasilysucceedsatallthesetasks,andproducesoutputsthatareessentiallyindistinguishablefrom(orevenbetterthan)whathumanscouldproduce.WealsocompareGPT-4’sperformancetothoseofpreviousLLMs,mostnotablyChatGPT,whichisafine-tunedversionof(animproved)GPT-3[BMR+20].InFigure1.2,wedisplaytheresultsofaskingChatGPTforboththeinfini-tudeofprimespoemandtheTikZunicorndrawing.Whilethesystemperformsnon-triviallyonbothtasks,thereisnocomparisonwiththeoutputsfromGPT-4.Thesepreliminaryobservationswillrepeatthemselvesthroughoutthepaper,onagreatvarietyoftasks.ThecombinationofthegeneralityofGPT-4’scapabilities,withnumerousabilitiesspanningabroadswathofdomains,anditsperformanceonawidespectrumoftasksatorbeyondhuman-level,makesuscomfortablewithsayingthatGPT-4isasignificantsteptowardsAGI.1AsGPT-4’sdevelopmentcontinuedafterourexperiments,oneshouldexpectdifferentresponsesfromthefinalversionofGPT-4.Inparticular,allquantitativeresultsshouldbeviewedasestimatesofthemodel’spotential,ratherthandefinitivenumbers.Werepeatthiscaveatthroughoutthepapertoclarifythattheexperienceonthedeployedmodelmaydiffer.Moreoverweemphasizethattheversionwetestedwastext-onlyforinputs,butforsimplicitywerefertoitasGPT-4too.GPT-4Prompt:Canyouwriteaproofthatthereareinfinitelymanyprimes,witheverylinethatrhymes?GPT-4:Yes,IthinkIcan,Thoughitmighttakeacleverplan.I’llstartbynotingEuclid’sproof,Whichshowsthatprimesaren’tjustaloof.Assumewehaveafinitelist,Ofprimes,andthatnonehavebeenmissed.Multiplythemalltogether,Andaddone,justtobeclever.Theresultwillbeanumber,Thathasnoprimefactors,ohwonder!Buteverynumberhasaprime,Thatdividesit,it’sjustamatteroftime.Sowe’vefoundacontradiction,Andourfinitelistneedseviction.Theremustbeinfinitelymanyprimes,Andthat’stheendofmyrhyminglines.sPrompt:DrawaunicorninTiKZ.GPT-4:[ProducesLATEXcompilingtofollowingpicture.]Prompt:Canyougenerateapyplotforthefollowingdata:x=[1,3,5,6,8],y1=[2,3,5,18,1],y2=[3,5,6,8,1],y3=[5,1,2,3,4],y4=[9,7,2,3,1].Iwanty1,y2tobeinthesameplot,buty3isinanotherplotnexttothatplot,y4isinbelow.Iwantthelegendofy1tobe”bob”,y2tobe”alice”,y3tobe”bilbo”,y4tobe”allie”.Iwantthex-axistobelabeledwith”time”andyaxistobelabeledwith”money”.Iwanttoadda10%-40%randomerrorbartoeachcurve,throughalltimes(includingnon-integers).Iwantsmoothedcurvestoshowtheplot,andsmoothederrorbar.Donotuselinearinterpolation,usesmoothinterpolation!Iwanttoalsoaddsomesmallzig-zagtothesmoothedcurvetomakeitlookmorereal.Iwanttoputabaselineasthemeanofeachline.Iwanttoputapiechartbelowindicatingthefractionofthefourpeopleineachtimestep.Ialsowantananimationtoshowhowthefractionsarechanginginthepiechartin**continuoustime**.Interpolatethemissingfractions!Ialsowantverticallineanimationinotherthreeplotstomatchthepiechart.Iwantthefanciestplot.Pleaseaddasmanyfancythingsaspossible.Hereisapossiblecodetogenerateapyplotforthefollowingdata,withsomecommentstoexplainthesteps:GPT-4:[Wegivetwosnapshotsfromtheanimationresultingfromtheproducedcode]Prompt:Supposeg(x)=f_1(x),g(0)=5,g(4)=7,g(3)=2,g(7)=9,g(9)=6whatisf(f(f(6)))?GPT-4:Therearedifferentwaystoapproachthisproblem,butonepossiblemethodistousethefactthatg(x)=f_1(x)impliesthatf(g(x))=xforanyxinthedomainofg.Thatis,applyingftobothsidesoftheequationg(x)=f_1(x)givesf(g(x))=f(f_1(x))=x.Thismeansthatwecanundotheeffectoffbyapplyingg,andviceversa.Usingthisidea,wecantrytofindf(6)bylookingforthevalueofxsuchthatg(x)=6.Fromthegiveninformation,weseethatg(9)=6,sof(6)=f(g(9))=9.Similarly,wecanfindf(9)bylookingforthevalueofxsuchthatg(x)=9.Weseethatg(7)=9,sof(9)=f(g(7))=7.Finally,wecanfindf(7)bylookingforthevalueofxsuchthatg(x)=7.Weseethatg(4)=7,sof(7)=f(g(4))=4.Therefore,f(f(f(6)))=f(f(9))=f(7)=4.Figure1.1:PreliminaryexamplesofGPT-4’scaabilitiesinlanguage,vision,coding,andmathe-matics.6ChatGPTChatGPTPrompt:Canyouwriteaproofthatthereareinfinitelymanyprimes,witheverylinethatrhymes?ChatGPT:Sure!Hereisaproofthatthereareinfinitelymanyprimes,witheachlineoftheproofrhyming:Considerthesetofallprimes.CallthissetPandletitstartwithjusttwo,thefirstprime.IfwetakealltheprimesinPandaddone,Wegetanewsetofnumbersthathasneverbeendone.Thisset,callitS,iscomposedofonlycompositenumbers.Noprimeswillbefoundamongthem,that’sforsure.Butweknowthattheremustbeatleastoneprime,SinceSisinfinite,andthatmuchisclear.IfwetakethesmallestnumberinSthatisnotinP,Andcallitp,wecanaddittoourset,don’tyousee?Butthisprocesscanberepeatedindefinitely.Thus,oursetPmustalsobeinfinite,you’llagree.Sothereyouhaveit,aproofthat’squitesolid.Thereareinfinitelymanyprimes,thatmuchwecansee.Noneedtobecoy,noneedtobecontrite.Infinityisabignumber,justlikeP.Prompt:DrawaunicorninTiKZ.ChatGPT:[ProducesLATEXcompilingtofollowingpicture.]Figure1.2:ChatGPToutputonthefirsttwopromptsfromFigure1.1.OurclaimthatGPT-4representsprogresstowardsAGIdoesnotmeanthatitisperfectatwhatitdoes,orthatitcomesclosetobeingabletodoanythingthatahumancando(whichisoneoftheusualdefinitionofAGI;seetheconclusionsectionformoreonthis),orthatithasinnermotivationandgoals(anotherkeyaspectinsomedefinitionsofAGI).Infact,evenwithintherestrictedcontextofthe1994definitionofintelligence,itisnotfullyclearhowfarGPT-4cangoalongsomeofthoseaxesofintelligence,e.g.,planning(seeSection8),andarguablyitisentirelymissingtheparton“learnquicklyandlearnfromexperience”asthemodelisnotcontinuouslyupdating(althoughitcanlearnwithinasession,seeSection5forexample).OverallGPT-4stillhasmanylimitations,andbiases,whichwediscussindetailbelowandthatarealsocoveredinOpenAI’sreport[Ope23].Inparticularitstillsuffersfromsomeofthewell-documentedshortcomingsofLLMssuchastheproblemofhallucinations[MNBM20](seeFigure1.8)ormakingbasicarithmeticmistakes[CKB+21](seeAppendixD),andyetithasalsoovercomesomefundamentalobstaclessuchasacquiringmanynon-linguisticcapabilities(e.g.,itsolvesmostoftheLLMfailuremodesdescribedin[MIB+23],anditalsomadegreatprogressoncommon-sense,seeFigure1.7forafirstexampleandAppendixAformore).Thishighlightsthefactthat,whileGPT-4isatorbeyondhuman-levelformanytasks,overallitspatternsofintelligencearedecidedlynothuman-like.However,GPT-4isalmostcertainlyonlyafirststeptowardsaseriesofincreasinglygenerallyintelligentsystems,andinfactGPT-4itselfhasimprovedthroughoutourtimetestingit,seeFigure1.3fortheevolutionoftheunicorndrawingoverthecourseofamonthoftraining2.Evenasafirststep,however,GPT-4challengesaconsiderablenumberofwidelyheldassumptionsaboutmachineintelligence,andexhibitsemergentbehaviorsandcapabilitieswhosesourcesandmechanismsare,atthismoment,hardtodiscernprecisely(seeagaintheconclusionsectionformorediscussiononthis).OurprimarygoalincomposingthispaperistoshareourexplorationofGPT-4’scapabilitiesandlimitationsinsupportofourassessmentthatatechnologicalleaphasbeenachieved.WebelievethatGPT-4’sintelligencesignalsatrueparadigmshiftinthefieldofcomputerscienceandbeyond.2Notethattheimprovingwerefertohereisaslowtypeoflearning,whicheventuallycomestoahalt,asopposedtothefast-pacedandreal-timelearningonewouldexpectfromanAGI.7Figure1.3:WequeriedGPT-4threetimes,atroughlyequaltimeintervalsoverthespanofamonthwhilethesystemwasbeingrefined,withtheprompt“DrawaunicorninTikZ”.WecanseeaclearevolutioninthesophisticationofGPT-4’sdrawings.1.1OurapproachtostudyingGPT-4’sintelligenceHowcanwemeasuretheintelligenceofanLLMthathasbeentrainedonanunknownbutextremelyvastcorpusofweb-textdata?Thestandardapproachinmachinelearningistoevaluatethesystemonasetofstandardbenchmarkdatasets,ensuringthattheyareindependentofthetrainingdataandthattheycoverarangeoftasksanddomains.Thisapproachisdesignedtoseparatetruelearningfrommerememorization,andisbackedupbyarichtheoreticalframework[SSBD14,MRT18].However,thismethodologyisnotnecessarilysuitableforstudyingGPT-4,fortworeasons.First,sincewedonothaveaccesstothefulldetailsofitsvasttrainingdata,wehavetoassumethatithaspotentiallyseeneveryexistingbenchmark,oratleastsomesimilardata.Forexample,itseemslikeGPT-4knowstherecentlyproposedBIG-bench[SRR+22](atleastGPT-4knowsthecanaryGUIDfromBIG-bench).Ofcourse,OpenAIthemselveshaveaccesstoallthetrainingdetails,andthustheirreport[Ope23]containsalotofdetailedbenchmarkresults.Nevertheless,thesecondreasonforgoingbeyondtraditionalbenchmarksisprobablymoresignificant:OneofthekeyaspectsofGPT-4’sintelligenceisitsgenerality,theabilitytoseeminglyunderstandandconnectanytopic,andtoperformtasksthatgobeyondthetypicalscopeofnarrowAIsystems.SomeofGPT-4’smostimpressiveperformanceareontasksthatdonotadmitasinglesolution,suchaswritingagraphicuserinterface(GUI)orhelpingahumanbrainstormonsomework-relatedproblem.Benchmarksforsuchgenerativeorinteractivetaskscanbedesignedtoobutthemetricofevaluationbecomesachallenge(seee.g.,[PSZ+21]forsomerecentprogressonthisactiveresearchareainNLP).WenotethatcriticismsofthestandardapproachtomeasureAIsystemswerealsomadein[Cho19],whereanewbenchmarkwasproposedtoevaluategeneralintelligence.WedonottestGPT-4onthelatterbenchmarkforthereasonspreviouslymentioned,aswellasthefactthatthebench-markisvisualinnatureandthusmoreappropriateforthemultimodalversionofGPT-4describedin[Ope23].Toovercomethelimitationsdescribedabove,weproposehereadifferentapproachtostudyingGPT-4whichisclosertotraditionalpsychologyratherthanmachinelearning,leveraginghumancreativityandcu-riosity.WeaimtogeneratenovelanddifficulttasksandquestionsthatconvincinglydemonstratethatGPT-4goesfarbeyondmemorization,andthatithasadeepandflexibleunderstandingofconcepts,skills,anddomains(asomewhatsimilarapproachwasalsoproposedin[CWF+22]).WealsoaimtoprobeGPT-4’sresponsesandbehaviors,toverifyitsconsistency,coherence,andcorrectness,andtouncoveritslimitationsandbiases.Weacknowledgethatthisapproachissomewhatsubjectiveandinformal,andthatitmaynotsatisfytherigorousstandardsofscientificevaluation.However,webelievethatitisausefulandnecessaryfirststeptoappreciatetheremarkablecapabilitiesandchallengesofGPT-4,andthatsuchafirststepopensupnewopportunitiesfordevelopingmoreformalandcomprehensivemethodsfortestingandanalyzingAIsystemswithmoregeneralintelligence.ToillustrateourapproachtoassessingGPT-4’sintelligence,letusconsiderthefirsttwoexampleinter-actionswithGPT-4thatwehaveinFigure1.1.ThefirstexampleisaskingGPT-4towriteaproofoftheinfinitudeofprimesintheformofapoem.Thisisachallengingtaskthatrequirescombiningelementary8Figure1.4:WegavetoGPT-4atransformedversionoftheTikZc

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论