2024年斯坦福Agent+AI+论文(英)_第1页
2024年斯坦福Agent+AI+论文(英)_第2页
2024年斯坦福Agent+AI+论文(英)_第3页
2024年斯坦福Agent+AI+论文(英)_第4页
2024年斯坦福Agent+AI+论文(英)_第5页
已阅读5页,还剩75页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

arXiv:2401.03568v2[cs.AI]25Jan2024

AGENTAI:

SURVEYINGTHEHORIZONSOFMULTIMODALINTERACTION

ZaneDurante1†*,QiuyuanHuang2‡∗,NaokiWake2∗,RanGong3†,JaeSungPark4†,BidiptaSarkar1†,RohanTaori1†,YusukeNoda5,DemetriTerzopoulos3,YejinChoi4,KatsushiIkeuchi2,HoiVo5,LiFei-Fei1,JianfengGao2

StanfordUniversity;2MicrosoftResearch,Redmond;

UniversityofCalifornia,LosAngeles;4UniversityofWashington;5MicrosoftGaming

Figure1:OverviewofanAgentAIsystemthatcanperceiveandactindifferentdomainsandapplications.AgentAIisemergingasapromisingavenuetowardArtificialGeneralIntelligence(AGI).AgentAItraininghasdemonstratedthecapacityformulti-modalunderstandinginthephysicalworld.Itprovidesaframeworkforreality-agnostictrainingbyleveraginggenerativeAIalongsidemultipleindependentdatasources.Largefoundationmodelstrainedforagentandaction-relatedtaskscanbeappliedtophysicalandvirtualworldswhentrainedoncross-realitydata.WepresentthegeneraloverviewofanAgentAIsystemthatcanperceiveandactinmanydifferentdomainsandapplications,possiblyservingasaroutetowardsAGIusinganagentparadigm.

∗EqualContribution.‡ProjectLead.†WorkdonewhileinterningatMicrosoftResearch,Redmond.

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

ABSTRACT

Multi-modalAIsystemswilllikelybecomeaubiquitouspresenceinoureverydaylives.Apromisingapproachtomakingthesesystemsmoreinteractiveistoembodythemasagentswithinphysicalandvirtualenvironments.Atpresent,systemsleverageexistingfoundationmodelsasthebasicbuildingblocksforthecreationofembodiedagents.Embeddingagentswithinsuchenvironmentsfacilitatestheabilityofmodelstoprocessandinterpretvisualandcontextualdata,whichiscriticalforthecreationofmoresophisticatedandcontext-awareAIsystems.Forexample,asystemthatcanperceiveuseractions,humanbehavior,environmentalobjects,audioexpressions,andthecollectivesentimentofascenecanbeusedtoinformanddirectagentresponseswithinthegivenenvironment.Toaccelerateresearchonagent-basedmultimodalintelligence,wedefine“AgentAI”asaclassofinteractivesystemsthatcanperceivevisualstimuli,languageinputs,andotherenvironmentally-groundeddata,andcanproducemeaningfulembodiedactions.Inparticular,weexploresystemsthataimtoimproveagentsbasedonnext-embodiedactionpredictionbyincorporatingexternalknowledge,multi-sensoryinputs,andhumanfeedback.WearguethatbydevelopingagenticAIsystemsingroundedenvironments,onecanalsomitigatethehallucinationsoflargefoundationmodelsandtheirtendencytogenerateenvironmentallyincorrectoutputs.TheemergingfieldofAgentAIsubsumesthebroaderembodiedandagenticaspectsofmultimodalinteractions.Beyondagentsactingandinteractinginthephysicalworld,weenvisionafuturewherepeoplecaneasilycreateanyvirtualrealityorsimulatedsceneandinteractwithagentsembodiedwithinthevirtualenvironment.

Contents

1

Introduction

5

1.1

Motivation

.......................................................

5

1.2

Background

......................................................

5

1.3

Overview

.......................................................

6

2

AgentAIIntegration

7

2.1

InfiniteAIagent

....................................................

7

2.2

AgentAIwithLargeFoundationModels

.......................................

8

2.2.1

Hallucinations

.................................................

8

2.2.2

BiasesandInclusivity

.............................................

9

2.2.3

DataPrivacyandUsage

............................................

10

2.2.4

InterpretabilityandExplainability

.......................................

11

2.2.5

InferenceAugmentation

............................................

12

2.2.6

Regulation

...................................................

13

2.3

AgentAIforEmergentAbilities

............................................

14

3

AgentAIParadigm

15

3.1

LLMsandVLMs

...................................................

15

3.2

AgentTransformerDefinition

.............................................

15

3.3

AgentTransformerCreation

..............................................

16

4

AgentAILearning

17

4.1

StrategyandMechanism

................................................

17

4.1.1

ReinforcementLearning(RL)

.........................................

17

4.1.2

ImitationLearning(IL)

............................................

18

4.1.3

TraditionalRGB

................................................

18

4.1.4

In-contextLearning

..............................................

18

4.1.5

OptimizationintheAgentSystem

......................................

18

4.2

AgentSystems(zero-shotandfew-shotlevel)

.....................................

19

4.2.1

AgentModules

................................................

19

4.2.2

AgentInfrastructure

..............................................

19

4.3

AgenticFoundationModels(pretrainingandfinetunelevel)

.............................

19

2

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

5AgentAICategorization

20

5.1

GeneralistAgentAreas

................................

................

20

5.2

EmbodiedAgents

...................................

................

20

5.2.1

ActionAgents

.................................

................

20

5.2.2

InteractiveAgents

...............................

................

21

5.3

SimulationandEnvironmentsAgents

.........................

................

21

5.4

GenerativeAgents

...................................

................

21

5.4.1

AR/VR/mixed-realityAgents

.........................

................

22

5.5

KnowledgeandLogicalInferenceAgents

.......................

................

22

5.5.1

KnowledgeAgent

...............................

................

23

5.5.2

LogicAgents

.................................

................

23

5.5.3

AgentsforEmotionalReasoning

.......................

................

23

5.5.4

Neuro-SymbolicAgents

............................

................

24

5.6

LLMsandVLMsAgent

................................

................

24

6AgentAIApplicationTasks

24

6.1

AgentsforGaming

..................................

................

24

6.1.1

NPCBehavior

.................................

................

24

6.1.2

Human-NPCInteraction

............................

................

25

6.1.3

Agent-basedAnalysisofGaming

.......................

................

25

6.1.4

SceneSynthesisforGaming

.........................

................

27

6.1.5

ExperimentsandResults

...........................

................

27

6.2

Robotics

........................................

................

28

6.2.1

LLM/VLMAgentforRobotics.

........................

................

30

6.2.2

ExperimentsandResults.

...........................

................

31

6.3

Healthcare

.......................................

................

35

6.3.1

CurrentHealthcareCapabilities

........................

................

36

6.4

MultimodalAgents

..................................

................

36

6.4.1

Image-LanguageUnderstandingandGeneration

...............

................

36

6.4.2

VideoandLanguageUnderstandingandGeneration

.............

................

37

6.4.3

ExperimentsandResults

...........................

................

39

6.5

Video-languageExperiments

.............................

................

41

6.6

AgentforNLP

.....................................

................

45

6.6.1

LLMagent

..................................

................

45

6.6.2

GeneralLLMagent

..............................

................

45

6.6.3

Instruction-followingLLMagents

......................

................

46

6.6.4

ExperimentsandResults

...........................

................

46

7AgentAIAcrossModalities,Domains,andRealities

48

7.1

AgentsforCross-modalUnderstanding

........................

................

48

7.2

AgentsforCross-domainUnderstanding

.......................

................

48

7.3

Interactiveagentforcross-modalityandcross-reality

.................

................

49

7.4

SimtoRealTransfer

..................................

................

49

8ContinuousandSelf-improvementforAgentAI

49

8.1

Human-basedInteractionData

............................

................

49

8.2

FoundationModelGeneratedData

..........................

................

50

9AgentDatasetandLeaderboard

50

9.1

“CuisineWorld”DatasetforMulti-agentGaming

...................

................

50

9.1.1

Benchmark

..................................

................

51

9.1.2

Task

......................................

................

51

9.1.3

MetricsandJudging

..............................

................

51

9.1.4

Evaluation

...................................

................

51

9.2

Audio-Video-LanguagePre-trainingDataset.

.....................

................

51

10BroaderImpactStatement

52

11EthicalConsiderations

53

3

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

12

DiversityStatement

53

References

55

Appendix

69

A

GPT-4VAgentPromptDetails

69

B

GPT-4VforBleedingEdge

69

C

GPT-4VforMicrosoftFightSimulator

69

D

GPT-4VforAssassin’sCreedOdyssey

69

E

GPT-4VforGEARSofWAR4

69

F

GPT-4VforStarfield

75

AuthorBiographies

77

Acknowledgemets

80

4

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

1 Introduction

1.1 Motivation

Historically,AIsystemsweredefinedatthe1956DartmouthConferenceasartificiallifeformsthatcouldcollectinformationfromtheenvironmentandinteractwithitinusefulways.Motivatedbythisdefinition,Minsky’sMITgroupbuiltin1970aroboticssystem,calledthe“CopyDemo,”thatobserved“blocksworld”scenesandsuccessfullyreconstructedtheobservedpolyhedralblockstructures.Thesystem,whichcomprisedobservation,planning,andmanipulationmodules,revealedthateachofthesesubproblemsishighlychallengingandfurtherresearchwasnecessary.TheAIfieldfragmentedintospecializedsubfieldsthathavelargelyindependentlymadegreatprogressintacklingtheseandotherproblems,butover-reductionismhasblurredtheoverarchinggoalsofAIresearch.

Toadvancebeyondthestatusquo,itisnecessarytoreturntoAIfundamentalsmotivatedbyAristotelianHolism.Fortunately,therecentrevolutioninLargeLanguageModels(LLMs)andVisualLanguageModels(VLMs)hasmadeitpossibletocreatenovelAIagentsconsistentwiththeholisticideal.Seizinguponthisopportunity,thisarticleexploresmodelsthatintegratelanguageproficiency,visualcognition,contextmemory,intuitivereasoning,andadaptability.ItexploresthepotentialcompletionofthisholisticsynthesisusingLLMsandVLMs.Inourexploration,wealsorevisitsystemdesignbasedonAristotle’sFinalCause,theteleological“whythesystemexists”,whichmayhavebeenoverlookedinpreviousroundsofAIdevelopment.

WiththeadventofpowerfulpretrainedLLMsandVLMs,arenaissanceinnaturallanguageprocessingandcomputervisionhasbeencatalyzed.LLMsnowdemonstrateanimpressiveabilitytodecipherthenuancesofreal-worldlinguisticdata,oftenachievingabilitiesthatparallelorevensurpasshumanexpertise(

OpenAI

,

2023

).Recently,researchershaveshownthatLLMsmaybeextendedtoactasagentswithinvariousenvironments,performingintricateactionsandtaskswhenpairedwithdomain-specificknowledgeandmodules(

Xietal.

,

2023

).Thesescenarios,characterizedbycomplexreasoning,understandingoftheagent’sroleanditsenvironment,alongwithmulti-stepplanning,testtheagent’sabilitytomakehighlynuancedandintricatedecisionswithinitsenvironmentalconstraints(

Wuetal.

,

2023

;

MetaFundamental

AIResearch(FAIR)DiplomacyTeametal.

,

2022

).

Buildingupontheseinitialefforts,theAIcommunityisonthecuspofasignificantparadigmshift,transitioningfromcreatingAImodelsforpassive,structuredtaskstomodelscapableofassumingdynamic,agenticrolesindiverseandcomplexenvironments.Inthiscontext,thisarticleinvestigatestheimmensepotentialofusingLLMsandVLMsasagents,emphasizingmodelsthathaveablendoflinguisticproficiency,visualcognition,contextualmemory,intuitivereasoning,andadaptability.LeveragingLLMsandVLMsasagents,especiallywithindomainslikegaming,robotics,andhealthcare,promisesnotjustarigorousevaluationplatformforstate-of-the-artAIsystems,butalsoforeshadowsthetransformativeimpactsthatAgent-centricAIwillhaveacrosssocietyandindustries.Whenfullyharnessed,agenticmodelscanredefinehumanexperiencesandelevateoperationalstandards.Thepotentialforsweepingautomationusheredinbythesemodelsportendsmonumentalshiftsinindustriesandsocio-economicdynamics.Suchadvancementswillbeintertwinedwithmultifacetedleader-board,notonlytechnicalbutalsoethical,aswewillelaborateuponinSection

11

.Wedelveintotheoverlappingareasofthesesub-fieldsofAgentAIandillustratetheirinterconnectednessinFig.

1

.

1.2 Background

Wewillnowintroducerelevantresearchpapersthatsupporttheconcepts,theoreticalbackground,andmodernimplementationsofAgentAI.

LargeFoundationModels:LLMsandVLMshavebeendrivingtheefforttodevelopgeneralintelligentmachines(

Bubecketal.

,

2023

;

Mirchandanietal.

,

2023

).Althoughtheyaretrainedusinglargetextcorpora,theirsuperiorproblem-solvingcapacityisnotlimitedtocanonicallanguageprocessingdomains.LLMscanpotentiallytacklecomplextasksthatwerepreviouslypresumedtobeexclusivetohumanexpertsordomain-specificalgorithms,rangingfrommathematicalreasoning(

Imanietal.

,

2023

;

Weietal.

,

2022

;

Zhuetal.

,

2022

)toansweringquestionsofprofessionallaw(

Blair-Staneketal.

,

2023

;

Choietal.

,

2023

;

Nay

,

2022

).RecentresearchhasshownthepossibilityofusingLLMstogeneratecomplexplansforrobotsandgameAI(

Liangetal.

,

2022

;

Wangetal.

,

2023a

,

b

;

Yaoetal.

,

2023a

;

Huang

etal.

,

2023a

),markinganimportantmilestoneforLLMsasgeneral-purposeintelligentagents.

5

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

EmbodiedAI:AnumberofworksleverageLLMstoperformtaskplanning(

Huangetal.

,

2022a

;

Wangetal.

,

2023b

;

Yaoetal.

,

2023a

;

Lietal.

,

2023a

),specificallytheLLMs’WWW-scaledomainknowledgeandemergentzero-shotembodiedabilitiestoperformcomplextaskplanningandreasoning.RecentroboticsresearchalsoleveragesLLMstoperformtaskplanning(

Ahnetal.

,

2022a

;

Huangetal.

,

2022b

;

Liangetal.

,

2022

)bydecomposingnaturallanguageinstructionintoasequenceofsubtasks,eitherinthenaturallanguageformorinPythoncode,thenusingalow-levelcontrollertoexecutethesesubtasks.Additionally,theyincorporateenvironmentalfeedbacktoimprovetaskperformance(

Huangetal.

,

2022b

),(

Liangetal.

,

2022

),(

Wangetal.

,

2023a

),and(

Ikeuchietal.

,

2023

).

InteractiveLearning:AIagentsdesignedforinteractivelearningoperateusingacombinationofmachinelearningtechniquesanduserinteractions.Initially,theAIagentistrainedonalargedataset.Thisdatasetincludesvarioustypesofinformation,dependingontheintendedfunctionoftheagent.Forinstance,anAIdesignedforlanguagetaskswouldbetrainedonamassivecorpusoftextdata.Thetraininginvolvesusingmachinelearningalgorithms,whichcouldincludedeeplearningmodelslikeneuralnetworks.ThesetrainingmodelsenabletheAItorecognizepatterns,makepredictions,andgenerateresponsesbasedonthedataonwhichitwastrained.TheAIagentcanalsolearnfromreal-timeinteractionswithusers.Thisinteractivelearningcanoccurinvariousways:1)Feedback-basedlearning:TheAIadaptsitsresponsesbasedondirectuserfeedback(

Lietal.

,

2023b

;

Yuetal.

,

2023a

;

Parakhetal.

,

2023

;

Zha

etal.

,

2023

;

Wakeetal.

,

2023a

,

b

,

c

).Forexample,ifausercorrectstheAI’sresponse,theAIcanusethisinformationtoimprovefutureresponses(

Zhaetal.

,

2023

;

Liuetal.

,

2023a

).2)ObservationalLearning:TheAIobservesuserinteractionsandlearnsimplicitly.Forexample,ifusersfrequentlyasksimilarquestionsorinteractwiththeAIinaparticularway,theAImightadjustitsresponsestobettersuitthesepatterns.ItallowstheAIagenttounderstandandprocesshumanlanguage,multi-modelsetting,interpretthecrossreality-context,andgeneratehuman-users’responses.Overtime,withmoreuserinteractionsandfeedback,theAIagent’sperformancegenerallycontinuousimproves.ThisprocessisoftensupervisedbyhumanoperatorsordeveloperswhoensurethattheAIislearningappropriatelyandnotdevelopingbiasesorincorrectpatterns.

1.3 Overview

MultimodalAgentAI(MAA)isafamilyofsystemsthatgenerateeffectiveactionsinagivenenvironmentbasedontheunderstandingofmultimodalsensoryinput.WiththeadventofLargeLanguageModels(LLMs)andVision-LanguageModels(VLMs),numerousMAAsystemshavebeenproposedinfieldsrangingfrombasicresearchtoapplications.Whiletheseresearchareasaregrowingrapidlybyintegratingwiththetraditionaltechnologiesofeachdomain(e.g.,visualquestionansweringandvision-languagenavigation),theysharecommoninterestssuchasdatacollection,benchmarking,andethicalperspectives.Inthispaper,wefocusonthesomerepresentativeresearchareasofMAA,namelymultimodality,gaming(VR/AR/MR),robotics,andhealthcare,andweaimtoprovidecomprehensiveknowledgeonthecommonconcernsdiscussedinthesefields.AsaresultweexpecttolearnthefundamentalsofMAAandgaininsightstofurtheradvancetheirresearch.Specificlearningoutcomesinclude:

MAAOverview:Adeepdiveintoitsprinciplesandrolesincontemporaryapplications,providingresearcherwithathoroughgraspofitsimportanceanduses.

Methodologies:DetailedexamplesofhowLLMsandVLMsenhanceMAAs,illustratedthroughcasestudiesingaming,robotics,andhealthcare.

PerformanceEvaluation:GuidanceontheassessmentofMAAswithrelevantdatasets,focusingontheireffectivenessandgeneralization.

EthicalConsiderations:Adiscussiononthesocietalimpactsandethicalleader-boardofdeployingAgentAI,highlightingresponsibledevelopmentpractices.

EmergingTrendsandFutureleader-board:Categorizethelatestdevelopmentsineachdomainanddiscussthefuturedirections.

Computer-basedactionandgeneralistagents(GAs)areusefulformanytasks.AGAtobecometrulyvaluabletoitsusers,itcannaturaltointeractwith,andgeneralizetoabroadrangeofcontextsandmodalities.WeaimstocultivateavibrantresearchecosystemandcreateasharedsenseofidentityandpurposeamongtheAgentAIcommunity.MAAhasthepotentialtobewidelyapplicableacrossvariouscontextsandmodalities,includinginputfromhumans.Therefore,webelievethisAgentAIareacanengageadiverserangeofresearchers,fosteringadynamicAgentAIcommunityand

6

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

sharedgoals.Ledbyesteemedexpertsfromacademiaandindustry,weexpectthatthispaperwillbeaninteractiveandenrichingexperience,completewithagentinstruction,casestudies,taskssessions,andexperimentsdiscussionensuringacomprehensiveandengaginglearningexperienceforallresearchers.

ThispaperaimstoprovidegeneralandcomprehensiveknowledgeaboutthecurrentresearchinthefieldofAgentAI.Tothisend,therestofthepaperisorganizedasfollows.Section

2

outlineshowAgentAIbenefitsfromintegratingwithrelatedemergingtechnologies,particularlylargefoundationmodels.Section

3

describesanewparadigmandframeworkthatweproposefortrainingAgentAI.Section

4

providesanoverviewofthemethodologiesthatarewidelyusedinthetrainingofAgentAI.Section

5

categorizesanddiscussesvarioustypesofagents.Section

6

introducesAgentAIapplicationsingaming,robotics,andhealthcare.Section

7

explorestheresearchcommunity’seffortstodevelopaversatileAgentAI,capableofbeingappliedacrossvariousmodalities,domains,andbridgingthesim-to-realgap.Section

8

discussesthepotentialofAgentAIthatnotonlyreliesonpre-trainedfoundationmodels,butalsocontinuouslylearnsandself-improvesbyleveraginginteractionswiththeenvironmentandusers.Section

9

introducesournewdatasetsthataredesignedforthetrainingofmultimodalAgentAI.Section

11

discussesthehottopicoftheethicsconsiderationofAIagent,limitations,andsocietalimpactofourpaper.

2 AgentAIIntegration

FoundationmodelsbasedonLLMsandVLMs,asproposedinpreviousresearch,stillexhibitlimitedperformanceintheareaofembodiedAI,particularlyintermsofunderstanding,generating,editing,andinteractingwithinunseenenvironmentsorscenarios(

Huangetal.

,

2023a

;

Zengetal.

,

2023

).Consequently,theselimitationsleadtosub-optimaloutputsfromAIagents.Currentagent-centricAImodelingapproachesfocusondirectlyaccessibleandclearlydefineddata(e.g.textorstringrepresentationsoftheworldstate)andgenerallyusedomainandenvironment-independentpatternslearnedfromtheirlarge-scalepretrainingtopredictactionoutputsforeachenvironment(

Xietal.

,

2023

;

Wang

etal.

,

2023c

;

Gongetal.

,

2023a

;

Wuetal.

,

2023

).In(

Huangetal.

,

2023a

),weinvestigatethetaskofknowledge-guidedcollaborativeandinteractivescenegenerationbycombininglargefoundationmodels,andshowpromisingresultsthatindicateknowledge-groundedLLMagentscanimprovetheperformanceof2Dand3Dsceneunderstanding,generation,andediting,alongsidewithotherhuman-agentinteractions(

Huangetal.

,

2023a

).ByintegratinganAgentAIframework,largefoundationmodelsareabletomoredeeplyunderstanduserinputtoformacomplexandadaptiveH

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论