2024年斯坦福Agent+AI+论文（英）

上传人：中*** IP属地：广东上传时间：2024-04-06 格式：DOC 页数：80 大小：9.34MB 积分：25 举报 版权申诉

已阅读5页，还剩75页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

arXiv:2401.03568v2[cs.AI]25Jan2024

AGENTAI:

SURVEYINGTHEHORIZONSOFMULTIMODALINTERACTION

ZaneDurante1†*,QiuyuanHuang2‡∗,NaokiWake2∗,RanGong3†,JaeSungPark4†,BidiptaSarkar1†,RohanTaori1†,YusukeNoda5,DemetriTerzopoulos3,YejinChoi4,KatsushiIkeuchi2,HoiVo5,LiFei-Fei1,JianfengGao2

StanfordUniversity;2MicrosoftResearch,Redmond;

UniversityofCalifornia,LosAngeles;4UniversityofWashington;5MicrosoftGaming

Figure1:OverviewofanAgentAIsystemthatcanperceiveandactindifferentdomainsandapplications.AgentAIisemergingasapromisingavenuetowardArtificialGeneralIntelligence(AGI).AgentAItraininghasdemonstratedthecapacityformulti-modalunderstandinginthephysicalworld.Itprovidesaframeworkforreality-agnostictrainingbyleveraginggenerativeAIalongsidemultipleindependentdatasources.Largefoundationmodelstrainedforagentandaction-relatedtaskscanbeappliedtophysicalandvirtualworldswhentrainedoncross-realitydata.WepresentthegeneraloverviewofanAgentAIsystemthatcanperceiveandactinmanydifferentdomainsandapplications,possiblyservingasaroutetowardsAGIusinganagentparadigm.

∗EqualContribution.‡ProjectLead.†WorkdonewhileinterningatMicrosoftResearch,Redmond.

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

ABSTRACT

Multi-modalAIsystemswilllikelybecomeaubiquitouspresenceinoureverydaylives.Apromisingapproachtomakingthesesystemsmoreinteractiveistoembodythemasagentswithinphysicalandvirtualenvironments.Atpresent,systemsleverageexistingfoundationmodelsasthebasicbuildingblocksforthecreationofembodiedagents.Embeddingagentswithinsuchenvironmentsfacilitatestheabilityofmodelstoprocessandinterpretvisualandcontextualdata,whichiscriticalforthecreationofmoresophisticatedandcontext-awareAIsystems.Forexample,asystemthatcanperceiveuseractions,humanbehavior,environmentalobjects,audioexpressions,andthecollectivesentimentofascenecanbeusedtoinformanddirectagentresponseswithinthegivenenvironment.Toaccelerateresearchonagent-basedmultimodalintelligence,wedefine“AgentAI”asaclassofinteractivesystemsthatcanperceivevisualstimuli,languageinputs,andotherenvironmentally-groundeddata,andcanproducemeaningfulembodiedactions.Inparticular,weexploresystemsthataimtoimproveagentsbasedonnext-embodiedactionpredictionbyincorporatingexternalknowledge,multi-sensoryinputs,andhumanfeedback.WearguethatbydevelopingagenticAIsystemsingroundedenvironments,onecanalsomitigatethehallucinationsoflargefoundationmodelsandtheirtendencytogenerateenvironmentallyincorrectoutputs.TheemergingfieldofAgentAIsubsumesthebroaderembodiedandagenticaspectsofmultimodalinteractions.Beyondagentsactingandinteractinginthephysicalworld,weenvisionafuturewherepeoplecaneasilycreateanyvirtualrealityorsimulatedsceneandinteractwithagentsembodiedwithinthevirtualenvironment.

Contents

Introduction

1.1

Motivation

.......................................................

1.2

Background

......................................................

1.3

Overview

.......................................................

AgentAIIntegration

2.1

InfiniteAIagent

....................................................

2.2

AgentAIwithLargeFoundationModels

.......................................

2.2.1

Hallucinations

.................................................

2.2.2

BiasesandInclusivity

.............................................

2.2.3

DataPrivacyandUsage

............................................

2.2.4

InterpretabilityandExplainability

.......................................

2.2.5

InferenceAugmentation

............................................

2.2.6

Regulation

...................................................

2.3

AgentAIforEmergentAbilities

............................................

AgentAIParadigm

3.1

LLMsandVLMs

...................................................

3.2

AgentTransformerDefinition

.............................................

3.3

AgentTransformerCreation

..............................................

AgentAILearning

4.1

StrategyandMechanism

................................................

4.1.1

ReinforcementLearning(RL)

.........................................

4.1.2

ImitationLearning(IL)

............................................

4.1.3

TraditionalRGB

................................................

4.1.4

In-contextLearning

..............................................

4.1.5

OptimizationintheAgentSystem

......................................

4.2

AgentSystems(zero-shotandfew-shotlevel)

.....................................

4.2.1

AgentModules

................................................

4.2.2

AgentInfrastructure

..............................................

4.3

AgenticFoundationModels(pretrainingandfinetunelevel)

.............................

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

5AgentAICategorization

5.1

GeneralistAgentAreas

................................

................

5.2

EmbodiedAgents

...................................

................

5.2.1

ActionAgents

.................................

................

5.2.2

InteractiveAgents

...............................

................

5.3

SimulationandEnvironmentsAgents

.........................

................

5.4

GenerativeAgents

...................................

................

5.4.1

AR/VR/mixed-realityAgents

.........................

................

5.5

KnowledgeandLogicalInferenceAgents

.......................

................

5.5.1

KnowledgeAgent

...............................

................

5.5.2

LogicAgents

.................................

................

5.5.3

AgentsforEmotionalReasoning

.......................

................

5.5.4

Neuro-SymbolicAgents

............................

................

5.6

LLMsandVLMsAgent

................................

................

6AgentAIApplicationTasks

6.1

AgentsforGaming

..................................

................

6.1.1

NPCBehavior

.................................

................

6.1.2

Human-NPCInteraction

............................

................

6.1.3

Agent-basedAnalysisofGaming

.......................

................

6.1.4

SceneSynthesisforGaming

.........................

................

6.1.5

ExperimentsandResults

...........................

................

6.2

Robotics

........................................

................

6.2.1

LLM/VLMAgentforRobotics.

........................

................

6.2.2

ExperimentsandResults.

...........................

................

6.3

Healthcare

.......................................

................

6.3.1

CurrentHealthcareCapabilities

........................

................

6.4

MultimodalAgents

..................................

................

6.4.1

Image-LanguageUnderstandingandGeneration

...............

................

6.4.2

VideoandLanguageUnderstandingandGeneration

.............

................

6.4.3

ExperimentsandResults

...........................

................

6.5

Video-languageExperiments

.............................

................

6.6

AgentforNLP

.....................................

................

6.6.1

LLMagent

..................................

................

6.6.2

GeneralLLMagent

..............................

................

6.6.3

Instruction-followingLLMagents

......................

................

6.6.4

ExperimentsandResults

...........................

................

7AgentAIAcrossModalities,Domains,andRealities

7.1

AgentsforCross-modalUnderstanding

........................

................

7.2

AgentsforCross-domainUnderstanding

.......................

................

7.3

Interactiveagentforcross-modalityandcross-reality

.................

................

7.4

SimtoRealTransfer

..................................

................

8ContinuousandSelf-improvementforAgentAI

8.1

Human-basedInteractionData

............................

................

8.2

FoundationModelGeneratedData

..........................

................

9AgentDatasetandLeaderboard

9.1

“CuisineWorld”DatasetforMulti-agentGaming

...................

................

9.1.1

Benchmark

..................................

................

9.1.2

Task

......................................

................

9.1.3

MetricsandJudging

..............................

................

9.1.4

Evaluation

...................................

................

9.2

Audio-Video-LanguagePre-trainingDataset.

.....................

................

10BroaderImpactStatement

11EthicalConsiderations

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

DiversityStatement

References

Appendix

GPT-4VAgentPromptDetails

GPT-4VforBleedingEdge

GPT-4VforMicrosoftFightSimulator

GPT-4VforAssassin’sCreedOdyssey

GPT-4VforGEARSofWAR4

GPT-4VforStarfield

AuthorBiographies

Acknowledgemets

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

1 Introduction

1.1 Motivation

Historically,AIsystemsweredefinedatthe1956DartmouthConferenceasartificiallifeformsthatcouldcollectinformationfromtheenvironmentandinteractwithitinusefulways.Motivatedbythisdefinition,Minsky’sMITgroupbuiltin1970aroboticssystem,calledthe“CopyDemo,”thatobserved“blocksworld”scenesandsuccessfullyreconstructedtheobservedpolyhedralblockstructures.Thesystem,whichcomprisedobservation,planning,andmanipulationmodules,revealedthateachofthesesubproblemsishighlychallengingandfurtherresearchwasnecessary.TheAIfieldfragmentedintospecializedsubfieldsthathavelargelyindependentlymadegreatprogressintacklingtheseandotherproblems,butover-reductionismhasblurredtheoverarchinggoalsofAIresearch.

Toadvancebeyondthestatusquo,itisnecessarytoreturntoAIfundamentalsmotivatedbyAristotelianHolism.Fortunately,therecentrevolutioninLargeLanguageModels(LLMs)andVisualLanguageModels(VLMs)hasmadeitpossibletocreatenovelAIagentsconsistentwiththeholisticideal.Seizinguponthisopportunity,thisarticleexploresmodelsthatintegratelanguageproficiency,visualcognition,contextmemory,intuitivereasoning,andadaptability.ItexploresthepotentialcompletionofthisholisticsynthesisusingLLMsandVLMs.Inourexploration,wealsorevisitsystemdesignbasedonAristotle’sFinalCause,theteleological“whythesystemexists”,whichmayhavebeenoverlookedinpreviousroundsofAIdevelopment.

WiththeadventofpowerfulpretrainedLLMsandVLMs,arenaissanceinnaturallanguageprocessingandcomputervisionhasbeencatalyzed.LLMsnowdemonstrateanimpressiveabilitytodecipherthenuancesofreal-worldlinguisticdata,oftenachievingabilitiesthatparallelorevensurpasshumanexpertise(

OpenAI

2023

).Recently,researchershaveshownthatLLMsmaybeextendedtoactasagentswithinvariousenvironments,performingintricateactionsandtaskswhenpairedwithdomain-specificknowledgeandmodules(

Xietal.

2023

).Thesescenarios,characterizedbycomplexreasoning,understandingoftheagent’sroleanditsenvironment,alongwithmulti-stepplanning,testtheagent’sabilitytomakehighlynuancedandintricatedecisionswithinitsenvironmentalconstraints(

Wuetal.

2023

;

MetaFundamental

AIResearch(FAIR)DiplomacyTeametal.

2022

Buildingupontheseinitialefforts,theAIcommunityisonthecuspofasignificantparadigmshift,transitioningfromcreatingAImodelsforpassive,structuredtaskstomodelscapableofassumingdynamic,agenticrolesindiverseandcomplexenvironments.Inthiscontext,thisarticleinvestigatestheimmensepotentialofusingLLMsandVLMsasagents,emphasizingmodelsthathaveablendoflinguisticproficiency,visualcognition,contextualmemory,intuitivereasoning,andadaptability.LeveragingLLMsandVLMsasagents,especiallywithindomainslikegaming,robotics,andhealthcare,promisesnotjustarigorousevaluationplatformforstate-of-the-artAIsystems,butalsoforeshadowsthetransformativeimpactsthatAgent-centricAIwillhaveacrosssocietyandindustries.Whenfullyharnessed,agenticmodelscanredefinehumanexperiencesandelevateoperationalstandards.Thepotentialforsweepingautomationusheredinbythesemodelsportendsmonumentalshiftsinindustriesandsocio-economicdynamics.Suchadvancementswillbeintertwinedwithmultifacetedleader-board,notonlytechnicalbutalsoethical,aswewillelaborateuponinSection

.Wedelveintotheoverlappingareasofthesesub-fieldsofAgentAIandillustratetheirinterconnectednessinFig.

1.2 Background

Wewillnowintroducerelevantresearchpapersthatsupporttheconcepts,theoreticalbackground,andmodernimplementationsofAgentAI.

LargeFoundationModels:LLMsandVLMshavebeendrivingtheefforttodevelopgeneralintelligentmachines(

Bubecketal.

2023

;

Mirchandanietal.

2023

).Althoughtheyaretrainedusinglargetextcorpora,theirsuperiorproblem-solvingcapacityisnotlimitedtocanonicallanguageprocessingdomains.LLMscanpotentiallytacklecomplextasksthatwerepreviouslypresumedtobeexclusivetohumanexpertsordomain-specificalgorithms,rangingfrommathematicalreasoning(

Imanietal.

2023

;

Weietal.

2022

;

Zhuetal.

2022

)toansweringquestionsofprofessionallaw(

Blair-Staneketal.

2023

;

Choietal.

2023

;

Nay

2022

).RecentresearchhasshownthepossibilityofusingLLMstogeneratecomplexplansforrobotsandgameAI(

Liangetal.

2022

;

Wangetal.

2023a

;

Yaoetal.

2023a

;

Huang

etal.

2023a

),markinganimportantmilestoneforLLMsasgeneral-purposeintelligentagents.

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

EmbodiedAI:AnumberofworksleverageLLMstoperformtaskplanning(

Huangetal.

2022a

;

Wangetal.

2023b

;

Yaoetal.

2023a

;

Lietal.

2023a

),specificallytheLLMs’WWW-scaledomainknowledgeandemergentzero-shotembodiedabilitiestoperformcomplextaskplanningandreasoning.RecentroboticsresearchalsoleveragesLLMstoperformtaskplanning(

Ahnetal.

2022a

;

Huangetal.

2022b

;

Liangetal.

2022

)bydecomposingnaturallanguageinstructionintoasequenceofsubtasks,eitherinthenaturallanguageformorinPythoncode,thenusingalow-levelcontrollertoexecutethesesubtasks.Additionally,theyincorporateenvironmentalfeedbacktoimprovetaskperformance(

Huangetal.

2022b

),(

Liangetal.

2022

),(

Wangetal.

2023a

),and(

Ikeuchietal.

2023

InteractiveLearning:AIagentsdesignedforinteractivelearningoperateusingacombinationofmachinelearningtechniquesanduserinteractions.Initially,theAIagentistrainedonalargedataset.Thisdatasetincludesvarioustypesofinformation,dependingontheintendedfunctionoftheagent.Forinstance,anAIdesignedforlanguagetaskswouldbetrainedonamassivecorpusoftextdata.Thetraininginvolvesusingmachinelearningalgorithms,whichcouldincludedeeplearningmodelslikeneuralnetworks.ThesetrainingmodelsenabletheAItorecognizepatterns,makepredictions,andgenerateresponsesbasedonthedataonwhichitwastrained.TheAIagentcanalsolearnfromreal-timeinteractionswithusers.Thisinteractivelearningcanoccurinvariousways:1)Feedback-basedlearning:TheAIadaptsitsresponsesbasedondirectuserfeedback(

Lietal.

2023b

;

Yuetal.

2023a

;

Parakhetal.

2023

;

Zha

etal.

2023

;

Wakeetal.

2023a

).Forexample,ifausercorrectstheAI’sresponse,theAIcanusethisinformationtoimprovefutureresponses(

Zhaetal.

2023

;

Liuetal.

2023a

).2)ObservationalLearning:TheAIobservesuserinteractionsandlearnsimplicitly.Forexample,ifusersfrequentlyasksimilarquestionsorinteractwiththeAIinaparticularway,theAImightadjustitsresponsestobettersuitthesepatterns.ItallowstheAIagenttounderstandandprocesshumanlanguage,multi-modelsetting,interpretthecrossreality-context,andgeneratehuman-users’responses.Overtime,withmoreuserinteractionsandfeedback,theAIagent’sperformancegenerallycontinuousimproves.ThisprocessisoftensupervisedbyhumanoperatorsordeveloperswhoensurethattheAIislearningappropriatelyandnotdevelopingbiasesorincorrectpatterns.

1.3 Overview

MultimodalAgentAI(MAA)isafamilyofsystemsthatgenerateeffectiveactionsinagivenenvironmentbasedontheunderstandingofmultimodalsensoryinput.WiththeadventofLargeLanguageModels(LLMs)andVision-LanguageModels(VLMs),numerousMAAsystemshavebeenproposedinfieldsrangingfrombasicresearchtoapplications.Whiletheseresearchareasaregrowingrapidlybyintegratingwiththetraditionaltechnologiesofeachdomain(e.g.,visualquestionansweringandvision-languagenavigation),theysharecommoninterestssuchasdatacollection,benchmarking,andethicalperspectives.Inthispaper,wefocusonthesomerepresentativeresearchareasofMAA,namelymultimodality,gaming(VR/AR/MR),robotics,andhealthcare,andweaimtoprovidecomprehensiveknowledgeonthecommonconcernsdiscussedinthesefields.AsaresultweexpecttolearnthefundamentalsofMAAandgaininsightstofurtheradvancetheirresearch.Specificlearningoutcomesinclude:

MAAOverview:Adeepdiveintoitsprinciplesandrolesincontemporaryapplications,providingresearcherwithathoroughgraspofitsimportanceanduses.

Methodologies:DetailedexamplesofhowLLMsandVLMsenhanceMAAs,illustratedthroughcasestudiesingaming,robotics,andhealthcare.

PerformanceEvaluation:GuidanceontheassessmentofMAAswithrelevantdatasets,focusingontheireffectivenessandgeneralization.

EthicalConsiderations:Adiscussiononthesocietalimpactsandethicalleader-boardofdeployingAgentAI,highlightingresponsibledevelopmentpractices.

EmergingTrendsandFutureleader-board:Categorizethelatestdevelopmentsineachdomainanddiscussthefuturedirections.

Computer-basedactionandgeneralistagents(GAs)areusefulformanytasks.AGAtobecometrulyvaluabletoitsusers,itcannaturaltointeractwith,andgeneralizetoabroadrangeofcontextsandmodalities.WeaimstocultivateavibrantresearchecosystemandcreateasharedsenseofidentityandpurposeamongtheAgentAIcommunity.MAAhasthepotentialtobewidelyapplicableacrossvariouscontextsandmodalities,includinginputfromhumans.Therefore,webelievethisAgentAIareacanengageadiverserangeofresearchers,fosteringadynamicAgentAIcommunityand

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

sharedgoals.Ledbyesteemedexpertsfromacademiaandindustry,weexpectthatthispaperwillbeaninteractiveandenrichingexperience,completewithagentinstruction,casestudies,taskssessions,andexperimentsdiscussionensuringacomprehensiveandengaginglearningexperienceforallresearchers.

ThispaperaimstoprovidegeneralandcomprehensiveknowledgeaboutthecurrentresearchinthefieldofAgentAI.Tothisend,therestofthepaperisorganizedasfollows.Section

outlineshowAgentAIbenefitsfromintegratingwithrelatedemergingtechnologies,particularlylargefoundationmodels.Section

describesanewparadigmandframeworkthatweproposefortrainingAgentAI.Section

providesanoverviewofthemethodologiesthatarewidelyusedinthetrainingofAgentAI.Section

categorizesanddiscussesvarioustypesofagents.Section

introducesAgentAIapplicationsingaming,robotics,andhealthcare.Section

explorestheresearchcommunity’seffortstodevelopaversatileAgentAI,capableofbeingappliedacrossvariousmodalities,domains,andbridgingthesim-to-realgap.Section

discussesthepotentialofAgentAIthatnotonlyreliesonpre-trainedfoundationmodels,butalsocontinuouslylearnsandself-improvesbyleveraginginteractionswiththeenvironmentandusers.Section

introducesournewdatasetsthataredesignedforthetrainingofmultimodalAgentAI.Section

discussesthehottopicoftheethicsconsiderationofAIagent,limitations,andsocietalimpactofourpaper.

2 AgentAIIntegration

FoundationmodelsbasedonLLMsandVLMs,asproposedinpreviousresearch,stillexhibitlimitedperformanceintheareaofembodiedAI,particularlyintermsofunderstanding,generating,editing,andinteractingwithinunseenenvironmentsorscenarios(

Huangetal.

2023a

;

Zengetal.

2023

).Consequently,theselimitationsleadtosub-optimaloutputsfromAIagents.Currentagent-centricAImodelingapproachesfocusondirectlyaccessibleandclearlydefineddata(e.g.textorstringrepresentationsoftheworldstate)andgenerallyusedomainandenvironment-independentpatternslearnedfromtheirlarge-scalepretrainingtopredictactionoutputsforeachenvironment(

Xietal.

2023

;

Wang

etal.

2023c

;

Gongetal.

2023a

;

Wuetal.

2023

).In(

Huangetal.

2023a

),weinvestigatethetaskofknowledge-guidedcollaborativeandinteractivescenegenerationbycombininglargefoundationmodels,andshowpromisingresultsthatindicateknowledge-groundedLLMagentscanimprovetheperformanceof2Dand3Dsceneunderstanding,generation,andediting,alongsidewithotherhuman-agentinteractions(

Huangetal.

2023a

).ByintegratinganAgentAIframework,largefoundationmodelsareabletomoredeeplyunderstanduserinputtoformacomplexandadaptiveH

人人文库> 全部分类> 专业文献 > IT计算机

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

2024年斯坦福Agent+AI+论文（英）

文档简介

温馨提示

最新文档

评论

2024年斯坦福Agent+AI+论文（英）

文档简介

温馨提示

最新文档

评论

相关文档