在大规模语言模型时代构建自主系统 Building Agentic Systems in an Era of Large Language Models_第1页
在大规模语言模型时代构建自主系统 Building Agentic Systems in an Era of Large Language Models_第2页
在大规模语言模型时代构建自主系统 Building Agentic Systems in an Era of Large Language Models_第3页
在大规模语言模型时代构建自主系统 Building Agentic Systems in an Era of Large Language Models_第4页
在大规模语言模型时代构建自主系统 Building Agentic Systems in an Era of Large Language Models_第5页
已阅读5页,还剩146页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

BuildingAgenticSystemsinanEraofLargeLanguageModels

CharlesPacker

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2024-223

/Pubs/TechRpts/2024/EECS-2024-223.html

December19,2024

Copyright©2024,bytheauthor(s).

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

Fall2024

BuildingAgenticSystemsinanEraofLargeLanguageModels

By

CharlesPacker

Adissertationsubmittedinpartialsatisfactionoftherequirementsforthedegreeof

DoctorofPhilosophy

in

ComputerScience

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorJosephE.Gonzalez,ChairProfessorIonStoica

ProfessorMateiZahariaDoctorYuandongTian

BuildingAgenticSystemsinanEraofLargeLanguageModels

Copyright2024by

CharlesPacker

1

Abstract

BuildingAgenticSystemsinanEraofLargeLanguageModels

by

CharlesPacker

DoctorofPhilosophyinComputerScience

UniversityofCalifornia,BerkeleyProfessorJosephE.Gonzalez,Chair

Buildingintelligentautonomoussystemsthatcanreason,adapt,andinteractwiththeirenvironmenthasbeenalong-standinggoalinartificialintelligence.Thisthesisexplorestheevolutionofagenticsystemsthroughthedeeplearningrevolution,fromreinforcementlearningtomodernLargeLanguageModels(LLMs),focusingonthecriticalcomponentsneededtocreatereliableautonomousagents.

First,weaddressthefundamentalchallengeofgeneralizationindeepreinforcementlearn-ing(RL),introducingasystematicframeworkforevaluatingandimprovinghowlearnedpoli-ciestransferacrossenvironments.Buildingonthisfoundation,wepresentHindsightTaskRelabeling(HTR),anovelapproachthatenablesmeta-RLalgorithmstolearnadaptationstrategiesinsparserewardsettingswithoutrequiringdenserewardsignalsduringtraining.

Finally,weaddresstheemergingchallengesofbuildingreliableagentsusingLargeLan-guageModels.WhileLLMsdemonstrateunprecedentedreasoningcapabilities,theireffec-tivenessasautonomousagentsislimitedbyfundamentalconstraintsintheirarchitecture-mostnotably,theirstatelessnatureandfixedcontextwindows.WepresentMemGPT,anoperatingsystem-inspiredframeworkthatenablesLLMstomanagetheirownmemoryandstate,introducingconceptslikevirtualcontextmanagementandself-directedmemoryopera-tions.MemGPTdemonstratesthatbytreatingLLMsasanewfundamentalunitofcompute-analogoustohowCPUswerethefundamentalunitintraditionaloperatingsystems-wecanbuildmorereliableandcapableautonomousagents.

Together,thesesystemstracetheevolutionofagenticAIsystemsandprovidekeybuild-ingblocksforcreatingmorereliableandcapableautonomousagents.Byaddressingcorechallengesingeneralization,adaptation,andmemorymanagement,thisthesisestablishesafoundationforengineeringthenextgenerationofAIsystemsthatcaneffectivelyreasonandinteractwiththeworld.

i

Tomyparents

ii

Contents

ListofFigures

v

ListofTables

ix

Acknowledgments

x

1Introduction

1

1.1Background

1

1.1.1TheDeepLearningRevolutioninRoboticsandControl

1

1.1.2TheRiseofFoundationModels

2

1.2DeepLearningforAgenticSystems

2

1.3TheLLMAgentParadigm

3

2AssessingGeneralizationinDeepReinforcementLearning

4

2.1Introduction

4

2.2Background

6

2.3Notation

7

2.4Algorithms

8

2.5Environments

9

2.6Experimentalsetup

11

2.7Experimentalsetup

12

2.8Resultsanddiscussion

14

2.9Conclusion

15

2.10Additionaldetails

16

2.10.1EnvironmentDetails

16

2.10.2TrainingHyperparameters

16

2.10.3DetailedExperimentalResults

18

2.10.4BehaviorofMountainCar

18

2.10.5TrainingCurves

21

2.10.6Videosoftrainedagents

21

Contentsiii

3HindsightTaskRelabelling:ExperienceReplayforSparseRewardMeta-

RL

26

3.1Introduction

26

3.2Relatedwork

27

3.3Background

28

3.3.1Meta-ReinforcementLearning(Meta-RL)

29

3.3.2Off-PolicyMeta-ReinforcementLearning

29

3.3.3HindsightExperienceReplay

30

3.4LeveragingHindsightinMeta-ReinforcementLearning

31

3.4.1AlgorithmDesign

32

3.4.2SingleEpisodeRelabeling(SER)strategy

33

3.4.3EpisodeClustering(EC)strategy

33

3.4.4ComparisonofHTRandHER

34

3.4.5Limitations

34

3.5Experiments

35

3.5.1Environments

35

3.5.2HTRenablesmeta-trainingusingonlysparsereward

36

3.5.3Varyingkeyhyperparameters

38

3.6Conclusion

39

3.7ExperimentalSetup(additionaldetails)

40

3.7.1ComputingInfrastructure

40

3.7.2Hyperparameters

40

3.7.3RewardFunctions

40

3.7.4ChangingtheDistancetoGoal

41

3.8AlgorithmSpecifics

41

3.8.1Sample-TimevsDataGenerationRelabelling

41

3.8.2SingleEpisodeRelabellingImplementationDetails

41

3.8.3EpisodeClusteringImplementationDetails

42

3.8.4TimeandSpaceComplexity

43

4MemGPT:TowardsLLMsasOperatingSystems

44

4.1Introduction

44

4.2MemGPT(MemoryGPT)

46

4.2.1Maincontext(prompttokens)

46

4.2.2QueueManager

47

4.2.3Functionexecutor(handlingofcompletiontokens)

47

4.2.4Controlflowandfunctionchaining

48

4.3Experiments

49

4.4Experiments

49

4.4.1MemGPTforconversationalagents

50

4.4.2MemGPTfordocumentanalysis

52

4.5Relatedwork

55

Contentsiv

4.6Conclusion

56

4.7Additionaldetails

56

4.7.1Limitations

56

4.7.2MemGPTpseudocode

57

4.7.3MemGPTfunctionset

58

4.7.4Promptsandinstructions

61

4.7.5BalancingWorkingContextandtheFIFOQueue

67

5FromServingModelstoServingAgents:TheMissingPiecesforSupport-

ingAgenticWorkloads

69

5.1Introduction

69

5.1.1TheExistingStatelessLLMProgrammingModel

69

5.1.2AgenticProgrammingModel

70

5.1.3AgentState

70

5.2TheAgentHostingLayer

70

5.2.1LLMInference:Co-optimizationwiththeinferencelayer

71

5.2.2State&ContextManagement

71

5.2.3Multi-agentcommunicationandorchestration

71

6Conclusion&FutureWork

72

Bibliography

74

v

ListofFigures

2.1Schematicofthethreeversionsofanenvironment

17

2.2MountainCar:heatmapoftherewardsachievedbyA2CwiththeFFarchitecture

onDRandDE.TheaxesarethetwoenvironmentparametersvariedinRandE.

22

2.3Pendulum:heatmapoftherewardsachievedbyA2CwiththeFFarchitecture

onDRandDE.TheaxesarethetwoenvironmentparametersvariedinRandE.

23

2.4PPOwithFFarchitecture

24

2.5PPOwithRCarchitecture

24

2.6EPOpt-PPOwithFFarchitecture

24

2.7EPOpt-PPOwithRCarchitecture

24

2.8RL2-PPO

24

2.9TrainingcurvesforthePPO-basedalgorithmsonCartPole,allthreeenvironment

versions.Notethatthedecreaseinmeanepisoderewardat10000episodesinthe

twoEPOpt-PPOplotsisduetothefactthatittransitionsfrombeingcomputed

usingallgeneratedepisodes(ϵ=1)toonlythe10%withlowestreward(ϵ=0.1).

24

2.10VideoframesofagentstrainedwithA2ConHalfCheetah,trainedintheDeter-

ministic(D),Random(R),andExtreme(E)settings(fromtoptobottom).All

agentsevaluatedintheDsetting

25

2.11VideoframesofagentstrainedwithPPOonHalfCheetah,trainedintheDeter-

ministic(D),Random(R),andExtreme(E)settings(fromtoptobottom).All

agentsevaluatedintheDsetting

25

ListofFiguresvi

3.1Ingoal-conditionedRL(a),anagentmustnavigatetoaprovidedgoallocationg

(filledcircle,revealedtotheagent).Anunsuccessfulattemptforgoalgprovides

nosparserewardsignal,butcanberelabelledasasuccessfulattemptforgoalg′,

creatingsparserewardthatcanbeusedtotraintheagent.Inmeta-RL(b),the

taskT(i.e.,goal,hollowcircle)isneverrevealedtotheagent,andinsteadmust

beinferredusingexperienceonpriortasksandlimitedexperience(τ1:t−1)onthe

newtask.In(b),thereisnosharedoptimaltaskT′torelabelallattemptswith.

HTRrelabelseachattemptτunderitsownhindsighttaskT′,andmodifiesthe

underlyingmeta-RLtraininglooptolearnadaptationstrategiesontherelabelled

tasks.Notethatweincludemultipletrajectoriesτin(b)vsasingletrajectory

in(a)tohighlighttheadaptationstageinmeta-RL,whichdoesnotexistin

goal-conditionedRLandrequiressignificantlydifferentsamplingandrelabeling

procedures

27

3.2Sparserewardenvironmentsformeta-RLthatrequiretemporally-extendedex-

ploration.Ineachenvironment,thetask(thetop-leftcirclein(a),thegreen

spherein(b)and(c))isnotrevealedtotheagentviatheobservation.Theagent

mustinsteadinferthetaskthroughtemporally-extendedexploration(illustrated

bythedottedlinesin(a)),sincenorewardsignalisprovideduntilthetaskis

successfullycompleted.Priormeta-RLmethodssuchasPEARL(.Rakellyetal

2019)andMAESN(Guptaetal.2018b)areonlyableto(meta-)learnmeaning-

fuladaptationstrategiesusingdenserewardfunctions.Ourapproach,Hindsight

TaskRelabeling(HTR),can(meta-)trainwiththeoriginalsparserewardfunction

anddoesnotrequireadditionaldenserewardfunctions

30

3.3IllustrationofHindsightTaskRelabeling(HTR)inameta-RLtrainingloop.

HTRisagnostictotheunderlying(off-policy)meta-RLalgorithm;theagent

architectureand/ortrainingspecifics(e.g.,theencoderφ,actorπandQ-function

neuralnetworksshowninblue)canbemodifiedindependentlyoftherelabeling

scheme.HTRcanalsobeperformedinan‘eager’fashionatthedatacollection

stage(asopposedto‘lazy’relabelinginthedatasamplingstage),seeSection3

fordetails

31

3.4HTRalgorithm

33

3.5Evaluatingadaptationtotraintasksprogressivelyduringmeta-training.Y-

axismeasuresaveragesparsereturnduringadaptationthroughoutmeta-training

(shadedstddev),thoughtheoracleisstilltrainedusingdensereward.Conven-

tionalmeta-RLmethodsstruggletolearnusingsparsereward.HindsightTask

Relabeling(HTR)iscomparabletodenserewardmeta-trainingperformance

36

3.6Evaluatingadaptationtotesttasksaftermeta-training.Y-axismeasuresaverage

(sparse)returnduringadaptationusingcontextcollectedonline,usingsparsere-

wardonly.AdaptationstrategieslearnedwithHindsightTaskRelabeling(HTR)

generalizetoheld-outtasksaswellastheoraclewhichislearnedusingshapedre-

wardfunctions.WithoutHTRoraccesstoashapedrewardduringmeta-training,

theagentisunabletolearnareasonablestrategy

37

ListofFiguresvii

3.7Visualizingexplorationbehaviorlearnedduringmeta-trainingusing300pre-

adaptationtrajectories(i.e.,sampledfromthelatenttaskprior).Inthesparse

rewardsetting,withoutHTR(middlerow)theagentisunabletolearnameaning-

fulexplorationstrategyandappearstoexplorerandomlyneartheorigin.With

HTR(bottomrow),theagentlearnstoexplorenearthetruetaskdistribution

(greycircles),similartoanagenttrainedwithashapeddenserewardfunction

(toprow)

38

3.8ComparingHTRwithSERvsEConPointRobot

38

3.9AveragereturnwhenvaryingKonPointRobot

38

3.10AveragetaskdistancewhenvaryingKonPointRobot

38

3.11RelativerewardsignalfromhindsightvsgroundtruthtasksusingPointRobot.

39

3.12Meta-trainingonPointRobotwithvaryinggoaldistances.Ifthedistanceto

thegoalisshortenoughforrandomexplorationtoleadtosparsereward,meta-

trainingispossibleusingonlythesparserewardfunction.Oncethisisnolonger

thecase,meta-trainingisonlypossiblewithaproxydenserewardfunction,or

byusingHindsightTaskRelabellingontheoriginalsparserewardfunction

41

3.13IllustrationofHindsightTaskRelabeling(HTR)usingEpisodeClustering(EC)

inameta-RLtrainingloop,whererelabellingoccursatthedatacollectionstage.

42

4.1MemGPTwritesdatatopersistentmemoryafteritreceivesasystemalertabout

limitedcontextspace

45

4.2MemGPTcansearchout-of-contextdatatobringrelevantinformationintothe

currentcontextwindow

45

4.3InMemGPT,afixed-contextLLMprocessorisaugmentedwithahierarchical

memorysystemandfunctionsthatletitmanageitsownmemory.TheLLM’s

prompttokens(inputs),ormaincontext,consistofthesysteminstructions,work-

ingcontext,andaFIFOqueue.TheLLMcompletiontokens(outputs)arein-

terpretedasfunctioncallsbythefunctionexecutor.MemGPTusesfunctions

tomovedatabetweenmaincontextandexternalcontext(thearchivalandre-

callstoragedatabases).TheLLMcanrequestimmediatefollow-upLLMin-

ferencetochainfunctioncallstogetherbygeneratingaspecialkeywordargu-

ment(request_heartbeat=true)initsoutput;functionchainingiswhatallows

MemGPTtoperformmulti-stepretrievaltoansweruserqueries

46

lected1/2024).*Approximatessagecounassumingaprepromptof1ktokens,

4.4ComparingcontextlengthsofcommonlyusedmodelsandLLMAPIs(datacol-

andanaveragemessagesizeof50tokens(250characters)

48

4.5AnexampleconversationsnippetwhereMemGPTupdatesstoredinformation.

Heretheinformationisstoredinworkingcontextmemory(locatedwithinthe

prompttokens)

48

ListofFiguresviii

4.6DocumentQAtaskperformance.MemGPT’sperformanceisunaffectedby

increasedcontextlength.Methodssuchastruncationcanextendtheeffective

contextlengthsoffixedlengthmodelssuchasGPT-4,butsuchcompression

methodswillleadtoperformancedegradationasthenecessarycompressiongrows.

RunningMemGPTwithGPT-4andGPT-4Turbohaveequivalentresultsonthis

task

52

4.7AnexampleofMemGPTsolvingthedocumentQAtask.AdatabaseofWikipedia

documentsisuploadedtoarchivalstorage.MemGPTqueriesarchivalstoragevia

functioncalling,whichpullspaginatedsearchresultsintomaincontext

52

4.8NestedKVretrievaltaskperformance.MemGPTistheonlyapproach

thatisabletoconsistentlycompletethenestedKVtaskbeyond2nestinglevels.

WhileGPT-4Turboperformsbetterasabaseline,MemGPTwithGPT-4Turbo

performsworsethanMemGPTwithGPT-4

54

4.9AnexampleofMemGPTsolvingthenestedKVtask(UUIDsshortenedforread-

ability).Theexamplekey-valuepairhastwonestinglevels,andtheMemGPT

agentreturnsthefinalanswerwhenaqueryforthefinalvalue(f37 617)only

returnsoneresult(indicatingthatitisnotalsoakey)

54

4.10MemGPTalgorithmpseudocode

57

ix

ListofTables

2.1Generalizationperformance(in%success)ofeachalgorithm,averagedoverall

environments(meanandstandarddeviationoverfiveruns)

14

2.2Rangesofparametersforeachversionofeachenvironment,usingsetnotation

17

2.3Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onAcrobot

18

2.4Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onCartPole

19

2.5Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onMountainCar

19

2.6Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onPendulum

20

2.7Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onHalfCheetah

20

2.8Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onHopper

21

4.1Deepmemoryretrieval(DMR)performance.Inthistask,theagentisaskeda

specificquestionaboutatopicdiscussedinapriorconversation(sessions1–5).

Theagent’sresponseisscoredagainstthegoldanswer.MemGPTsignificantly

outperformsthefixed-contextbaselines.‘R-L’isROUGE-L

49

4.2Conversationopenerperformance.Theagent’sconversationopenerisevaluated

usingsimilarityscorestothegoldpersonalabels(SIM-1/3)andtothehuman-

createdopener(SIM-H).MemGPTisabletoexceedtheperformanceofthe

human-createdconversationopenerwithavarietyofunderlyingmodels

49

x

Acknowledgments

Firstandforemost,Iwanttothankmyfamily,whoalwayspushedmetoachievemore.TheyarethereasonIlovetodohardthings.

NextIwouldliketothankmyadvisor,ProfessorJosephE.Gonzalez.JoeyhelpedmeachievemyonetruegoalinthePhD:tomakesciencefictionintosciencereality.Hisflexibilityandencouragement,regardlessofwheremyresearchinterestsled(evenwhennotdirectlyinhiscriticalresearchpath),wereinstrumentaltomysuccess.IcouldnothaveaskedforabetterPhDadvisor.

Iamalsodeeplygratefultomyotherthesiscommitteemembers:IonStoica,MateiZaharia,andYuandongTian.HavingsuchrenownedworldexpertsinAIandsystemsresearchonmycommitteewasanincrediblehonor.

MyjourneyinAIresearchbeganatUCSanDiego,whereIworkedwithProfessorsJulianMcAuleyandKamalikaChaudhuriasanundergraduate.ThisledtomyworkwithProfessorLawrenceHolderduringanREUatWashingtonStateUniversity,whereIwrotemyfirstfirst-authorpaper.Aftergraduation,ProfessorDawnSongtookachanceonme,hiringmeafterabriefchatataStarbucksinHayesValley-amomentthatbroughtmetoBerkeleyandsetmeonmypathtowardthePhD.

SeveralmentorswerecrucialtomydevelopmentasaresearcherduringmytimeatBerke-ley.VladlenKoltuntaughtmeinvaluablelessonsaboutresearchdiscipline,particularlyaboutknowingwhentoabandon‘zombie’researchprojects-adviceIwishIhadfollowedmoreclosely.RichardShinandKatelynGaoworkedcloselywithmeduringmyfirsttwoyearsatBerkeleyandweregreatmentors.OnceIbeganthePhD,RowanMcAllisterandNickRhinehartguidedmyresearchinautonomousvehiclesandhelpedmaintainmyresearchmo-mentumduringthechallengingmiddleyearsofmyPhD.I’malsogratefultoPieterAbbeelandSergeyLevine,who,thoughnotmyformaladvisors,providedcrucialfeedbackthathelpedseveralpaperscrossthefinishlinetopublication.

TheRISELabwasanincrediblehomeformyresearch.Iwasfortunatetoworkalong-sideamazingcolleaguesinJoey’sgroup:KevinLin,LisaDunlap,JustinWong,ShishirPatil,TianjunZhang,ParasJain,SukritKalra,andSuziePetryk.Theinfamous"StarFactory"cubicle,whichallegedlyhousedtheDatabricksfoundersandlatertheAnyscalefounders,becamethebirthplaceofMemGPT,Gorilla,andSkyPlaneduringmytimethere-anunmatcheddensityofopensourceresearchcontributionsinasinglecubiclespace.

Andfinally,IwouldliketothankSarahWoodersandKevinLin,whoarejoiningmeonan

Acknowledgmentsxi

excitingnewadventurepost-PhD,wherewe’llbetakingourresearchoncontextmanagementforLLMagentsintotherealworld.

Thisthesis,andthejourneyitrepresents,wouldnothavebeenpossiblewithoutthesupport,guidance,andencouragementofalltheseincrediblepeople.Thankyou.

Additionalcontextaroundthisthesis:Thisthesiswaswrittenduringanextraordinaryperiodinartificialintelligenceresearch(2017-2024).WhenIbeganmyPhD,deepreinforce-mentlearningwasattheforefrontofautonomoussystemsresearch,withbreakthroughslikeAlphaGoandOpenAIFivedemonstratingsuperhumanperformanceincomplexgames.

Thencamethetransformerrevolution.Whatstartedasincrementalimprovementsinnaturallanguageprocessingrapidlyevolvedintosomethingfarmoreprofound.ThereleaseofChatGPTinlate2022markedaparadigmshiftnotjustinAIresearch,butinhowsocietyviewedartificialintelligence.LargeLanguageModelsdemonstratedcapabilitiesthatseemedimpossiblejustafewyearsearlier:sophisticatedreasoningandintelligencethatwasgeneral.

Ihadtheuniqueprivilegeofnotjustwitnessingthisrevolution,butactivelyparticipatinginit.Myresearchjourneyparalleledthistransition:fromworkingonfundamentalchallengesindeepreinforcementlearning,toultimatelyhelpingpioneernewapproachesforbuildingreliableautonomoussystemsusingLargeLanguageModels.Thisthesisreflectsboththe‘before’and‘after’ofthispivotalmomentinAIhistory;atimethatwilllikelyberememberedasthebeginningofthefoundationmodelera.

Thespeedofprogressduringthisperiodwasunprecedented.Papersthatseemedcutting-edgewhenIstartedmyPhDquicklybecamehistoricalartifacts.Researchdirectionsthatappearedpromisingweresuddenlyobsolete.Yetthisrapidevolutioncreatedextraordinaryopportunitiestocontributetogenuinelynewdirectionsincomputerscience:tohelpestab-lishthefoundationsforhowwebuildAIsystemsinthisnewera.

Thisthesisrepresentsmysmallcontributiontothisremarkableperiodincomputinghistory.

1

Chapter1

Introduction

Buildingintelligentautonomoussystemsthatcaneffectivelyreason,adapt,andinteractwiththeirenvironmenthasbeenalongstandinggoalinartificialintelligence.Therecentdeeplearningrevolution,particularlytheemergenceofLargeLanguageModels(LLMs),hasdramaticallychangedourapproachtobuildingsuchsystems.Thisthesistracesthisevolutionthroughseveralkeyadvancesinbuildingagenticsystems,fromdeepreinforcementlearningtomodernLLM-basedapproaches,focusingonthecriticalcomponentsneededtocreatereliableautonomousagents.

1.1Background

Thedevelopmentofagenticsystemshasundergoneseveralsignificantparadigmshifts,eachintroducingnewcapabilitiesandchallenges.Understandingtheseshiftsandtheirim-plicationsiscrucialforbuildingeffectiveautonomousagents.

1.1.1TheDeepLearningRevolutioninRoboticsandControl

Theintegrationofdeepneuralnetworkswithreinforcementlearningmarkedasignificantadvancementinautonomoussystems.Thiscombinationenabled:

•End-to-EndLearning:DeepRLallowedsystemstolearndirectlyfromrawsensoryinput,eliminatingtheneedforhand-engineeredfeatures.

•ComplexPolicyLearning:Neuralnetworksasfunctionapproximatorsenabledlearningsophisticatedcontrolpoliciesforhigh-dimensionaltasks.

•ImprovedGeneralization:Deeparchitecturespromisedbettertransferoflearnedbe-haviorsacrosssimilartasks.

However,severalkeychallengesemerged:

1.2.DEEPLEARNINGFORAGENTICSYSTEMS2

•LimitedGeneralization:Learnedpoliciesoftenfailedtotransferbeyondtheirspecifictrainingconditions

•SampleInefficiency:DeepRLsystemsrequiredextensiv

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论