多智能体合作强化学习中的通信_第1页
多智能体合作强化学习中的通信_第2页
多智能体合作强化学习中的通信_第3页
多智能体合作强化学习中的通信_第4页
多智能体合作强化学习中的通信_第5页
已阅读5页,还剩253页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

warwick.ac.uk/lib-publications

AThesisSubmittedfortheDegreeofPhDattheUniversityofWarwick

PermanentWRAPURL:

http://wrap.warwick.ac.uk/179925

Copyrightandreuse:

Thisthesisismadeavailableonlineandisprotectedbyoriginalcopyright.Pleasescrolldowntoviewthedocumentitself.

Pleaserefertotherepositoryrecordforthisitemforinformationtohelpyoutociteit.Ourpolicyinformationisavailablefromtherepositoryhomepage.

Formoreinformation,pleasecontacttheWRAPTeamat:

wrap@warwick.ac.uk

WARW

THEUNIVERSITYOFWARWICK

LearningtoCommunicateinCooperativeMulti-Agent

ReinforcementLearning

by

EmanuelePesce

Thesis

SubmittedtotheUniversityofWarwick

inpartialfulfilmentoftherequirements

foradmissiontothedegreeof

DoctorofPhilosophyinEngineering

WarwickManufacturingGroup

February2023

i

Contents

ListofTables

iv

ListofFigures

v

Acknowledgments

viii

Declarations

ix

1Publications

ix

2Sponsorshipsandgrants

x

Abstract

xi

Acronyms

xii

Chapter1Introduction

1

1.1Researchobjectives

3

1.2Contributions

3

1.3Outline

4

Chapter2LiteratureReview

5

2.1Reinforcementlearning

6

2.2Deepreinforcementlearning

8

2.3Multi-agentdeepreinforcementlearning

9

2.4Cooperativemethods

12

2.5Emergenceofcommunication

13

2.6Communicationmethods

14

ii

2.6.1Attentionmechanismstosupportcommunication

17

2.6.2Graph-basedcommunicationmechanisms

18

Chapter3Memory-drivencommunication

20

3.1Introduction

20

3.2Memory-drivenMADDPG

22

3.2.1Problemsetup

22

3.2.2Memory-drivencommunication

23

3.2.3MD-MADDPGdecentralisedexecution

29

3.3Experimentalsettings

29

3.3.1Environments

29

3.4Experimentalresults

32

3.4.1Mainresults

32

3.4.2Implementationdetails

35

3.4.3Increasingthenumberofagents

36

3.5Communicationanalysis

37

3.6Ablationstudies

41

3.6.1Investigatethememorycomponents

41

3.6.2Corruptingthememory

42

3.6.3Multipleseeds

44

3.6.4Multiplememorysizes

47

3.7Summary

47

Chapter4Connectivity-drivencommunication

49

4.1Introduction

49

4.2Connectivity-drivencommunication

52

4.2.1Problemsetup

52

4.2.2Learningthedynamiccommunicationgraph

52

4.2.3Learningatime-dependentattentionmechanism

54

4.2.4Heatkernel:additionaldetailsandanillustration

56

4.2.5Reinforcementlearningalgorithm

58

4.3Experimentalsettings

61

4.3.1Environments

61

iii

4.3.2Implementationdetails

64

4.4Experimentalresults

65

4.4.1Mainresults

65

4.4.2Varyingthenumberofagents

71

4.5Communicationanalysis

72

4.6Ablationstudies

78

4.6.1Investigatingtheheat-kernelcomponents

78

4.6.2Heat-kernelthreshold

79

4.7Summary

80

Chapter5BenchmarkingMARLmethodsforcooperativemis-

sionsofunmannedaerialvehicles

82

5.1Introduction

82

5.2Proposeddroneenvironment

84

5.3Competingalgorithms

87

5.4Experimentalsettings

92

5.5Experimentalresults

94

5.6Discussion

96

5.7Summary

97

Chapter6Conclusionsandfuturework

98

6.1Conclusion

98

6.2Futurework

100

6.3Ethicalimplications

102

iv

ListofTables

3.1ComparingMD-MADDPGwithotherbaselines

33

3.2Increasingthenumberofagents-CooperativeNavigation

36

3.3Increasingthenumberofagents-POCooperativeNavigation

.37

3.4AblationstudyonMD-MADDPGcomponents

43

3.5Corruptingthememorycontent

43

4.1ComparingCDCwithotherbasilnes

66

4.2ComparivesummaryofMARLalgorithms

67

4.3Varyingthenumberofagents

71

4.4Graphanalysis

75

4.5Heat-kernelthreshold

80

5.1Commonparametersoftheenvironments

86

5.2SummaryofselectedMARLalgorithms

92

5.3Environmentparameters

93

5.4Benchmarkingresults

94

v

ListofFigures

2.1Reinforcementlearning

7

2.2MADDPG

11

3.1TheMD-MADDPGframework

24

3.2Environmentillustrations

32

3.3Learnedcommunicationstrategies-write

38

3.4Learnedcommunicationstrategy-read

41

3.5ChangingseedsonSwappingCooperativeNavigation

45

3.6ChangingseedsonSequentialCooperativeNavigation

46

3.7Investigatingdifferentmemorydimensions

47

4.1TheCDCframework

53

4.2Anedgeselectionexample

58

4.3Enviromentillustrations

63

4.4Learningcurves-NavigationControlandLineControl

69

4.5Learningcurves-FormationControlandDynamicPackControl

70

4.6Communicationnetworks-NavigationControlandLineControl

73

4.7Communicationnetworks-FormationControlandDynamic

PackControl

74

4.8Averagecommunicationgraphs-NavigationControlandLine

Control

76

4.9Averagecommunicationgraphs-FormationControlandDy-

namicPackControl

77

4.10AblationstudyonCDCcomponents

79

vi

5.1AUVrepresentation

86

5.2Learningcurves

96

TomybelovedSimona,fortheendlesslove,care,andsupportthroughoutalltheseyearstogether,andforbelievinginmemorethananyoneelse.

viii

Acknowledgments

Thisthesiswouldnothavebeenpossiblewithoutthehelpandsupportofmanypeople.Firstly,IwouldliketoexpressmygratitudetoProfessorGiovanniMontanaandtheWMGdepartmentforgrantingmetheopportunitytopursueafully-fundedPhDattheUniversityofWarwick.IamthankfultoGiovanniforhisguidancethroughoutthisjourney,consistentlyprovidingmewithusefuladviceandcontributingtotherevisionofmymanuscripts.Additionally,IamgratefultoKurtDebattistaforhissupportovertheyears.IwouldalsoliketoextendmythankstoLukeOwenandRamonDalmau-CodinafortheircontributionstothedevelopmentoftheUAVenvironment.IwouldliketoacknowledgeJeremieHoussineauandRaúlSantos-Rodríguez,myexaminers,fortheirvaluableadvicetoenhancethisthesis.AbigthankyougoestoProfessorTonyMcNallyforhisassistanceincoordinatingtheexaminationprocess.IalsowishtoextendmythankstoProfessorRobertoTagliaferriforbeingasourceofinspirationandinstillinginmealoveforthisfield.

Furthermore,IwishtothankallthefriendsIhavehadtheprivilegeofmeetingalongthispath.Ruggiero,withwhomIhavesharedcountlessmemorablemomentsandtechnicaldiscussions.Demetris,forhisencouragementandinspirationeverytimeIneededit.IamalsothankfultoKevin,Massimo,Ozsel,SaadandFrancesco,allofwhomhaveplayedsignificantrolesinmakingthisPhDjourneyamoreenjoyableandsociableexperience.

Finally,Iamincrediblygratefultomyparents,RoccoandRita,foralwaysbelievinginmeandencouragingmetobewhoIam.AspecialthanksgoestoSimona,mybelovedpartner,whoisthemostcaringpersonIhavemetandhasconsistentlybeenasourceofsupportduringbothjoyfulandchallengingtimes.

ix

Declarations

ThisthesisissubmittedtotheUniversityofWarwickinsupportofmyapplic-ationforthedegreeofDoctorofPhilosophy.Ithasbeencomposedbymyselfandhasnotbeensubmittedinanypreviousapplicationforanydegree.

1Publications

Partsofthisthesishavebeenpreviouslypublishedbytheauthorinthefollowing:

[125]

EmanuelePesceandGiovanniMontana.Improvingcoordinationinsmall-scalemulti-agentdeepreinforcementlearningthroughmemory-drivencommunication.MachineLearning,109(9):1727–1747,2020

[126]

EmanuelePesceandGiovanniMontana.Learningmulti-agentcoordin-ationthroughconnectivity-drivencommunication.MachineLearning,2022.doi:10.1007/s10994-022-06286-6

[127]

EmanuelePesce,RamonDalmau,LukeOwen,andGiovanniMontana.Benchmarkingmulti-agentdeepreinforcementlearningforcooperativemissionsofunmannedaerialvehicles.InProceedingsoftheInternationalWorkshoponCitizen-CentricMultiagentSystems,pages49–56.CMAS,2023

Alltheworkpublished

[125

127

]islicensedunderaCreativeCommonsAttribution4.0InternationalLicense.Toviewacopyofthislicencevisit

/licenses/by/4.0/.

x

2Sponsorshipsandgrants

ThisresearchwasfundedbytheUniversityofWarwick.

xi

Abstract

Recentadvancesindeepreinforcementlearninghaveproducedunpreceden-tedresults.Thesuccessobtainedonsingle-agentapplicationsledtoexploringthesetechniquesinthecontextofmulti-agentsystemswhereseveraladditionalchallengesneedtobeconsidered.Communicationhasalwaysbeencrucialtoachievingcooperationinmulti-agentdomainsandlearningtocommunicaterepresentsafundamentalmilestoneformulti-agentreinforcementlearningalgorithms.Inthisthesis,differentmulti-agentreinforcementlearningap-proachesareexplored.Theseprovidearchitecturesthatarelearnedend-to-endandcapableofachievingeffectivecommunicationprotocolsthatcanboostthesystemperformanceincooperativesettings.Firstly,weinvestigateanovelapproachwhereintra-agentcommunicationhappensthroughasharedmemorydevicethatcanbeusedbytheagentstoexchangemessagesthroughlearnablereadandwriteoperations.Secondly,weproposeagraph-basedapproachwhereconnectivitiesareshapedbyexchangingpairwisemessageswhicharethenaggregatedthroughanovelformofattentionmechanismbasedonagraphdiffusionmodel.Finally,wepresentanewsetofenvironmentswithreal-worldinspiredconstraintsthatweutilisetobenchmarkthemostrecentstate-of-the-artsolutions.Ourresultsshowthatcommunicationcanbeafundamentaltooltoovercomesomeoftheintrinsicdifficultiesthatcharacterisecooperativemulti-agentsystems.

xii

Acronyms

CDCConnectivity-drivencommunication.

CLDECentralisedlearningdecentralisedexecution.

CNCooperativenavigation.

DDPGDeepdeterministicpolicygradient.

DNNDeepneuralnetwork.

DPGDeterministicpolicygradient.

DQNDeepQ-network.

DRLDeepreinforcementlearning.

ERExperiencereplay.

GNNGraphneuralnetwork.

HKHeatkernel.

KLKullback-Leibler.

LSTMLongshorttermmemory.

MAMulti-agent.

MADRLMulti-agentdeepreinforcementlearning.

MA-MADDPGMeta-agentMADDPG.

MADDPGMulti-agentDDPG.

xiii

MARLMulti-agentreinforcementlearning.

MD-MADDPGMemory-drivenMADDPG.

MDPMarkovdecisionprocess.

NNNeuralnetwork.

PCPrincipalcomponent.

PCAPrincipalcomponentanalysis.

PGPolicygradient.

POPartialobservability.

PPOProximalpolicyoptimization.

RLReinforcementlearning.

RNNRecurrentneuralnetwork.

TRPOTrustregionpolicyoptimisation.

UASUnmannedaerialsystem.

VDNValuedecompositionnetwork.

1

Chapter1

Introduction

ReinforcementLearning(RL)allowsagentstolearnhowtomapobservationstoactionsthroughfeedbackrewardsignals[

157]

.Recently,deepneuralnetworks(DNNs)

[89,

141

]havehadanoticeableimpactonRL

[94]

.Theyprovideflexiblemodelsforlearningvaluefunctionsandpolicies,overcomedifficultiesrelatedtolargestatespaces,andeliminatetheneedforhand-craftedfeaturesandad-hocheuristics

[29,

121,

122]

.Deepreinforcementlearning(DRL)algorithms,whichusuallyrelyondeepneuralnetworkstoapproximatefunctions,havebeensuccessfullyemployedinsingle-agentsystems,includingvideogameplaying

[111

],robotlocomotion

[97

],objectlocalisation[

18

]anddata-centercooling

[38]

.FollowingtheuptakeofDRLinsingle-agentdomains,thereisnowaneedtodevelopimprovedlearningalgorithmsformulti-agent(MA)systemswhereadditionalchallengesarise.Multi-agentreinforcementlearning(MARL)extendsRLtoproblemscharacterizedbytheinterplayofmultipleagentsoperatinginasharedenvironment.Thisisascenariothatistypicalofmanyreal-worldapplicationsincludingrobotnavigation[

162

],autonomousvehiclescoordination

[15

],trafficmanagement

[36

],andsupplychainmanagement

[90]

.Comparedtosingle-agentsystems,MARLpresentsadditionallayersofcomplexity.Earlyapproachesstartedexploringhowdeepreinforcementlearningtechniquescanbeutilisedinmulti-agentsettings[

23,

53,

155

],whereitemergedaneedofnoveltechniquesspecificallydesignedtotackleMAchallenges.

2

MarkovDecisionProcesses(MDP),uponwhichDRLmethodsrely,assumethattherewarddistributionanddynamicsarestationary[

58]

.Whenmultiplelearnersinteractwitheachother,thispropertyisviolatedbecausetherewardthatanagentreceivesalsodependsonotheragents’actions[

86]

.Thisissue,knownasthemoving-targetproblem

[166

],removesconvergenceguaranteesandintroducesadditionallearninginstabilities.Furtherdifficultiesarisefromenvironmentscharacterizedbypartialobservability[

23,

128,

151

]wherebytheagentsdonothavefullaccesstotheworldstate,andwherecoordinationskillsareessential.

Animportantchallengeinmulti-agentdeepreinforcementlearning(MADRL)ishowtofacilitatecommunicationamonginteractingagents.Communicationiswidelyknowntoplayacriticalroleinpromotingcoordinationbetweenhumans

[159]

.Humanshavebeenproventoexcelatcommunicatingeveninabsenceofaconventionalcode[

32]

.Whencoordinationisrequiredandnocommonlanguagesexist,simplecommunicationprotocolsarelikelytoemerge

[144]

.Humancommunicationinvolvesmorethansendingandreceivingmes-sages,itrequiresspecializedinteractiveintelligencewherereceivershavetheabilitytorecognizeintentionsandsenderscanproperlydesignmessages[

178]

.Theemergenceofcommunicationhasbeenwidelyinvestigated[

47,

163

],forexamplenewsignsandsymbolscanemergewhenitcomestorepresentingrealconcepts.Fusarolietal.

[46]

demonstratedthatlanguagecanbeseenasasocialcoordinationdevicelearntthroughreciprocalinteractionwiththeenvironmentforoptimizingcoordinativedynamics.Therelationbetweencommunicationandcoordinationhasbeenwidelydiscussed[

34,

71,

109,

170]

.Communicationisanessentialskillinmanytasks:forinstance,incriticalsituations,whereisoffundamentalimportanceinproperlymanagingcriticalandurgentsitu-ations,suchasemergencyresponseorganizations[

28

],inwhichiscrucialtoestablishaclearwayofcommunicating.Forexample,inordertoproperlymanagecriticalandurgentsituations,emergencyresponseorganizationsneedaclearcommunicationthatisfundamentalandcanbeachievedthroughsharinginformationamongstdifferentagentsinvolved,whichisusuallyaccomplishedthroughyearsofcommontraining

[28]

.Inmultiplayervideogames,itisoften

3

essentialtoreachasufficientlyhighlevelofcoordinationrequiredtosucceed,oftenacquiredviacommunicating[

20]

.WebelievethatcommunicationisapromisingtoolthatneedstobeexploitedbyMADRLmodelsinordertoenhancetheirperformanceinmulti-agentenvironments.Whenthisresearchwasstarted,wenoticedalackofmethodstoenableinter-agentcommunication,sowedecidedtoexplorethisareatocontributetofillingagapthathadthepotentialforimprovingthecollaborationprocessinaMAsystem.

1.1Researchobjectives

TheaimofthisresearchistoexplorenovelcommunicationmodelstoenhancetheperformanceofexistingMARLmethods.Inparticular,wefocusoncooper-ativescenarios,whichiswherecommunicationisneededthemostbytheagentsinordertoproperlysucceedandcompletetheassignedtasks.Weinvestigatedifferentapproachestoachievingeffectivewaysofcommunicatingtoboostthelevelofcooperationinmulti-agentsettings.Theresultingcommunicationprotocolsarelearnedend-to-endsothat,attrainingtime,theycanbeadaptedbytheagentstoovercomethedifficultiesproposedbytheunderlyingenviron-mentalconfiguration.Inaddition,wealsoaimtoanalysethecontentofthelearnedcommunicationcontent.

1.2Contributions

Themaincontributionsmadeinthisthesisaresummarisedasfollows:

•inChapter

3

,weproposeanovelmulti-agentapproachwhereinter-agentcommunicationisobtainedbyprovidingacentralisedsharedmemorythateachagenthastolearntouseinordertoreadandwritemessagesfortheothersinsequentialorder;

•inChapter

4

,wediscussanovelmulti-agentmodelthatfirstconstructsagraphofconnectivitiestoencodepair-wisemessageswhicharethenusedtogenerateanagent-specificsetofencodingsthroughaproposed

4

attentionmechanismthatutilisesadiffusionmodelsuchastheheat-kernel(HK);

•inChapter

5

,weproposeanenvironmenttosimulatedronebehavioursinrealisticsettingsandpresentarangeofexperimentsinordertoevaluatetheperformanceofseveralstate-of-the-artmethodsinsuchscenarios.

1.3Outline

Thissectionprovidesanoutlineofthisthesis.Therestofthisdocumentisstructuredasfollows.Chapter

2

reviewstheexistingMADRLmodelsthatrelatetothiswork,withaspecialfocusoncooperativealgorithms.Chapter

3

introducesthefirstresearchcontributionthatproposesanovelformofcommunicationbasedonasharedmemorycell.Chapter

4

presentsthesecondresearchcontributioninwhichagraphbasedarchitectureisexploitedbyadiffusionmodeltogenerateagentspecificmessages.Chapter

5

proposesanovelenvironmenttosimulatearealisticscenarioofdronenavigationanddiscussesanextensivecomparisonofseveralstate-of-the-artMADRLmodels.Chapter

6

concludesthisworkwithadiscussionoftheresultsobtainedandrecommendationsforfuturework.

5

Chapter2

LiteratureReview

Inthischapter,weintroducetheRLsettingandreviewtheexistingworksrelatedtomulti-agentreinforcementlearning.InSection

2.2

,wediscusssignificantmilestonesinsingle-agentreinforcementlearningtoestablishthefoundationalknowledgefortheextendedorutilizedbasiclearningtechniques.Section

2.2

presentsdeeplearningextensionsforthepreviouslymentionedapproaches,servingasaconnectionbetweensingle-agentandmulti-agentmethodologies.MovingontoSection

2.3

,wefocusonhowtheseapproacheshavebeenexpandedtooperateinmulti-agentscenarios,withaparticularemphasisonthetrainingphasesthatarecommonlyemployedinstate-of-the-artworks.Wethencategorizethemulti-agentliteratureintothefollowinggroups:

•Cooperativemethods(Section

2.4

):worksthatconcentrateonachievingcooperationbetweenagents

•Emergenceofcommunication(Section

2.5

):worksthatinvestigatehowautonomousagentscanlearnlanguages

•Communicationmethods(Section

2.6

):workswhereagentsmustlearntocommunicatetoenhancesystemperformance

Inthisreview,weintentionallyomittedspecificresearchareassuchastraditionalgametheoryapproaches[

120,

123,

145

],microgridsystems[

27,

70,

72

],andprogrammingforparallelexecutionsofagents[

24,

45,

136]

.Our

6

primaryfocuswasonmulti-agentworksbasedonreinforcementlearningapproaches,withaparticularemphasisoncommunicationmethodologies.

Someofthemethodsmentionedinthisreviewhavealsobeenchosenasbaselinesfortheexperimentspresentedinthesubsequentchapters,particularlyinChapter

5

,thatplaysacrucialroleasitservesasapracticalcontextfortheproposedmulti-agentapproachesdiscussedinChapters

3

and

4.

Byintroducingaspecificallydesignedenvironmenttosimulatedronebehavioursinrealisticsettings,Chapter

5

providesindeedapracticalplatformforevaluatingtheperformanceofstate-of-the-artMADRLmodelsthatemploydifferentcommunicationandcoordinationmethodsdiscussedinthischapter.

2.1Reinforcementlearning

Reinforcementlearningmethodsformalisetheinteractionofanagent(oractor)withitsenvironmentsusingaMarkovdecisionprocess[

129]

.AnMDPisdefinedasatuple〈S,A,R,T,γ〉whereSisthesetthatcontainsallthestatesofagivenenvironment,Aisasetoffiniteactionsthatcanbeselectedbytheagent,andtherewardfunctionR:S×A→Rdefinesrewardreceivedbyanagentwhenexecutingtheactiona∈Awhilebeinginastates∈S.AtransitionfunctionT:S×Adescribeshowtheenvironmentdeterminesthenextstatewhenstartingfromastates∈Sandgivenanactiona∈A.Thediscountfactorγbalancesthetrade-offbetweencurrentandfuturerewards.AsrepresentedinFigure

2.1

,anagentinteractswiththeenvironmentbyproducinganactiongiventhecurrentstateandreceivingarewardinreturn.MDPsaresuitablemodelstotakedecisionsinfullyobservableenvironmentswhereacompletedescriptionofallitselementsisavailabletotheagentsandcanbeexploitedbytechniquessuchasthevalueiterationalgorithm[

9]

whichiterativelycomputesavaluefunctionthatestimatesthepotentialrewardfunctionofeachstate.Astate-actionvalueinsteadiscalculatedwhenthepotentialrewardfunctionisestimatedusingboththestateandtheaction.WhenaMDPissolvedastochasticpolicyπ:S×A→[0,1]isobtainedtomapstatesintoactions.RLalgorithmsoftenmakeuseofthepastagents’

7

Figure2.1:Areinforcementlearningsetting.Theenvironmentprovidesanobservationwhiletheagentproducesanactionandreceivesarewardinreturn.

experiencewithinteractingwiththeenvironment.Awell-knownalgorithmistheQ-learning

[176

],atabularapproachthatkeepstrackoftheQ-functionsQ(s,a)thatestimatethediscountedsumoffuturerewardforagivenpairstate-action.Everytimetheagentmovesfromastatesintoastates’usinganactiona’therespectivetabularentryisupdatedasfollows:

Q(s,a)=Q(s,a)+α[(r+γmaxQ(s’,a’))−Q(s,a)](2.1)

a,

whereα∈[0,1]isthelearningrate.

Policygradients(PG)methods[

157

]representanalternativeapproachofQ-learningwheretheparametersθofthepolicyaredirectlyadjustedtominimisetheobjectivefunctionbytakingstepsinthedirectionofthegradientfinedastheγ-discountedsumofrewardsandt∈{1,...T}isthetime-stepoftheenvironment.Suchgradientiscalculatedasfollows:

▽θJ(θ)=Ea~πθ[▽θlogπθ(a|s)Q(s,a)](2.2)

TheREINFORCEalgorithm

[179

]utilisesEq.

2.2

inconjunctionwithaMonteCarloestimationoffullsampledtrajectoriestolearnpolicyparametersinthe

8

followingway:

(2.3)

Policygradientalgorithmsarerenownedtosufferfromhighvariancewhichcansignificantlyslowthelearningprocess[

79]

.Thisissueisoftenmitigatedbyaddingabaseline,suchastheaveragerewardorthestatevaluefunction,thataimstocorrectthehighvariationattrainingtime.Actor-criticmethods[

79]

arecomposedofanactormodulethatselectstheactionstotakeandacriticthatprovidesthefeedbacknecessaryforthelearningprocess.Whenthecriticisabletolearnboththestate-actionandthevaluefunctions,anadvantagefunctioncanbecalculatedasthedifferencebetweenthesetwoestimates.Apopularactor-criticalgorithmistheDeterministicPolicyGradient(DPG)

[149

],inwhichtheactorisupdatedthroughthegradientofthepolicy,whilethecriticutilisesthestandardQ-learningapproach.InDPGthepolicyisassumedtobeadeterministicfunctionμθ:S→Aandthegradientthatminimisestheobjectivefunctioncanbewrittenas:

▽θJ(θ)=Es~D[▽θμθ(a|s)▽aQ(s,a)Ia=μθ(s)](2.4)

whereDisanexperiencereplay(ER)bufferthatstoresthehistoricaltransitions,μθandQ(s,a)representtheactorandthecritic,respectively.

2.2Deepreinforcementlearning

Deeplearningtechniques

[89

]havewidelybeenadoptedtoovercomethemajorlimitationsoftraditionalreinforcementlearningalgorithms,suchaslearninginenvironmentswithlargestatespacesorhavingtoprovidehand-specifiedfeatures

[158]

.Deepneuralnetworks(DNN)asfunctionapproximatorsallowedindeedtoapproximatevaluefunctionsandagents’policies[

12]

.InDQN

[110

],theQ-learningframeworkisextendedwithDNNs,inordertoapproximatethestateprovidedbytheenvironment,whilestillkeepingthehistoricalexperienceinanexperiencereplaybufferwhichisusedsampledataattrainingtime.DQNlearnstoapproximatet

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论