




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
warwick.ac.uk/lib-publications
AThesisSubmittedfortheDegreeofPhDattheUniversityofWarwick
PermanentWRAPURL:
http://wrap.warwick.ac.uk/179925
Copyrightandreuse:
Thisthesisismadeavailableonlineandisprotectedbyoriginalcopyright.Pleasescrolldowntoviewthedocumentitself.
Pleaserefertotherepositoryrecordforthisitemforinformationtohelpyoutociteit.Ourpolicyinformationisavailablefromtherepositoryhomepage.
Formoreinformation,pleasecontacttheWRAPTeamat:
wrap@warwick.ac.uk
WARW
THEUNIVERSITYOFWARWICK
LearningtoCommunicateinCooperativeMulti-Agent
ReinforcementLearning
by
EmanuelePesce
Thesis
SubmittedtotheUniversityofWarwick
inpartialfulfilmentoftherequirements
foradmissiontothedegreeof
DoctorofPhilosophyinEngineering
WarwickManufacturingGroup
February2023
i
Contents
ListofTables
iv
ListofFigures
v
Acknowledgments
viii
Declarations
ix
1Publications
ix
2Sponsorshipsandgrants
x
Abstract
xi
Acronyms
xii
Chapter1Introduction
1
1.1Researchobjectives
3
1.2Contributions
3
1.3Outline
4
Chapter2LiteratureReview
5
2.1Reinforcementlearning
6
2.2Deepreinforcementlearning
8
2.3Multi-agentdeepreinforcementlearning
9
2.4Cooperativemethods
12
2.5Emergenceofcommunication
13
2.6Communicationmethods
14
ii
2.6.1Attentionmechanismstosupportcommunication
17
2.6.2Graph-basedcommunicationmechanisms
18
Chapter3Memory-drivencommunication
20
3.1Introduction
20
3.2Memory-drivenMADDPG
22
3.2.1Problemsetup
22
3.2.2Memory-drivencommunication
23
3.2.3MD-MADDPGdecentralisedexecution
29
3.3Experimentalsettings
29
3.3.1Environments
29
3.4Experimentalresults
32
3.4.1Mainresults
32
3.4.2Implementationdetails
35
3.4.3Increasingthenumberofagents
36
3.5Communicationanalysis
37
3.6Ablationstudies
41
3.6.1Investigatethememorycomponents
41
3.6.2Corruptingthememory
42
3.6.3Multipleseeds
44
3.6.4Multiplememorysizes
47
3.7Summary
47
Chapter4Connectivity-drivencommunication
49
4.1Introduction
49
4.2Connectivity-drivencommunication
52
4.2.1Problemsetup
52
4.2.2Learningthedynamiccommunicationgraph
52
4.2.3Learningatime-dependentattentionmechanism
54
4.2.4Heatkernel:additionaldetailsandanillustration
56
4.2.5Reinforcementlearningalgorithm
58
4.3Experimentalsettings
61
4.3.1Environments
61
iii
4.3.2Implementationdetails
64
4.4Experimentalresults
65
4.4.1Mainresults
65
4.4.2Varyingthenumberofagents
71
4.5Communicationanalysis
72
4.6Ablationstudies
78
4.6.1Investigatingtheheat-kernelcomponents
78
4.6.2Heat-kernelthreshold
79
4.7Summary
80
Chapter5BenchmarkingMARLmethodsforcooperativemis-
sionsofunmannedaerialvehicles
82
5.1Introduction
82
5.2Proposeddroneenvironment
84
5.3Competingalgorithms
87
5.4Experimentalsettings
92
5.5Experimentalresults
94
5.6Discussion
96
5.7Summary
97
Chapter6Conclusionsandfuturework
98
6.1Conclusion
98
6.2Futurework
100
6.3Ethicalimplications
102
iv
ListofTables
3.1ComparingMD-MADDPGwithotherbaselines
33
3.2Increasingthenumberofagents-CooperativeNavigation
36
3.3Increasingthenumberofagents-POCooperativeNavigation
.37
3.4AblationstudyonMD-MADDPGcomponents
43
3.5Corruptingthememorycontent
43
4.1ComparingCDCwithotherbasilnes
66
4.2ComparivesummaryofMARLalgorithms
67
4.3Varyingthenumberofagents
71
4.4Graphanalysis
75
4.5Heat-kernelthreshold
80
5.1Commonparametersoftheenvironments
86
5.2SummaryofselectedMARLalgorithms
92
5.3Environmentparameters
93
5.4Benchmarkingresults
94
v
ListofFigures
2.1Reinforcementlearning
7
2.2MADDPG
11
3.1TheMD-MADDPGframework
24
3.2Environmentillustrations
32
3.3Learnedcommunicationstrategies-write
38
3.4Learnedcommunicationstrategy-read
41
3.5ChangingseedsonSwappingCooperativeNavigation
45
3.6ChangingseedsonSequentialCooperativeNavigation
46
3.7Investigatingdifferentmemorydimensions
47
4.1TheCDCframework
53
4.2Anedgeselectionexample
58
4.3Enviromentillustrations
63
4.4Learningcurves-NavigationControlandLineControl
69
4.5Learningcurves-FormationControlandDynamicPackControl
70
4.6Communicationnetworks-NavigationControlandLineControl
73
4.7Communicationnetworks-FormationControlandDynamic
PackControl
74
4.8Averagecommunicationgraphs-NavigationControlandLine
Control
76
4.9Averagecommunicationgraphs-FormationControlandDy-
namicPackControl
77
4.10AblationstudyonCDCcomponents
79
vi
5.1AUVrepresentation
86
5.2Learningcurves
96
TomybelovedSimona,fortheendlesslove,care,andsupportthroughoutalltheseyearstogether,andforbelievinginmemorethananyoneelse.
viii
Acknowledgments
Thisthesiswouldnothavebeenpossiblewithoutthehelpandsupportofmanypeople.Firstly,IwouldliketoexpressmygratitudetoProfessorGiovanniMontanaandtheWMGdepartmentforgrantingmetheopportunitytopursueafully-fundedPhDattheUniversityofWarwick.IamthankfultoGiovanniforhisguidancethroughoutthisjourney,consistentlyprovidingmewithusefuladviceandcontributingtotherevisionofmymanuscripts.Additionally,IamgratefultoKurtDebattistaforhissupportovertheyears.IwouldalsoliketoextendmythankstoLukeOwenandRamonDalmau-CodinafortheircontributionstothedevelopmentoftheUAVenvironment.IwouldliketoacknowledgeJeremieHoussineauandRaúlSantos-Rodríguez,myexaminers,fortheirvaluableadvicetoenhancethisthesis.AbigthankyougoestoProfessorTonyMcNallyforhisassistanceincoordinatingtheexaminationprocess.IalsowishtoextendmythankstoProfessorRobertoTagliaferriforbeingasourceofinspirationandinstillinginmealoveforthisfield.
Furthermore,IwishtothankallthefriendsIhavehadtheprivilegeofmeetingalongthispath.Ruggiero,withwhomIhavesharedcountlessmemorablemomentsandtechnicaldiscussions.Demetris,forhisencouragementandinspirationeverytimeIneededit.IamalsothankfultoKevin,Massimo,Ozsel,SaadandFrancesco,allofwhomhaveplayedsignificantrolesinmakingthisPhDjourneyamoreenjoyableandsociableexperience.
Finally,Iamincrediblygratefultomyparents,RoccoandRita,foralwaysbelievinginmeandencouragingmetobewhoIam.AspecialthanksgoestoSimona,mybelovedpartner,whoisthemostcaringpersonIhavemetandhasconsistentlybeenasourceofsupportduringbothjoyfulandchallengingtimes.
ix
Declarations
ThisthesisissubmittedtotheUniversityofWarwickinsupportofmyapplic-ationforthedegreeofDoctorofPhilosophy.Ithasbeencomposedbymyselfandhasnotbeensubmittedinanypreviousapplicationforanydegree.
1Publications
Partsofthisthesishavebeenpreviouslypublishedbytheauthorinthefollowing:
[125]
EmanuelePesceandGiovanniMontana.Improvingcoordinationinsmall-scalemulti-agentdeepreinforcementlearningthroughmemory-drivencommunication.MachineLearning,109(9):1727–1747,2020
[126]
EmanuelePesceandGiovanniMontana.Learningmulti-agentcoordin-ationthroughconnectivity-drivencommunication.MachineLearning,2022.doi:10.1007/s10994-022-06286-6
[127]
EmanuelePesce,RamonDalmau,LukeOwen,andGiovanniMontana.Benchmarkingmulti-agentdeepreinforcementlearningforcooperativemissionsofunmannedaerialvehicles.InProceedingsoftheInternationalWorkshoponCitizen-CentricMultiagentSystems,pages49–56.CMAS,2023
Alltheworkpublished
[125
–
127
]islicensedunderaCreativeCommonsAttribution4.0InternationalLicense.Toviewacopyofthislicencevisit
/licenses/by/4.0/.
x
2Sponsorshipsandgrants
ThisresearchwasfundedbytheUniversityofWarwick.
xi
Abstract
Recentadvancesindeepreinforcementlearninghaveproducedunpreceden-tedresults.Thesuccessobtainedonsingle-agentapplicationsledtoexploringthesetechniquesinthecontextofmulti-agentsystemswhereseveraladditionalchallengesneedtobeconsidered.Communicationhasalwaysbeencrucialtoachievingcooperationinmulti-agentdomainsandlearningtocommunicaterepresentsafundamentalmilestoneformulti-agentreinforcementlearningalgorithms.Inthisthesis,differentmulti-agentreinforcementlearningap-proachesareexplored.Theseprovidearchitecturesthatarelearnedend-to-endandcapableofachievingeffectivecommunicationprotocolsthatcanboostthesystemperformanceincooperativesettings.Firstly,weinvestigateanovelapproachwhereintra-agentcommunicationhappensthroughasharedmemorydevicethatcanbeusedbytheagentstoexchangemessagesthroughlearnablereadandwriteoperations.Secondly,weproposeagraph-basedapproachwhereconnectivitiesareshapedbyexchangingpairwisemessageswhicharethenaggregatedthroughanovelformofattentionmechanismbasedonagraphdiffusionmodel.Finally,wepresentanewsetofenvironmentswithreal-worldinspiredconstraintsthatweutilisetobenchmarkthemostrecentstate-of-the-artsolutions.Ourresultsshowthatcommunicationcanbeafundamentaltooltoovercomesomeoftheintrinsicdifficultiesthatcharacterisecooperativemulti-agentsystems.
xii
Acronyms
CDCConnectivity-drivencommunication.
CLDECentralisedlearningdecentralisedexecution.
CNCooperativenavigation.
DDPGDeepdeterministicpolicygradient.
DNNDeepneuralnetwork.
DPGDeterministicpolicygradient.
DQNDeepQ-network.
DRLDeepreinforcementlearning.
ERExperiencereplay.
GNNGraphneuralnetwork.
HKHeatkernel.
KLKullback-Leibler.
LSTMLongshorttermmemory.
MAMulti-agent.
MADRLMulti-agentdeepreinforcementlearning.
MA-MADDPGMeta-agentMADDPG.
MADDPGMulti-agentDDPG.
xiii
MARLMulti-agentreinforcementlearning.
MD-MADDPGMemory-drivenMADDPG.
MDPMarkovdecisionprocess.
NNNeuralnetwork.
PCPrincipalcomponent.
PCAPrincipalcomponentanalysis.
PGPolicygradient.
POPartialobservability.
PPOProximalpolicyoptimization.
RLReinforcementlearning.
RNNRecurrentneuralnetwork.
TRPOTrustregionpolicyoptimisation.
UASUnmannedaerialsystem.
VDNValuedecompositionnetwork.
1
Chapter1
Introduction
ReinforcementLearning(RL)allowsagentstolearnhowtomapobservationstoactionsthroughfeedbackrewardsignals[
157]
.Recently,deepneuralnetworks(DNNs)
[89,
141
]havehadanoticeableimpactonRL
[94]
.Theyprovideflexiblemodelsforlearningvaluefunctionsandpolicies,overcomedifficultiesrelatedtolargestatespaces,andeliminatetheneedforhand-craftedfeaturesandad-hocheuristics
[29,
121,
122]
.Deepreinforcementlearning(DRL)algorithms,whichusuallyrelyondeepneuralnetworkstoapproximatefunctions,havebeensuccessfullyemployedinsingle-agentsystems,includingvideogameplaying
[111
],robotlocomotion
[97
],objectlocalisation[
18
]anddata-centercooling
[38]
.FollowingtheuptakeofDRLinsingle-agentdomains,thereisnowaneedtodevelopimprovedlearningalgorithmsformulti-agent(MA)systemswhereadditionalchallengesarise.Multi-agentreinforcementlearning(MARL)extendsRLtoproblemscharacterizedbytheinterplayofmultipleagentsoperatinginasharedenvironment.Thisisascenariothatistypicalofmanyreal-worldapplicationsincludingrobotnavigation[
162
],autonomousvehiclescoordination
[15
],trafficmanagement
[36
],andsupplychainmanagement
[90]
.Comparedtosingle-agentsystems,MARLpresentsadditionallayersofcomplexity.Earlyapproachesstartedexploringhowdeepreinforcementlearningtechniquescanbeutilisedinmulti-agentsettings[
23,
53,
155
],whereitemergedaneedofnoveltechniquesspecificallydesignedtotackleMAchallenges.
2
MarkovDecisionProcesses(MDP),uponwhichDRLmethodsrely,assumethattherewarddistributionanddynamicsarestationary[
58]
.Whenmultiplelearnersinteractwitheachother,thispropertyisviolatedbecausetherewardthatanagentreceivesalsodependsonotheragents’actions[
86]
.Thisissue,knownasthemoving-targetproblem
[166
],removesconvergenceguaranteesandintroducesadditionallearninginstabilities.Furtherdifficultiesarisefromenvironmentscharacterizedbypartialobservability[
23,
128,
151
]wherebytheagentsdonothavefullaccesstotheworldstate,andwherecoordinationskillsareessential.
Animportantchallengeinmulti-agentdeepreinforcementlearning(MADRL)ishowtofacilitatecommunicationamonginteractingagents.Communicationiswidelyknowntoplayacriticalroleinpromotingcoordinationbetweenhumans
[159]
.Humanshavebeenproventoexcelatcommunicatingeveninabsenceofaconventionalcode[
32]
.Whencoordinationisrequiredandnocommonlanguagesexist,simplecommunicationprotocolsarelikelytoemerge
[144]
.Humancommunicationinvolvesmorethansendingandreceivingmes-sages,itrequiresspecializedinteractiveintelligencewherereceivershavetheabilitytorecognizeintentionsandsenderscanproperlydesignmessages[
178]
.Theemergenceofcommunicationhasbeenwidelyinvestigated[
47,
163
],forexamplenewsignsandsymbolscanemergewhenitcomestorepresentingrealconcepts.Fusarolietal.
[46]
demonstratedthatlanguagecanbeseenasasocialcoordinationdevicelearntthroughreciprocalinteractionwiththeenvironmentforoptimizingcoordinativedynamics.Therelationbetweencommunicationandcoordinationhasbeenwidelydiscussed[
34,
71,
109,
170]
.Communicationisanessentialskillinmanytasks:forinstance,incriticalsituations,whereisoffundamentalimportanceinproperlymanagingcriticalandurgentsitu-ations,suchasemergencyresponseorganizations[
28
],inwhichiscrucialtoestablishaclearwayofcommunicating.Forexample,inordertoproperlymanagecriticalandurgentsituations,emergencyresponseorganizationsneedaclearcommunicationthatisfundamentalandcanbeachievedthroughsharinginformationamongstdifferentagentsinvolved,whichisusuallyaccomplishedthroughyearsofcommontraining
[28]
.Inmultiplayervideogames,itisoften
3
essentialtoreachasufficientlyhighlevelofcoordinationrequiredtosucceed,oftenacquiredviacommunicating[
20]
.WebelievethatcommunicationisapromisingtoolthatneedstobeexploitedbyMADRLmodelsinordertoenhancetheirperformanceinmulti-agentenvironments.Whenthisresearchwasstarted,wenoticedalackofmethodstoenableinter-agentcommunication,sowedecidedtoexplorethisareatocontributetofillingagapthathadthepotentialforimprovingthecollaborationprocessinaMAsystem.
1.1Researchobjectives
TheaimofthisresearchistoexplorenovelcommunicationmodelstoenhancetheperformanceofexistingMARLmethods.Inparticular,wefocusoncooper-ativescenarios,whichiswherecommunicationisneededthemostbytheagentsinordertoproperlysucceedandcompletetheassignedtasks.Weinvestigatedifferentapproachestoachievingeffectivewaysofcommunicatingtoboostthelevelofcooperationinmulti-agentsettings.Theresultingcommunicationprotocolsarelearnedend-to-endsothat,attrainingtime,theycanbeadaptedbytheagentstoovercomethedifficultiesproposedbytheunderlyingenviron-mentalconfiguration.Inaddition,wealsoaimtoanalysethecontentofthelearnedcommunicationcontent.
1.2Contributions
Themaincontributionsmadeinthisthesisaresummarisedasfollows:
•inChapter
3
,weproposeanovelmulti-agentapproachwhereinter-agentcommunicationisobtainedbyprovidingacentralisedsharedmemorythateachagenthastolearntouseinordertoreadandwritemessagesfortheothersinsequentialorder;
•inChapter
4
,wediscussanovelmulti-agentmodelthatfirstconstructsagraphofconnectivitiestoencodepair-wisemessageswhicharethenusedtogenerateanagent-specificsetofencodingsthroughaproposed
4
attentionmechanismthatutilisesadiffusionmodelsuchastheheat-kernel(HK);
•inChapter
5
,weproposeanenvironmenttosimulatedronebehavioursinrealisticsettingsandpresentarangeofexperimentsinordertoevaluatetheperformanceofseveralstate-of-the-artmethodsinsuchscenarios.
1.3Outline
Thissectionprovidesanoutlineofthisthesis.Therestofthisdocumentisstructuredasfollows.Chapter
2
reviewstheexistingMADRLmodelsthatrelatetothiswork,withaspecialfocusoncooperativealgorithms.Chapter
3
introducesthefirstresearchcontributionthatproposesanovelformofcommunicationbasedonasharedmemorycell.Chapter
4
presentsthesecondresearchcontributioninwhichagraphbasedarchitectureisexploitedbyadiffusionmodeltogenerateagentspecificmessages.Chapter
5
proposesanovelenvironmenttosimulatearealisticscenarioofdronenavigationanddiscussesanextensivecomparisonofseveralstate-of-the-artMADRLmodels.Chapter
6
concludesthisworkwithadiscussionoftheresultsobtainedandrecommendationsforfuturework.
5
Chapter2
LiteratureReview
Inthischapter,weintroducetheRLsettingandreviewtheexistingworksrelatedtomulti-agentreinforcementlearning.InSection
2.2
,wediscusssignificantmilestonesinsingle-agentreinforcementlearningtoestablishthefoundationalknowledgefortheextendedorutilizedbasiclearningtechniques.Section
2.2
presentsdeeplearningextensionsforthepreviouslymentionedapproaches,servingasaconnectionbetweensingle-agentandmulti-agentmethodologies.MovingontoSection
2.3
,wefocusonhowtheseapproacheshavebeenexpandedtooperateinmulti-agentscenarios,withaparticularemphasisonthetrainingphasesthatarecommonlyemployedinstate-of-the-artworks.Wethencategorizethemulti-agentliteratureintothefollowinggroups:
•Cooperativemethods(Section
2.4
):worksthatconcentrateonachievingcooperationbetweenagents
•Emergenceofcommunication(Section
2.5
):worksthatinvestigatehowautonomousagentscanlearnlanguages
•Communicationmethods(Section
2.6
):workswhereagentsmustlearntocommunicatetoenhancesystemperformance
Inthisreview,weintentionallyomittedspecificresearchareassuchastraditionalgametheoryapproaches[
120,
123,
145
],microgridsystems[
27,
70,
72
],andprogrammingforparallelexecutionsofagents[
24,
45,
136]
.Our
6
primaryfocuswasonmulti-agentworksbasedonreinforcementlearningapproaches,withaparticularemphasisoncommunicationmethodologies.
Someofthemethodsmentionedinthisreviewhavealsobeenchosenasbaselinesfortheexperimentspresentedinthesubsequentchapters,particularlyinChapter
5
,thatplaysacrucialroleasitservesasapracticalcontextfortheproposedmulti-agentapproachesdiscussedinChapters
3
and
4.
Byintroducingaspecificallydesignedenvironmenttosimulatedronebehavioursinrealisticsettings,Chapter
5
providesindeedapracticalplatformforevaluatingtheperformanceofstate-of-the-artMADRLmodelsthatemploydifferentcommunicationandcoordinationmethodsdiscussedinthischapter.
2.1Reinforcementlearning
Reinforcementlearningmethodsformalisetheinteractionofanagent(oractor)withitsenvironmentsusingaMarkovdecisionprocess[
129]
.AnMDPisdefinedasatuple〈S,A,R,T,γ〉whereSisthesetthatcontainsallthestatesofagivenenvironment,Aisasetoffiniteactionsthatcanbeselectedbytheagent,andtherewardfunctionR:S×A→Rdefinesrewardreceivedbyanagentwhenexecutingtheactiona∈Awhilebeinginastates∈S.AtransitionfunctionT:S×Adescribeshowtheenvironmentdeterminesthenextstatewhenstartingfromastates∈Sandgivenanactiona∈A.Thediscountfactorγbalancesthetrade-offbetweencurrentandfuturerewards.AsrepresentedinFigure
2.1
,anagentinteractswiththeenvironmentbyproducinganactiongiventhecurrentstateandreceivingarewardinreturn.MDPsaresuitablemodelstotakedecisionsinfullyobservableenvironmentswhereacompletedescriptionofallitselementsisavailabletotheagentsandcanbeexploitedbytechniquessuchasthevalueiterationalgorithm[
9]
whichiterativelycomputesavaluefunctionthatestimatesthepotentialrewardfunctionofeachstate.Astate-actionvalueinsteadiscalculatedwhenthepotentialrewardfunctionisestimatedusingboththestateandtheaction.WhenaMDPissolvedastochasticpolicyπ:S×A→[0,1]isobtainedtomapstatesintoactions.RLalgorithmsoftenmakeuseofthepastagents’
7
Figure2.1:Areinforcementlearningsetting.Theenvironmentprovidesanobservationwhiletheagentproducesanactionandreceivesarewardinreturn.
experiencewithinteractingwiththeenvironment.Awell-knownalgorithmistheQ-learning
[176
],atabularapproachthatkeepstrackoftheQ-functionsQ(s,a)thatestimatethediscountedsumoffuturerewardforagivenpairstate-action.Everytimetheagentmovesfromastatesintoastates’usinganactiona’therespectivetabularentryisupdatedasfollows:
Q(s,a)=Q(s,a)+α[(r+γmaxQ(s’,a’))−Q(s,a)](2.1)
a,
whereα∈[0,1]isthelearningrate.
Policygradients(PG)methods[
157
]representanalternativeapproachofQ-learningwheretheparametersθofthepolicyaredirectlyadjustedtominimisetheobjectivefunctionbytakingstepsinthedirectionofthegradientfinedastheγ-discountedsumofrewardsandt∈{1,...T}isthetime-stepoftheenvironment.Suchgradientiscalculatedasfollows:
▽θJ(θ)=Ea~πθ[▽θlogπθ(a|s)Q(s,a)](2.2)
TheREINFORCEalgorithm
[179
]utilisesEq.
2.2
inconjunctionwithaMonteCarloestimationoffullsampledtrajectoriestolearnpolicyparametersinthe
8
followingway:
(2.3)
Policygradientalgorithmsarerenownedtosufferfromhighvariancewhichcansignificantlyslowthelearningprocess[
79]
.Thisissueisoftenmitigatedbyaddingabaseline,suchastheaveragerewardorthestatevaluefunction,thataimstocorrectthehighvariationattrainingtime.Actor-criticmethods[
79]
arecomposedofanactormodulethatselectstheactionstotakeandacriticthatprovidesthefeedbacknecessaryforthelearningprocess.Whenthecriticisabletolearnboththestate-actionandthevaluefunctions,anadvantagefunctioncanbecalculatedasthedifferencebetweenthesetwoestimates.Apopularactor-criticalgorithmistheDeterministicPolicyGradient(DPG)
[149
],inwhichtheactorisupdatedthroughthegradientofthepolicy,whilethecriticutilisesthestandardQ-learningapproach.InDPGthepolicyisassumedtobeadeterministicfunctionμθ:S→Aandthegradientthatminimisestheobjectivefunctioncanbewrittenas:
▽θJ(θ)=Es~D[▽θμθ(a|s)▽aQ(s,a)Ia=μθ(s)](2.4)
whereDisanexperiencereplay(ER)bufferthatstoresthehistoricaltransitions,μθandQ(s,a)representtheactorandthecritic,respectively.
2.2Deepreinforcementlearning
Deeplearningtechniques
[89
]havewidelybeenadoptedtoovercomethemajorlimitationsoftraditionalreinforcementlearningalgorithms,suchaslearninginenvironmentswithlargestatespacesorhavingtoprovidehand-specifiedfeatures
[158]
.Deepneuralnetworks(DNN)asfunctionapproximatorsallowedindeedtoapproximatevaluefunctionsandagents’policies[
12]
.InDQN
[110
],theQ-learningframeworkisextendedwithDNNs,inordertoapproximatethestateprovidedbytheenvironment,whilestillkeepingthehistoricalexperienceinanexperiencereplaybufferwhichisusedsampledataattrainingtime.DQNlearnstoapproximatet
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 瓜子炒货类工艺及销售培训
- 防高空坠落课件
- 自我吐槽培训
- 借款合同范本需要夫妻
- 胃溃疡的饮食护理
- 胸腔低负压吸引的护理
- 幼儿园获奖公开课:大班美术《线条的旅行》课件
- 2025至2030年中国手提式花洒头数据监测研究报告
- 2025至2030年中国扁槽剥线机市场分析及竞争策略研究报告
- 2025至2030年中国微调器行业投资前景及策略咨询研究报告
- 消防更换设备方案范本
- 合伙开办教育培训机构合同范本
- 嵌入式机器视觉流水线分拣系统设计
- 《电力建设工程施工安全管理导则》(nbt10096-2018)
- 江苏省盐城市东台市第一教育联盟2024-2025学年七年级下学期3月月考英语试题(原卷版+解析版)
- 湖南省2025届高三九校联盟第二次联考历史试卷(含答案解析)
- 2024年全国职业院校技能大赛(高职组)安徽省集训选拔赛“电子商务”赛项规程
- 2025年中考数学复习:翻折问题(含解析)
- (统编版2025新教材)语文七下全册知识点
- 家具全屋定制的成本核算示例-成本实操
- 第二单元第1课《精彩瞬间》第2课时 课件-七年级美术下册(人教版2024)
评论
0/150
提交评论