牛津斯坦福 -元强化学习综述 A Survey of Meta-Reinforcement Learning_第1页
牛津斯坦福 -元强化学习综述 A Survey of Meta-Reinforcement Learning_第2页
牛津斯坦福 -元强化学习综述 A Survey of Meta-Reinforcement Learning_第3页
牛津斯坦福 -元强化学习综述 A Survey of Meta-Reinforcement Learning_第4页
牛津斯坦福 -元强化学习综述 A Survey of Meta-Reinforcement Learning_第5页
已阅读5页,还剩96页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

arXivv[cs.LG]19Jan2023AarXivv[cs.LG]19Jan2023JacobBeck*jacob.beck@cs.ox.ac.ukUniversityofOxfordRistoVuorio*risto.vuorio@cs.ox.ac.ukUniversityofOxfordEvanZheranLiuZhengXiongLuisaZintgraf十evanliu@zheng.xiong@cs.ox.ac.ukzintgraf@StanfordUniversityUniversityofOxfordUniversityofOxfordChelseaFinncbfinnChelseaFinncbfinn@shimon.whiteson@cs.ox.ac.ukUniversityofOxfordAbstractWhiledeepreinforcementlearning(RL)hasfueledmultiplehigh-profilesuc-dbackfrommorewidespreadadoptionbynywe1IntroductioninforcementlearningmetaRLisafamilyofmachinelearningMLmethodsthatlearntoreinforcementlearn.Thatis,meta-RLusessample-inefficientMLtolearnsample-efficientRLtedasamachinelearningproblemforasignicantperiodoftimeIntriguingly,32Background2.1Reinforcementlearningtoastheagent’senvironment.AnMDPisdefinedbyatupleM=〈s,A,P,P0,R,γ,T),wheresisthesetofstates,Athesetofactions,P(st+11st,at):sxAxs→R+theprobabilityoftransitioningfromstatesttostatest+1aftertakingactionat,P0(s0):s→R+isadistributionApolicyisafunctionπ(a1s):sxA→R+thatmapsstatestoactionprobabilities.Thisway,TPPsat1st)P(st+11st,at).t=0J(π)=Eτ~P(τ)┌t0γtrt┐,ltipleepisodesaregathered.IfHepisodeshavebeengathered,thenD={τh}=0isallofthedatadeneanRLalgorithmasthefunctionf(D):((sxAxR)T)H→Φ.Inpractice,thedatamayincludeath2.2Meta-RLdefinitionisinsteadtolearn(partsof)analgorithmfusingmachinelearning.WhereRLlearnsapolicy,fthehumanfromdirectlydesigningandimplementingtheRLalgorithmsmtomaximizeameta-RLobjective.Hence,fθoutputstheparametersofπφdirectly:φ=fθ(D).Werefertothepolicyπφasthebasepolicywithbaseparametersφ.Here,Disameta-trajectoryylerdinglywemaycalltheouterloopparametersandortheesupportedbyanysetoftasksHoweverorsandAtobesharedbetweenallofthetasksandthetaskstoonly41s-s-1s-d方p1d方1s-s-1s-d方p1d方p1s-s-1sssssdd方p-s-s-11ittedinsweetheτeDK∶H7(θ)=EMi~p(M)┌ED┌G(τ)f│θ,Mi┐┐,τeDK∶HwhereG(τ)isthediscountedreturnintheMDPMiandHisthelengthofthetrial,orthetask-erloopfθ(D).2.3ExamplealgorithmsMetaLearningMAMLwhichusesmetagradientsandFastRLviaSlowRL(RL2),whichusesrecurrentneuralnetworks[46,239].Manymeta-RLalgorithmsbuildonsimilartothoseusedinMAMLandRLwhichmakesthemexcellentMAMLManydesignsoftheinner-loopalgorithmfθbuildonexistingRLalgorithmsandusemeta-learningtoimprovethem.MAML[55]isaninfluentialdesignfollowingthispattern.ItsrsandithgradientdescenttobeagoodstartingpointforlearningontasksfromthetaskdistributionWhenadaptingtoanewtask,MAMLcollectsdatastepforataskMi~p(M):φ=f(D,φ0)=φ0+α5φОJˆ(D,πφО),5ptuallyrightwhereJˆ(D,πφО)isanestimateofthereturnsofthepolicyπφОforthetaskMiandαistheφ=φ0+β5φОJD1),Mi~p(M)whereπ1isthepolicyfortaskiupdatedoncebytheinner-loop,βisalearningrate,andientDepolicyforvariancereductionhighervaluesofKinitmeralwithKuptodifferencesinthediscountingTooptimizemtheRNNHoweverMAMLcannottrivially6Multi-taskMulti-task-RL2[46,239],MAML[55]-LPGLPGMetaGenRL9]2.4ProblemCategoriesWhilethegivenproblemsettingappliestoallofmetaRLdistinctclustersintheliteraturehaveultitasksettingInthissettinganagentmustquicklyrringtrainingMethodsforthismanyshotsingletasksettingtendto7Meta-LearningFew-ShotMeta-RLMeta-LearningAdaptationGoal MDP1 MDP2 MDP3Rl2,L2RL,VariBADMeta-Learning MDP1 MDP2 MDP3 MAML,DREAMZero-ShotPerformwellfromstartMethods:Few-ShotFreeexplorationphaseMethods:Learnnewtaskswithinafewsteps/episodesOverMultiple(similar)tasks......Meta-LearningMany-ShotMetaMeta-LearningGoalLearnnewtasksbetterthanstandardRLalgorithmsLPG,MetaGenRLMeta-LearningOverMultiple(diverse)tasks MDP1 MDP2 MDP3SolutionsMethods:AdaptationGoalAcceleratestandardRLalgorithmsMeta-LearningOverwindowsinasingletask.(Noreset)SolutionsMethods:STACX,FRODOMeta-LearningAdaptationAdaptationAdaptationMDPMDP18ParameterizedpolicygradientsMAML-likeFinnetal.[55],Lietal.[124],Sungetal.[219],Vuorioetal.[235],ZintgrafMAML-likeDistributionalMAMLndMeta-gradientestimationFoersteretal.[60],Al-Shedivatetal.[207],Stadieetal.[216],Liuetal.[133],Maoetal.[139],Fallahetal.[52],Tang[222],andVuorioetal.[234]BlackboxnerloopHeessetal.[88],Duanetal.[46],Wangetal.[239],Humpliketal.[95],Fakooretal.[51],Yanetal.[256],Zintgrafetal.[281],Liuetal.[130],andZintgrafetal282]AttentionMishraetal.[150],Fortunatoetal.[62],Emukpereetal.[49],Ritteretal.[190],Wangetal.[240],andMelo[141]HypernetworksXianetal.[250]andBecketal.[17]TaskInferenceMulti-taskpre-trainingHumpliketal.[95],Kamiennyetal.[104],Raileanuetal.[182],Liuetal.[130],andPengetal.[174]LatentforZhouetal.[278],Raileanuetal.[182],Zintgrafetal.[281],Zhangetal.[268],Zintgrafetal.[282],Becketal.[17],Heetal.[86],andImagawaetal.97]ConstrastivelearningFuetal.[64]rnerwouldacgthis3Few-ShotMeta-RLkinhomekitchensTraininganewereitntocookinitHowevertrainingsuchanagentwithmetaRLinvolvesuniquefew-shotsetting.Recallthatmeta-RLitselflearnsalearningalgorithmfθ.Thisplacesunique•Parameterizedpolicygradientmethodsbuildthestructureofexistingpolicygradiente9PPGMethodBlackBoxMethodGeneralizationeralizationalizationAMLalizationAMLLrereInductivebiasinstructureInductiveInductivebiasinstructureInductivebiasfromdatachallengesOnesuchrningsnsupervision.Inthestandardmeta-RLproblemsetting,rewardsareavailableduringbothmeta-ample,itmaybedifficulttomanuallydesignaninformativetaskdistributionformeta-training,metanges3.1ParameterizedPolicyGradientMethodsMeta-RLlearnsalearningalgorithmfθ,theinner-loop.WecalltheparameterizationoffθthesectionwediscussonewayofparameterizingtheinnerloopthatbuildsinthestructureofexistingstandardRLalgorithms.Parameterizedpolicygradients(PPG)φj+1=fθ(Dj,φj)=φj+αθ5φjJˆθ(Dj,πφj),teverφj+1=φj+αθMθ5φjJˆθ(Dj,πφj)[255,170,58].Whileavaluebased-methodcouldbeusedcanbeupdatedwithback-propagationinaPPGmethodorbyaneuralnetworkinablackboxodslearnafulldistributionoverinitialpolicyparameters,p(φ0)[82,260,242,285,73].Thisterstionfitviavariationalinference[82,73].Moreover,thedistributionitselfcanbeupdatedintheyweightsandbiasesofthelastlayerofthepolicy[181],whileleavingtherestoftheparametersectorditionedInthiscasetheinputtothepolicyitselfparameterizesaMeta-gradientestimationinouter-loopoptimizationEstimatinggradientsfortheouter-loopisnnerloopThereforeoptimizingtheouter-looprequirestakingthegradientofagradient,orameta-gradient,whichinvolvesofdatausedbyinnerlooponpriortedbydataintheouterloopStillthesepriorpoliciesdoaffectthedistributionofdatasampledinD,usedlaterbytheinner-looplearningalgorithm.Thusignoringthegradienttermsinthepolicyentpwithanmethodmayusearstorderapproximation63],orusegradient-freeoptimizationtoopti-Outer-loopalgorithmsWhilemostPPGmethodsuseapolicy-gradientalgorithmintheouter-saDAdditionally,onecantraintask-specificexpertsandthenusetheseforimitationlearninginthetorybehaviorbyoptimizingEquationtheycaneoverPPG3.2BlackBoxMethodsauniversalfunctionapproximator.ThisplacesfewerconstraintsonthefunctionfθthanwithaedbystructureByconditioningapolicyonacontextvector,alloftheweightsandbiasesofTmustgeneralizebetweenalltasksHoweverwhensignicantlydistinctpoliciesarerequiredfordifferenttasks,cydirectlyTheinnerloopmayproducealloftheparametersofafeedInner-looprepresentationWhilemanyblackboxmethodsuserecurrentneuralnetworks,[88,opionmechaexOuter-loopalgorithmsWhilemanyblackboxmethodsuseon-policyalgorithmsintheouter-loop[46,239,281],itisstraightforwardtouseoff-policyalgorithms[185,51,130],whichbringBlackboxtrade-offsOnekeybenefitofblackboxmethodsisthattheycanrapidlyaltertheirnoftenstruggletogeneralizeoutsideofpM,252].Considertherobotchef:whileitkboxingafullyblack-boxmethod,thepolicyorinner-loopcanbefine-tunedwithpolicygradientsat3.3TaskInferenceMethodsritrainingforeachtask,withnoplanningrequired.Infact,trainingapolicyoveradistributionoftasks,withaccesstothetruetask,canbetakenasthedefinitionofmulti-taskRL[263].InthedsmapthetaskdirectlytoweightspolicyheasTaskinferencewithprivilegedinformationAstraightforwardmethodforinferringthetaskistokcMionoftionwnTaskinferencewithmulti-tasktrainingSomeresearchusesthemulti-tasksettingtoimproventedtourinonthatencodesthetaskreprensitcontainsonlythisinformation[95,130].Afterthis,gθ(cM)canbeinferredinmeta-learningtaskRLmaybeisneededforthemeta-RLpolicytoidentifythetask.InthiscaseinsteadofonlyinferringthefcientlymanyexploratorythetasksharingpoliciesbecomeslessfeasibleOftenintrinsicrewardsareTaskinferencewithoutprivilegedinformationOthertaskinferencemethodsdonotrelyonForinstanceataskcanbeonortransitionfunction[278,281,268,280,86];andtaskinferencecanusecontrastivelearningHepisodesxAxAxAxAxAxAxAxAxxAKepisodestrationoffreeexplorationinrstKepisodesyellowfollowedbynotfreeexploedbyexploitationwhitedistributionusingavariationalinformationbottleneckesldtoreheotherhandtrainingthehattion3.4ExplorationandMeta-ExplorationshouldworkforanyMDPandmayconsistofrandomon-policyexploration,epsilon-greedyex-istypeofexplorationstilloccursintheadditionallyexistsexplorationintheZhouetal.[278],Gurumurthyetal.[83],Fuetal.[64],Liuetal.[130],andZhangetal.[268]pMToenablesampleefcientadaptationduseddistribution.Recallthatinthefew-shotadaptationsetting,oneachtrial,theagentisplacedintoanewtaskonsolvingthetaskinthenextfewepisodes(i.e.,overtheH_KepisodesinEquation3).Anduringtentiallyevenbeyondtheinitialfewshotswithexploitingwhatitalreadyknowstoachievehighrewards.ItisalwaysoptimaltoexploreinthefirstKepisodes,sincenoicingshorttermrewardstolearnabetterpolicyforhigherlaterreturnspaysdividends,whilewhenH_Kissmall,theagentmustexploitmoretoobtainanyrewarditcan,optimallyEnd-to-endoptimization.Perhapsthesimplestapproachistolearntoexploreandexploitend-to-endbydirectlymaximizingthemeta-RLobjective(Equation3)asdonebyblackboxmeta-RLapproaches[46,239,150,216,26].Approachesinthiscategoryimplicitlylearntoexplore,astheydirectlyoptimizethemeta-RLobjectivewhosemaximizationrequiresexploration.Morespecifically,thereturnsinthelaterK_HepisodesτeDG(τ)canonlybemaximizedifthepolicyappropriatelyexploresinthefirstKepisodes,somaximizingthemeta-RLobjectivecanyieldoptimalexplorationinprinciple.Thisapproachworkswellwhencomplicatedexplorationstrategiesarenotneeded.Forexample,ifattemptingseveraltasksinthedistributionoftasksisareasonableformofexplorationforaparticulartaskdistribution,thenend-to-endoptimizationmayworkwell.ingredients(i.e.,explore)ifdoingsoresultsinacookedmeal.Hence,itischallengingtolearnLriorsamplingTocircumventthechallengeofimplicitlylearningtoexploreRakellyetalwhattheidentityofthetaskis,andthentoiterativelyrefinethisdistributionbyinteractingwithviahatsalongtsinitialpositionrenttaddningthedynamicsandrewardfunctioninformationgainoverthetaskdistribution[64,130],orareductioninuncertaintyoftheposte-firstKepisodes,andthentheexploitationpolicyexploitsfortheremainingH_Kexploitationinformationaboutthetaskdynamics,butareirrelevantforarobotcheftryingtocookameal.cyusedhespaceofxxOptimal0-shotPosteriorSamplingIrrelevantExplorationAxxAxx AEpisode1xxAxxAxxAEpisode2xxAxxAxxAEpisode30imalexplorationandposteriorsamplingThethirdrowconsideringthatthisintrinsicrewardcanbeusedtotrainapolicyexclusivelyforoff-policydatasnotForexample,usingrandomnetworkdistillation[29],arewardmayaddanincentivefornovelty[282],oraddanincentiveforgettingdatawhereTD-errorishigh[77].Manyoftheserewards3.5Bayes-AdaptiveOptimalityrtaintyInsteadoptimalexplorationonlyreducesuncertaintygsmeforexplorationislimitedThereforesiscussximateBayesoptimalpoliciesandanalyzethebehaviorofBayes-adaptiveMarkovdecisionprocesses.Todeterminetheoptimalexplorationstrategy,weicsandrewardfunctionFromahighleveltheBayesadaptiveMarkovdeciBAMDPmaximizesreturnswhenplacedintoanunknownMDP.Crucially,thedynamicsofthearacterizesthecurrentuncertaintyasadistributionoverpotentialtionstsarst)sofar,andtheinitialbeliefb0isapriorp(r,p).Then,thestatesofntySpecicallytheBAMDPrewardR+(st,bt,at)=ER~bt[R(st,at)].(4)heBAMDP:P+(st+1,bt+11st,bt,at)=ER,P~bt[P(st+11st,at)δ(bt+1=p(R,P1τ:t+1)].(5)hecurrentbeliefr=R+(st,bt,at)=ER~bt[R(st,at)].EbRstbtatLearninganapproximateBayes-optimalpolicyDirectlycomputingBayes-optimalpoliciesre-onandthelatentvariablesmcanbelearnedbyrollingoutthepolicytoobtainonarnBayesadaptiveoptimalpoliciestheframeworkofBAMDPscanstillofferahelpfulFirst,blackboxmeta-RLalgorithmssuchasRL2learnarecurrentpolicythatnotonlycondi-tionsonthecurrentstatest,butonthehistoryofobservedstates,actions,andrewardsτ:t=memputingthebeliefstateetaRLalgorithmscaninprinciplelearnBayesadaptiveatetosmetaRLalgorithmsstruggletolearnischallengingLiuetalhighlightonesuchoptimizationchallengeforblackboxmeta-RLwheretheagentisgivenafew“free”episodestoexplore,andtheobjectiveistomaximizethernsbeginningfromthersttimestepThesetheresultinusinglesssuitableutensilsoringredients,though,especiallywhenoptimizedatlowerlveinterhecurrenttaskwhichisequivalenttothebeliefstateThenexplocycanbesufficientforoptimallysolvingthemeta-RLproblem,evenifitdoesnotmakeuseofallthisstate3.6SupervisionInthissection,wediscussmostofthedifferenttypesofsupervisionconsideredinmeta-RL.Inexperttrajectoriesorotherprivilegedinformationduringmeta-trainingand/ortesting).EachofMeta-RLMeta-RLwithMeta-RLviaImitationHYPER

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论