版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
arXivv[cs.LG]19Jan2023AarXivv[cs.LG]19Jan2023JacobBeck*jacob.beck@cs.ox.ac.ukUniversityofOxfordRistoVuorio*risto.vuorio@cs.ox.ac.ukUniversityofOxfordEvanZheranLiuZhengXiongLuisaZintgraf十evanliu@zheng.xiong@cs.ox.ac.ukzintgraf@StanfordUniversityUniversityofOxfordUniversityofOxfordChelseaFinncbfinnChelseaFinncbfinn@shimon.whiteson@cs.ox.ac.ukUniversityofOxfordAbstractWhiledeepreinforcementlearning(RL)hasfueledmultiplehigh-profilesuc-dbackfrommorewidespreadadoptionbynywe1IntroductioninforcementlearningmetaRLisafamilyofmachinelearningMLmethodsthatlearntoreinforcementlearn.Thatis,meta-RLusessample-inefficientMLtolearnsample-efficientRLtedasamachinelearningproblemforasignicantperiodoftimeIntriguingly,32Background2.1Reinforcementlearningtoastheagent’senvironment.AnMDPisdefinedbyatupleM=〈s,A,P,P0,R,γ,T),wheresisthesetofstates,Athesetofactions,P(st+11st,at):sxAxs→R+theprobabilityoftransitioningfromstatesttostatest+1aftertakingactionat,P0(s0):s→R+isadistributionApolicyisafunctionπ(a1s):sxA→R+thatmapsstatestoactionprobabilities.Thisway,TPPsat1st)P(st+11st,at).t=0J(π)=Eτ~P(τ)┌t0γtrt┐,ltipleepisodesaregathered.IfHepisodeshavebeengathered,thenD={τh}=0isallofthedatadeneanRLalgorithmasthefunctionf(D):((sxAxR)T)H→Φ.Inpractice,thedatamayincludeath2.2Meta-RLdefinitionisinsteadtolearn(partsof)analgorithmfusingmachinelearning.WhereRLlearnsapolicy,fthehumanfromdirectlydesigningandimplementingtheRLalgorithmsmtomaximizeameta-RLobjective.Hence,fθoutputstheparametersofπφdirectly:φ=fθ(D).Werefertothepolicyπφasthebasepolicywithbaseparametersφ.Here,Disameta-trajectoryylerdinglywemaycalltheouterloopparametersandortheesupportedbyanysetoftasksHoweverorsandAtobesharedbetweenallofthetasksandthetaskstoonly41s-s-1s-d方p1d方1s-s-1s-d方p1d方p1s-s-1sssssdd方p-s-s-11ittedinsweetheτeDK∶H7(θ)=EMi~p(M)┌ED┌G(τ)f│θ,Mi┐┐,τeDK∶HwhereG(τ)isthediscountedreturnintheMDPMiandHisthelengthofthetrial,orthetask-erloopfθ(D).2.3ExamplealgorithmsMetaLearningMAMLwhichusesmetagradientsandFastRLviaSlowRL(RL2),whichusesrecurrentneuralnetworks[46,239].Manymeta-RLalgorithmsbuildonsimilartothoseusedinMAMLandRLwhichmakesthemexcellentMAMLManydesignsoftheinner-loopalgorithmfθbuildonexistingRLalgorithmsandusemeta-learningtoimprovethem.MAML[55]isaninfluentialdesignfollowingthispattern.ItsrsandithgradientdescenttobeagoodstartingpointforlearningontasksfromthetaskdistributionWhenadaptingtoanewtask,MAMLcollectsdatastepforataskMi~p(M):φ=f(D,φ0)=φ0+α5φОJˆ(D,πφО),5ptuallyrightwhereJˆ(D,πφО)isanestimateofthereturnsofthepolicyπφОforthetaskMiandαistheφ=φ0+β5φОJD1),Mi~p(M)whereπ1isthepolicyfortaskiupdatedoncebytheinner-loop,βisalearningrate,andientDepolicyforvariancereductionhighervaluesofKinitmeralwithKuptodifferencesinthediscountingTooptimizemtheRNNHoweverMAMLcannottrivially6Multi-taskMulti-task-RL2[46,239],MAML[55]-LPGLPGMetaGenRL9]2.4ProblemCategoriesWhilethegivenproblemsettingappliestoallofmetaRLdistinctclustersintheliteraturehaveultitasksettingInthissettinganagentmustquicklyrringtrainingMethodsforthismanyshotsingletasksettingtendto7Meta-LearningFew-ShotMeta-RLMeta-LearningAdaptationGoal MDP1 MDP2 MDP3Rl2,L2RL,VariBADMeta-Learning MDP1 MDP2 MDP3 MAML,DREAMZero-ShotPerformwellfromstartMethods:Few-ShotFreeexplorationphaseMethods:Learnnewtaskswithinafewsteps/episodesOverMultiple(similar)tasks......Meta-LearningMany-ShotMetaMeta-LearningGoalLearnnewtasksbetterthanstandardRLalgorithmsLPG,MetaGenRLMeta-LearningOverMultiple(diverse)tasks MDP1 MDP2 MDP3SolutionsMethods:AdaptationGoalAcceleratestandardRLalgorithmsMeta-LearningOverwindowsinasingletask.(Noreset)SolutionsMethods:STACX,FRODOMeta-LearningAdaptationAdaptationAdaptationMDPMDP18ParameterizedpolicygradientsMAML-likeFinnetal.[55],Lietal.[124],Sungetal.[219],Vuorioetal.[235],ZintgrafMAML-likeDistributionalMAMLndMeta-gradientestimationFoersteretal.[60],Al-Shedivatetal.[207],Stadieetal.[216],Liuetal.[133],Maoetal.[139],Fallahetal.[52],Tang[222],andVuorioetal.[234]BlackboxnerloopHeessetal.[88],Duanetal.[46],Wangetal.[239],Humpliketal.[95],Fakooretal.[51],Yanetal.[256],Zintgrafetal.[281],Liuetal.[130],andZintgrafetal282]AttentionMishraetal.[150],Fortunatoetal.[62],Emukpereetal.[49],Ritteretal.[190],Wangetal.[240],andMelo[141]HypernetworksXianetal.[250]andBecketal.[17]TaskInferenceMulti-taskpre-trainingHumpliketal.[95],Kamiennyetal.[104],Raileanuetal.[182],Liuetal.[130],andPengetal.[174]LatentforZhouetal.[278],Raileanuetal.[182],Zintgrafetal.[281],Zhangetal.[268],Zintgrafetal.[282],Becketal.[17],Heetal.[86],andImagawaetal.97]ConstrastivelearningFuetal.[64]rnerwouldacgthis3Few-ShotMeta-RLkinhomekitchensTraininganewereitntocookinitHowevertrainingsuchanagentwithmetaRLinvolvesuniquefew-shotsetting.Recallthatmeta-RLitselflearnsalearningalgorithmfθ.Thisplacesunique•Parameterizedpolicygradientmethodsbuildthestructureofexistingpolicygradiente9PPGMethodBlackBoxMethodGeneralizationeralizationalizationAMLalizationAMLLrereInductivebiasinstructureInductiveInductivebiasinstructureInductivebiasfromdatachallengesOnesuchrningsnsupervision.Inthestandardmeta-RLproblemsetting,rewardsareavailableduringbothmeta-ample,itmaybedifficulttomanuallydesignaninformativetaskdistributionformeta-training,metanges3.1ParameterizedPolicyGradientMethodsMeta-RLlearnsalearningalgorithmfθ,theinner-loop.WecalltheparameterizationoffθthesectionwediscussonewayofparameterizingtheinnerloopthatbuildsinthestructureofexistingstandardRLalgorithms.Parameterizedpolicygradients(PPG)φj+1=fθ(Dj,φj)=φj+αθ5φjJˆθ(Dj,πφj),teverφj+1=φj+αθMθ5φjJˆθ(Dj,πφj)[255,170,58].Whileavaluebased-methodcouldbeusedcanbeupdatedwithback-propagationinaPPGmethodorbyaneuralnetworkinablackboxodslearnafulldistributionoverinitialpolicyparameters,p(φ0)[82,260,242,285,73].Thisterstionfitviavariationalinference[82,73].Moreover,thedistributionitselfcanbeupdatedintheyweightsandbiasesofthelastlayerofthepolicy[181],whileleavingtherestoftheparametersectorditionedInthiscasetheinputtothepolicyitselfparameterizesaMeta-gradientestimationinouter-loopoptimizationEstimatinggradientsfortheouter-loopisnnerloopThereforeoptimizingtheouter-looprequirestakingthegradientofagradient,orameta-gradient,whichinvolvesofdatausedbyinnerlooponpriortedbydataintheouterloopStillthesepriorpoliciesdoaffectthedistributionofdatasampledinD,usedlaterbytheinner-looplearningalgorithm.Thusignoringthegradienttermsinthepolicyentpwithanmethodmayusearstorderapproximation63],orusegradient-freeoptimizationtoopti-Outer-loopalgorithmsWhilemostPPGmethodsuseapolicy-gradientalgorithmintheouter-saDAdditionally,onecantraintask-specificexpertsandthenusetheseforimitationlearninginthetorybehaviorbyoptimizingEquationtheycaneoverPPG3.2BlackBoxMethodsauniversalfunctionapproximator.ThisplacesfewerconstraintsonthefunctionfθthanwithaedbystructureByconditioningapolicyonacontextvector,alloftheweightsandbiasesofTmustgeneralizebetweenalltasksHoweverwhensignicantlydistinctpoliciesarerequiredfordifferenttasks,cydirectlyTheinnerloopmayproducealloftheparametersofafeedInner-looprepresentationWhilemanyblackboxmethodsuserecurrentneuralnetworks,[88,opionmechaexOuter-loopalgorithmsWhilemanyblackboxmethodsuseon-policyalgorithmsintheouter-loop[46,239,281],itisstraightforwardtouseoff-policyalgorithms[185,51,130],whichbringBlackboxtrade-offsOnekeybenefitofblackboxmethodsisthattheycanrapidlyaltertheirnoftenstruggletogeneralizeoutsideofpM,252].Considertherobotchef:whileitkboxingafullyblack-boxmethod,thepolicyorinner-loopcanbefine-tunedwithpolicygradientsat3.3TaskInferenceMethodsritrainingforeachtask,withnoplanningrequired.Infact,trainingapolicyoveradistributionoftasks,withaccesstothetruetask,canbetakenasthedefinitionofmulti-taskRL[263].InthedsmapthetaskdirectlytoweightspolicyheasTaskinferencewithprivilegedinformationAstraightforwardmethodforinferringthetaskistokcMionoftionwnTaskinferencewithmulti-tasktrainingSomeresearchusesthemulti-tasksettingtoimproventedtourinonthatencodesthetaskreprensitcontainsonlythisinformation[95,130].Afterthis,gθ(cM)canbeinferredinmeta-learningtaskRLmaybeisneededforthemeta-RLpolicytoidentifythetask.InthiscaseinsteadofonlyinferringthefcientlymanyexploratorythetasksharingpoliciesbecomeslessfeasibleOftenintrinsicrewardsareTaskinferencewithoutprivilegedinformationOthertaskinferencemethodsdonotrelyonForinstanceataskcanbeonortransitionfunction[278,281,268,280,86];andtaskinferencecanusecontrastivelearningHepisodesxAxAxAxAxAxAxAxAxxAKepisodestrationoffreeexplorationinrstKepisodesyellowfollowedbynotfreeexploedbyexploitationwhitedistributionusingavariationalinformationbottleneckesldtoreheotherhandtrainingthehattion3.4ExplorationandMeta-ExplorationshouldworkforanyMDPandmayconsistofrandomon-policyexploration,epsilon-greedyex-istypeofexplorationstilloccursintheadditionallyexistsexplorationintheZhouetal.[278],Gurumurthyetal.[83],Fuetal.[64],Liuetal.[130],andZhangetal.[268]pMToenablesampleefcientadaptationduseddistribution.Recallthatinthefew-shotadaptationsetting,oneachtrial,theagentisplacedintoanewtaskonsolvingthetaskinthenextfewepisodes(i.e.,overtheH_KepisodesinEquation3).Anduringtentiallyevenbeyondtheinitialfewshotswithexploitingwhatitalreadyknowstoachievehighrewards.ItisalwaysoptimaltoexploreinthefirstKepisodes,sincenoicingshorttermrewardstolearnabetterpolicyforhigherlaterreturnspaysdividends,whilewhenH_Kissmall,theagentmustexploitmoretoobtainanyrewarditcan,optimallyEnd-to-endoptimization.Perhapsthesimplestapproachistolearntoexploreandexploitend-to-endbydirectlymaximizingthemeta-RLobjective(Equation3)asdonebyblackboxmeta-RLapproaches[46,239,150,216,26].Approachesinthiscategoryimplicitlylearntoexplore,astheydirectlyoptimizethemeta-RLobjectivewhosemaximizationrequiresexploration.Morespecifically,thereturnsinthelaterK_HepisodesτeDG(τ)canonlybemaximizedifthepolicyappropriatelyexploresinthefirstKepisodes,somaximizingthemeta-RLobjectivecanyieldoptimalexplorationinprinciple.Thisapproachworkswellwhencomplicatedexplorationstrategiesarenotneeded.Forexample,ifattemptingseveraltasksinthedistributionoftasksisareasonableformofexplorationforaparticulartaskdistribution,thenend-to-endoptimizationmayworkwell.ingredients(i.e.,explore)ifdoingsoresultsinacookedmeal.Hence,itischallengingtolearnLriorsamplingTocircumventthechallengeofimplicitlylearningtoexploreRakellyetalwhattheidentityofthetaskis,andthentoiterativelyrefinethisdistributionbyinteractingwithviahatsalongtsinitialpositionrenttaddningthedynamicsandrewardfunctioninformationgainoverthetaskdistribution[64,130],orareductioninuncertaintyoftheposte-firstKepisodes,andthentheexploitationpolicyexploitsfortheremainingH_Kexploitationinformationaboutthetaskdynamics,butareirrelevantforarobotcheftryingtocookameal.cyusedhespaceofxxOptimal0-shotPosteriorSamplingIrrelevantExplorationAxxAxx AEpisode1xxAxxAxxAEpisode2xxAxxAxxAEpisode30imalexplorationandposteriorsamplingThethirdrowconsideringthatthisintrinsicrewardcanbeusedtotrainapolicyexclusivelyforoff-policydatasnotForexample,usingrandomnetworkdistillation[29],arewardmayaddanincentivefornovelty[282],oraddanincentiveforgettingdatawhereTD-errorishigh[77].Manyoftheserewards3.5Bayes-AdaptiveOptimalityrtaintyInsteadoptimalexplorationonlyreducesuncertaintygsmeforexplorationislimitedThereforesiscussximateBayesoptimalpoliciesandanalyzethebehaviorofBayes-adaptiveMarkovdecisionprocesses.Todeterminetheoptimalexplorationstrategy,weicsandrewardfunctionFromahighleveltheBayesadaptiveMarkovdeciBAMDPmaximizesreturnswhenplacedintoanunknownMDP.Crucially,thedynamicsofthearacterizesthecurrentuncertaintyasadistributionoverpotentialtionstsarst)sofar,andtheinitialbeliefb0isapriorp(r,p).Then,thestatesofntySpecicallytheBAMDPrewardR+(st,bt,at)=ER~bt[R(st,at)].(4)heBAMDP:P+(st+1,bt+11st,bt,at)=ER,P~bt[P(st+11st,at)δ(bt+1=p(R,P1τ:t+1)].(5)hecurrentbeliefr=R+(st,bt,at)=ER~bt[R(st,at)].EbRstbtatLearninganapproximateBayes-optimalpolicyDirectlycomputingBayes-optimalpoliciesre-onandthelatentvariablesmcanbelearnedbyrollingoutthepolicytoobtainonarnBayesadaptiveoptimalpoliciestheframeworkofBAMDPscanstillofferahelpfulFirst,blackboxmeta-RLalgorithmssuchasRL2learnarecurrentpolicythatnotonlycondi-tionsonthecurrentstatest,butonthehistoryofobservedstates,actions,andrewardsτ:t=memputingthebeliefstateetaRLalgorithmscaninprinciplelearnBayesadaptiveatetosmetaRLalgorithmsstruggletolearnischallengingLiuetalhighlightonesuchoptimizationchallengeforblackboxmeta-RLwheretheagentisgivenafew“free”episodestoexplore,andtheobjectiveistomaximizethernsbeginningfromthersttimestepThesetheresultinusinglesssuitableutensilsoringredients,though,especiallywhenoptimizedatlowerlveinterhecurrenttaskwhichisequivalenttothebeliefstateThenexplocycanbesufficientforoptimallysolvingthemeta-RLproblem,evenifitdoesnotmakeuseofallthisstate3.6SupervisionInthissection,wediscussmostofthedifferenttypesofsupervisionconsideredinmeta-RL.Inexperttrajectoriesorotherprivilegedinformationduringmeta-trainingand/ortesting).EachofMeta-RLMeta-RLwithMeta-RLviaImitationHYPER
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 学校2024-2025学年度德育工作计划
- 进行性肢端黑变病的临床护理
- 【培训课件】销售技能培训 顾问式实战销售
- 产后胳膊疼的健康宣教
- 低磷血症的临床护理
- 《教学管理》课件
- 变形杆菌性角膜炎的临床护理
- JJF(陕) 077-2021 水泥胶砂试体成型振实台校准规范
- 幼儿教师培训课件:《信息交流》
- 创新教学方法提升幼儿园教育质量计划
- 中国当代文学专题-003-国开机考复习资料
- 预防校园欺凌主题班会课件(共36张课件)
- 喜茶营销策略分析
- 别墅小区防盗报警系统设计方案
- DB37∕T 5016-2021 民用建筑外窗工程技术标准
- 操作系统填空题
- 《阿利的红斗篷》阅读题及答案
- [QC]提高隧道防水板一次安装合格率
- 产科重点专科汇报课件
- 金属风管支架重量计算表
- 义务教育《劳动》课程标准(2022年版)
评论
0/150
提交评论