第10章_强化学习PPT课件

上传人：伊*** IP属地：上海上传时间：2022-05-06 格式：PPTX 页数：90 大小：1.54MB 积分：20 举报 版权申诉

已阅读5页，还剩85页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、2021-11-17强化学习史忠植1内容提要内容提要引言强化学习模型动态规划蒙特卡罗方法时序差分学习 Q学习强化学习中的函数估计应用第1页/共90页2021-11-17强化学习史忠植2引言引言人类通常从与外界环境的交互中学习。所谓强化（reinforcement）学习是指从环境状态到行为映射的学习，以使系统行为从环境中获得的累积奖励值最大。在强化学习中，我们设计算法来把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自己发现。主体的动作的影响不只是立即得到的奖励，而且还影响接下来的动作和最终

2、的奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重要的特性。第2页/共90页2021-11-17强化学习史忠植3引言引言强化学习技术是从控制理论、统计学、心理学等相关学科发展而来，最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用，并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后，对强化学习的研究和应用日益开展起来，成为目前机器学习领域的研究热点之一。第3页/

3、共90页2021-11-17强化学习史忠植4引言强化思想最先来源于心理学的研究。1911年Thorndike提出了效果律（Law of Effect）：一定情景下让动物感到舒服的行为，就会与此情景增强联系（强化），当此情景再现时，动物的这种行为也更易再现；相反，让动物感觉不舒服的行为，会减弱与情景的联系，此情景再现时，此行为将很难再现。换个说法，哪种行为会“记住”，会与刺激建立联系，取决于行为产生的效果。动物的试错学习,包含两个含义：选择（selectional）和联系（associative），对应计算上的搜索和记忆。所以，1954年，Minsky在他的博士论文中实现了计算上的试错学习

4、。同年，Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“Steps Toward Artificial Intelligence”，此后开始广泛使用。1969年， Minsky因在人工智能方面的贡献而获得计算机图灵奖。第4页/共90页2021-11-17强化学习史忠植5引言 1953到1957年，Bellman提出了求解最优控制问题的一个有效方法：动态规划（dynamic programming） Bellman于 1957年还提出了最优控制问题的随机离散版本，就是著名的马尔可夫决策过程（MDP, Markov decisio

5、n processe），1960年Howard提出马尔可夫决策过程的策略迭代方法，这些都成为现代强化学习的理论基础。 1972年，Klopf把试错学习和时序差分结合在一起。1978年开始，Sutton、Barto、 Moore，包括Klopf等对这两者结合开始进行深入研究。 1989年Watkins提出了Q-学习Watkins 1989，也把强化学习的三条主线扭在了一起。 1992年，Tesauro用强化学习成功了应用到西洋双陆棋（backgammon）中，称为TD-Gammon 。第5页/共90页2021-11-17强化学习史忠植6内容提要内容提要引言强化学习模型动态规划蒙特卡罗方

6、法时序差分学习 Q学习强化学习中的函数估计应用第6页/共90页2021-11-17强化学习史忠植7强化学习模型i: inputr: reward s: statea: action状态 sisi+1ri+1奖励 ri动作动作 aia0a1a2s0s1s2s3第7页/共90页2021-11-17强化学习史忠植8描述一个环境（问题） Accessible vs. inaccessible Deterministic vs. non-deterministic Episodic vs. non-episodic Static vs. dynamic Discrete vs. continu

7、ousThe most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.第8页/共90页2021-11-17强化学习史忠植9强化学习问题 Agent-environment interaction States, Actions, Rewards To define a finite MDP state and action sets : S and A one-step “dynamics” defined by

8、transition probabilities (Markov Property): reward probabilities: EnvironmentactionstaterewardRLAgent1Pr, for all ,( ).asstttPssss aas sS aA s11, for all ,( ).assttttRE rss aa sss sS aA s第9页/共90页2021-11-17强化学习史忠植10与监督学习对比 Reinforcement Learning Learn from interaction learn from its own experience,

9、and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)lSupervised Learning Learn from ex

10、amples provided by a knowledgable external supervisor.第10页/共90页2021-11-17强化学习史忠植11强化学习要素 Policy: stochastic rule for selecting actions Return/Reward: the function of future rewards agent tries to maximize Value: what is good because it predicts reward Model: what follows whatPolicyRewardValueModel

11、ofenvironmentIs unknownIs my goalIs I can getIs my method第11页/共90页2021-11-17强化学习史忠植12在策略下的Bellman公式1t1t4t23t2t1t4t33t22t1ttRrrrrrrrrrRThe basic idea: So: sssVrEssRE)s(Vt1t1tttOr, without the expectation operator: asassass)s(VRP)a, s()s(V is the discount rate第12页/共90页2021-11-17强化学习史忠植13BellmanBellm

12、an最优策略公式最优策略公式11( )( )( )max(),max( )tttta A saassssa A ssVsE rVsss aaPRVs其中：V*：状态值映射S：环境状态R：奖励函数P：状态转移概率函数：折扣因子第13页/共90页2021-11-17强化学习史忠植14马尔可夫决策过程马尔可夫决策过程 MARKOV DECISION PROCESS 由四元组定义。环境状态集S 系统行为集合A 奖励函数R：SA 状态转移函数P：SAPD（S）记R（s，a，s）为系统在状态s采用a动作使环境状态转移到s获得的瞬时奖励值；记P（s，a，s）为系统在状态s采用a动作使环境状态转

13、移到s的概率。第14页/共90页2021-11-17强化学习史忠植15马尔可夫决策过程马尔可夫决策过程 MARKOV DECISION PROCESS 马尔可夫决策过程的本质是：当前状态向下一状态转移的概率和奖励值只取决于当前状态和选择的动作，而与历史状态和历史动作无关。因此在已知状态转移概率函数P和奖励函数R的环境模型知识下，可以采用动态规划技术求解最优策略。而强化学习着重研究在P函数和R函数未知的情况下，系统如何学习最优行为策略。第15页/共90页2021-11-17强化学习史忠植16MARKOV DECISION PROCESSCharacteristics of MDP:a set

14、 of states : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a第16页/共90页2021-11-17强化学习史忠植17马尔可夫决策过程马尔可夫决策过程 MARKOV DECISION PROCESS第17页/共90页2021-11-17强化学习史忠植18MDP EXAMPLE:TransitionfunctionSta

15、tes and rewardsBellman Equation:(Greedy policy selection)第18页/共90页2021-11-17强化学习史忠植19MDP Graphical Representation, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)第19页/共90页2021-11-17强化学习史忠植20Reinforcement Learning Deterministic transitionsStochastic transitionsaijMis the probability to

16、 reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequenc

17、es is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)第20页/共90页2021-11-17强化学习史忠植21Reinforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120

18、.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decisio

19、n problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.第21页/共90页2021-11-17强化学习史忠植22动态规划动态规划Dynamic Programming 动态规划(dynamic programming)的方法通过从后继状态回溯到前驱状态来计算赋值函数。动态规划的方法基于下一个状态分布的模型来接连的更新状态。强化学习的动态规划的方法是基于这样一个事实：对任何策略和任何状态s，有(10.9)式迭

20、代的一致的等式成立)()()|()|()(sVssRasssasVaas(as)是给定在随机策略下状态s时动作a的概率。(ssa)是在动作a下状态s转到状态s的概率。这就是对V的Bellman(1957)等式。第22页/共90页2021-11-17强化学习史忠植23动态规划动态规划Dynamic Programming - Problem A discrete-time dynamic system States 1, , n + termination state 0 Control U(i) Transition Probability pij(u) Accumulative cost

21、structure Policies10),(juirk,10)()|(1ipiijipkijkk第23页/共90页2021-11-17强化学习史忠植24lFinite Horizon ProblemlInfinite Horizon ProblemlValue Iteration动态规划动态规划Dynamic Programming Iterative Solution iiiiiriGEiVNkkkkkkNNN0101),(,()()(iiiiirEiVNkkkkkkn0101),(,(lim)(nijVjuirupiVTnjijiUu, 1,)(),()(max)(0)()()(1iV

22、TiVkk)(min)(*iViV第24页/共90页2021-11-17强化学习史忠植25动态规划中的策略迭代动态规划中的策略迭代/ /值迭值迭代代11( )argmax( )( )( , )( )( )max( )aassssasaaksssskasaaksssskassPRVsVss aPRVsVsPRVs*1010VVV policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration第25页/共90页2021-11-17强化学习史忠植26动态规划方法动态规划方法11( )()tttV

23、 sErV sTTTTstrt1st1TTTTTTTTT第26页/共90页2021-11-17强化学习史忠植27自适应动态规划自适应动态规划(ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve jijjUMiRiU)()()(= value determination.No maximization over actions because agent is passive unlike in value iteration.using DP

24、Large state spacee.g. Backgammon: 1050 equations in 1050 variables第27页/共90页2021-11-17强化学习史忠植28Value Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993

25、b)第28页/共90页2021-11-17强化学习史忠植29Policy Iteration Algorithm Policies converge faster than values.Why faster convergence? 第29页/共90页2021-11-17强化学习史忠植30动态规划动态规划Dynamic Programming 典型的动态规划模型作用有限，很多问题很难给出环境的完整模型。仿真机器人足球就是这样的问题，可以采用实时动态规划方法解决这个问题。在实时动态规划中不需要事先给出环境模型，而是在真实的环境中不断测试，得到环境模型。可以采用反传神经网络实现对状态泛化，网

26、络的输入单元是环境的状态s, 网络的输出是对该状态的评价V(s)。第30页/共90页2021-11-17强化学习史忠植31没有模型的方法没有模型的方法Model Free MethodsModels of the environment:T: S x A ( S) and R : S x A RDo we know them? Do we have to know them? Monte Carlo Methods Adaptive Heuristic Critic Q Learning第31页/共90页2021-11-17强化学习史忠植32蒙特卡罗方法蒙特卡罗方法 Monte Carlo

27、 Methods 蒙特卡罗方法不需要一个完整的模型。而是它们对状态的整个轨道进行抽样，基于抽样点的最终结果来更新赋值函数。蒙特卡罗方法不需要经验，即从与环境联机的或者模拟的交互中抽样状态、动作和奖励的序列。联机的经验是令人感兴趣的，因为它不需要环境的先验知识，却仍然可以是最优的。从模拟的经验中学习功能也很强大。它需要一个模型，但它可以是生成的而不是分析的，即一个模型可以生成轨道却不能计算明确的概率。于是，它不需要产生在动态规划中要求的所有可能转变的完整的概率分布。第32页/共90页2021-11-17强化学习史忠植33Monte Carlo方法. state following return

28、 actual the is where)()()(ttttttsRsVRsVsVTTTTTTTTTTstTTTTTTTTTT第33页/共90页2021-11-17强化学习史忠植34蒙特卡罗方法蒙特卡罗方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s) Based only on experience Assumes episodic tasks (Experience is divided into episodes and a

29、ll episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 第34页/共90页2021-11-17强化学习史忠植35Monte Carlo策略策略评价评价lGoal: learn V (s) under P and R are unknown in advancelGiven: some number of episodes under which contain slIdea: Average r

30、eturns observed after visits to slEvery-Visit MC: average returns for every time s is visited in an episodelFirst-visit MC: average returns only for first time s is visited in an episodelBoth converge asymptotically12345第35页/共90页2021-11-17强化学习史忠植36Problem: Unvisited pairs(problem of maintaining exp

31、loration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡罗方法蒙特卡罗方法第36页/共90页2021-11-17强化学习史忠植37蒙特卡罗控制蒙特卡罗控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improv

32、ement Policy improvement step: greedify with respect to value (or action-value) function第37页/共90页2021-11-17强化学习史忠植38时序差分学习时序差分学习 Temporal-Difference时序差分学习中没有环境模型，根据经验学习。每步进行迭代，不需要等任务完成。预测模型的控制算法，根据历史信息判断将来的输入和输出，强调模型的函数而非模型的结构。时序差分方法和蒙特卡罗方法类似，仍然采样一次学习循环中获得的瞬时奖惩反馈，但同时类似与动态规划方法采用自举方法估计状态的值函数。然后通过多次迭代

33、学习，去逼近真实的状态值函数。第38页/共90页2021-11-17强化学习史忠植39时序差分学习时序差分学习 TD)()()()(11tttttsVsVrsVsVTTTTTTTTTTst1rt1stTTTTTTTTTT第39页/共90页2021-11-17强化学习史忠植40时序差分学习时序差分学习 Temporal-Difference)()()(:methodCarlo Monte visit-every SimplettttsVRsVsV )()()()(:TD(0) method, TD simplest The11tttttsVsVrsVsVtarget: the actual

34、return after time ttarget: an estimate of the return第40页/共90页2021-11-17强化学习史忠植41时序差分学习时序差分学习 (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.)()()()()(iUjUiRiUiUTheorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a func

35、tion of times a state is visited (=Ni), then U(i) itself converges to the correct value第41页/共90页2021-11-17强化学习史忠植42TD(l l) A Forward View TD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation) l-return: Backup using l-return:111)(1)(11)1 ()1 (tTnttTntnntnntRRRRlll

36、lll)()(tttttsVRsVl第42页/共90页2021-11-17强化学习史忠植43时序差分学习算法时序差分学习算法 TD(l l)Idea: update from the whole epoch, not just on state transition.kmmmmkmiUiUiRiUiU)()()()()(1lSpecial cases:l=1: Least-mean-square (LMS), Mont Carlol=0: TDIntermediate choice of l (between 0 and 1) is best. Interplay with 第43页/共90

37、页2021-11-17强化学习史忠植44时序差分学习算法时序差分学习算法 TD(l l)算法算法 10.1 TD(0)学习算法Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from V(e.g., -greedy) Take action a, observer r, s Until s is termin

38、al V sV srV sV sss 第44页/共90页2021-11-17强化学习史忠植45时序差分学习算法第45页/共90页2021-11-17强化学习史忠植46时序差分学习算法收敛性TD(l l)Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. )()(2tttitiIn practice, often a fixed is used for all i and t.第46页/共90页2021-11-17强化学习史忠植47时序差分学习 TD第47页/共90页2021-11

39、-17强化学习史忠植48Q-learningWatkins, 1989在Q学习中，回溯从动作结点开始，最大化下一个状态的所有可能动作和它们的奖励。在完全递归定义的Q学习中，回溯树的底部结点一个从根结点开始的动作和它们的后继动作的奖励的序列可以到达的所有终端结点。联机的Q学习，从可能的动作向前扩展，不需要建立一个完全的世界模型。Q学习还可以脱机执行。我们可以看到，Q学习是一种时序差分的方法。第48页/共90页2021-11-17强化学习史忠植49Q-learning在Q学习中，Q是状态-动作对到学习到的值的一个函数。对所有的状态和动作： Q: (state x action) value 对

40、Q学习中的一步：),(),(MAX),()1 (),(11tttatttttasQasQrcasQcasQ (10.15)其中c和都1，rt+1是状态st+1的奖励。第49页/共90页2021-11-17强化学习史忠植50Q-Learning Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.). Derive the estimated policy as an argument of the ma

41、ximum of the estimated Q-function. Allow different parameter vectors at different time points. Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.第50页/共90页2021-11-17强化学习史忠植51Q-learningQ (a,i),(max)(iaQiUajaaijjaQ

42、MiRiaQ), (max)(),(Direct approach (ADP) would require learning a model .Q-learning does not:aijM),(), (max)(),(),(iaQjaQiRiaQiaQaDo this update after each state transition:第51页/共90页2021-11-17强化学习史忠植52ExplorationTradeoff between exploitation (control) and exploration (identification) Extremes: greed

43、y vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate is decreased fast enough but not too fast (as we discussed in

44、TD learning)第52页/共90页2021-11-17强化学习史忠植53Common exploration methods)(sA1.In value iteration in an ADP agent: Optimistic estimate of utility U+(i)2.-greedy methodNongreedy actions Greedy action3.Boltzmann explorationjaijaiaNjUMfiRiU),(),(max)()(Exploration func),(nufR+ if nNu o.w.aTjUMTjUMajaijjaijee

45、P)()(*)(1sA第53页/共90页2021-11-17强化学习史忠植54Q-Learning AlgorithmQ学习算法 Initialize Q(s,a) arbitrarily Repeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from Q (e.g., -greedy) Take action a, observer r, s Until s is terminal,max,aQ s aQ s arQ s aQ

46、 s ass第54页/共90页2021-11-17强化学习史忠植55Q-Learning Algorithm Set For The estimated policy satisfies. 01TQ.);,(1minarg).;,(max, 0,.,1,21,11111niitittitttttttattQYnaQrYTTttASAS.),;,(maxarg),(1,tQttttatttQtasas第55页/共90页2021-11-17强化学习史忠植56What is the intuition? Bellman equation gives If and the training set

47、 were infinite, then Q-learning minimizes which is equivalent to minimizingttttttattttaQrEQtASAS,| ),(max),(11*1*1AS*11ttQQ2*);,(ttttQQEAS211*1);,(),(max1tttttttatQaQrEtASAS第56页/共90页2021-11-17强化学习史忠植57A-Learning Murphy, 2003 and Robins, 2004 Estimate the A-function (advantages) using some approxima

48、tor, as in Q-learning. Derive the estimated policy as an argument of the maximum of the estimated A-function. Allow different parameter vectors at different time points. Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss f

49、unction.第57页/共90页2021-11-17强化学习史忠植58A-Learning Algorithm (Inefficient Version) For The estimated policy satisfies);,(max);,();,(.),|();,(1minarg).;,(, 0,.,1,1,21, 1,1tttttattttttttniitittitittittTtjjjjjTtjjtaEYnrYTTttASASASASASAS.),;,(maxarg),(1,tttttatttAtasas第58页/共90页2021-11-17强化学习史忠植59Differenc

50、es between Q and A-learning Q-learning At time t we model the main effects of the history, (St,At-1) and the action At and their interaction Our Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learning At time t we only model the effects of At and its interac

51、tion with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 第59页/共90页2021-11-17强化学习史忠植60Q-Learning Vs. A-Learning Relative merits and demerits are not completely known till now. Q-learning has low variance but high bias. A-learning has high varianc

52、e but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off.第60页/共90页2021-11-17强化学习史忠植61POMDP部分感知马氏决策过程 Rather than observing the state we observe some function of the state. Ob Observable functiona random variable for each states. Problem : different states may look

53、 similarThe optimal strategy might need to consider the history.第61页/共90页2021-11-17强化学习史忠植62Framework of POMDPPOMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上，是观察的集合，即系统可以感知的世界状态集合，观察函数：SAPD（）。系统在采取动作a转移到状态s时，观察函数确定其在可能观察上的概率分布。记为（s, a, o）。第62页/共90页2021-11-17强化学习史忠植63POMDPsWhat if state information (from sensors)

54、is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.第63页/共90页2021-11-17强化学习史忠植64POMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques第64页/共90页2021-11-17强化学习史忠植65POMDP_信度状态方法 Idea: Given a history of actions and observable value, we

55、 compute a posterior distribution for the state we are in (belief state) The belief-state MDP States: distribution over S (states of the POMDP) Actions: as in POMDP Transition: the posterior distribution (given the observation)Open Problem : How to deal with the continuous distribution? 第65页/共90页202

56、1-11-17强化学习史忠植66The Learning Process of Belief MDP , , ,Pr, ,Pr,s SO s a oP s a s b sb ss o a bo a b, ,Pr, ,Pr,o Ob a bb a b oo a b ,s Sb ab s R s a第66页/共90页2021-11-17强化学习史忠植67Major Methods to Solve POMDP 算法名称基本思想学习值函数Memoryless policies直接采用标准的强化学习算法直接采用标准的强化学习算法Simple memory based approaches使用使用k

57、个历史观察表示当前状态个历史观察表示当前状态UDM(Utile Distinction Memory)分解状态，构建有限状态机模型分解状态，构建有限状态机模型NSM(Nearest Sequence Memory)存储状态历史，进行距离度量存储状态历史，进行距离度量USM(Utile Suffix Memory)综合综合UDM和和NSM两种方法两种方法Recurrent-Q使用循环神经网络进行状态预测使用循环神经网络进行状态预测策略搜索Evolutionary algorithms使用遗传算法直接进行策略搜索使用遗传算法直接进行策略搜索Gradient ascent method使用梯度下降（

58、上升）法搜索使用梯度下降（上升）法搜索第67页/共90页2021-11-17强化学习史忠植68强化学习中的函数估计强化学习中的函数估计RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state space00000,V M VM VMM VMM Vis the TD operator.Mis the function approximation operator.第68页/共90页2021-11-17强化学习史忠植69并行两个迭代过程并行

59、两个迭代过程值函数迭代过程值函数逼近过程,1, ,max,aQ s aVs ar s a sVs a,Vs aM Q s aHow to construct the M function? Using state cluster, interpolation, decision tree or neural network?第69页/共90页2021-11-17强化学习史忠植70lFunction Approximator: V( s) = f( s, w)lUpdate: Gradient-descent Sarsa: w w + rt+1 + Q(st+1,at+1)- Q(st,a

60、t) w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 并行两个迭代过程并行两个迭代过程第70页/共90页2021-11-17强化学习史忠植71Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete t

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

第10章_强化学习PPT课件

文档简介

温馨提示

最新文档

评论

第10章_强化学习PPT课件

文档简介

温馨提示

最新文档

评论

相关文档