高级人工智能10强化学习_第1页
高级人工智能10强化学习_第2页
高级人工智能10强化学习_第3页
高级人工智能10强化学习_第4页
高级人工智能10强化学习_第5页
已阅读5页,还剩74页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、2022/7/81高级人工智能 第十章强化学习2022/7/82内容提要引言强化学习模型动态规划蒙特卡罗方法时序差分学习Q学习强化学习中的函数估计应用2022/7/83引言 人类通常从与外界环境的交互中学习。所谓强化(reinforcement)学习是指从环境状态到行为映射的学习,以使系统行为从环境中获得的累积奖励值最大。在强化学习中,我们设计算法来把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自己发现。主体的动作的影响不只是立即得到的奖励,而且还影响接下来的动作和最终的奖励。试错搜索(trial-and-e

2、rror search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重要的特性。 2022/7/84引言 强化学习技术是从控制理论、统计学、心理学等相关学科发展而来,最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用,并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后,对强化学习的研究和应用日益开展起来,成为目前机器学习领域的研究热点之一。2022/7/85引言强化思想最先来源于心理学的研究。1911年Thorndike提出了效

3、果律(Law of Effect):一定情景下让动物感到舒服的行为,就会与此情景增强联系(强化),当此情景再现时,动物的这种行为也更易再现;相反,让动物感觉不舒服的行为,会减弱与情景的联系,此情景再现时,此行为将很难再现。换个说法,哪种行为会“记住”,会与刺激建立联系,取决于行为产生的效果。动物的试错学习,包含两个含义:选择(selectional)和联系(associative),对应计算上的搜索和记忆。所以,1954年,Minsky在他的博士论文中实现了计算上的试错学习。同年,Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“

4、Steps Toward Artificial Intelligence”,此后开始广泛使用。1969年, Minsky因在人工智能方面的贡献而获得计算机图灵奖。2022/7/86引言1953到1957年,Bellman提出了求解最优控制问题的一个有效方法:动态规划(dynamic programming) Bellman于 1957年还提出了最优控制问题的随机离散版本,就是著名的马尔可夫决策过程(MDP, Markov decision processe),1960年Howard提出马尔可夫决策过程的策略迭代方法,这些都成为现代强化学习的理论基础。1972年,Klopf把试错学习和时序差分结

5、合在一起。1978年开始,Sutton、Barto、 Moore,包括Klopf等对这两者结合开始进行深入研究。 1989年Watkins提出了Q-学习Watkins 1989,也把强化学习的三条主线扭在了一起。1992年,Tesauro用强化学习成功了应用到西洋双陆棋(backgammon)中,称为TD-Gammon 。2022/7/87内容提要引言强化学习模型动态规划蒙特卡罗方法时序差分学习Q学习强化学习中的函数估计应用2022/7/88主体强化学习模型i: inputr: reward s: statea: action状态 sisi+1ri+1奖励 ri环境动作 aia0a1a2s0s

6、1s2s32022/7/89描述一个环境(问题)Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuousThe most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.2022/7/810强化学习问题Agent-envi

7、ronment interactionStates, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabilities:EnvironmentactionstaterewardRLAgent2022/7/811与监督学习对比Reinforcement Learning Learn from interactionlearn from i

8、ts own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)Supervised Lear

9、ning Learn from examples provided by a knowledgable external supervisor.2022/7/812强化学习要素Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to maximizeValue: what is good because it predicts rewardModel: what follows whatPolicyRewardValueModel ofenv

10、ironmentIs unknownIs my goalIs I can getIs my method2022/7/813在策略下的Bellman公式The basic idea: So: Or, without the expectation operator: is the discount rate2022/7/814Bellman最优策略公式2022/7/815MARKOV DECISION PROCESS k-armed bandit gives immediate reward DELAYED REWARD?Characteristics of MDP:a set of stat

11、es : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a2022/7/816MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)2022/7/817MDP Graphical Representation,

12、 : T (s, action, s )Similarity to Hidden Markov Models (HMMs)2022/7/818动态规划Dynamic Programming - ProblemA discrete-time dynamic systemStates 1, , n + termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies2022/7/819Finite Horizon ProblemInfinite Horizon Proble

13、mValue Iteration动态规划Dynamic Programming Iterative Solution 2022/7/820动态规划中的策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration2022/7/821动态规划方法TTTTTTTTTTTTT2022/7/822自适应动态规划(ADP)Idea: use the constraints (state transition probabilities) between states to speed

14、learning.Solve = value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g. Backgammon: 1050 equations in 1050 variables2022/7/823Value Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)S

15、top Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993b)2022/7/824Policy Iteration Algorithm Policies converge faster than values.Why faster convergence? 2022/7/825Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability

16、 to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequ

17、ences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)2022/7/826Reinforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912

18、 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).

19、Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.2022/7/827Model Free MethodsModels of the environment:T: S x A ( S) and R : S x A RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning2022/7/828Monte

20、Carlo策略评价Goal: learn Vp(s) under P and R are unknown in advanceGiven: some number of episodes under p which contain sIdea: Average returns observed after visits to sEvery-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visit

21、ed in an episodeBoth converge asymptotically123452022/7/829蒙特卡罗方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on experience Assumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of

22、 the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 2022/7/830Problem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡罗方法 2022/7/831Monte Carlo方法TTTTTTTTTT

23、TTTTTTTTTT2022/7/832蒙特卡罗控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function2022/7/833时序差分学习 Temporal-Differencetarget: the act

24、ual return after time ttarget: an estimate of the return2022/7/834时序差分学习 (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.Theorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a function of times a state is visited (=Ni), t

25、hen U(i) itself converges to the correct value2022/7/835时序差分学习 TDTTTTTTTTTTTTTTTTTTTT2022/7/836TD(l) A Forward ViewTD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return: Backup using l-return:2022/7/837时序差分学习算法 TD()Idea: update from the whole epoch, not ju

26、st on state transition.Special cases:=1: Least-mean-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 2022/7/838时序差分学习算法2022/7/839时序差分学习算法收敛性TD()Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. In practice, often a fixed is

27、 used for all i and t.2022/7/840时序差分学习 TD2022/7/841Q-LearningWatkins,1989Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the maximum of the estimated Q-function.Allow different pa

28、rameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.2022/7/842Q-learningQ (a,i)Direct approach (ADP) would require learning a model .Q-learning does not:Do this update after

29、 each state transition:2022/7/843ExplorationTradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selecti

30、on es greedy as time approaches infinity, and* The learning rate a is decreased fast enough but not too fast (as we discussed in TD learning)2022/7/844Common exploration methodsIn value iteration in an ADP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzm

31、ann explorationExploration funcR+ if nNu o.w.2022/7/845Q-Learning AlgorithmSetForThe estimated policy satisfies2022/7/846What is the intuition?Bellman equation gives If and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing2022/7/847A-Learning Murphy, 2003 an

32、d Robins, 2004Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the estimated A-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the

33、approximator, and of course, squared error as the appropriate loss function.2022/7/848A-Learning Algorithm (Inefficient Version)ForThe estimated policy satisfies2022/7/849Differences between Q and A-learningQ-learningAt time t we model the main effects of the history, (St,At-1) and the action At and

34、 their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learningAt time t we only model the effects of At and its interaction with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 2022/7/850Q-L

35、earning Vs. A-LearningRelative merits and demerits are not completely known till now.Q-learning has low variance but high bias.A-learning has high variance but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off.2022/7/851POMDP部分感知马氏决策过程 Rather than observing the st

36、ate we observe some function of the state.Ob Observable functiona random variable for each states.Problem : different states may look similarThe optimal strategy might need to consider the history.2022/7/852Framework of POMDPPOMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上,是观察的集合,即系统可以感知的世界状态集合,观察函数:SAPD()。系统在采取动作a转

37、移到状态s时,观察函数确定其在可能观察上的概率分布。记为(s, a, o)。1 可以是S的子集,也可以与S无关2022/7/853POMDPsWhat if state information (from sensors) is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.2022/7/854POMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques2022/7

38、/855POMDP_信度状态方法Idea: Given a history of actions and observable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribution over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation)Open

39、 Problem : How to deal with the continuous distribution? 2022/7/856The Learning Process of Belief MDP2022/7/857Major Methods to Solve POMDP 算法名称基本思想学习值函数Memoryless policies直接采用标准的强化学习算法Simple memory based approaches使用k个历史观察表示当前状态UDM(Utile Distinction Memory)分解状态,构建有限状态机模型NSM(Nearest Sequence Memory)

40、存储状态历史,进行距离度量USM(Utile Suffix Memory)综合UDM和NSM两种方法Recurrent-Q使用循环神经网络进行状态预测策略搜索Evolutionary algorithms使用遗传算法直接进行策略搜索Gradient ascent method使用梯度下降(上升)法搜索2022/7/858强化学习中的函数估计RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state spaceis the TD operato

41、r.is the function approximation operator.2022/7/859并行两个迭代过程值函数迭代过程值函数逼近过程How to construct the M function? Using state cluster, interpolation, decision tree or neural network?2022/7/860Function Approximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(s

42、t,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 2022/7/861Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete timeDiscrete ev

43、entsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RL2022/7/862The equations2022/7/863Multi-agent MDPDistributed RLMarkov GameBest ResponseEnvironmentactionstaterewardRLAgentRLAgent2022/7/864三种观点问题空间主要方法算法准则合作多

44、agent强化学习分布、同构、合作环境交换状态提高学习收敛速度交换经验交换策略交换建议基于平衡解多agent强化学习同构或异构、合作或竞争环境极小极大-Q理性和收敛性NASH-QCE-QWoLF最佳响应多agent强化学习异构、竞争环境PHC收敛性和不遗憾性IGAGIGAGIGA-WoLF2022/7/865马尔可夫对策在n个agent的系统中,定义离散的状态集S(即对策集合G),agent动作集Ai的集合A, 联合奖赏函数Ri:SA1An和状态转移函数P:SA1AnPD(S)。 2022/7/866基于平衡解方法的强化学习Open Problem : Nash equilibrium or

45、other equilibrium is enough? The optimal policy in single game is Nash equilibrium.2022/7/867Applications of RLCheckers Samuel 59TD-Gammon Tesauro 92Worlds best downpeak elevator dispatcher Crites at al 95Inventory management Bertsekas et al 9510-15% better than industry standardDynamic channel assi

46、gnment Singh & Bertsekas, Nie&Haykin 95Outperforms best heuristics in the literatureCart-pole Michie&Chambers 68- with bang-bang controlRobotic manipulation Grupen et al. 93-Path planningRobot docking Lin 93ParkingFootball Stone98TetrisMultiagent RL Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markov

47、itch 95-, lots of work sinceCombinatorial optimization: maintenance & repairControl of reasoning Zhang & Dietterich IJCAI-952022/7/868仿真机器人足球 应用Q学习算法进行仿真机器人足球2 对1 训练,训练的目的是试图使主体学习获得到一种战略上的意识,能够在进攻中进行配合宋志伟, 2003 2022/7/869仿真机器人足球 前锋A控球,并且在可射门的区域内,但是A已经没有射门角度了;队友B也处于射门区域,并且B具有良好的射门角度。A传球给B,射门由B来完成,那么这

48、次进攻配合就会很成功。通过Q学习的方法来进行2 对1的射门训练,让A掌握在这种状态情况下传球给B的动作是最优的策略;主体通过大量的学习训练(大数量级的状态量和重复相同状态)来获得策略,因此更具有适应性 。2022/7/870仿真机器人足球 状态描述,将进攻禁区划分为个小区域,每个小区域是边长为2m的正方形,一个二维数组()便可描述这个区域。使用三个Agent的位置来描述2 对 1进攻时的环境状态,利用图10.11所示的划分来泛化状态。可认为主体位于同一战略区域为相似状态,这样对状态的描述虽然不精确,但设计所需的是一种战略层次的描述,可认为Agent在战略区域内是积极跑动的,这种方法满足了需求。

49、如此,便描述了一个特定的状态;其中,是进攻队员A的区域编号,是进攻队员B的区域编号,是守门员的区域编号。区域编号计算公式为:。相应的,所保存的状态值为三个区域编号组成的对。前锋A控球,并且在可射门的区域内,但是A已经没有射门角度了;队友B也处于射门区域,并且B具有良好的射门角度。A传球给B,射门由B来完成,那么这次进攻配合就会很成功。通过Q学习的方法来进行2 对1的射门训练,让A掌握在这种状态情况下传球给B的动作是最优的策略;主体通过大量的学习训练(大数量级的状态量和重复相同状态)来获得策略,因此更具有适应性 。19图10.11 进攻禁区内的位置划分072022/7/871仿真机器人足球可选动作集确定为的策略通过基于概率的射门训练的学习来得到。的策略是,始终向受到威胁小,并且射门成功率高的区域带球。为了实现这一策略目标,可划分进攻区域为多个战略区,在每个战略区进行射门评价,记录每个区域的射门成功率。Pass策略很简单

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论