




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Finite-Sample Analysis for SARSA with Linear Function Approximation Shaofeng Zou Department of Electrical Engineering University at Buffalo, The State University of New York Buffalo, NY 14228 Tengyu Xu Department of ECE The Ohio State University Columbus, OH 43210 Yin
2、gbin Liang Department of ECE The Ohio State University Columbus, OH 43210 Abstract SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear func- tion approximation under the non-i.i.d. data,
3、where a single sample trajectory is available. With a Lipschitz continuous policy improvement operator that is smooth enough, SARSA has been shown to converge asymptotically 28,23. However, its non-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d. samples and the fact tha
4、t the behavior policy changes dynamically with time. In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. Our approach enables non-asymptotic convergence analyses of th
5、is type of stochastic approximation algorithms, which may be of independent interest. Using our bias characterization technique and a gradient descent type of analysis, we provide the fi nite-sample analysis on the mean square error of the SARSA algorithm. We then further study a fi tted SARSA algor
6、ithm, which includes the original SARSA algorithm and its variant in 28 as special cases. This fi tted SARSA algorithm provides a more general framework for iterative on-policy fi tted policy iteration, which is more memory and computationally effi cient. For this fi tted SARSA algorithm, we also pr
7、ovide its fi nite-sample analysis. 1Introduction SARSA, originally proposed in 31, is an on-policy reinforcement learning algorithm, which continuously updates the behavior policy towards attaining as large an accumulated reward as possible over time. Specifi cally, SARSA is initialized with a state
8、 and a policy. At each time instance, it takes an action based on the current policy, observes the next state, and receives a reward. Using the newly observed information, it fi rst updates the estimate of the action-value function, and then improves the behavior policy by applying a policy improvem
9、ent operator, e.g.,-greedy, to the estimated action-value function. Such a process is iteratively taken until it converges (see Algorithm 1 for a precise description of the SARSA algorithm). With the tabular approach that stores the action-value function, the convergence of SARSA has been establishe
10、d in 33. However, the tabular approach may not be applicable when the state space is 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. large or continuous. For this purpose, SARSA that incorporates parametrized function approximation is commonly used, and is
11、 more effi cient and scalable. With the function approximation approach, SARSA is not guaranteed to converge in general when the-greedy or softmax policy improvement operators are used 13,10. However, under certain conditions, its convergence can be established. For example, a variant of SARSA with
12、linear function approximation was constructed in 28, where between two policy improvements, a temporal difference (TD) learning algorithm is applied to learn the action-value function till its convergence. The convergence of this algorithm was established in 28 using a contraction argument under the
13、 condition that the policy improvement operator is Lipschitz continuous and the Lipschitz constant is not too large. The convergence of the original SARSA algorithm under the same Lipschitz condition was later established using an O.D.E. approach in 23. Previous studies on SARSA in 28,23 mainly focu
14、sed on the asymptotic convergence analysis, which does not suggest how fast SARSA converges and how the accuracy of the solution depends on the number of samples, i.e., sample complexity. The goal of this paper is to provide such a non-asymptotic fi nite-sample analysis of SARSA and to further under
15、stand how the parameters of the underlying Markov process and the algorithm affect the convergence rate. Technically, such an analysis does not follow directly from the existing fi nite-sample analysis for time difference (TD) learning 4,34 and Q-learning 32 , where samples are taken by a Markov pro
16、cess with a fi xed transition kernel. The analysis of SARSA necessarily needs to deal with samples taken from a Markov decision process with a time-varying transition kernel, and in this paper, we develop novel techniques to explicitly characterize the stochastic bias for a Markov decision process w
17、ith a time-varying transition kernel, which may be of independent interest. 1.1Contributions In this paper, we design a novel approach to analyze SARSA and a more general fi tted SARSA algorithm, and develop the corresponding fi nite-sample error bounds. In particular, we consider the on-line settin
18、g where a single sample trajectory with Markovian noise is available, i.e., samples are not identical and independently distributed (i.i.d.). Bias characterization for time-varying Markov process.One major challenge in our analysis is due to the fact that the estimate of the “gradient” is biased wit
19、h non-i.i.d. Markovian noise. Existing studies mostly focus on the case where the samples are generated according to a Markov process with a fi xed transition kernel, e.g., TD learning 4,34 and Q-learning with nearest neighbors 32, so that the uniform ergodicity of the Markov process can be exploite
20、d to decouple the dependency on the Markovian noise, and then to explicitly bound the stochastic bias. For Markov processes with a time-varying transition kernel, such a property of uniform ergodicity does not hold in general. In this paper, we develop a novel approach to explicitly characterize the
21、 stochastic bias induced by non-i.i.d. samples generated from Markov processes with time-varying transition kernels. The central idea of our approach is to construct auxiliary Markov chains, which are uniformly ergodic, to approximate the dynamically changing Markov process to facilitate the analysi
22、s. Our approach can also be applied more generally to analyze stochastic approximation (SA) algorithms with time-varying Markov transition kernels, which may be of independent interest. Finite-sample analysis for on-policy SARSA.For the on-policy SARSA algorithm, as the estimate of the action-value
23、function changes with time, the behavior policy also changes. By a gradient descent type of analysis 4 and our bias characterization technique for analyzing time-varying Markov processes, we develop the fi nite-sample analysis for the on-policy SARSA algorithm with a continuous state space and linea
24、r function approximation. Our analysis is for the on-line case with a single sample trajectory and non-i.i.d. data. To the best of our knowledge, this is the fi rst fi nite-sample analysis for this type of on-policy algorithm with time-varying behavior policy. Fitted SARSA algorithm. We propose a mo
25、re general on-line fi tted SARSA algorithm, where between two policy improvements, a “fi tted” step is taken to obtain a more accurate estimate of the action-value function of the corresponding behavior policy via multiple iterations rather than taking only a single iteration as in the original SARS
26、A. In particular, it includes the variant of SARSA in 28 as a special case, in which each fi tted step is required to converge before doing policy improvement. We provide a non-asymptotic analysis for the convergence of the proposed algorithm. Interestingly, our analysis indicates that the fi tted s
27、tep can stop at any time (not necessarily until convergence) without affecting the overall convergence of the fi tted SARSA algorithm. 2 1.2Related Work Finite-sample analysis for TD learning.The asymptotic convergence of the TD algorithm was established in 36 . The fi nite-sample analysis of the TD
28、 algorithm was provided in 9,19 under the i.i.d. setting and in 4,34 recently under the non-i.i.d. setting, where a single sample trajectory is available. The fi nite sample analysis for the two-time scale methods for TD learning was also studied very recently under i.i.d. setting in 8, under non-i.
29、i.d. setting with constant step sizes in 15, and under non-i.i.d. setting with diminishing step sizes in 38. Differently from TD, the goal of which is to estimate the value function of a fi xed policy, SARSA aims to continuously update its estimate of the action-value function to obtain an optimal p
30、olicy. While samples of the TD algorithm are generated by following a time-invariant behavior policy, the behavior policy that generates samples in SARSA follows from an instantaneous estimate of the action-value function, which changes over time. Q-learning with function approximation.The asymptoti
31、c convergence of Q-learning with linear function approximation was established in 23 under certain conditions. An approach based on a combination of Q-learning and kernel-based nearest neighbor regression was proposed in 32 which fi rst discretize the entire state space, and then use the nearest nei
32、ghbor regression method to estimate the action-value function. Such an approach was shown to converge, and a fi nite-sample analysis of the convergence rate was further provided. Q-learning algorithms in 23,32 are off-policy algorithms, where a fi xed behavior policy is used to collect samples, wher
33、eas SARSA is an on-policy algorithm with a time-varying behavior policy. Moreover, differently from the nearest neighbor approach, we consider SARSA with linear function approximation. These differences require different techniques to characterize the non-asymptotic convergence rate. On-policy SARSA
34、 algorithm.SARSA was originally proposed in 31, and using the tabular approach its convergence was established in 33. With function approximation, SARSA is not guaranteed to converge if-greedy and softmax are used. With a smooth enough Lipschitz continuous policy improvement operator, the asymptotic
35、 convergence of SARSA was shown in 23,28. In this paper, we further develop the non-asymptotic fi nite-sample analysis for SARSA under the Lipschitz continuous condition. Fitted value/policy iteration algorithms.The least-squares temporal difference learning (LSTD) algorithms have been extensively s
36、tudied in 6,5,25,20,12,29,30,35,37 and references therein, where in each iteration a least square regression problem based on a batch data is solved. Approximate (fi tted) policy iteration (API) algorithms further extend fi tted value iteration with policy improvement. Several variants were studied,
37、 which adopt different objective functions, including least-squares policy iteration (LSPI) algorithms in 18,21,39 , fi tted policy iteration based on Bellman residual minimization (BRM) in 1,11 , and classifi cation-based policy iteration algorithm in 22 . The fi tted SARSA algorithm in this paper
38、uses an iterative way (TD(0) algorithm) to estimate the action-value function between two policy improvements, which is more memory and computationally effi cient than the batch method. Differently from 28, we do not require a convergent TD(0) run for each fi tted step. For this algorithm, we provid
39、e its non-asymptotic convergence analysis. 2Preliminaries 2.1Markov Decision Process Consider a general reinforcement learning setting, where an agent interacts with a stochastic environ- ment, which is modeled as a Markov decision process (MDP). Specifi cally, we consider a MDP that consists of(X,A
40、,P,r,?), whereXis a continuous state spaceX Rd, andA is a fi nite action set. We further letXt2 Xdenote the state at timet, andAt2 Adenote the action at timet. Then, the measureP defi nes the action dependent transition kernel for the underlying Markov chainXtt?0: P(Xt+12 U|Xt= x,At= a) = R U P(dy|x
41、,a),for any measurable setU X. The one-stage reward at timetis given byr(Xt,At), wherer : X A ! Ris the reward function, and is assumed to be uniformly bounded, i.e.,r(x,a) 2 0,rmax,for any(x,a) 2 X A. Finally,?denotes the discount factor. A stationary policy maps a statex 2 Xto a probability distri
42、bution(|x)overA, which does not depend on time.For a policy, the corresponding value functionV : X ! R is defi ned as the expected total discounted reward obtained by actions executed according to 3 :V (x0) = EP1 t=0? tr(Xt,At)|X0 = x0.The action-value functionQ: X A ! R is defi ned asQ(x,a) = r(x,a
43、) + ? R X P(dy|x,a)V (y). The goal is to fi nd an optimal pol- icy that maximizes the value function from any initial state.The optimal value function is defi ned asV (x) = supV(x), 8x 2 X. The optimal action-value function is defi ned as Q(x,a) = supQ(x,a), 8(x,a) 2 X A.The optimal policyis then gr
44、eedy with re- spect toQ. It can be verifi ed thatQ= Q . The Bellman operatorH is defi ned as (HQ)(x,a) = r(x,a) + ? R X maxb2AQ(y,b)P(dy|x,a).It is clear thatHis contraction in the sup norm defi ned askQksup= sup(x,a)2XA|Q(x,a)|, and the optimal action-value functionQis the fi xed point of H 3. 2.2L
45、inear Function Approximation LetQ = Q: 2 RN be a family of real-valued functions defi ned onX A. We consider the problem where any function inQis a linear combination of a set ofN fi xed functions?i: XA ! R fori = 1,.,N . Specifi cally, for 2 RN,Q(x,a) = PN i=1i?i(x,a) = ? T(x,a).We assume thatk?(x,
46、a)k2 1,8(x,a) 2 X A, which can be ensured by normalizing?iN i=1. The goal is to fi nd aQwith a compact representation into approximate the optimal action-value functionQ with a continuous state space. 3Finite-Sample Analysis for SARSA 3.1SARSA with Linear Function Approximation We consider a -depend
47、ent behavior policy, which changes with time. Specifi cally, the behavior policy tis given by?(?T(x,a)t), where?is a policy improvement operator, e.g., greedy,-greedy, softmax and mellowmax 2. Suppose thatxt,at,rtt?0is a sample trajectory of states, actions and rewards obtained from the MDP followin
48、g the time dependent behavior policyt(see Algorithm 1). The projected SARSA with linear function approximation updates as follows: t+1= proj2,R(t+ tgt(t),(1) wheregt(t) = rQ(xt,at)?t= ?(xt,at)?t,?tdenotes the temporal difference at time t: ?t= r(xt,at)+?T(xt+1,at+1)t?T(xt,at)t,andproj2,R() := argmin
49、0:k0k2Rk?0k2. In this paper, we refer to gtas gradient, although it is not a gradient of any function. Algorithm 1 SARSA Initialization: 0, x0, R, ?i, for i = 1,2,.,N Method: 0 ?(?T0) Choose a0according to 0 for t = 1,2,. do Observe xtand r(xt?1,at?1) Choose ataccording to t?1 t proj2,R(t?1+ t?1gt?1
50、(t?1) Policy improvement: t ?(?Tt) end for Here, the projection step is to control the norm of the gradientgt(t), which is a commonly used technique to control the gradient bias 4,16,17,7,26. With a small step sizetand a bounded gradient,tdoes not change too fast. We note that 14 showed that SARSA c
51、onverges to a bounded region, and thustis bounded for allt ? 0. This implies that our analysis still holds without the projection step. We further note that even without exploiting the fact thattis bounded, the fi nite-sample analysis for SARSA can still be obtained by combining our approach of anal
52、yzing the stochastic bias with an extension of the approach in 34. However, to convey the central idea of characterizing the stochastic bias of a MDP with dynamically changing transition kernel, we focus on the projected SARSA in this paper. 4 We consider the following Lipschitz continuous policy im
53、provement operator?as in 28,23. For any 2 RN, the behavior policy = ?(?T) is Lipschitz with respect to : 8(x,a) 2 X A, |1(a|x) ? 2(a|x)| Ck1? 2k2,(2) whereC 0is the Lipschitz constant. Further discussion about this assumption and its impact on the convergence is provided in Section 5. We further ass
54、ume that for any fi xed 2 RN, the Markov chainXtt?0induced by the behavior policyand the transition kernelPis uniformly ergodic with the invariant measure denoted by P , and satisfi es the following assumption. Assumption 1. There are constants m 0 and 2 (0,1) such that sup x2X dTV(P(Xt2 |X0= x),P)
55、mt,8t ? 0, where dTV(P,Q) denotes the total-variation distance between the probability measures P and Q. We denote bythe probability measure induced by the invariant measurePand the behavior policy. We assume that theNbase functions?is are linearly independent in the Hilbert space L2(X A,), where is
56、 the limit point of Algorithm 1, which will be defi ned in the next section. For the spaceL2(X A,), two measurable functions onX Aare equivalent if they are identical except on a set of -measure zero. 3.2Finite-Sample Analysis We fi rst defi neA= E?(X,A)(?T(Y,B) ? ?T(X,A), andb= E?(X,A)r(X,A), where
57、 Edenotes the expectation whereXfollows the invariant probability measureP,Ais generated by the behavior policy(A = |X),Yis the subsequent state ofXfollowing actionA, i.e.,Yfollows from the transition kernelP(Y 2 |X,A), andBis generated by the behavior policy(B = |Y ). It was shown in 23 that the al
58、gorithm in(1)converges to a unique point , which satisfi es the following relation:A+ b= 0,if the Lipschitz constantCis not so large that(A+ C?I)is negative defi nite1. LetG = rmax+2Rand? = G|A|(2+dlog 1 me+ 1 1?). Recall in(2)that the policy is Lipschitz with respect to with Lipschitz constant C. We then make the following assumption 28, 23. Assumption 2. The Lipschitz constant C is not so large that (A + C?I) is negative defi nite, and denote the largest eigenvalue of 1 2 ?(A + C?I) + (A+ C?I)T ? by ?ws 0. The follo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 北师大版五年级下册分数加减法巧算练习200道及答案
- 认真对待项目管理考试中的试题及答案
- 学习与反思结合提高2025年注册会计师考试的能力试题及答案
- 项目组建过程中的决策设计试题及答案
- 医院感染控制与微生物检验及试题及答案
- 行政机关合同纠纷处理新机制
- 廉政谈话时的表态发言稿
- 股票投资策略相关试题及答案
- 室内空气质量提升措施计划
- 项目管理考试的知识点及其适用领域分析试题及答案
- 对患者入院评估的系统化方法试题及答案
- 教育与社会发展的关系试题及答案
- 七年级英语下学期期中押题预测卷(深圳专用)(原卷版)
- 2024年贵州贵州路桥集团有限公司招聘真题
- DB11-T 2397-2025 取水供水用水排水数据库表结构
- 多式联运模式在跨境电商中的应用-全面剖析
- 中药学(士)基础知识押题密卷1
- 2025年第三届天扬杯建筑业财税知识竞赛题库附答案(1401-1536题)
- 2025中考语文常考作文押题(10大主题+10篇范文)
- 2024安康市专职消防员招聘考试题目及答案
- 气相色谱-质谱联用GC-MS
评论
0/150
提交评论