版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Reducing the variance in online optimization by transporting past gradients Sbastien M. R. Arnold University of Southern California Los Angeles, CA Pierre-Antoine Manzagol Google Brain Montral, QC manzagop Reza Babanezhad University of British Columbia Vancouver, BC rezababacs.ubc.
2、ca Ioannis Mitliagkas Mila, Universit de Montral Montral, QC ioannisiro.umontreal.ca Nicolas Le Roux Mila, Google Brain Montral, QC nlr Abstract Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing past gradients can
3、 be benefi cial when there is a fi nite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of implicit gradient transport (IGT) which transforms gradients computed at previ
4、ous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or A
5、dam. We show experimentally that it achieves state-of- the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all compo
6、nent functions are equal.2 1Introduction We wish to solve the following minimization problem: = argmin Expf(,x) ,(1) where we only have access to samplesx and to a fi rst-order oracle that gives us, for a givenand a givenx, the derivative off(,x)with respect to, i.e. f(,x) = g(,x). It is known 35 th
7、at, when fis smooth and strongly convex, there is a converging algorithm for Problem 1 that takes the form t+1= t tg(t,xt), wherextis a sample fromp. This algorithm, dubbed stochastic gradient (SG), has a convergence rate ofO(1/t)(see for instance 4), within a constant factor of the minimax rate for
8、 this problem. When one has access to the true gradientg() = Expg(,x)rather than just a sample, this rate dramatically improves to O(et) for some 0. In addition to hurting the convergence speed, noise in the gradient makes optimization algorithms harder to tune. Indeed, while full gradient descent i
9、s convergent for constant stepsize, and also Work done while at Mila. 2Open-source implementation available at: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. amenable to line searches to fi nd a good value for that stepsize, the stochastic gradient metho
10、d from 35 with a constant stepsize only converges to a ball around the optimum 38.3Thus, to achieve convergence, one needs to use a decreasing stepsize. While this seems like a simple modifi cation, the precise decrease schedule can have a dramatic impact on the convergence speed. While theory presc
11、ribest= O(t)with (1/2,1in the smooth case, practictioners often use larger stepsizes like t= O(t1/2) or even constant stepsizes. When the distributionp has fi nite support, Eq. 1 becomes a fi nite sum and, in that setting, it is possible to achieve effi cient variance reduction and drive the noise t
12、o zero, allowing stochastic methods to achieve linear convergence rates 24,17,50,28,42,5 . Unfortunately, the fi nite support assumption is critical to these algorithms which, while valid in many contexts, does not have the broad applicability of the standard SG algorithm. Several works have extende
13、d these approaches to the online setting by applying these algorithms while increasing the mini-batch sizeN2,14 but they need to revisit past examples multiple times and are not truly online. Another line of work reduces variance by averaging iterates 33,22,3,10,7,6,16. While these methods converge
14、for a constant stepsize in the stochastic case4, their practical speed is heavily dependent on the fraction of iterates kept in the averaging, a hyperparameter that is thus hard to tune, and they are rarely used in deep learning. Our work combines two existing ideas and adds a third: a) At every ste
15、p, it updates the parameters using a weighted average of past gradients, like in SAG 24,40, albeit with a different weighting scheme; b) It reduces the bias and variance induced by the use of these old gradients by transporting them to “equivalent” gradients computed at the current point, similar to
16、 11; c) It does so implicitly by computing the gradient at a parameter value different from the current one. The resulting gradient estimator can then be used as a plug-in replacement of the stochastic gradient within any optimization scheme. Experimentally, both SG using our estimator and its momen
17、tum variant outperform the most commonly used optimizers in deep learning. 2Momentum and other approaches to dealing with variance Stochastic variance reduction methods use an average of past gradients to reduce the variance of the gradient estimate. At fi rst glance, it seems like their updates are
18、 similar to that of momentum 32, also known as the heavy ball method, which performs the following updates5: vt= tvt1+ (1 t)g(t,xt),v0= g(0,x0) t+1= t tvt. Whent= , this leads tot+1= t t tg(0,x0) + (1 ) t X i=1 tig(i,xi) ! . Hence, the heavy ball method updates the parameters of the model using an a
19、verage of past gradients, bearing similarity with SAG 24, albeit with exponential instead of uniform weights. Interestingly, while momentum is a popular method for training deep networks, its theoretical analysis in the stochastic setting is limited 44, except in the particular setting when the nois
20、e converges to 0 at the optimum 26. Also surprising is that, despite the apparent similarity with stochastic variance reduction methods, current convergence rates are slower when using 0in the presence of noise 39, although this might be a limitation of the analysis. 2.1Momentum and variance We prop
21、ose here an analysis of how, on quadratics, using past gradients as done in momentum does not lead to a decrease in variance. If gradients are stochastic, thent= t is a random variable. Denoting?ithe noise at timestepi, i.e.g(i,xi) = g(i) + ?i, and writingt Et = Pt i=0Ni,t?i, withNi,t the impact of
22、the noise of thei-th datapoint on thet-th iterate, we may now analyze the total impact of each?ion the iterates. Figure 1 shows the impact of?iont Etas 3Under some conditions, it does converge linearly to the optimum e.g., 45 4Under some conditions on f. 5This is slightly different from the standard
23、 formulation but equivalent for constant t. 2 (a) Stochastic gradient(b) Momentum - = 0.9 (c) Momentum - t= 1 1 t (d) Momentum - t= 1 1 t with IGT. Figure 1: Variance induced over time by the noise from three different datapoints (i = 1,i = 25and i = 50) as well as the total variance for SG ( = 0 ,
24、top left), momentum with fi xed = 0.9(top right), momentum with increasingt= 1 1 t without (bottom left) and with (bottom right) transport. The impact of the noise of each gradient?iincreases for a few iterations then decreases. Although a largerreduces the maximum impact of a given datapoint, the t
25、otal variance does not decrease. With transport, noises are now equal and total variance decreases. The y-axis is on a log scale. measured byN2 i,tfor three datapoints (i = 1,i = 25andi = 50) as a function oftfor stochastic gradient ( = 0, left) and momentum ( = 0.9, right). As we can see, when usin
26、g momentum, the variance due to a given datapoint fi rst increases as the noise infl uences both the next iterate (through the parameter update) and the subsequent updates (through the velocity). Due to the weight1 when a point is fi rst sampled, a larger value ofleads to a lower immediate impact of
27、 the noise of a given point on the iterates. However, a largeralso means that the noise of a given gradient is kept longer, leading to little or no decrease of the total variance (dashed blue curve). Even in the case of stochastic gradient, the noise at a given timestep carries over to subsequent ti
28、mesteps, even if the old gradients are not used for the update, as the iterate itself depends on the noise. At every timestep, the contribution to the noise of the 1st, the 25th and the 50th points in Fig. 1 is unequal. If we assume that the?iare i.i.d., then the total variance would be minimal if t
29、he contribution from each point was equal. Further, one can notice that the impact of datapointiis only a function of t i and not of t. This guarantees that the total noise will not decrease over time. To address these two points, one can increase the momentum parameter over time. In doing so, the n
30、oise of new datapoints will have a decreasing impact on the total variance as their gradient is multiplied by1 t. Figure 1c shows the impactN2 i,tof each noise?ifor an increasing momentum t= 1 1 t. The peak of noise fori = 25is indeed lower than that ofi = 1. However, the variance still does not go
31、to 0. This is because, as the momentum parameter increases, the update is an average of many gradients, including stale ones. Since these gradients were computed at iterates already infl uenced by the noise over previous datapoints, that past noise is amplifi ed, as testifi ed by the higher peak ati
32、 = 1for the increasing momentum. Ultimately, increasing momentum does not lead to a convergent algorithm in the presence of noise when using a constant stepsize. 3 2.2SAG and Hessian modelling The impact of the staleness of the gradients on the convergence is not limited to momentum. In SAG, for ins
33、tance, the excess error afterkupdates is proportional to ?1 min? 1 16b , 1 8N ?k, compared to the excess error of the full gradient method which is ?1 1 ?k whereis the condition number of the problem. 6 The difference between the two rates is larger when the minimum in the SAG rate is the second ter
34、m. This happens either whenb is small, i.e. the problem is well conditioned and a lot of progress is made at each step, or whenNis large, i.e. there are many points to the training set. Both cases imply that a large distance has been travelled between two draws of the same datapoint. Recent works sh
35、owed that correcting for that staleness by modelling the Hessian 46,11 leads to improved convergence. As momentum uses stale gradients, the velocity is an average of current and past gradients and thus can be seen as an estimate of the true gradient at a point which is not the current one but rather
36、 a convex combination of past iterates. As past iterates depend on the noise of previous gradients, this bias in the gradients amplifi es the noise and leads to a non-converging algorithm. We shall thus “transport” the old stochastic gradientsg(i,xi)to make them closer to their corresponding value a
37、t the current iterate,g(t,xi). Past works did so using the Hessian or an explicit approximation thereof, which can be expensive and diffi cult to compute and maintain. We will resort to using implicit transport, a new method that aims at compensating the staleness of past gradients without making ex
38、plicit use of the Hessian. 3Converging optimization through implicit gradient transport Before showing how to combine the advantages of both increasing momentum and gradient transport, we demonstrate how to transport gradients implicitly. This transport is only exact under a strong assumption that w
39、ill not hold in practice. However, this result will serve to convey the intuition behind implicit gradient transport. We will show in Section 4 how to mitigate the effect of the unsatisfi ed assumption. 3.1Implicit gradient transport Let us assume that we received samplesx0,.,xtin an online fashion.
40、 We wish to approach the full gradientgt(t) = 1 t+1 Pt i=0g(t,xi)as accurately as possible. We also assume here that a) We have a noisy estimate b gt1(t1) of gt1(t1); b) We can compute the gradient g(,xt) at any location . We shall seek a such that t t + 1b gt1(t1) + 1 t + 1g(,xt) gt(t) . To this en
41、d, we shall make the following assumption: Assumption 3.1. All individual functions f(,x) are quadratics with the same Hessian H. This is the same assumption as 10, Section 4.1. Although it is unlikely to hold in practice, we shall see that our method still performs well when that assumption is viol
42、ated. Under Assumption 3.1, we then have (see details in Appendix) gt(t) = t t + 1gt1(t) + 1 t + 1g(t,xt) t t + 1b gt1(t1) + 1 t + 1g(t + t(t t1),xt) . Thus, we can transport our current estimate of the gradient by computing the gradient on the new point at a shifted location = t+ t(t t1). This extr
43、apolation step is reminiscent of Nesterovs acceleration with the difference that the factor in front of t t1, t, is not bounded. 6The b in the convergence rate of SAG is generally larger than the in the full gradient algorithm. 4 3.2Combining increasing momentum and implicit gradient transport We no
44、w describe our main algorithm, Implicit Gradient Transport (IGT). IGT uses an increasing momentumt= t t+1. At each step, when updating the velocity, it computes the gradient of the new point at an extrapolated location so that the velocity vtis a good estimate of the true gradient g(t). We can rewri
45、te the updates to eliminate the velocity vt, leading to the update: t+1= 2t + 1 t + 1 t t t + 1t1 t + 1g (t + t(t t1),xt) .(IGT) We see in Fig. 1d that IGT allows a reduction in the total variance, thus leading to convergence with a constant stepsize. This is captured by the following proposition: P
46、roposition 3.1.Iff is a quadratic function with positive defi nite HessianHwith largest eigenvalue Land condition numberand if the stochastic gradients satisfy:g(,x) = g()+?with?a random i.i.d. noise with covariance bounded byBI, then Eq. IGT with stepsize = 1/Lleads to iteratest satisfying Ekt k2 ?
47、 1 1 ?2t k0 k2+ d2B 2 0 t , with = (2 + 2log) for every t 2. The proof of Prop. 3.1 is provided in the appendix. Despite this theoretical result, two limitations remain: First, Prop. 3.1 shows that IGT does not improve the dependency on the conditioning of the problem; Second, the assumption of equa
48、l Hessians is unlikely to be true in practice, leading to an underestimation of the bias. We address the conditioning issue in the next section and the assumption on the Hessians in Section 4. 3.3IGT as a plug-in gradient estimator We demonstrated that the IGT estimator has lower variance than the s
49、tochastic gradient estimator for quadratic objectives. IGT can also be used as a drop-in replacement for the stochastic gradient in an existing, popular fi rst order method: the heavy ball (HB). This is captured by the following two propositions: Proposition 3.2(Non-stochastic).In the non-stochastic
50、 case, whereB = 0, variance is equal to0 and Heavyball-IGT achieves the accelerated linear rateO? ? 1 +1 ?t? using the known, optimal heavy ball tuning, = ? 1 +1 ?2 , = (1 + )2/L. Proposition 3.3(Online, stochastic).WhenB 0, there exist constant hyperparameters 0, 0 such that kEt k2converges to zero
51、 linearly, and the variance is O(1/t). The pseudo-code can be found in Algorithm 1. Algorithm 1 Heavyball-IGT 1:procedure HEAVYBALL-IGT(Stepsize , Momentum , Initial parameters 0) 2:v0 g(0,x0),w0 v0,1 0+ w0 3:for t = 1,.,T 1 do 4:t t t+1 5:vt tvt1+ (1 t)g ? t+ t 1t(t t1),xt ? 6:wt wt1 vt 7:t+1 t+ wt
52、 8:end for 9:return T 10:end procedure 5 4IGT and Anytime Tail Averaging So far, IGT weighs all gradients equally. This is because, with equal Hessians, one can perfectly transport these gradients irrespective of the distance travelled since they were computed. In practice, the individual Hessians a
53、re not equal and might change over time. In that setting, the transport induces an error which grows with the distance travelled. We wish to average a linearly increasing number of gradients, to maintain theO(1/t)rate on the variance, while forgetting about the oldest gradients to decrease the bias.
54、 To this end, we shall use anytime tail averaging 23, named in reference to the tail averaging technique used in optimization 16. Tail averaging is an online averaging technique where only the last points, usually a constant fractioncof the total number of points seen, is kept. Maintaining the exact
55、 average at every timestep is memory ineffi cient and anytime tail averaging performs an approximate averaging using t= c(t1) 1+c(t1) ? 1 1 c q 1c t(t1) ? . We refer the reader to 23 for additional details. 5Impact of IGT on bias and variance in the ideal case To understand the behaviour of IGT when
56、 Assumption 3.1 is verifi ed, we minimize a strongly convex quadratic function with HessianQ R100100with condition number1000, and we have access to the gradient corrupted by noise?t, where?t N(0,0.3I100). In that scenario where all Hessians are equal and implicit gradient transport is exact, Fig. 2
57、a confi rms theO(1/t)rate of IGT with constant stepsize while SGD and HB only converge to a ball around the optimum. To further understand the impact of IGT, we study the quality of the gradient estimate. Standard stochastic methods control the variance of the parameter update by scaling it with a d
58、ecreasing stepsize, which slows the optimization down. With IGT, we hope to have a low variance while maintaining a norm of the update comparable to that obtained with gradient descent. To validate the quality of our estimator, we optimized a quadratic function using IGT, collecting iteratest. For each iterate, we computed the squared error between the true gradient and either the stochastic or the IGT gradient. In this case where both estimators are unbiased, this is the trace of the noise covariance of our estimators. The results in
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025届黑龙江省黑河市通北一中高三语文第一学期期末教学质量检测模拟试题含解析
- 上海市浦东新区市级名校2025届生物高三上期末教学质量检测试题含解析
- 2025届天津市河北区高二生物第一学期期末质量检测试题含解析
- 广东省河源市2025届高二数学第一学期期末达标检测模拟试题含解析
- 2025届广东清远市生物高二上期末监测模拟试题含解析
- 黑龙江哈三中2025届生物高一上期末经典试题含解析
- 福建省厦门市湖里区双十中学2025届高二数学第一学期期末学业质量监测试题含解析
- (新版)煤矿安全管理人员(机电班组长)考试题库-上(单选题)
- 2025届湖南省安乡县一中数学高一上期末联考模拟试题含解析
- DB11T 2032-2022 工程建设项目多测合一技术规程
- 2024年美容师技能竞赛考试题库备赛500题(含答案)
- 2024年食品生产企业食品安全管理人员监督抽查考试题库(含答案)
- 第1~12课(考点清单)-2024-2025学年七年级历史上学期期中考点大串讲(统编版2024)
- 产业转移现状研究报告
- 会议培训合同协议书
- 家电以旧换新风险评估与管理方案
- 第12关:小说阅读(含答案与解析)-2024年中考语文一轮复习题型专练
- 20242025七年级上册科学浙教版新教材第1章第2节科学测量1长度测量讲义教师版
- 配电工程施工方案高低压配电工程施工组织设计
- 12、口腔科诊疗指南及技术操作规范
- 《钢结构厂房工程量计算表》
评论
0/150
提交评论