iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization

上传人：我*** IP属地：北京上传时间：2020-06-18 格式：PDF 页数：11 大小：781.74KB 积分：12 举报 版权申诉

iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization_第2页

iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization_第3页

iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization_第4页

iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Practical Two-Step Look-Ahead Bayesian Optimization Jian Wu wujian046 Peter I. Frazier School of Operations Research and Information Engineering Cornell University Ithaca, NY 14850 Abstract Expected improvement and other acquisition functions widely used in Bayesian op- timization us

2、e a “one-step” assumption: they value objective function evaluations assuming no future evaluations will be performed. Because we usually evaluate over multiple steps, this assumption may leave substantial room for improvement. Existing theory gives acquisition functions looking multiple steps in th

3、e future but calculating them requires solving a high-dimensional continuous-state continuous- action Markov decision process (MDP). Fast exact solutions of this MDP remain out of reach of todays methods. As a result, previous two- and multi-step looka- head Bayesian optimization algorithms are eith

4、er too expensive to implement in most practical settings or resort to heuristics that may fail to fully realize the promise of two-step lookahead. This paper proposes a computationally effi cient algorithm that provides an accurate solution to the two-step lookahead Bayesian optimization problem in

5、seconds to at most several minutes of computation per batch of evaluations. The resulting acquisition function provides increased query effi ciency and robustness compared with previous two- and multi-step lookahead methods in both single-threaded and batch experiments. This unlocks the value of two

6、-step lookahead in practice. We demonstrate the value of our algorithm with extensive experiments on synthetic test functions and real-world problems. 1Introduction We consider minimization of a continuous black-box functionfover a hyperrectangleA Rd. We suppose evaluationsf(x) are time-consuming to

7、 obtain, do not provide fi rst- or second-order derivative information and are noise-free. Such problems arise when tuning hyperparameters of complex machine learning models Snoek et al., 2012 and optimizing engineering systems using physics-based simulators Forrester et al., 2008. We consider this

8、problem within a Bayesian optimization (BayesOpt) framework Brochu et al., 2010, Frazier, 2018. BayesOpt methods contain two components: (1) a statistical model overf, typically a Gaussian process Rasmussen and Williams, 2006; and (2) an acquisition function computed from the statistical model that

9、quantifi es the value of evaluatingf . After a fi rst stage of evaluations off, often at points chosen uniformly at random fromA , we behave iteratively: we fi t the statistical model to all available data; then optimize the resulting acquisition function (which can be evaluated quickly and often pr

10、ovides derivative information) to fi nd the best point(s) at which to evaluatef; perform these evaluations; and repeat until our evaluation budget is exhausted. Peter Frazier is also a Staff Data Scientist at Uber 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Ca

11、nada. The most widely-used acquisition functions use a one-step lookahead approach. They consider the direct effect of the evaluation on an immediate measure of solution quality, and do not consider evaluations that will be performed later. This includes expected improvement (EI) Jones et al., 1998,

12、 probability of improvement (PI) Kushner 1964, entropy search (ES) Hernndez-Lobato et al., 2014, Wang and Jegelka, 2017, and the knowledge gradient (KG) Wu and Frazier, 2016. By myopically maximizing the immediate improvement in solution quality, they may sacrifi ce even greater gains in solution qu

13、ality obtainable through coordinated action across multiple evaluations. Researchers have sought to address this shortcoming through non-myopic acquisition functions. The decision of where to sample next in BayesOpt can be formulated as a partially observable Markov decision process (POMDP) Ginsbour

14、ger and Riche, 2010. The solution to this POMDP is given by the Bellman recursion Lam et al., 2016 and provides a non-myopic acquisition function that provides the best possible average-case performance under the prior. However, the “curse of dimensionality Powell 2007 prevents solving this POMDP fo

15、r even small-scale problems. The past literature Lam et al., 2016, Osborne et al., 2009, Ginsbourger and Riche, 2010, Gonzlez et al., 2016 instead approximates the solution to this POMDP to create non-myopic acquisition functions. Two-step lookahead is particularly attractive Osborne et al., 2009, G

16、insbourger and Riche, 2010, Gonzlez et al., 2016 because it is substantially easier to compute than looking ahead more than two steps, but still promises a performance improvement over the one-step acquisition functions used in practice. Indeed, Ginsbourger and Riche 2010 argue that using two-step l

17、ookahead encourages a particularly benefi cial form of exploration: evaluating a high uncertainty region benefi ts future evaluations; if the evaluation reveals the region was better than expected, then future evaluations evaluate nearby to fi nd improvements in solution quality. This benefi t occur

18、s even if the fi rst evaluation does not generate a direct improvement in solution quality. In numerical experiments, Osborne et al. 2009, Ginsbourger and Riche 2010 show that two-step lookahead improves over one-step lookahead in a range of practical problems. At the same time, optimizing two-step

19、acquisition functions is computationally challenging. Unlike common one-step acquisition functions like expected improvement, they cannot be computed in closed form and instead require a time-consuming simulation with nested optimization. Simulation creates noise and prevents straightforwand differe

20、ntiation, which hampers optimizing these two- step acquisition functions precisely. Existing approaches Osborne et al., 2009, Ginsbourger and Riche, 2010, Gonzlez et al., 2016 use derivative-free optimizers, which can require a large number of iterations to optimize precisely, particularly as the di

21、mensiondof the feasible space grows. (Numerical experiments in Osborne et al. 2009, Ginsbourger and Riche 2010 are restricted to problems withd 3.) As a result, existing two- and multi-step methods require a prohibitive amount of computation (e.g., Lam 2018 reports that the method in Lam et al. 2016

22、 requires between 10 minutes and 1 hour per evaluation even on low-dimensional problems). If suffi cient computation is not performed, then errors in the acquisition-function optimization overwhelm the benefi ts provided by two-step lookahead and query effi ciency degrades compared to a one-step acq

23、uisition function supporting precise optimization. Similar challenges arise for the multi-step method proposed in Lam et al. 2016. This computational challenge has largely prevented the widespread adoption of non-myopic acquisition functions in practice. Contributions.This article makes two key inno

24、vations unlocking the power of two-step lookahead in practice. First, we provide an unbiased estimator based on the envelope theorem for the gradient of the two-step lookahead acquisition function, which can be used within a stochastic optimizer. Second, we show how Monte Carlo variance reduction me

25、thods can further reduce the computational cost of estimating both the two-step lookahead acquisition function and its gradient. Third, we prove that when our gradient estimator is used within stochastic gradient ascent to optimize the acquisition function, the sequence of iterates converges to a lo

26、cal stationary point of the acquisition function. Together, these innovations support optimizing the acquisition function accurately with computation requiring between a few seconds and several minutes on a single core. Moreover, this computation can be easily parallelized across cores. It also scal

27、es better in the batch size and dimension of the black-box function compared with the common practice of using a derivative- free optimizer. An implementation is available in the Cornell-MOE codebase,https:/github. com/wujian16/Cornell-MOE, and the code to replicate our experiments is available atht

28、tps: / 2 Ourapproachleveragescomputationaltechniquesdevelopedintheliterature. Thefi rstisinfi nitessimal perturbation analysis Heidelberger et al., 1988 and the envelope theorem Milgrom and Segal, 2002, previously used in Bayesian optimization to optimize the knowledge gradient aquisition function (

29、which is myopic, as noted above) by Wu et al. 2017. This built on earlier work using infi nitessimal perturbation analysis without the envelope theorem to optimize the expected improvement acquisition function (also myopic) in the batch setting Wang et al., 2016. The second is a pair of variance red

30、ucton methods: Gauss-Hermite quadrature Liu and Pierce 1994 and importance sampling Asmussen and Glynn, 2007. Our paper is the fi rst to demonstrate the power of these techniques for non-myopic Bayesian optimization. 2The Two-Step Optimal (2-OPT) Acquisition Function This section defi nes the two-st

31、ep lookahead acquisition function. This acquisition function is optimal when there are two stages of measurements remaining, and so we call it 2-OPT. Before defi ning 2-OPT, we fi rst provide notation and brief background from Gaussian process regression in Sect. 2.1. We then defi ne 2-OPT in Sect.

32、2.2 and show how to estimate it with Monte Carlo in Sect. 2.3. While 2-OPT has been defi ned implicitly in past work, we include a complete description to provide a framework and notation supporting our novel effi cient method for maximizing it in Sect. 3. 2.1Gaussian process model for the objective

33、 f We place a Gaussian process (GP) prior on the objectivef . Although standard, here we briefl y describe inference under a GP to provide notation used later. Our GP prior is characterized by a mean function()and a kernel functionK(,). The posterior distribution onfafter observingfat data points D

34、= (x(1),.,x(m) ) is a GP with mean function and kernel defi ned respectively by (x) + K(x,D)K(D,D)1(f(D) (D), K(x,x0) K(x,D)K(D,D)1K(D,x0). (1) In (1),f(D) = (f(x(1),.,f(x(m), and similarly for(D). ExpressionsK(x,D),K(D,x), and K(D,D) similarly evaluate to a column vector, row vector, and square mat

35、rix respectively. 2.2Two-step lookahead acquisition function Here we defi ne the 2-OPT acquisition function from a theoretical (but not yet computational) perspective. This formulation follows previous work on two-step and multi-step acquisition functions Lam et al., 2016, Osborne et al., 2009, Gins

36、bourger and Riche, 2010, Gonzlez et al., 2016. 2-OPT gives optimal average-case behavior when we have two stages of evaluations remaining, and the second stage of evaluated may be chosen based on the results from the fi rst. Tosupportbatchevaluationswhilemaintainingcomputationaltractability, ourfi r

37、ststageofevaluations uses a batch of q 1 simultaneous evaluations, while the second stage uses a single evaluation. Throughout, we assume that we have already observed a collection of data pointsD, so that the current posterior distribution is a GP with a mean function0and kernelK0given by(1), and u

38、se E0to indicate the expectation taken with respect to this distribution. We letf 0 = minf(D)be the best point observed thus far. We index quantities associated with the fi rst stage of evaluations by 1 and the second by 2. We let X1indicate the set ofq points to be evaluated in the fi rst stage. We

39、 letf(X1) = (f(x) : x X1) indicate the corresponding vector of observed values and and letminf(X1)be the smallest value in this vector. We let x2indicate the single point observed in the second stage. For eachi = 1,2 , we defi nef i to be smallest value observed by the end of stagei, sof 1 = min(f 0

40、,f(X1)andf 2 = min(f 1,f(x2). We letibe the mean function andKithe kernel for the posterior distribution givenDand observations available at the end of stagei. We letEiindicate the expectation taken with respect to the corresponding Gaussian process. The overall loss whose expected value we seek to

41、minimize is f 2. To fi nd the optimal sampling strategy, we follow the dynamic programming principle. We fi rst write the expected loss achievable at the end of the second stage, conditioned on the selection of points 3 Figure 1:We demonstrate 2-OPT and EI minimizing a 1-d synthetic function sampled

42、 from a GP. Each row shows the posterior onf(mean+/one standard deviation) and the corresponding acquisition function, for EI (left) and 2-OPT (right). We plot progress over three iterations. On the fi rst iteration, EI evaluates a point that refi nes an existing local optimum and could have provide

43、d a small one-step improvement, but provides little information of use in future evaluations. In contrast, 2-OPT explores more aggressively, which helps it identify a new global minimum in the next iteration. (X1) and results (f(X1) ) from the fi rst stage. If we choose the fi nal evaluation optimal

44、ly, then this expected loss is L1= minx2E1f 2. This posterior and thus also L1depends on X1and f(X1). Following the derivation of the expected improvement Jones et al. 1998, we rewrite this as L1= min x2 E1 ?f 1 (f 1 f(x2)+?= f 1 max x2 E1 ?(f 1 f(x2)+?= f 1 max x2 EI1(x2), wherey+= max(y,0)is the p

45、ositive part function andEI1(x)is the expected improvement under the GP after the fi rst evaluation has been performed: EI1(x) = EI(f 1 1(x2),K1(x2,x2).(2) HereEI(m,v) = m(m/v)+ v(m/v) gives the expected improvement at a point where the difference between the best observed point and the mean ismand

46、the variance isv.is the standard normal cdf and is the standard normal pdf. With this expression for the value achievable at the start of the second stage, the expected value achieved at the start of the fi rst stage is: E0L1 = E0 ? f 1 max x2 EI1(x2) ? = E0 ? f 0 (f 0 minf(X1)+ max x2 EI1(x2) ? = f

47、 0 EI0(X1) E0 ? max x2 EI1(x2) ? ,(3) whereEI0(X1) = E0(f 0 minf(X1)+is the multipoints expected improvement Ginsbourger et al., 2010 under the GP with mean 0and kernel K0. We defi ne our two-step acquistion function to be 2-OPT(X1) = EI0(X1) + E0 ? max x2 EI1(x2) ? .(4) Becausef 0 does not depend o

48、n the set of pointsX1 we evaluate next, fi nding theX1that minimizes (3) is equivalent to fi nding the value that maximizes2-OPT . In the defi nition of2-OPT, we emphasize thatE0maxx2EI1(x2)depends onX1through the fact thatf(X1) infl uences the mean function1 and kernel K1from which EI1is computed.

49、Figure 1 illustrates the behavior of the two-step acquisition function, showing how it provides more exploration than EI. 2.3Monte Carlo estimation of 2-OPT() 2-OPT(X1)cannot be computed in closed form. We can, however, estimate it using Monte Carlo. We fi rst use the reparameterization trick Wilson

50、 et al., 2018 to writef(X1)as0(X1) +C0(X1)Z, 4 whereZis aq-dimensional independent standard normal random variable andC0(X1)is the Cholesky decomposition of K0(X1,X1). We assume that K0(X1,X1 ) is positive defi nite so C0(X1) is of full rank. Then, under (1), for generic x, 1(x) = 0(x) + K0(x,X1)K0(

51、X1,X1)1(f(X1) 0(X1) = 0(x) + K0(x,X1)(C0(X1)C0(X1)T)1C0(X1)Z = 0(x) + 0(x,X1)Z K1(x,x) = K0(x) K0(x,X1)K0(X1,X1)1K(X1,x) = K0(x) K0(x,X1)(C0(X1)C0(X1)T)1K(X1,x) = K0(x) 0(x,X1)0(x,X1)T. where 0(x,X1) = K0(x,X1)C0(X1)1. With this notation, we can write EI2(x2) explicitly as EI2(x2) = EI(0(x2) + 0(x2,

52、X1)Z,K0(x2) 0(x2,X1)0(x2,X1)T) =: (X1,x2,Z) where we have defi ned a new function for later convenience. Thus, we can rewrite the 2-OPT acquisition function as 2-OPT = E02-OPT(X1,Z) where 2-OPT(X1,Z) = (f 0 minf(X1)+ max x2 EI1(x2) = max(f 0 0(X1) C0(X1)Z)+ max x2 (X1,x2,Z), where max(y)+is the larg

53、est non-negative component of y, or 0 if all components are negative. Then, to provide an unbiased estimate of2-OPT(x1), we sampleZand compute 2-OPT(Z)using a nonlinear global optimization routine to calculate the inner maximization. Averaging many such replications provides a strongly consistent es

54、timate of 2-OPT(x1). Previous approaches Osborne et al., 2009, Ginsbourger and Riche, 2010, Gonzlez et al., 2016 use this or a similar simulation method to obtain an estimator of2-OPT, and then use this estimator within a derivative-free optimization approach. This requires extensive computation bec

55、ause: 1. The nested optimization over x2is time-consuming and must be done for each simulation. 2.Noise in the simulation requires either a noise-tolerant derivative-free optimization method that would typically require more iterations, or requires that the simulation be averaged over enough replica

56、tions on each iteration to make noise negligible. This increases the number of simulations required to optimize accurately. 3.It does not leverage derivative information, causing optimization to require more iterations, especially as the dimension d of the search space or the batch size q grows. 3 E

57、ffi ciently Optimizing 2-OPT Here we describe a novel computational approach to optimizing 2-OPT that is substantially more effi cient than previously proposed methods. Our approach includes two components: a novel simulation-based stochastic gradient estimator, which can be used within multistart s

58、tochastic gradient ascent; and variance reduction techniques that reduce the variance of this stochastic gradient estimator. 5 3.1Unbiased Estimation of the Gradient of 2-OPT We now show how to obtain an unbiased and strongly consistent estimator of the gradient of2-OPT. The main idea is to exchange

59、 the expectation and gradient operators 2-OPT(X1) = E0 h 2-OPT(X1,Z) i = E0 ? max(f 0 0(X1) C0(X1)Z)+ max x2 (X1,x2,Z) ? = E0 ?max(f 0 0(X1) C0(X1)Z)+ (X1,x 2,Z) ? wherex 2 argmaxx2A(X1,x2,Z) is fi xed and the last equation follows under some regularity conditions by the envelope theorem Milgrom and Segal, 2002. The follo

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization

文档简介

温馨提示

最新文档

评论

iccv2019论文全集9174-practical-two-step-lookahead-bayesian-optimization

文档简介

温馨提示

最新文档

评论

相关文档