iccv2019论文全集9097-accurate-uncertainty-estimation-and-decomposition-in-ensemble-learning_第1页
iccv2019论文全集9097-accurate-uncertainty-estimation-and-decomposition-in-ensemble-learning_第2页
iccv2019论文全集9097-accurate-uncertainty-estimation-and-decomposition-in-ensemble-learning_第3页
iccv2019论文全集9097-accurate-uncertainty-estimation-and-decomposition-in-ensemble-learning_第4页
iccv2019论文全集9097-accurate-uncertainty-estimation-and-decomposition-in-ensemble-learning_第5页
已阅读5页,还剩7页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Accurate Uncertainty Estimation and Decomposition in Ensemble Learning Jeremiah Zhe Liu Google Research and structural uncertainty that refl ects the uncertainty about whether a given model specifi cation is suffi cient for describing the data, i.e. whether there exists a systematic discrepancy betw

2、een CDFF(y|x,)based on the model and the data-generating distribution F(y|x). The goal of uncertainty estimation is to properly characterize both a models aleatoric and epistemic uncertainties 24,42. In regions that are well represented by the training data, a models aleatoric uncertainty should acc

3、urately estimate the data-generating distribution by fl exibly capturing the stochastic pattern in the data (i.e., calibration 19), while in regions unexplored by the training data, the models epistemic uncertainty should increase to capture the models lack of confi dence in the resulting prediction

4、s (i.e. out-of-distribution generalization 24). Within the epistemic uncertainty, the structural uncertainty needs to be estimated to identify the sources of structural biases in the ensemble model, and to quantify how these structural biases may impact the model output, something necessary for the

5、continuous model validation and refi nement of a running ensemble system 40,34. A comprehensive framework for quantifying these three types of uncertainties is currently lacking in the ensemble learning literature. We refer readers to Supplementary Section A for a full review and how our work is rel

6、ated to existing literature. Briefl y, existing methods typically handle the aleatoric uncertainty using an assumed distribution family (e.g., Gaussian) 24,48 that may not capture the stochastic patterns in the data (e.g. asymmetry, heavy-tailedness, multimodality, or their combinations). Work exist

7、s on quantifying epistemic uncertainty, although ensemble methods mainly work with collections of base models of the same class, and usually do not explicitly characterize the models structural uncertainty 6, 9, 10, 50, 24, 51, 27. In this work, we develop an ensemble model that addresses all three

8、sources of predictive uncer- tainty. Our specifi c contributions are: 1) We propose Bayesian Nonparametric Ensemble (BNE), an augmentation framework that mitigates misspecifi cation in the original ensemble model and fl exibly quantifi es all three sources of predictive uncertainty (Section 2). 2) W

9、e establish BNEs model properties in uncertainty characterization, including its theoretical guarantee with respect to consistent estimation of aleatoric uncertainty, and its ability to decompose different sources of epistemic uncertainties (Section 3). 3) We demonstrate through experiments that the

10、 proposed method achieves accurate uncertainty estimation under complex observational noise and improves predictive accuracy (Section 4), and illustrate our method by predicting ambient fi ne particle pollution in Eastern Massachusetts, USA by ensembling three different existing prediction models de

11、veloped by multiple research groups (Section 5). 2Bayesian Nonparametric Ensemble In this section, we introduce the Bayesian Nonparametric Ensemble (BNE), an augmentation frame- work for ensemble learning. We focus on the application of BNE to regression tasks. Given an ensemble model, BNE mitigates

12、 the original models misspecifi cation in the prediction function and in the distribution function using Bayesian nonparametric machinery. As a result, BNE enables an ensemble to fl exibly quantify aleatoric uncertainty in the data, and account for both the parametric and the structural uncertaintie

13、s. We build the full BNE model by starting from the classic ensemble model. DenotingF(y|x)the CDF of data-generating distribution for an continuous outcome. Given an observation pairx,y RpR wherey F(y|x)and a set of base model predictorsfkK k=1, a classic ensemble model assumes the form Y = K k=1 fk

14、(x)k+,(1) where =kK k=1are the ensemble weights assigned to each base model, and is a random variable describing the distribution of the outcome. For simplicity of exposition, in the rest of this section we assumeandfollow independent Gaussian priors, which corresponds to a classic stacking model as

15、suming a Gaussian outcome 10. 2 In practice, given a set of predictorsfkK k=1 s built by domain experts, a practitioner needs to fi rst specify a distribution family for(e.g. Gaussian such that N(0,), then estimateandusing collected data. During this process, two types of model biases can arise: bia

16、s in prediction function = K k=1fk(x)kcaused by the systematic bias shared among all the base predictors fks; and bias in distribution specifi cation caused by assuming a distribution family forthat fails to capture the stochastic pattern in the data, producing inaccurate estimates of aleatoric unce

17、rtainty. BNE mitigates these two types of biases that exist in (1) using Bayesian nonparametric machinery. Mitigate prediction bias using residual process To mitigate models structural bias in pre- diction, BNE fi rst adds to (1) a fl exible residual process(x), so the ensemble model becomes a semip

18、arametric model 11, 39: Y = K k=1 fk(x)k+(x)+.(2) In this work, we model(x)nonparametrically using a Gaussian process (GP) with zero mean function0(x) = 0and kernel functionk(x,x0). The residual process(x) adds additional fl exibility of the models mean functionE(Y|x) , and domain experts can select

19、 a fl exible kernel forto best approximate the data-generating function of interest (e.g., a RBF kernel to approximate arbitrary continuous functions over a compact support 33). As a result, in densely-sampled regions that are well captured by the training data,(x) will confi dently mitigate the pre

20、diction bias between the observationyand the prediction functionK k=1fk(x)k. However, in sparsely-sampled regions, the posterior mean of(x)will be shrunk back towards0(x) = 0, so as to leave the predictions of the original ensemble (1) intact (since these expert-built base models presumably have bee

21、n specially designed for the problem being considered) and the posterior uncertainty of(x)will be larger to refl ect the models increased structural uncertainty in its prediction function at location x. We recommend selectingkfrom the shift-invariant kernel familyk(x,x0) = g(xx0). Shift-invariant ke

22、rnels are well suited for characterizing a models epistemic uncertainty, since the resulting predictive variances are explicitly characterized by the distance from the training data, which yields predictive uncertainty that increases as the prediction location of interest is farther away from data 3

23、6. We write the model CDF of (2) as(y|x,). In the case N(0,2 ), is a Gaussian CDF with meanand variance2 . Notice that since(x) is a Gaussian process, (2) specifi esYas a hierarchical Gaussian process with mean function K k=1fk(x)kand kernel function k(x,x0)+ 2 . Mitigate distribution bias using cal

24、ibration function G Although fl exible in its mean prediction, the model in (2) can still be restrictive in its distributional assumptions. That is, at a given location x Rp , because the model corresponds to a Gaussian process specifi cation forY, the posterior of (2) still follows a Gaussian distr

25、ibution 36. Consequently, when the data distribution is multi- modal, non-symmetric, or heavy-tailed, the model in (2) can still fail to capture the underlying data-generating distributionF(y|x), resulting in systematic discrepancy between(y|x,)and F(y|x). To mitigate this bias in the specifi cation

26、 of the data distribution, BNE further augments(y|x,)by using a nonparametric functionGto calibrate the models distributional assumption using observed dataz = y,x, i.e., BNE models its CDF asF(y|x,) = G?(y|x,)?. As a result, the full BNE models CDF is a fl exible nonparametric function capable of m

27、odeling a wide range of complex distributions. In this work, we modelGusing a Gaussian process with identity mean functionI(x)=x and kernel functionkG, and we impose probit-based likelihood constraints onGso it respects the mathematical property of a CDF (i.e. monotonic and bounded between0,1, see S

28、ection B for detail). As a result, the full BNE models CDF follows a constrained Gaussian process (CGP) 29, 30, 38: F(y|x,) CGP ? (y|x,), kG?z,z0? ? ,(3) wherez = y,x. In this work, we setkGto the Matrn 3 2 kernelkMatrn3/2(d) = (1+ 3d/l) exp(3d/l)whered = |xx0|2. The sample space of a Matrn 3 2 Gaus

29、sian process corresponds to the space of Hlder continuous functions that are at least once differentiable, allowingF to fl exibly model the space of (Lipschitz) continuous CDFsF(y|x)whose probability density function (PDF) exist 46. Consequently, in regions well represented by the training data, the

30、 BNEs model CDF will 3 fl exibly capture the complex patterns in the data distribution. In regions outside the training data, the BNEs model CDF will fall back to(y|x,), not interfering with the generalization behavior of the original ensemble model. Additionally, the posterior uncertainty in (3) wi

31、ll refl ect the models additional structural uncertainty with respect to its distribution specifi cation. Figure 2: Illustrative example illustrating impact ofGon models posterior predictive distribution. Dashed Line: True DistributionF(y|x), Black Ticks: Observations, Red Shade: predictive density

32、of kfkk, Grey Shade: predictive density of(y|x,)(Gaussian assumption), Blue Shade: predictive density of G(y|x,) (nonparametric noise correction). To further illustrate the roleG plays in the BNEs ability to fl exibly characterize an outcome distribu- tion, we consider an illustrative example where

33、we run the BNE model both with and withoutGto predicty at a fi xed location ofx(i.e. estimating the conditional distributionF(y|x) at fi xed location ofx) wherey|x Gamma(1.5,2)(Figure 2). As shown, the posterior distribution of(y|x,)(grey shade) fails to capture the skewness in the datas empirical d

34、istribution, and consequently yields a biased maximum a posterior (MAP) estimate due to its restrictive distributional assumptions. On the other hand, the full BNE modelF(y|x,) = G?(y|x,)?is able to calibrate its predictive distribution (blue shade) toward the data distribution using G, and conseque

35、ntly produces improved characterization of F(y|x) and improved MAP estimate. Model SummaryTo recap, given a classic ensemble model (1), BNE nonparametrically augments the models prediction function with a residual process, and augments the models distribution function with a calibration functionG .

36、Specifi cally, for datay|xthat is generated from the distribution F(y|x), the full BNE assumes the following model: F(y|x) = G?(y|x,)?, = K k=1 fk(x)k+(x).(4) The priors are defi ned to be G CGP(I,kG), GP(0,k), N(0,2 I), wherekGis the Matrn 3 2 kernel, andkis a shift-invariant kernel to be chosen by

37、 the domain expert (we set it to Matrn 3 2 in this work). The zero-mean GP ensures the ensemble bias termreverts to zero out of sample, while the identity-mean GP allows the noise process to be white Gaussian noise out of sample. In other words, this prior structure allows BNE to fl exibly capture d

38、ata distribution where data exists, and revert to the classic ensemble otherwise. BNEs hyper-parameters are the Matrn length-scale parameterslandlG, and the prior variances and. Consistent with the existing GP approaches, we place the inverse-Gamma priors on the landlGand the Half Normal priors onan

39、d43. Posterior sampling is performed using Hamiltonian Monte Carlo (HMC) 2, for which we pre-orthogonalize kernel matrices with respect to their mean functions to avoid parameter non-identifi ability 31,37. The time complexity for sampling from the BNE posterior isO(N3)due to the need to invert theN

40、Nkernel matrices. For large datasets, we can consider the parallel MCMC scheme proposed in 26 which partitions the data intoKsubsets and estimates the predictive intervals with reduced complexityO(N3/K2). Section C describes posterior inference in further detail. 3Characterizing Model Uncertainties

41、with BNE 3.1Mitigating Model Bias under Uncertainty In this section we study the contribution of BNEs model components to an ensembles prediction and predictive uncertainty estimation. For a model with predictive CDFF(y|x), we notice that 4 the models predictive behavior is completely characterized

42、byF(y|x): a models predictive mean is expressed asE(y|x) = R yRI(y 0)F(y|x)dy, and a models(1q)%predictive interval is expressed asUq(y|x) = F1(1 q 2|x), F 1(1+q 2|x)12. Consequently, BNE improves upon an ensemble models prediction and uncertainty estimation by building a fl exible model forFthat be

43、tter captures the data-generating F(y|x). Bias Correction for Prediction and Uncertainty EstimationWe can express the predictive mean of BNE as: E(y|x,G) = K k=1 fk(x)k+ (x) |z due to + Z yY h (y|x,)G?(y|x,)? i dy |z due toG .(5) See Supplementary E for derivation. As shown, the predictive mean for

44、the full BNE is composed of three parts: 1) the predictive mean of the original ensembleK k=1fk(x)k; 2) the term representing BNEs direct correction to the prediction function; and 3) the term R?(y|x,)G(y|x,)?dy representing BNEs indirect correction to prediction obtained upon the relaxation of the

45、original Gaussian assumption in model CDF. We denote these two error-correction terms asD(y|x)and DG(y|x). To express BNEs estimated predictive uncertainty, we denote as,the predictive CDF of the original ensemble (1) (i.e. with meankfkkand variance2 ). Then BNEs predictive interval is: Uq(y|x,G) =

46、h 1 , ? G1(1 q 2|x) ? +(x), 1 , ? G1(1+ q 2|x) ? +(x) i .(6) Comparing (6) to the predictive interval of original ensemble1 ,(1 q 2), 1 ,(1+ q 2), we see that the locations of the BNE predictive interval endpoints are adjusted by the residual process, while the spread of the predictive interval (i.e

47、. the predictive uncertainty) is calibrated by G. Quantifying Uncertainty in Bias CorrectionA salient feature of BNE is that it can quantify its uncertainty in bias correction. This is because the bias correction termsDandDGare random quantities that have posterior distributions (since they are func

48、tions ofandG ). Specifi cally, we can quantify the posterior uncertainty in whetherDandDGare different from zero by estimating P?D ?y|x? 0?andP?DG?y|x? 0?, i.e., the percentiles of0in the posterior distribution ofD andDG. Values close to 0 or 1 indicate strong evidence that model bias impacts model

49、prediction. Values close to 0.5 indicate a lack of evidence of this impact, since the posterior distributions of these error-correction terms are roughly centered around zero. This approach can be generalized to describe the impact of the distribution biases on other properties of the predictive dis

50、tribution (e.g. skewness, multi-modality, etc. see Section E for detail). 3.2Consistent Estimation of Aleatoric Uncertainty Recall that a model characterize the aleatoric uncertainty in data through its model CDF. As it is clear from the expression of predictive intervalUq(y|x) = F1(1 q 2|x), F 1(1+

51、q 2|x), for a model to reliably estimate its predictive uncertainty, the model CDFFshould be estimated to be consistent with the data-generating CDFF(y|x), such that, for example, the95%predictive interval U0.95(y|x)indeed contains the observationsy F(y|x) 95%of the time. This consistency property i

52、s known in the probabilistic forecast literature as calibration 19 , and defi nes a mathematically rigorous condition for a model to achieve reliable estimation of its predictive uncertainty. To this end, using the fl exible calibration functionG, BNE enables its model CDF to consistently capturing

53、the data-generating F(y|x): Theorem 1(Posterior Consistency).LetF = G be a realization of the CGP prior defi ned in (3). Suppose that the true data-generating CDFF(y|x)is contained in the support ofF. Given yi,xin i=1, a random sample from F(y|x), denote the expectation with respect toFasEand denote

54、 the posterior distribution asn. There exists a sequencen 0 and suffi ciently largeMsuch that En ? |FF|2 Mn ? ? ?y i,xini=1 ? 0. 5 We defer the full proof to Section D. This result states that, as the sample size grows, the BNEs posteriordistributionofFconcentratesaroundthetruedata-generatingCDFF, t

55、hereforeconsistently capture the aleatoric uncertainty in the data distribution. By settingkGto the Matrn 3 2 kernel, the prior support of BNE is large and contains the space of compactly supported, Lipschitz continuous Fs whose PDF exist 5,46. The convergence speed of the posteriorFdepends both on

56、the distance ofFrelative to the prior distribution, and on how close the smoothness of the Matrn prior matches the smoothness ofF44,45. To this end, the BNE improves its speed of convergence by centering Fs prior mean to(y|x,)and by estimating the kernel hyperparameterlGadaptively through an inverse

57、 Gamma prior. 3.3Uncertainty Decomposition For an ensemble model that is augmented by BNE, the goal of uncertainty decomposition is to understand how different sources of uncertainty combine to impact the ensemble models predictive distribution, and to distinguish the contribution of each source in

58、driving the overall predictive uncertainty. As shown in Figure 1, the posterior uncertainty in each of a BNEs model parameters ,Gaccounts for an important source of model uncertainty. Consequently, both the aleatoric and epistemic uncertainties are quantifi ed by the BNEs posterior distribution, and can be distinguished through a careful decomposition of the model posterior. We fi rst show how to separate the aleatoric and epistemic uncertainties in BNEs posterior pre

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论