iccv2019论文全集8907-high-dimensional-multivariate-forecasting-with-low-rank-gaussian-copula-processes_第1页
iccv2019论文全集8907-high-dimensional-multivariate-forecasting-with-low-rank-gaussian-copula-processes_第2页
iccv2019论文全集8907-high-dimensional-multivariate-forecasting-with-low-rank-gaussian-copula-processes_第3页
iccv2019论文全集8907-high-dimensional-multivariate-forecasting-with-low-rank-gaussian-copula-processes_第4页
iccv2019论文全集8907-high-dimensional-multivariate-forecasting-with-low-rank-gaussian-copula-processes_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、High-Dimensional Multivariate Forecasting with Low-Rank Gaussian Copula Processes David Salinas Naverlabs david.salinas Michael Bohlke-Schneider Amazon Research bohlkem Laurent Callot Amazon Research lcallot Roberto Medico Ghent University roberto.medico91 Jan Gasthaus Amazon Research gasthaus Abstr

2、act Predicting the dependencies between observations from multiple time series is critical for applications such as anomaly detection, fi nancial risk management, causal analysis, or demand forecasting. However, the computational and numeri- cal diffi culties of estimating time-varying and high-dime

3、nsional covariance matri- ces often limits existing methods to handling at most a few hundred dimensions or requires making strong assumptions on the dependence between series. We pro- pose to combine an RNN-based time series model with a Gaussian copula process output model with a low-rank covarian

4、ce structure to reduce the computational complexity and handle non-Gaussian marginal distributions. This permits to dras- tically reduce the number of parameters and consequently allows the modeling of time-varying correlations of thousands of time series. We show on several real- world datasets tha

5、t our method provides signifi cant accuracy improvements over state-of-the-art baselines and perform an ablation study analyzing the contribu- tions of the different components of our model. 1Introduction The goal of forecasting is to predict the distribution of future time series values. Forecastin

6、g tasks frequently require predicting several related time series, such as multiple metrics for a compute fl eet or multiple products of the same category in demand forecasting. While these time series are often dependent, they are commonly assumed to be (conditionally) independent in high-dimension

7、al settings because of the hurdle of estimating large covariance matrices. Assumingindependence, however, makessuchmethodsunsuitedforapplicationsinwhichthecorre- lations between time series play an important role. This is the case in fi nance, where risk minimizing portfolios cannot be constructed w

8、ithout a forecast of the covariance of assets. In retail, a method providing a probabilistic forecast for different sellers should take competition relationships and can- nibalization effects into account. In anomaly detection, observing several nodes deviating from their expected behavior can be ca

9、use for alarm even if no single node exhibits clear signs of anomalous behavior. Multivariate forecasting has been an important topic in the statistics and econometrics literature. Several multivariate extensions of classical univariate methods are widely used, such as vector au- toregressions (VAR)

10、 extending autoregressive models 19, multivariate state-space models 7, or Work done while being at Amazon Research 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Figure 1: Top: covariance matrix predicted by our model for taxi traffi c time series for 12

11、14 locations in New-York at 4 different hours of a Sunday (only a neighborhood of 103 series is shown here, for clearer visualization). Bottom: Correlation graph obtained by keeping only pairs with covariance above a fi xed threshold at the same hours. Both spatial and temporal relations are learned

12、 from the data as the covariance evolves over time and edges connect locations that are close to each other. multivariate generalized autoregressive conditional heteroskedasticity (MGARCH) models 2. The rapid increase in the diffi culty of estimating these models due to the growth in number of param

13、eters with the dimension of the problem have been binding constraints to move beyond low-dimensional cases. To alleviate these limitations, researchers have used dimensionality reduction methods and regularization, see for instance 3, 5 for VAR models and 34, 9 for MGARCH models, but these models re

14、main unsuited for applications with more than a few hundreds dimensions 23. Forecasting can be seen as an instance of sequence modeling, a topic which has been intensively studied by the machine learning community. Deep learning-based sequence models have been suc- cessfully applied to audio signals

15、 33, language modeling 13, 30, and general density estimation of univariate sequences 22, 21. Similar sequence modeling techniques have also been used in the context of forecasting to make probabilistic predictions for collections of real or integer-valued time series 26, 36, 16. These approaches fi

16、 t a global (i.e. shared) sequence-to-sequence model to a collection of time series, but generate statistically independent predictions. Outside the forecasting domain, similar methods have also been applied to (low-dimensional) multivariate dependent time series, e.g. two-dimensional time series of

17、 drawing trajectories 13, 14. Two main issues prevent the estimation of high-dimensional multivariate time series models. The fi rst one is the O(N2) scaling of the number of parameters required to express the covariance matrix where N denotes the dimension. Using dimensionality reduction techniques

18、 like PCA as a pre- processing step is a common approach to alleviate this problem, but it separates the estimation of the model from the preprocessing step, leading to decreased performance. This motivated 27 to perform such a factorization jointly with the model estimation. In this paper we show h

19、ow the low rank plus diagonal covariance structure of the factor analysis model 29, 25, 10, 24 can be used in combination with Gaussian copula processes 37 to an LSTM-RNN 15 to jointly learn the tempo- ral dynamics and the (time-varying) covariance structure, while signifi cantly reducing the number

20、 of parameters that need to be estimated. The second issue affects not only multivariate models, but all global time series models, i.e. models that estimate a single model for a collection of time series: In real-world data, the magnitudes of the time series can vary drastically between different s

21、eries of the same data set, often spanning several orders of magnitude. In online retail demand forecasting, for example, item sales follow a power- law distribution, with a large number of items selling only a few units throughout the year, and a few popular items selling thousands of units per day

22、 26. The challenge posed by this for estimating global models across time series has been noted in previous work 26, 37, 27. Several approaches have been proposed to alleviate this problem, including simple, fi xed invertible transformations such as the square-root or logarithmic transformations, an

23、d the data-adaptive Box-Cox transform 4, 2 that aims to map a potentially heavy-tailed distribution to a Normal distribution. Other approaches includes removing scale with mean-scaling 26, or with a separate network 27. Here, we propose to address this problem by modeling each time series marginal d

24、istribution sepa- rately using a non-parametric estimate of its cumulative distribution function (CDF). Using this CDF estimate as the marginal transformation in a Gaussian copula (following 18, 17, 1) effectively ad- dresses the challenges posed by scaling, as it decouples the estimation of margina

25、l distributions from the temporal dynamics and the dependency structure. The work most closely related to ours is the recent work 32, which also proposes to combine deep autoregressive models with copula to model correlated time series. Their approach uses a nonpara- metric estimate of the copula, w

26、hereas we employ a Gaussian copula with low-rank structure that is learned jointly with the rest of the model. The nonparametric copula estimate requires splitting a N-dimensional cube into Nmany pieces (where N is the time series dimension and is a desired precision), making it diffi cult to scale

27、that approach to large dimensions. The method also requires the marginal distributions and the dependency structure to be time-invariant, an assumption which is often violated is practice as shown in Fig. 1. A concurrent approach was proposed in 35 which also uses Copula and estimates marginal quant

28、ile functions with the approach proposed in 11 and models the Cholesky factor as the output of a neural network. Two important differences are that this approach requires to estimate O(N2) parameters to model the covariance matrix instead of O(N) with the low-rank approach that we propose, another d

29、ifference is the use of a non-parametric esti- mator for the marginal quantile functions. The main contributions of this paper are: a method for probabilistic high-dimensional multivariate forecasting (scaling to dimensions up to an order of magnitude larger than previously reported in 23), a parame

30、trization of the output distribution based on a low-rank-plus-diagonal covariance matrix enabling this scaling by signifi cantly reducing the number of parameters, a copula-based approach for handling different scales and modeling non-Gaussian data, an empirical study on artifi cial and six real-wor

31、ld datasets showing how this method im- proves accuracy over the state of the art while scaling to large dimensions. The rest of the paper is structured as follows: In Section 2 we introduce the probabilistic forecast- ing problem and describe the overall structure of our model. We then describe how

32、 we can use the empirical marginal distributions in a Gaussian copula to address the scaling problem and handle non-Gaussian distributions in Section 3. In Section 4 we describe the parametrization of the covari- ance matrix with low-rank-plus-diagonal structure, and how the resulting model can be v

33、iewed as a low-rank Gaussian copula process. Finally, we report experiments with real-world datasets that demonstrate how these contributions combine to allow our model to generate correlated predictions that outperform state-of-the-art methods in terms of accuracy. 2Autoregressive RNN Model for Pro

34、babilistic Multivariate Forecasting Let us denote the values of a multivariate time series by zi,t D, where i 1,2,.,N indexes the individual univariate component time series, and t indexes time. The domain D is assumed to be either R or N. We will denote the multivariate observation vector at time t

35、 by zt DN. Given T observations z1,.,zT, we are interested in forecasting the future values for time units, i.e. we want to estimate the joint conditional distribution P(zT+1,.,zT+|z1,.,zT). In a nutshell, our model takes the form of a non-linear, deterministic state space model whose state hi,t Rke

36、volves independently for each time series i according to transition dynamics , hi,t= h(hi,t1,zi,t1)i = 1,.,N,(1) where the transition dynamics are parametrized using a LSTM-RNN 15. Note that the LSTM is unrolled for each time series separately, but parameters are tied across time series. Given the s

37、tate values hi,tfor all time series i = 1,2,.,N and denoting by htthe collection of state values for all series at time t, we parametrize the joint emission distribution using a Gaussian copula, p(zt|ht) = N(f1(z1,t),f2(z2,t),.,fN(zN,t)T| (ht),(ht).(2) 3 N 1t 2t 4t , d1t00 0d2t0 00d4t + v1t v2t v4t

38、v1t v2t v4t T z1t h1t 1t,d1t,v1t z2t h2t 2t,d2t,v2t z4t h4t 4t,d4t,v4t N 1t 3t 4t , d1t00 0d3t0 00d4t + v1t v3t v4t v1t v3t v4t T z1t h1t 1t,d1t,v1t z3t h3t 3t,d3t,v3t z4t h4t 4t,d4t,v4t Figure 2: Illustration of our model parametrization. During training, dimensions are sampled at random and a loca

39、l LSTM is unrolled on each of them individually (1, 2, 4, then 1, 3, 4 in the example). The parameters governing the state updates and parameter projections are shared for all time series. This parametrization can express the Low-rank Gaussian distribution on sets of series that varies during traini

40、ng or prediction. The transformations fi: D R here are invertible mappings of the form 1 Fi, where denotes the cumulative distribution function of the standard normal distribution, and Fidenotes an estimate of the marginal distribution of the i-th time series zi,1,.,zi,T. The role of these functions

41、 fiis to transform the data for the i-th time series such that marginally it follows a standard normal distribution. The functions () and () map the state htto the mean and covariance of a Gaussian distribution over the transformed observations (described in more detail in Section 4). Under this mod

42、el, we can factorize the joint distribution of the observations as p(z1,.zT+) = T+ Y t=1 p(zt|z1,.,zt1) = T+ Y t=1 p(zt|ht).(3) Both the state update function and the mappings () and () have free parameters that are learned from the data. We denote the vector of all free parameters, consisting of th

43、e parameters of the state update has well as and which denote the free parameters in () and (). Given and hT+1, we can produce Monte Carlo samples from the joint predictive distribution p(zT+1,.zT+|z1,.,zT) = p(zT+1,.zT+|hT+1) = T+ Y t=T+1 p(zt|ht)(4) by sequentially sampling from P(zt|ht) and updat

44、ing htusing 2. We learn the parameters from the observed data z1,.,zTusing maximum likelihood estimation by i.e. by minimizing the loss function logp(z1,z2,.,zT) = T X t=1 logp(zt|ht),(5) using stochastic gradient descent-based optimization. To handle long time series, we employ a data augmentation

45、strategy which randomly samples fi xed-size slices of length T0+ from the time series during training, where we fi x the context length hyperparameter T0to . During prediction, only the last T0time steps are used in computing the initial state for prediction. 3Gaussian Copula A copula function C : 0

46、,1N 0,1 is the CDF of a joint distribution of a collection of real random variables U1,.,UNwith uniform marginal distribution 8, i.e. C(u1,.,uN) = P(U1 u1,.,UN uN). Sklars theorem 28 states that any joint cumulative distribution F admits a representation in terms of its univariate marginals Fiand a

47、copula function C, F(z1,.,zN) = C(F1(z1),.,FN(zN). 2Note that the model complexity scales linearly with the number of Monte Carlo samples. 4 When the marginals are continuous the copula C is unique and is given by the joint distribution of the probability integral transforms of the original variable

48、s, i.e. u C where ui= Fi(zi). Furthermore, if ziis continuous then ui U(0,1). A common modeling choice for C is to use the Gaussian copula, defi ned by: C(F1(z1),.,Fd(zN) = ,(1(F1(z1),.,1(FN(zN), where : R 0,1 is the CDF of the standard normal and ,is a multivariate normal distribu- tion parametrize

49、d with RNand RNN. In this model, the observations z, the marginally uniform random variables u and the Gaussian random variables x are related as follows: x u F1 zz F u 1 x. Setting fi= 1 Firesults in the model in Eq. (2). The marginal distributions Fiare not given a priori and need to be estimated

50、from data. We use the non-parametric approach of 18 proposed in the context of estimating high-dimensional distri- butions with sparse covariance structure. In particular, they use the empirical CDF of the marginal distributions, Fi(v) = 1 m m X t=1 1zitv, where m observations are considered. As we

51、require the transformations fito be differentiable, we use a linearly-interpolated version of the empirical CDF resulting in a piecewise-constant derivative F0(u). This allow us to write the log-density of the original observations under our model as logp(z;,) = log,(1(F(z) + log d dz 1( F(z) = log,

52、(1(F(z) + log d du 1(u) + log d dz F(z) = log,(1(F(z) log(1(F(z) + log F0(z) which are the individual terms in the total loss (5) where is the probability density function of the standard normal. The number of past observations m used to estimate the empirical CDFs is an hyperparameter and left cons

53、tant in our experiments with m = 100 3. 4Low-rank Gaussian Process Parametrization After applying the marginal transformations fi() our model assumes a joint multivariate Gaussian distribution over the transformed data. In this section we describe how the parameters (ht) and (ht) of this emission di

54、stribution are obtained from the LSTM state ht. We begin by describing how a low-rank-plus-diagonal parametrization of the covariance matrix can be used to keep the computational complexity and the number of parameters manageable as the number of time series N grows. We then show how, by viewing the

55、 emission distribution as a time-varying low-rank Gaussian Process gt GP( t(),kt(,), we can train the model by only considering a subset of time series in each mini-batch further alleviating memory constraints and allowing the model to be applied to very high-dimensional sets of time series. Let us

56、denote the vector of transformed observations by xt= f(zt) = f1(z1,t),f2(z2,t),.,fN(zN,t)T, so that p(xt|ht) = N(xt|(ht),(ht). The covariance matrix (ht) is a N N symmetric positive defi nite matrix with O(N2) free parameters. Evaluating the Gaussian likelihood navely 3This makes the underlying assu

57、mption that the marginal distributions are stationary, which is violated e.g. in case of a linear trend. Standard time series techniques such as de-trending or differencing can be used to pre-process the data such that this assumption is satisfi ed. 5 requires O(N3) operations. Using a structured pa

58、rametrization of the covariance matrix as the sum of a diagonal matrix and a low rank matrix, = D + V V T where D RNNis diagonal and V RNr, results in a compact representation with O(N r) parameters. This allows the likelihoodto beevaluatedusingO(Nr2+r3)operations. As therankhyperparameterr cantypically be chosen to be much smaller than N, this leads to a signifi cant speedup. In all our low-rank experiments we use r = 10.

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论