To read the full-text of this research, you can request a copy directly from the author.
Abstract
This paper describes the use of cross‐learning with panel data modeling for stacking regressions of different predictive models for time series employment across occupations in Europe during the last 15 years. The ARIMA and state space models were used for the predictions on the first‐level model ensemble. On the second level, the time series predictions of these models were combined for stacking, using panel data estimators as a cross‐learner and also exploiting the strong hierarchical data structure (time series nested in occupational groups). Very few methods adopt stacking to generate ensembles for time series regressions. Indeed, to the best of our knowledge, panel data modeling has never before been used as a cross‐learner in staking strategies. Empirical application was used to fit employment by occupations in 30 European countries between 2010 Q1 and 2022 Q4, using the last year as the test set. The empirical results show that using panel data modeling as a multivariate time series cross‐learner that stacks univariate time series base models—especially when they do not produce accurate predictions—is an alternative worthy of consideration, also with respect to such classical aggregation schemes as optimal and equal weighting.
Forecast combinations have flourished remarkably in the forecasting community and, in recent years, have become part of the mainstream of forecasting research and activities. Combining multiple forecasts produced from single (target) series is now widely used to improve accuracy through the integration of information gleaned from different sources, thereby mitigating the risk of identifying a single "best" forecast. Combination schemes have evolved from simple combination methods without estimation, to sophisticated methods involving time-varying weights, nonlinear combinations, correlations among components, and cross-learning. They include combining point forecasts and combining probabilistic forecasts. This paper provides an up-to-date review of the extensive literature on forecast combinations, together with reference to available open-source software implementations. We discuss the potential and limitations of various methods and highlight how these ideas have developed over time. Some important issues concerning the utility of forecast combinations are also surveyed. Finally, we conclude with current research gaps and potential insights for future research.
In this work, we propose a novel framework for density forecast combination by constructing time-varying weights based on time-varying features. Our framework estimates weights in the forecast combination via Bayesian log predictive scores, in which the optimal forecast combination is determined by time series features from historical information. In particular, we use an automatic Bayesian variable selection method to identify the importance of different features. To this end, our approach has better interpretability compared to other black-box forecasting combination schemes. We apply our framework to stock market data and M3 competition data. Based on our structure, a simple maximum-a-posteriori scheme outperforms benchmark methods, and Bayesian variable selection can further enhance the accuracy for both point forecasts and density forecasts.
Combination and aggregation techniques can significantly improve forecast accuracy. This also holds for probabilistic forecasting methods where predictive distributions are combined. There are several time-varying and adaptive weighting schemes such as Bayesian model averaging (BMA). However, the quality of different forecasts may vary not only over time but also within the distribution. For example, some distribution forecasts may be more accurate in the center of the distributions, while others are better at predicting the tails. Therefore, we introduce a new weighting method that considers the differences in performance over time and within the distribution. We discuss pointwise combination algorithms based on aggregation across quantiles that optimize with respect to the continuous ranked probability score (CRPS). After analyzing the theoretical properties of pointwise CRPS learning, we discuss B- and P-Spline-based estimation techniques for batch and online learning, based on quantile regression and prediction with expert advice. We prove that the proposed fully adaptive Bernstein online aggregation (BOA) method for pointwise CRPS online learning has optimal convergence properties. They are confirmed in simulations and a probabilistic forecasting study for European emission allowance (EUA) prices.
Forecast combinations have been widely applied in the last few decades to improve forecasting. Estimating optimal weights that can outperform simple averages is not always an easy task. In recent years, the idea of using time series features for forecast combinations has flourished. Although this idea has been proved to be beneficial in several forecasting competitions, it may not be practical in many situations. For example, the task of selecting appropriate features to build forecasting models is often challenging. Even if there was an acceptable way to define the features, existing features are estimated based on the historical patterns, which are likely to change in the future. Other times, the estimation of the features is infeasible due to limited historical data. In this work, we suggest a change of focus from the historical data to the produced forecasts to extract features. We use out-of-sample forecasts to obtain weights for forecast combinations by amplifying the diversity of the pool of methods being combined. A rich set of time series is used to evaluate the performance of the proposed method. Experimental results show that our diversity-based forecast combination framework not only simplifies the modeling process but also achieves superior forecasting performance in terms of both point forecasts and prediction intervals. The value of our proposition lies on its simplicity, transparency, and computational efficiency, elements that are important from both an optimization and a decision analysis perspective.
Forecasting is an indispensable element of operational research (OR) and an important aid to planning. The accurate estimation of the forecast uncertainty facilitates several operations management activities, predominantly in supporting decisions in inventory and supply chain management and effectively setting safety stocks. In this paper, we introduce a feature-based framework, which links the relationship between time series features and the interval forecasting performance into providing reliable interval forecasts. We propose an optimal threshold ratio searching algorithm and a new weight determination mechanism for selecting an appropriate subset of models and assigning combination weights for each time series tailored to the observed features. We evaluate our approach using a large set of time series from the M4 competition. Our experiments show that our approach significantly outperforms a wide range of benchmark models, both in terms of point forecasts as well as prediction intervals.
The M4 competition identified innovative forecasting methods, advancing the theory and practice of forecasting. One of the most promising innovations of M4 was the utilization of cross-learning approaches that allow models to learn from multiple series how to accurately predict individual ones. In this paper, we investigate the potential of cross-learning by developing various neural network models that adopt such an approach, and we compare their accuracy to that of traditional models that are trained in a series-by-series fashion. Our empirical evaluation, which is based on the M4 monthly data, confirms that cross-learning is a promising alternative to traditional forecasting, at least when appropriate strategies for extracting information from large, diverse time series data sets are considered. Ways of combining traditional with cross-learning methods are also examined in order to initiate further research in the field.
Accurate forecasts are vital for supporting the decisions of modern companies. Forecasters typically select the most appropriate statistical model for each time series. However, statistical models usually presume some data generation process while making strong assumptions about the errors. In this paper, we present a novel data-centric approach-'forecasting with similarity', which tackles model uncertainty in a model-free manner. Existing similarity-based methods focus on identifying similar patterns within the series, i.e., 'self-similarity'. In contrast , we propose searching for similar patterns from a reference set, i.e., 'cross-similarity'. Instead of extrapolating, the future paths of the similar series are aggregated to obtain the forecasts of the target series. Building on the cross-learning concept, our approach allows the application of similarity-based forecasting on series with limited lengths. We evaluate the approach using a rich collection of real data and show that it yields competitive accuracy in both points forecasts and prediction intervals.
Retail sales forecasting often requires forecasts for thousands of products for many stores. We present a meta-learning framework based on newly developed deep convolutional neural networks, which can first learn a feature representation from raw sales time series data automatically, and then link the learnt features with a set of weights which are used to combine a pool of base-forecasting methods. The experiments which are based on IRI weekly data show that the proposed meta learner provides superior forecasting performance compared with a number of state-of-art benchmarks, though the accuracy gains over some more sophisticated meta ensemble benchmarks are modest and the learnt features lack interpretability. When designing a meta-learner in forecasting retail sales, we recommend building a pool of base-forecasters including both individual and pooled forecasting methods, and target finding the best combination forecasts instead of the best individual method.
The M4 Competition follows on from the three previous M competitions, the purpose of which was to learn from empirical evidence both how to improve the forecasting accuracy and how such learning could be used to advance the theory and practice of forecasting. The aim of M4 was to replicate and extend the three previous competitions by: (a) significantly increasing the number of series, (b) expanding the number of forecasting methods, and (c) including prediction intervals in the evaluation process as well as point forecasts. This paper covers all aspects of M4 in detail, including its organization and running, the presentation of its results, the top-performing methods overall and by categories, its major findings and their implications, and the computational requirements of the various methods. Finally, it summarizes its main conclusions and states the expectation that its series will become a testing ground for the evaluation of new methods and the improvement of the practice of forecasting, while also suggesting some ways forward for the field.
This paper presents the winning submission of the M4 forecasting competition. The submission utilizes a Dynamic Computational Graph Neural Network system that enables mixing
of a standard Exponential Smoothing model with advanced Long Short Term Memory networks into a common framework. The result is a hybrid and hierarchical forecasting method.
Keywords: Forecasting competitions, M4, Dynamic Computational Graphs, Automatic Differentiation, Long Short Term Memory (LSTM) networks, Exponential Smoothing
The M4 competition is the continuation of three previous competitions started more than 45 years ago whose purpose was to learn how to improve forecasting accuracy, and how such learning can be applied to advance the theory and practice of forecasting. The purpose of M4 was to replicate the results of the previous ones and extend them into three directions: First significantly increase the number of series, second include Machine Learning (ML) forecasting methods, and third evaluate both point forecasts and prediction intervals. The five major findings of the M4 Competitions are: 1. Out Of the 17 most accurate methods, 12 were “combinations” of mostly statistical approaches. 2. The biggest surprise was a “hybrid” approach that utilized both statistical and ML features. This method's average sMAPE was close to 10% more accurate than the combination benchmark used to compare the submitted methods. 3. The second most accurate method was a combination of seven statistical methods and one ML one, with the weights for the averaging being calculated by a ML algorithm that was trained to minimize the forecasting. 4. The two most accurate methods also achieved an amazing success in specifying the 95% prediction intervals correctly. 5. The six pure ML methods performed poorly, with none of them being more accurate than the combination benchmark and only one being more accurate than Naïve2. This paper presents some initial results of M4, its major findings and a logical conclusion. Finally, it outlines what the authors consider to be the way forward for the field of forecasting.
Forecast selection and combination are regarded as two competing alternatives. In the literature there is substantial evidence that forecast combination is beneficial, in terms of reducing the forecast errors, as well as mitigating modelling uncertainty as we are not forced to choose a single model. However , whether all forecasts to be combined are appropriate, or not, is typically overlooked and various weighting schemes have been proposed to lessen the impact of inappropriate forecasts. We argue that selecting a reasonable pool of forecasts is fundamental in the modelling process and in this context both forecast selection and combination can be seen as two extreme pools of forecasts. We evaluate forecast pooling approaches and find them beneficial in terms of forecast accuracy. We propose a heuristic to automatically identify forecast pools, irrespective of their source or the performance criteria, and demonstrate that in various conditions it performs at least as good as alternative pools that require additional modelling decisions and better than selection or combination.
This article introduces this JBR Special Issue on simple versus complex methods in forecasting. Simplicity in forecasting requires that (1) method, (2) representation of cumulative knowledge, (3) relationships in models, and (4) relationships among models, forecasts, and decisions are all sufficiently uncomplicated as to be easily understood by decision-makers. Our review of studies comparing simple and complex methods - including those in this special issue - found 97 comparisons in 32 papers. None of the papers provide a balance of evidence that complexity improves forecast accuracy. Complexity increases forecast error by 27 percent on average in the 25 papers with quantitative comparisons. The finding is consistent with prior research to identify valid forecasting methods: all 22 previously identified evidence-based forecasting procedures are simple. Nevertheless, complexity remains popular among researchers, forecasters, and clients. Some evidence suggests that the popularity of complexity may be due to incentives: (1) researchers are rewarded for publishing in highly ranked journals, which favor complexity; (2) forecasters can use complex methods to provide forecasts that support decision-makers’ plans; and (3) forecasters’ clients may be reassured by incomprehensibility. Clients who prefer accuracy should accept forecasts only from simple evidence-based procedures. They can rate the simplicity of forecasters’ procedures using the questionnaire at simple-forecasting.com.
A new class of longitudinal data has emerged with the use of technological devices for scientific data collection. This class of data is called intensive longitudinal data (ILD). This book features applied statistical modelling strategies developed by leading statisticians and methodologists working in conjunction with behavioural scientists.
We develop a framework for forecasting multivariate data that follow known linear constraints. This is particularly common in forecasting where some variables are aggregates of others, commonly referred to as hierarchical time series, but also arises in other prediction settings. For point forecasting, an increasingly popular technique is reconciliation, whereby forecasts are made for all series (so-called base forecasts) and subsequently adjusted to cohere with the constraints. We extend reconciliation from point forecasting to probabilistic forecasting. A novel definition of reconciliation is developed and used to construct densities and draw samples from a reconciled probabilistic forecast. In the elliptical case, we prove that true predictive distributions can be recovered using reconciliation even when the location and scale of base predictions are chosen arbitrarily. Reconciliation weights are estimated to optimise energy or variogram score. The log score is not considered since it is improper when comparing unreconciled to reconciled forecasts, a result also proved in this paper. Due to randomness in the objective function, optimisation uses stochastic gradient descent. This method improves upon base forecasts in simulated and empirical data, particularly when the base forecasting models are severely misspecified. For milder misspecification, extending popular reconciliation methods for point forecasting results in similar performance to score optimisation.
The U.S. COVID-19 Forecast Hub aggregates forecasts of the short-term burden of COVID-19 in the United States from many contributing teams. We study methods for building an ensemble that combines forecasts from these teams. These experiments have informed the ensemble methods used by the Hub. To be most useful to policy makers, ensemble forecasts must have stable performance in the presence of two key characteristics of the component forecasts: (1) occasional misalignment with the reported data, and (2) instability in the relative performance of component forecasters over time. Our results indicate that in the presence of these challenges, an untrained and robust approach to ensembling using an equally weighted median of all component forecasts is a good choice to support public health decision makers. In settings where some contributing forecasters have a stable record of good performance, trained ensembles that give those forecasters higher weight can also be helpful.
This paper constructs individual-specific density forecasts for a panel of firms or households using a dynamic linear model with common and heterogeneous coefficients as well as cross-sectional heteroskedasticity. The panel considered in this paper features a large cross-sectional dimension N but short time series T. Due to the short T, traditional methods have difficulty in disentangling the heterogeneous parameters from the shocks, which contaminates the estimates of the heterogeneous parameters. To tackle this problem, I assume that there is an underlying distribution of heterogeneous parameters, model this distribution nonparametrically allowing for correlation between heterogeneous parameters and initial conditions as well as individual-specific regressors, and then estimate this distribution by combining information from the whole panel. Theoretically, I prove that in cross-sectional homoskedastic cases, both the estimated common parameters and the estimated distribution of the heterogeneous parameters achieve posterior consistency, and that the density forecasts asymptotically converge to the oracle forecast. Methodologically, I develop a simulation-based posterior sampling algorithm specifically addressing the nonparametric density estimation of unobserved heterogeneous parameters. Monte Carlo simulations and an empirical application to young firm dynamics demonstrate improvements in density forecasts relative to alternative approaches.
This article provides guidance on how to evaluate and improve the forecasting ability of models in the presence of instabilities, which are widespread in economic time series. Empirically relevant examples include predicting the financial crisis of 2007–08, as well as, more broadly, fluctuations in asset prices, exchange rates, output growth, and inflation. In the context of unstable environments, I discuss how to assess models’ forecasting ability; how to robustify models’ estimation; and how to correctly report measures of forecast uncertainty. Importantly, and perhaps surprisingly, breaks in models’ parameters are neither necessary nor sufficient to generate time variation in models’ forecasting performance: thus, one should not test for breaks in models’ parameters, but rather evaluate their forecasting ability in a robust way. In addition, local measures of models’ forecasting performance are more appropriate than traditional, average measures. (JEL C51, C53, E31, E32, E37, F37)
Proper scoring rules are used to assess the out-of-sample accuracy of probabilistic forecasts, with different scoring rules rewarding distinct aspects of forecast performance. Herein, we re-investigate the practice of using proper scoring rules to produce probabilistic forecasts that are ‘optimal’ according to a given score and assess when their out-of-sample accuracy is superior to alternative forecasts, according to that score. Particular attention is paid to relative predictive performance under misspecification of the predictive model. Using numerical illustrations, we document several novel findings within this paradigm that highlight the important interplay between the true data generating process, the assumed predictive model and the scoring rule. Notably, we show that only when a predictive model is sufficiently compatible with the true process to allow a particular score criterion to reward what it is designed to reward, will this approach to forecasting reap benefits. Subject to this compatibility, however, the superiority of the optimal forecast will be greater, the greater is the degree of misspecification. We explore these issues under a range of different scenarios and using both artificially simulated and empirical data.
The COVID-19 pandemic has placed forecasting models at the forefront of health policy making. Predictions of mortality, cases and hospitalisations help governments meet planning and resource allocation challenges. In this paper, we consider the weekly forecasting of the cumulative mortality due to COVID-19 at the national and state level in the U.S. Optimal decision-making requires a forecast of a probability distribution, rather than just a single point forecast. Interval forecasts are also important, as they can support decision making and provide situational awareness. We consider the case where probabilistic forecasts have been provided by multiple forecasting teams, and we combine the forecasts to extract the wisdom of the crowd. We use a dataset that has been made publicly available from the COVID-19 Forecast Hub. A notable feature of the dataset is that the availability of forecasts from participating teams varies greatly across the 40 weeks in our study. We evaluate the accuracy of combining methods that have been previously proposed for interval forecasts and predictions of probability distributions. These include the use of the simple average, the median, and trimming methods. In addition, we propose several new weighted combining methods. Our results show that, although the median was very useful for the early weeks of the pandemic, the simple average was preferable thereafter, and that, as a history of forecast accuracy accumulates, the best results can be produced by a weighted combining method that uses weights that are inversely proportional to the historical accuracy of the individual forecasting teams.
A series of recent papers introduce the concept of Forecast Reconciliation, a process by which independently generated forecasts of a collection of linearly related time series are reconciled via the introduction of accounting aggregations that naturally apply to the data. Aside from its clear presentational and operational virtues, the reconciliation approach generally improves the accuracy of the combined forecasts. In this paper, we examine the mechanisms by which this improvement is generated by re-formulating the reconciliation problem as a combination of direct forecasts of each time series with additional indirect forecasts derived from the linear constraints. Our work establishes a direct link between the nascent Forecast Reconciliation literature and the extensive work on Forecast Combination. In the original hierarchical setting, our approach clarifies for the first time how unbiased forecasts for the entire collection can be generated from base forecasts made at any level of the hierarchy, and we illustrate more generally how simple robust combined forecasts can be generated in any multivariate setting subject to linear constraints. In an empirical example, we show that simple combinations of such forecasts generate significant improvements in forecast accuracy where it matters most: where noise levels are highest and the forecasting task is at its most challenging.
Combining forecasts is an established approach for improving forecast accuracy. So-called optimal weights (OWs) estimate combination weights by minimizing errors on past forecasts. Yet the most successful and common approach ignores all training data and assigns equal weights (EWs) to forecasts. We analyze this phenomenon by relating forecast combination to statistical learning theory, which decomposes forecast errors into three components: bias, variance, and irreducible error. In this framework, EWs minimize the variance component (errors resulting from estimation uncertainty) but ignore the bias component (errors from under-sensitivity to training data). OWs, in contrast, minimize the bias and ignore the variance component. Reducing one component in general increases the other. To address this trade-off between bias and variance, we first derive the expected squared error of a combination using weights between EWs and OWs (technically, OWs shrunk toward EWs) and decompose it into the three error components. We then use the components to derive the shrinkage factor between EWs and OWs that minimizes the expected error. We evaluate the approach on forecasts from the Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters. For these forecasts, we first show that assumptions regarding the error distribution that are commonly used in theoretical analyses are likely to be violated in practice. We then demonstrate that our approach improves over EWs and OWs if the assumptions are met, for instance, as the result of using a standardization procedure for the training data.
This paper was accepted by Han Bleichrodt, decision analysis.
Building electric energy consumption forecasting is essential in establishing an energy operation strategy for building energy management systems. Because of recent developments of artificial intelligence hardware, deep neural network (DNN)-based electric energy consumption forecasting models yield excellent performances. However, constructing an optimal forecasting model using DNNs is difficult and time-consuming because several hyperparameters, including the activation function and number of hidden layers, must be determined to obtain the best combination of neural networks. The determination of the number of hidden layers in the DNN model is challenging because it greatly affects the forecasting performance of the DNN models. In addition, the best number of hidden layers for one situation or domain is often not optimal for another domain. Hence, many efforts have been made to combine multiple DNN models with different numbers of hidden layers to achieve a better forecasting performance than that of an individual DNN model. In this study, we propose a novel scheme for the combination of short-term load forecasting models using a stacking ensemble approach (COSMOS), which enables the more accurate prediction of the building electric energy consumption. For this purpose, we first collected 15-min interval electric energy consumption data for a typical office building and split them into training, validation, and test datasets. We constructed diverse four-layer DNN-based forecasting models based on the training set and by considering the input variable configuration and training epochs. We selected optimal DNN parameters using the validation set and constructed four DNN-based forecasting models with various numbers of hidden layers. We developed a building electric energy consumption forecasting model using the test set and sliding window-based principal component regression for the calculation of the final forecasting value from the forecasting values of the four DNN models. To demonstrate the performance of our approach, we conducted several experiments using actual electric energy consumption data and verified that our model yields a better prediction performance than other forecasting methods.
This paper considers the problem of forecasting a collection of short time series using cross‐sectional information in panel data. We construct point predictors using Tweedie's formula for the posterior mean of heterogeneous coefficients under a correlated random effects distribution. This formula utilizes cross‐sectional information to transform the unit‐specific (quasi) maximum likelihood estimator into an approximation of the posterior mean under a prior distribution that equals the population distribution of the random coefficients. We show that the risk of a predictor based on a nonparametric kernel estimate of the Tweedie correction is asymptotically equivalent to the risk of a predictor that treats the correlated random effects distribution as known (ratio optimality). Our empirical Bayes predictor performs well compared to various competitors in a Monte Carlo study. In an empirical application, we use the predictor to forecast revenues for a large panel of bank holding companies and compare forecasts that condition on actual and severely adverse macroeconomic conditions.
The investigation of the accuracy of methods employed to forecast agricultural commodities prices is an important area of study. In this context, the development of effective models is necessary. Regression ensembles can be used for this purpose. An ensemble is a set of combined models which act together to forecast a response variable with lower error. Faced with this, the general contribution of this work is to explore the predictive capability of regression ensembles by comparing ensembles among themselves, as well as with approaches that consider a single model (reference models) in the agribusiness area to forecast prices one month ahead. In this aspect, monthly time series referring to the price paid to producers in the state of Parana, Brazil for a 60 kg bag of soybean (case study 1) and wheat (case study 2) are used. The ensembles bagging (random forests - RF), boosting (gradient boosting machine - GBM and extreme gradient boosting machine - XGB), and stacking (STACK) are adopted. The support vector machine for regression (SVR), multilayer perceptron neural network (MLP) and K-nearest neighbors (KNN) are adopted as reference models. Performance measures such as mean absolute percentage error (MAPE), root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE) are used for models comparison. Friedman and Wilcoxon signed rank tests are applied to evaluate the models’ absolute percentage errors (APE). From the comparison of test set results, MAPE lower than 1% is observed for the best ensemble approaches. In this context, the XGB/STACK (LASSO-KNN-XGB-SVR) and RF models showed better performance for short-term forecasting tasks for case studies 1 and 2, respectively. Better APE (statistically smaller) is observed for XGB/STACK and RF in relation to reference models. Besides that, approaches based on boosting are consistent, providing good results in both case studies. Alongside this, the rank according to the performances is: XGB, GBM, RF, STACK, MLP, SVR and KNN. It can be concluded that the ensemble approach presents statistically significant gains, reducing prediction errors for the price series studied. The use of ensembles is recommended to forecast agricultural commodities prices one month ahead, since a more assertive performance is observed, which allows to increase the accuracy of the constructed model and reduce decision-making risk.
We propose an automated method for obtaining weighted forecast combinations using time series features. The proposed approach involves two phases. First, we use a collection of time series to train a meta-model for assigning weights to various possible forecasting methods with the goal of minimizing the average forecasting loss obtained from a weighted forecast combination. The inputs to the meta-model are features that are extracted from each series. Then, in the second phase, we forecast new series using a weighted forecast combination, where the weights are obtained from our previously trained meta-model. Our method outperforms a simple forecast combination, as well as all of the most popular individual methods in the time series forecasting literature. The approach achieved second position in the M4 competition.
Mean square forecast error loss implies a bias–variance trade-off that suggests that structural breaks of small magnitude should be ignored. In this paper, we provide a test to determine whether modeling a structural break improves forecast accuracy. The test is near optimal even when the date of a local-to-zero break is not consistently estimable. The results extend to forecast combinations that weight the post-break sample and the full sample forecasts by our test statistic. In a large number of macroeconomic time series, we find that structural breaks that are relevant for forecasting occur much less frequently than existing tests indicate.
We present new methodology and a case study in use of a class of Bayesian predictive synthesis (BPS) models for multivariate time series forecasting. This extends the foundational BPS framework to the multivariate setting, with detailed application in the topical and challenging context of multistep macroeconomic forecasting in a monetary policy setting. BPS evaluates—sequentially and adaptively over time—varying forecast biases and facets of miscalibration of individual forecast densities for multiple time series, and—critically—their time-varying inter-dependencies. We define BPS methodology for a new class of dynamic multivariate latent factor models implied by BPS theory. Structured dynamic latent factor BPS is here motivated by the application context—sequential forecasting of multiple U.S. macroeconomic time series with forecasts generated from several traditional econometric time series models. The case study highlights the potential of BPS to improve of forecasts of multiple series at multiple forecast horizons, and its use in learning dynamic relationships among forecasting models or agents.
Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
We propose four different estimators that take into account the autocorrelation structure when reconciling forecasts in a temporal hierarchy. Combining forecasts from multiple temporal aggregation levels exploits information differences and mitigates model uncertainty, while reconciliation ensures a unified prediction that supports aligned decisions at different horizons. In previous studies, weights assigned to the forecasts were given by the structure of the hierarchy or the forecast error variances without considering potential autocorrelation in the forecast errors. Our first estimator considers the autocovariance matrix within each aggregation level. Since this can be difficult to estimate, we propose a second estimator that blends autocorrelation and variance information, but only requires estimation of the first-order autocorrelation coefficient at each aggregation level. Our third and fourth estimators facilitate information sharing between aggregation levels using robust estimates of the cross-correlation matrix and its inverse. We compare the proposed estimators in a simulation study and demonstrate their usefulness through an application to short-term electricity load forecasting in four price areas in Sweden. We find that by taking account of auto- and cross-covariances when reconciling forecasts, accuracy can be significantly improved uniformly across all frequencies and areas.
The Makridakis Competitions seek to identify the most accurate forecasting methods for different types of predictions. The M4 competition was the first in which a model of the type commonly described as “machine learning” has outperformed the more traditional statistical approaches, winning the competition. However, many approaches that were self-labeled as “machine learning” failed to produce accurate results, which generated discussion about the respective benefits and drawbacks of “statistical” and “machine learning” approaches. Both terms have remained ill-defined in the context of forecasting. This paper introduces the terms “structured” and “unstructured” models to better define what is intended by the use of the terms “statistical” and “machine learning” in the context of forecasting based on the model’s data generating process. The mechanisms that underlie specific challenges to unstructured modeling are examined in the context of forecasting, along with common solutions. Finally, the innovations in the winning model that allowed it to overcome these challenges and produce highly accurate results are highlighted.
Forecasting a large set of time series with hierarchical aggregation constraints is a central problem for many organizations. However, it is particularly challenging to forecast these hierarchical structures. In fact, it requires not only good forecast accuracy at each level of the hierarchy, but also the coherency between different levels, i.e. the forecasts should satisfy the hierarchical aggregation constraints. Given some incoherent base forecasts, the state-of-the-art methods compute revised forecasts based on forecast combination which ensures that the aggregation constraints are satisfied. However, these methods assume the base forecasts are unbiased and constrain the revised forecasts to be also unbiased. We propose a new forecasting method which relaxes these unbiasedness conditions, and seeks the revised forecasts with the best tradeoff between bias and forecast variance. We also present a regularization method which allows us to deal with high-dimensional hierarchies, and provide its theoretical justification. Finally, we compare the proposed method with the state-of-the-art methods both theoretically and empirically. The results on both simulated and real-world data indicate that our methods provide competitive results compared to the state-of-the-art methods.
The evidence from the literature on forecast combination shows that combinations generally perform well. We discuss here how the accuracy and diversity of the methods being combined and the robustness of the combination rule can influence performance, and illustrate this by showing that a simple, robust combination of a subset of the nine methods used in the M4 competition’s best combination performs almost as well as that forecast, and is easier to implement. We screened out methods with low accuracy or highly correlated errors and combined the remaining methods using a trimmed mean. We also investigated the accuracy risk (the risk of a bad forecast), proposing two new accuracy measures for this purpose. Our trimmed mean and the trimmed mean of all nine methods both had lower accuracy risk than either the best combination in the M4 competition or the simple mean of the nine methods.
We present a detailed description of our submission for the M4 forecasting competition, in which it ranked 3rd overall. Our solution utilizes several commonly used statistical models, which are weighted according to their performance on historical data. We cluster series within each type of frequency with respect to the existence of trend and seasonality. Every class of series is assigned a different set of models to combine. Combination weights are chosen separately for each series. We conduct experiments with a holdout set to manually pick pools of models that perform best for a given series type, as well as to choose the combination approaches.
Forecast combinations were big winners in the M4 competition. This note reflects on and analyzes the reasons for the success of forecast combination. We illustrate graphically how and in what cases forecast combinations produce good results. We also study the effects of forecast combination on the bias and the variance of the forecast.
This paper considers estimating the slope parameters and forecasting in potentially heterogeneous panel data regressions with a long time dimension. We propose a novel optimal pooling averaging estimator that makes an explicit trade‐off between efficiency gains from pooling and bias due to heterogeneity. By theoretically and numerically comparing various estimators, we find that a uniformly best estimator does not exist and that our new estimator is superior in non‐extreme cases and robust in extreme cases. Our results provide practical guidance for the best estimator and forecast depending on features of data and models. We apply our method to examine the determinants of sovereign credit default swap spreads and forecast future spreads.
Despite the clear success of forecast combination in many economic environments, several important issues remain incompletely resolved. The issues relate to the selection of the set of forecasts to combine, and whether some form of additional regularization (e.g., shrinkage) is desirable. Against this background, and also considering the frequently-found good performance of simple-average combinations, we propose a LASSO-based procedure that sets some combining weights to zero and shrinks the survivors toward equality (“partially-egalitarian LASSO”). Ex post analysis reveals that the optimal solution has a very simple form: the vast majority of forecasters should be discarded, and the remainder should be averaged. We therefore propose and explore direct subset-averaging procedures that are motivated by the structure of partially-egalitarian LASSO and the lessons learned, which, unlike LASSO, do not require the choice of a tuning parameter. Intriguingly, in an application to the European Central Bank Survey of Professional Forecasters, our procedures outperform simple average and median forecasts; indeed, they perform approximately as well as the ex post best forecaster.
Large collections of time series often have aggregation constraints due to product or geographical groupings. The forecasts for the most disaggregated series are usually required to add-up exactly to the forecasts of the aggregated series, a constraint we refer to as “coherence.” Forecast reconciliation is the process of adjusting forecasts to make them coherent.
The reconciliation algorithm proposed by Hyndman et al. (2011 Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., and Shang, H. L. (2011), “Optimal Combination Forecasts for Hierarchical Time Series,” Computational Statistics and Data Analysis, 55, 2579–2589.[Crossref], [Web of Science ®] , [Google Scholar]) is based on a generalized least squares estimator that requires an estimate of the covariance matrix of the coherency errors (i.e., the errors that arise due to incoherence). We show that this matrix is impossible to estimate in practice due to identifiability conditions.
We propose a new forecast reconciliation approach that incorporates the information from a full covariance matrix of forecast errors in obtaining a set of coherent forecasts. Our approach minimizes the mean squared error of the coherent forecasts across the entire collection of time series under the assumption of unbiasedness. The minimization problem has a closed-form solution. We make this solution scalable by providing a computationally efficient representation.
We evaluate the performance of the proposed method compared to alternative methods using a series of simulation designs which take into account various features of the collected time series. This is followed by an empirical application using Australian domestic tourism data. The results indicate that the proposed method works well with artificial and real data. Supplementary materials for this article are available online.
In a recent study, Bergmeir, Hyndman and Benítez (2016) successfully employed a bootstrap aggregation (bagging) technique for improving the performance of exponential smoothing. Each series is Box-Cox transformed, and decomposed by Seasonal and Trend decomposition using Loess (STL); then bootstrapping is applied on the remainder series before the trend and seasonality are added back, and the transformation reversed to create bootstrapped versions of the series. Subsequently, they apply automatic exponential smoothing on the original series and the bootstrapped versions of the series, with the final forecast being the equal-weight combination across all forecasts. In this study we attempt to address the question: why does bagging for time series forecasting work? We assume three sources of uncertainty (model uncertainty, data uncertainty, and parameter uncertainty) and we separately explore the benefits of bagging for time series forecasting for each one of them. Our analysis considers 4004 time series (from the M- and M3-competitions) and two families of models. The results show that the benefits of bagging predominantly originate from the model uncertainty: the fact that different models might be selected as optimal for the bootstrapped series. As such, a suitable weighted combination of the most suitable models should be preferred to selecting a single model.
This paper proposes a framework for the analysis of the theoretical properties of forecast combination, with the forecast performance being measured in terms of mean squared forecast errors (MSFE). Such a framework is useful for deriving all existing results with ease. In addition, it also provides insights into two forecast combination puzzles. Specifically, it investigates why a simple average of forecasts often outperforms forecasts from single models in terms of MSFEs, and why a more complicated weighting scheme does not always perform better than a simple average. In addition, this paper presents two new findings that are particularly relevant in practice. First, the MSFE of a forecast combination decreases as the number of models increases. Second, the conventional approach to the selection of optimal models, based on a simple comparison of MSFEs without further statistical testing, leads to a biased selection.
The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.
Organizational science has increasingly recognized the need for integrating time into its theories. In parallel, innovations in longitudinal designs and analyses have allowed these theories to be tested. To promote these important advances, the current article introduces time series analysis for organizational research, a set of techniques that has proved essential in many disciplines for understanding dynamic change over time. We begin by describing the various characteristics and components of time series data. Second, we explicate how time series decomposition methods can be used to identify and partition these time series components. Third, we discuss periodogram and spectral analysis for analyzing cycles. Fourth, we discuss the issue of autocorrelation and how different structures of dependency can be identified using graphics and then modeled as autoregressive moving-average (ARMA) processes. Finally, we conclude by describing more time series patterns, the issue of data aggregation, and more sophisticated techniques that were not able to be given proper coverage. Illustrative examples based on topics relevant to organizational research are provided throughout, and a software tutorial in R for these analyses accompanies each section.
Numerous forecast combination techniques have been proposed. However, these do not systematically outperform a simple average (SA) of forecasts in empirical studies. Although it is known that this is due to instability of learned weights, managers still have little guidance on how to solve this “forecast combination puzzle”, i.e., which combination method to choose in specific settings. We introduce a model determining the yet unknown asymptotic out-of-sample error variance of the two basic combination techniques: SA, where no weightings are learned, and so-called optimal weights that minimize the in-sample error variance. Using the model, we derive multi-criteria boundaries (considering training sample size and changes of the parameters which are estimated for optimal weights) to decide when to choose SA. We present an empirical evaluation which illustrates how the decision rules can be applied in practice. We find that using the decision rules is superior to all other considered combination strategies.
This paper offers a theoretical explanation for the stylized fact that forecast combinations with estimated optimal weights often perform poorly in applications. The properties of the forecast combination are typically derived under the assumption that the weights are fixed, while in practice they need to be estimated. If the fact that the weights are random rather than fixed is taken into account during the optimality derivation, then the forecast combination will be biased (even when the original forecasts are unbiased), and its variance will be larger than in the fixed-weight case. In particular, there is no guarantee that the 'optimal' forecast combination will be better than the equal-weight case, or even improve on the original forecasts. We provide the underlying theory, some special cases, and a numerical illustration.
We discuss model and forecast comparison, calibration, and combination from a foundational perspective. Bayesian predictive synthesis (BPS) defines a coherent theoretical basis for combining multiple forecast densities, whether from models, individuals, or other sources, and extends existing forecast pooling and Bayesian model mixing methods. Time series extensions are implicit dynamic latent factor models, allowing adaptation to time-varying biases, mis-calibration, and dependencies among models or forecasters. Bayesian simulation-based computation enables implementation. A macroeconomic time series study highlights insights into dynamic relationships among synthesized forecast densities, as well as the potential for improved forecast accuracy at multiple horizons.
Empirical Bayes methods for Gaussian compound decision problems involving longitudinal data are considered. The new convex optimization formulation of the nonparametric (Kiefer-Wolfowitz) maximum likelihood estimator for mixture models is employed to construct nonparametric Bayes rules for compound decisions. The methods are first illustrated with some simulation examples and then with an application to models of income dynamics. Using PSID data we estimate a simple dynamic model of earnings that incorporates bivariate heterogeneity in intercept and variance of the innovation process. Profile likelihood is employed to estimate an AR(1) parameter controlling the persistence of the innovations. We find that persistence is relatively modest, , when we permit heterogeneity in variances. Evidence of negative dependence between individual intercepts and variances is revealed by the nonparametric estimation of the mixing distribution, and has important consequences for forecasting future income trajectories.
Multilevel regression and poststratification (MRP) is a method to estimate public opinion across geographic units from individual-level
survey data. If it works with samples the size of typical national surveys, then MRP offers the possibility of analyzing many
political phenomena previously believed to be outside the bounds of systematic empirical inquiry. Initial investigations of
its performance with conventional national samples produce generally optimistic assessments. This article examines a larger
number of cases and a greater range of opinions than in previous studies and finds substantial variation in MRP performance.
Through empirical and Monte Carlo analyses, we develop an explanation for this variation. The findings suggest that the conditions
necessary for MRP to perform well will not always be met. Thus, we draw a less optimistic conclusion than previous studies
do regarding the use of MRP with samples of the size found in typical national surveys.
A number of procedures for forecasting a time series from its own current and past values are surveyed. Forecasting performances of three methods--Box-Jenkins, Holt-Winters and stepwise autoregression--are compared over a large sample of economic time series. The possibility of combining individual forecasts in the production of an overall forecast is explored, and we present empirical results which indicate that such a procedure can frequently be profitable.
This work presents a novel approach to multivariate time series classification. The method exploits the multivariate structure of the time series and the possibilities of the stacking ensemble method. The basics of the method may be described in three steps: first, decomposing the multivariate time series on its constituent univariate time series; second, inducing a classifier for each univariate time series plus and additional multivariate classifier for the whole time series; third, creating the final multivariate time series classifier stacking the previous classifiers. The ensemble obtained has the potential to improve the accuracy of the single multivariate time series classifier. Several configurations of the stacking method have been tested on seven multivariate time series data sets. In five out of seven data sets, the proposed method obtains the smallest error rate. Moreover, in two out of seven data sets, stacking only the univariate time series classifiers provides the best results. The experimental results show that when a multivariate time series method does not produce an accurate classifier, stacking it with univariate time series classifiers is an alternative worthy of consideration.