Article

The M2-competition: A real-time judgmentally based forecasting study

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The purpose of the M2-Competition is to determine the post sample accuracy of various forecasting methods. It is an empirical study organized in such a way as to avoid the major criticism of the M-Competition that forecasters in real situations can use additional information to improve the predictive accuracy of quantitative methods. Such information might involve inside knowledge (e.g. a machine breakdown, a forthcoming strike in a major competitor, a steep price increase, etc.), be related to the expected state of the industry or economy that might affect the product(s) involved, or be the outcome of a careful study of the historical data and special care in procedure/methods employed while forecasting. The M2-Competition consisted of distributing 29 actual series (23 of these series came from four companies and six were of macro economic nature) to five forecasters. The data covered information including the September figures of the year involved. The objective was to make monthly forecasts covering 15 months starting from October and including December of the next year. A year later the forecasters were provided with the new data as they had become available and the process of predicting for 15 months ahead was repeated. In addition to being able to incorporate their predictions about the state of the economy and that of the industry the participating forecasters could ask for any additional information they wanted from the collaborating companies. Although the forecasters had additional information about the series being predicted the results show few or no differences in post-sample forecasting accuracy when compared to those of the M-Competition or the earlier Makridakis and Hibon empirical study.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... 6 An instance of CTF consists of a publicly available benchmark data set, a number of competitors working on the same (predictive) task and an objective referee. Through the M competitions [1,15,16,17], forecasting has had a long lasting tradition in CTF. ...
... Conversely, a number of forecasting competitions are available on Kaggle. 16 Conclusions from these competition such as the success of ensemble and data-driven methods have foreshadowed the results of the M4 competition and could be taken more seriously in the academic literature. ...
... The requirements of modern forecasting scenarios via high-dimensional, streaming or big data use cases (e.g., via internet of things applications) amplify the importance of such practical considerations. 16 For example: https://www.kaggle.com/c/favorita-grocery-sales-forecasting/ data https://www.kaggle.com/c/walmart-sales-forecasting/data ...
Preprint
Classifying forecasting methods as being either of a "machine learning" or "statistical" nature has become commonplace in parts of the forecasting literature and community, as exemplified by the M4 competition and the conclusion drawn by the organizers. We argue that this distinction does not stem from fundamental differences in the methods assigned to either class. Instead, this distinction is probably of a tribal nature, which limits the insights into the appropriateness and effectiveness of different forecasting methods. We provide alternative characteristics of forecasting methods which, in our view, allow to draw meaningful conclusions. Further, we discuss areas of forecasting which could benefit most from cross-pollination between the ML and the statistics communities.
... We find this interesting keeping the T error formula in mind and the empirical findings of several studies. (Makridakis et al., 1982;Makridakis et al., 1993;Makridakis & Hibon, 2000;Makridakis et al., 2018) where ensembles seems to be outperforming single models, one explanation could be that the E loss (m) ≥ E err (M)|m ∈ M ...
... It is evident that Period is the most important statistical feature for prediction for the Fanciful weighting model. Seasonal effects are also important for our weight net as seasonal strength is the second most important statistical feature, a result which coincides with the findings of Makridakis et al. (1993). Furthermore, Arch 2 which is the R 2 of a AR model is also an important statistical feature. ...
Preprint
Full-text available
This paper presents an ensemble forecasting method that shows strong results on the M4Competition dataset by decreasing feature and model selection assumptions, termed DONUT(DO Not UTilize human assumptions). Our assumption reductions, consisting mainly of auto-generated features and a more diverse model pool for the ensemble, significantly outperforms the statistical-feature-based ensemble method FFORMA by Montero-Manso et al. (2020). Furthermore, we investigate feature extraction with a Long short-term memory Network(LSTM) Autoencoder and find that such features contain crucial information not captured by traditional statistical feature approaches. The ensemble weighting model uses both LSTM features and statistical features to combine the models accurately. Analysis of feature importance and interaction show a slight superiority for LSTM features over the statistical ones alone. Clustering analysis shows that different essential LSTM features are different from most statistical features and each other. We also find that increasing the solution space of the weighting model by augmenting the ensemble with new models is something the weighting model learns to use, explaining part of the accuracy gains. Lastly, we present a formal ex-post-facto analysis of optimal combination and selection for ensembles, quantifying differences through linear optimization on the M4 dataset. We also include a short proof that model combination is superior to model selection, a posteriori.
... Thus, the sum of independent ARMA processes is again ARMA. Equation (9) can also be expressed as; ...
... In order to select the appropriate form of an ARMA model from a set of possible models, there is the need to examine the autocorrelation (ACF) and partial autocorrelation (PACF) plots of the stationary series. Makridakis et al. (1993) ox and Jenkins (1976) provided a theoretical framework and guiding principles to determine important criterion appropriate model is the developed by Akaike makes comparison between the quality of a set of possible models and ranks them in order, starting from the best. The "best" or "optimum" fits the model. ...
Article
Full-text available
This research work modeled chicken production in Nigeria using the least square approach as well as the univariate Box-Jenkins Autoregressive Moving Average (ARMA) model method. The objective is to investigate the productio make logical forecast for poultry farmers to meet up with any future challenges that may arise due to the increase in demand for poultry meat. The maximum likelihood method of estimation was used to obtain the parameters of the fitted Autoregre Integrated Moving Average (ARIMA) model. Yearly chicken production data for the period of 1961 to 2017 obtained from Food and Agricultural Organisation (FAO) was used to investigate the performance of the model. Augmented Dickey Fuller (ADF) test carried out to check for stationarity of the series shows that the original data was nonstationary and stationarity was attained after the first differencing. 1, 2) emerged as the best among other fitted models with Akaike Information Criterion (AIC) value of 1168.42 and log likelihood 579.21. The optimum model is given by: 1 1 2 2 1 1 2 2 t t t t t X X X e X             with estimated parameters 1  = 0.3169, 2  = 0.3948, = 0.9377. Diagnostic checks via standardised residuals, ACF of residuals and p-values revealed that the model captures the data well enough. Forecast values for the year 2018 to 2023 were then obtained which show a considerable variation over the years. The Autoregressive Integrated Moving Average (ARIMA) model clearly performed excellently in studying the behavior of chicken production data and forecasting its future values.
... (Hyndman, 2020;Makridakis et al., 2021). From these, the M competitions (Makridakis et al., 1982(Makridakis et al., , 1993Makridakis & Hibon, 2000;Makridakis et al., 2020d) are probably the most influential and widely cited in the field of forecasting, the most recent being the M5 competition that took place in the period of March-June, 2020. ...
... The competition also placed particular emphasis on benchmarking, considering a variety of methods, both traditional and state-of-the-art, that can be classified as statistical, ML, and combinations. In the first three M competitions (Makridakis et al., 1982(Makridakis et al., , 1993Makridakis & Hibon, 2000), and for many years in the forecasting literature (Bates & Granger, 1969;Claeskens et al., 2016), combinations and relatively simple methods were regarded as being at least as accurate as sophisticated ones. The M4 competition (Makridakis et al., 2020d), however, despite confirming the value of combining, indicated that more sophisticated, ML methods could provide significantly more accurate results. ...
Article
Full-text available
The M5 competition follows the previous four M competitions, whose purpose is to learn from empirical evidence how to improve forecasting performance and advance the theory and practice of forecasting. M5 focused on a retail sales forecasting application with the objective to produce the most accurate point forecasts for 42,840 time series that represent the hierarchical unit sales of the largest retail company in the world, Walmart, as well as to provide the most accurate estimates of the uncertainty of these forecasts. Hence, the competition consisted of two parallel challenges, namely the Accuracy and Uncertainty forecasting competitions. M5 extended the results of the previous M competitions by: (a) significantly expanding the number of participating methods, especially those in the category of machine learning; (b) evaluating the performance of the uncertainty distribution along with point forecast accuracy; (c) including exogenous/explanatory variables in addition to the time series data; (d) using grouped, correlated time series; and (e) focusing on series that display intermittency. This paper describes the background, organization, and implementations of the competition, and it presents the data used and their characteristics. Consequently, it serves as introductory material to the results of the two forecasting challenges to facilitate their understanding.
... Several forecasting competitions covering a range of real-world time series have shown that more sophisticated models do not necessarily outperform simpler ones [2,3,6,100,104,105]. ...
... During several forecasting competitions, Makridakis et al. [105] found that exponential smoothing performed well in time-series forecasting. When investigating strategies to select models to forecast time series, Fildes [115] found that exponential smoothing with damped trends was superior, especially when choosing a model that is applied to all series in an aggregate manner. ...
Article
Full-text available
This paper's top-level goal is to provide an overview of research conducted in the many academic domains concerned with forecasting. By providing a summary encompassing these domains, this survey connects them, establishing a common ground for future discussions. To this end, we survey literature on human judgement and quantitative forecasting as well as hybrid methods that involve both humans and algorithmic approaches. The survey starts with key search terms that identified more than 280 publications in the fields of computer science, operations research, risk analysis, decision science, psychology and forecasting. Results show an almost 10-fold increase in the application-focused forecasting literature between the 1990s and the current decade, with a clear rise of quantitative, data-driven forecasting models. Comparative studies of quantitative methods and human judgement show that (1) neither method is universally superior, and (2) the better method varies as a function of factors such as availability, quality, extent and format of data, suggesting that (3) the two approaches can complement each other to yield more accurate and resilient models. We also identify four research thrusts in the human/machine-forecasting literature: (i) the choice of the appropriate quantitative model, (ii) the nature of the interaction between quantitative models and human judgement, (iii) the training and incentivization of human forecasters, and (iv) the combination of multiple forecasts (both algorithmic and human) into one. This review surveys current research in all four areas and argues that future research in the field of human/machine forecasting needs to consider all of them when investigating predictive performance. We also address some of the ethical dilemmas that might arise due to the combination of quantitative models with human judgement.
... Humans versus algorithms is an ongoing debate in decision making. In the forecasting field in particular, judgment has found its way into several different aspects of the forecasting process (Arvan et al., 2019;Makridakis et al., 2020a), from directly producing judgmental point forecasts or prediction intervals (Lawrence et al., 1985;Lawrence and Makridakis, 1989;Makridakis et al., 1993;Harvey and Bolger, 1996;Petropoulos et al., 2017) to adjusting the formal (statistical) forecasts of forecasting support systems (Fildes et al., 2009(Fildes et al., , 2019Franses and Legerstee, 2009a,b), or, more recently, using judgment to select between statistical forecasting models Han et al., 2019;De Baets and Harvey, 2020). ...
... Various studies [9]- [11] underscore the inadequacy of independent level's forecasts in capturing the intricate relationships within organizational and business hierarchies. Additionally, many time series in this domain contain substantial amounts of intermittent data, making traditional models less applicable and studies from large-scale forecasting competitions suggest that combining models instead of selecting one "best" model can improve accuracy and robustness [12]- [14]. ...
Preprint
Full-text available
Ads demand forecasting for Walmart's ad products plays a critical role in enabling effective resource planning, allocation, and management of ads performance. In this paper, we introduce a comprehensive demand forecasting system that tackles hierarchical time series forecasting in business settings. Though traditional hierarchical reconciliation methods ensure forecasting coherence, they often trade off accuracy for coherence especially at lower levels and fail to capture the seasonality unique to each time-series in the hierarchy. Thus, we propose a novel framework "Multi-Stage Hierarchical Forecasting Reconciliation and Adjustment (Multi-Stage HiFoReAd)" to address the challenges of preserving seasonality, ensuring coherence, and improving accuracy. Our system first utilizes diverse models, ensembled through Bayesian Optimization (BO), achieving base forecasts. The generated base forecasts are then passed into the Multi-Stage HiFoReAd framework. The initial stage refines the hierarchy using Top-Down forecasts and "harmonic alignment." The second stage aligns the higher levels' forecasts using MinTrace algorithm, following which the last two levels undergo "harmonic alignment" and "stratified scaling", to eventually achieve accurate and coherent forecasts across the whole hierarchy. Our experiments on Walmart's internal Ads-demand dataset and 3 other public datasets, each with 4 hierarchical levels, demonstrate that the average Absolute Percentage Error from the cross-validation sets improve from 3% to 40% across levels against BO-ensemble of models (LGBM, MSTL+ETS, Prophet) as well as from 1.2% to 92.9% against State-Of-The-Art models. In addition, the forecasts at all hierarchical levels are proved to be coherent. The proposed framework has been deployed and leveraged by Walmart's ads, sales and operations teams to track future demands, make informed decisions and plan resources.
... Fair evaluation and comparing the output of different forecasting methods has remained an open question. Three competitions named Makridakis Competitions (M-Competitions), were held in 1982, 1993, and 2000 and intended to evaluate and compare the performance and accuracy of different time-series forecasting methods [13,14]. In their analysis, the accuracy of different methods is evaluated by calculating different error measures on business and economic time-series which may be applicable to other disciplines. ...
Preprint
Background: Over the past few decades, numerous forecasting methods have been proposed in the field of epidemic forecasting. Such methods can be classified into different categories such as deterministic vs. probabilistic, comparative methods vs. generative methods, and so on. In some of the more popular comparative methods, researchers compare observed epidemiological data from early stages of an outbreak with the output of proposed models to forecast the future trend and prevalence of the pandemic. A significant problem in this area is the lack of standard well-defined evaluation measures to select the best algorithm among different ones, as well as for selecting the best possible configuration for a particular algorithm. Results: In this paper, we present an evaluation framework which allows for combining different features, error measures, and ranking schema to evaluate forecasts. We describe the various epidemic features (Epi-features) included to characterize the output of forecasting methods and provide suitable error measures that could be used to evaluate the accuracy of the methods with respect to these Epi-features. We focus on long-term predictions rather than short-term forecasting and demonstrate the utility of the framework by evaluating six forecasting methods for predicting influenza in the United States. Our results demonstrate that different error measures lead to different rankings even for a single Epi-feature. Further, our experimental analyses show that no single method dominates the rest in predicting all Epi-features, when evaluated across error measures. As an alternative, we provide various consensus ranking schema that summarizes individual rankings, thus accounting for different error measures. We believe that a comprehensive evaluation framework, as presented in this paper, will add value to the computational epidemiology community.
... Examples of forecast competitions includeMakridakis, Chatfield et al. (1993) andMontgomery, Zarnowitz et al. (1992), amongst others. 18 Important methodological DL and ARDL studies after Fisher include:Koopmans (1941),Alt (1942),Koyck (1954),Almon (1965),Dhrymes (1971), and more recentlyPesaran and Shin (1998). ...
Article
Full-text available
The formal targeting of monetary aggregates by central banks, once a hallmark of Monetarist policy, has become nearly obsolete in the 21st century. Despite the appeal of money supply targeting during periods like the Volcker M1 experiment in the United States, few central banks today openly emphasize monetary aggregates in their policy frameworks. This study conducts a one-quarter-ahead forecasting "horse race" among 15 predictive models—spanning naïve models, atheoretical ARIMA, basic distributed lag, hybrid ARDL, and mixed-frequency MIDAS frameworks—to assess the near-term accuracy of models that include M2 versus those that do not. Our findings highlight the predictive appeal of monetary data, yet they also contextualize the Federal Reserve's current embargo on high-frequency weekly M2 data. When compared against historical US Treasury rate volatility, both during and post-Volcker, our analysis suggests that while monetary aggregates provide some predictive value, the additional gains in out-of-sample forecast precision associated with higher-frequency MIDAS may be marginal or even detrimental. These findings underscore the complexity of incorporating high-frequency monetary data into policy without clear gains in forecast accuracy.
... The only exception to the above approach was that taken in the M2 competition (Makridakis et al., 1993), where a three-phase approach was adopted. The organizers first provided participants with a first batch of data; which participants used to produce their first set of forecasts. ...
... Frequently, much time is spent by the analyst to select different methods and predictive algorithms, carefully train the methods, and craft an extensive framework for their evaluation. (see, e.g., Makridakis et al., 1982Makridakis et al., , 1993Makridakis & Hibon, 2000;Makridakis et al., 2018Makridakis et al., , 2022 or forecasting the vote shares obtained by different parties in political elections (Selb & Munzert, 2016). There are considerable pitfalls when evaluating forecasting systems based on the predictive performance only. ...
Article
Full-text available
Election polls are frequently employed to reflect voter sentiment with respect to a particular election (or fixed-event). Despite their widespread use as forecasts and inputs for predictive algorithms, there is substantial uncertainty regarding their efficiency. This uncertainty is amplified by judgment in the form of pollsters applying unpublished weighting schemes to ensure the representativeness of the sampled voters for the underlying population. Efficient forecasting systems incorporate past information instantly, which renders a given fixed-event unpredictable based on past information. This results in all sequential adjustments of the fixed-event forecasts across adjacent time periods (or forecast revisions) being martingale differences. This paper illustrates the theoretical conditions related to weak efficiency of fixed-event forecasting systems based on traditional least squares loss and asymmetrically weighted least absolute deviations (or quantile) loss. Weak efficiency of poll-based multi-period forecasting systems for all German federal state elections since the year 2000 is investigated. The inefficiency of almost all considered forecasting systems is documented and alternative explanations for the findings are discussed.
... Seasonal simple exponential smoothing Exponential smoothing forecasting can be considered as one of the most popular forecasting techniques since the 1950s (Osman & King, 2015). A lot of empirical studies and forecasting competitions, shows that the usefulness of the exponential smoothing technique for instance Makridakis et al. (1993). Formally, the multiplicative seasonal simple exponential smoothing with no trend equation takes the form of ...
Article
Full-text available
The aging phenomenon of the elderly occurs worldwide, especially Japan that is the most of the highest average age. Therefore, long stay tourism is alternative tourism for the elderly Japanese tourists. The aim of this research was to construct the appropriate forecasting model for the number of long stay Japanese tourist arrivals in Chiang Mai, Thailand. The data in this study gathered from the Chiang Mai Immigration Office that recorded in monthly during from January 2014 to July 2017 a total of 43 months. Then the data were classified into two sets. The first data set from January 2014 to December 2016 for 36 months were used to build the forecasting model by the methods of Classical decomposition, Seasonal simple exponential smoothing, Box-Jenkins and Combining. The second data set from January 2017 to July 2017 a total of 7 months were used to compare the earlier three methods of the forecasting accuracy model via the criteria of Root Mean Square Error: RMSE. Research results indicated that combining forecasts was the most suitable for forecasting the number of long stay Japanese tourist arrivals in Chiang Mai.
... Holt and Winter [31,32] introduced exponential smoothing as a simple method for capturing trend and seasonality patterns in time series data. This method has been a competitive alternative to the seasonal ARIMA and is widely used in applications (see [18,[33][34][35][36][37][38][39][40]) because of its robustness and accuracy [41]. An overview of the development of the exponential smoothing model for time series with trend and seasonal patterns can be seen in [3]. ...
Article
Full-text available
The importance of forecasting in the energy sector as part of electrical power equipment maintenance encourages researchers to obtain accurate electrical forecasting models. This study investigates simple to complex automatic methods and proposes two weighted ensemble approaches. The automated methods are the autoregressive integrated moving average; the exponential smoothing error–trend–seasonal method; the double seasonal Holt–Winter method; the trigonometric Box–Cox transformation, autoregressive, error, trend, and seasonal model; Prophet and neural networks. All accommodate trend and seasonal patterns commonly found in monthly, daily, hourly, or half-hourly electricity data. In comparison, the proposed ensemble approaches combine linearly (EnL) or nonlinearly (EnNL) the forecasting values obtained from all the single automatic methods by considering each model component’s weight. In this work, four electrical time series with different characteristics are examined, to demonstrate the effectiveness and applicability of the proposed ensemble approach—the model performances are compared based on root mean square error (RMSE) and absolute percentage errors (MAPEs). The experimental results show that compared to the existing average weighted ensemble approach, the proposed nonlinear weighted ensemble approach successfully reduces the RMSE and MAPE of the testing data by between 28% and 82%.
... In business specifically, improved forecasting translates in better production planning (leading to less waste) and less transportation (reducing CO 2 emissions) (Kahn 2003;Kerkkänen, Korpela, and Huiskonen 2009;Nguyen, Ni, and Rossetti 2010). The progress made in univariate forecasting in the past four decades is well reflected in the results and methods considered in associated competitions over that period (Makridakis et al. 1982(Makridakis et al. , 1993Makridakis and Hibon 2000;Athanasopoulos et al. 2011;Makridakis, Spiliotis, and Assimakopoulos 2018a). Recently, growing evidence has started to emerge suggesting that machine learning approaches could improve on classical forecasting methods, in contrast to some earlier assessments (Makridakis, Spiliotis, and Assimakopoulos 2018b). ...
Article
Can meta-learning discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets? This work provides positive evidence to this using a broad meta-learning framework which we show subsumes many existing meta-learning algorithms. Our theoretical analysis suggests that residual connections act as a meta-learning adaptation mechanism, generating a subset of task-specific parameters based on a given TS input, thus gradually expanding the expressive power of the architecture on-the-fly. The same mechanism is shown via linearization analysis to have the interpretation of a sequential update of the final linear layer. Our empirical results on a wide range of data emphasize the importance of the identified meta-learning mechanisms for successful zero-shot univariate forecasting, suggesting that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, resulting in performance that is at least as good as that of state-of-practice univariate forecasting models.
... The seasonal period s that was considered in this study was the time lag of seven periods since the day of the week pattern demonstrated to be relevant in the PACF analysis. Even though both methods seem simply at first, they serve as baseline models for accuracy results and can even generate satisfactory results in some conditions, according to Makridakis et al. (1982), Makridakis et al. (1993) and Lawrence et al. (2000). ...
Article
This case study compares the forecasting accuracy obtained for four daily Brazilian retail sales indexes at four time prediction horizons. The performance of traditional time series forecasting models, artificial neural network architectures and machine learning algorithms were compared in other to evaluate the existence of a single best performing model. Afterwards, ensemble methods were added to model comparison to verify if accuracy improvement could be obtained. Evidence found in this case study suggests that a consistent forecasting strategy exists for the Brazilian retail indexes by applying both seasonality treatment for holidays and calendar effects and by using an ensemble method which main inputs are the predictions of all models with calendar variables. This strategy was consistent across all 16 index and time horizon combinations since ensemble methods either outperformed the best single models or there were no statistical difference from them in a Diebold-Mariano's test.
... The M2 competition (Makridakis et al., 1993) owes its origin to the belief that the only reason for the superiority of the simple methods over the sophisticated ones in M1 was that no human judgment could be applied to the statistical forecasts. This assumption, however, had to be tested on real-time data so that judgment could be applied to the statistical forecasts, making M2 the first live forecasting competition from the 18 major ones conducted until now (Makridakis et al., 2021). ...
... Williams and Kavanagh (2016) identify two forms of empirical forecast research focusing on numerous time series to find evidence of effectiveness with real time series: forecast comparisons, which contrast results from various methods; and forecast competitions, which contrast results from various forecasters (and may also reflect various methods). While a significant portion of this literature focuses on generalized improvement in forecasting methods (Makridakis et al., 1982(Makridakis et al., , 1993Makridakis & Hibon, 2000;Makridakis et al., 2018b), much of the literature examines the domain specific applications of forecasting techniques (Athanasopoulos et al., 2011;Chen et al., 2003Chen et al., , 2008Gencay & Yang, 1996;P. R. Hansen & Lunde, 2005;Hong et al., 2014;Imhoff & Par e, 1982;Sahu & Kumar, 2013, 2014Williams & Kavanagh, 2016). ...
Article
The recent rapid development of artificial intelligence (AI) is expected to transform how governments work by enhancing the quality of decision-making. Despite rising expectations and the growing use of AI by governments, scholarly research on AI applications in public administration has lagged. In this study, we fill gaps in the current literature on the application of machine learning (ML) algorithms with a focus on revenue forecasting by local governments. Specifically, we explore how different ML models perform on predicting revenue for local governments and compare the relative performance of revenue forecasting by traditional forecasters and several ML algorithms. Our findings reveal that traditional statistical forecasting methods outperform ML algorithms overall, while one of ML algorithms, KNN, is more effective in predicting property tax revenue. This result is particularly salient for public managers in local governments to handle foreseeable fiscal challenges through more accurate predictions of revenue.
... There are several publications regarding empirical comparisons of forecasting approaches, e.g. the influential M-competitions that started in 1982 (Bojer & Meldgaard, 2021;Hong et al., 2019;Lloyd, 2014; Makridakis et al., 1982Makridakis et al., , 1993Makridakis & Hibon, 2000;Makridakis et al., 2018Makridakis et al., , 2020Makridakis et al., , 2021Stepnicka & Burda, 2017). Nevertheless, these competitions do not show a general superiority of a specific approach. ...
Article
Full-text available
Forecasting future demand is of high importance for many companies as it affects operational decisions. This is especially relevant for products with a short shelf life due to the potential disposal of unsold items. Horticultural products are highly influenced by this, however with limited attention in forecasting research so far. Beyond that, many forecasting competitions show a competitive performance of classical forecasting methods. For the first time, we empirically compared the performance of nine state-of-the-art machine learning and three classical forecasting algorithms for horticultural sales predictions. We show that machine learning methods were superior in all our experiments, with the gradient boosted ensemble learner XGBoost being the top performer in 14 out of 15 comparisons. This advantage over classical forecasting approaches increased for datasets with multiple seasons. Further, we show that including additional external factors, such as weather and holiday information, as well as meta-features led to a boost in predictive performance. In addition, we investigated whether the algorithms can capture the sudden increase in demand of horticultural products during the SARS-CoV-2 pandemic in 2020. For this special case, XGBoost was also superior. All code and data is publicly available on GitHub: https://github.com/grimmlab/HorticulturalSalesPredictions.
... There are many studies on the numerical and theoretical comparison of Box-Jenkins and ES methods. Several empirical studies have been published in turn by Reid (1969), Newbold and Granger (1974), Makridaki and Hibon (1979), Makridakis et al. (1982), Makridakis et al. (1993), Makridakis and Hibon (2000), Makridakis et al. (2018). ...
Article
Full-text available
Ata method is a new univariate time series forecasting method which provides innovative solutions to issues faced during the initialization and optimization stages of existing methods. Ata method's forecasting performance is superior to existing methods both in terms of easy implementation and accurate forecasting. It can be applied to non-seasonal or deseasonalized time series, where the deseasonalization can be performed via any preferred decomposition method. The R package ATAforecasting was developed as a comprehensive toolkit for automatic time series forecasting. It focuses on modelling all types of time series components with any preferred the Ata methods and handling seasonality patterns by utilizing some popular decomposition techniques. The ATAfore-casting package provides for researchers modelling seasonality with STL, STLplus, TBATS, stR and TRAMO/SEATS, and power family transformation and analysing the any time series with simple Ata method and additive, multiplicative, damped trend the Ata methods and level fixed Ata trended methods. It offers functions for researchers and data analysts to model any type of time series data sets without requiring specialization. However, an expert user may use the functions that can model all possible time series behaviours. The package also incorporates types of model specifications and their graphs, uses different accuracy measures that surely increase the Ata method's performance.
... The main finding of the first M competition (Makridakis et al., 1982), dating back to the early '80s, was that combining simple time series forecasting methods (e.g., Simple, Holt, and Damped exponential smoothing) results in superior forecasts that outperform both the individual methods used for combining and other, more sophisticated methods (e.g., ARIMA models). The same finding was confirmed in all M competitions that followed (Makridakis et al., 1993;Makridakis & Hibon, 2000;Makridakis et al., 2020b), where the vast majority of the top-performing submissions utilized combinations of various forecasting methods. For instance, the Theta method (Assimakopoulos & Nikolopoulos, 2000), that won the M3 competition, is based on a framework that decomposes data into two Theta lines which are individually extrapolated using Simple exponential smoothing (SES) and linear regression on a time-trend indicator and combined using equal weights to enhance forecasting accuracy. ...
Article
Full-text available
The scientific method consists of making hypotheses or predictions and then carrying out experiments to test them once the actual results have become available, in order to learn from both successes and mistakes. This approach was followed in the M4 competition with positive results and has been repeated in the M5, with its organizers submitting their ten predictions/hypotheses about its expected results five days before its launch. The present paper presents these predictions/hypotheses and evaluates their realization according to the actual findings of the competition. The results indicate that well-established practices, like combining forecasts, exploiting explanatory variables, and capturing seasonality and special days, remain critical for enhancing forecasting performance, re-confirming also that relatively new approaches, like cross-learning algorithms and machine learning methods, display great potential. Yet, we show that simple, local statistical methods may still be competitive for forecasting high granularity data and estimating the tails of the uncertainty distribution, thus motivating future research in the field of retail sales forecasting.
... These univariate models are easy to compute and were historically able to achieve similar levels of forecast accuracy to more complex methods (Makridakis et al., 2020a). This observation from the first forecasting competition by Makridakis and Hibon (1979) was reconfirmed in the first three M-competitions (Makridakis et al., 1982(Makridakis et al., , 1993Makridakis and Hibon, 2000). The M3 winning model, for instance, combined linear regression with simple exponential smoothing with drift (Assimakopoulos and Nikolopoulos, 2000). ...
Article
Full-text available
The winning machine learning methods of the M5 Accuracy competition demonstrated high levels of forecast accuracy compared to the top-performing benchmarks in the history of the M-competitions. Yet, large-scale adoption is hampered due to the significant computational requirements to model, tune, and train these state-of-the-art algorithms. To overcome this major issue, we discuss the potential of transfer learning (TL) to reduce the computational effort in hierarchical forecasting and provide a proof of concept that TL can be applied on M5 top-performing methods. We demonstrate our easy-to-use TL framework on the recursive store-level LightGBM models of the M5 winning method and attain similar levels of forecast accuracy with roughly 25% less training time. Our findings provide evidence for a novel application of TL to facilitate the practical applicability of the M5 winning methods in large-scale settings with hierarchically structured data.
... Citation Links to data and/or winning methods M or M1 (1982) Makridakis et al. (1982) https://forecasters.org/resources/time-series-data/ https://cran.r-project.org/package=Mcomp M2 (1993) Makridakis et al. (1993) https://forecasters.org/resources/time-series-data/ M3 (2000) Makridakis and Hibon (2000) (2018) https://www.kaggle.com/c/recruit-restaurant-visitorforecasting ...
Article
Full-text available
Forecasting competitions are the equivalent of laboratory experimentation widely used in physical and life sciences. They provide useful, objective information to improve the theory and practice of forecasting, advancing the field, expanding its usage, and enhancing its value to decision and policymakers. We describe 10 design attributes to be considered when organizing forecasting competitions, taking into account trade-offs between optimal choices and practical concerns, such as costs, as well as the time and effort required to participate in them. Consequently, we map all major past competitions in respect to their design attributes, identifying similarities and differences between them, as well as design gaps, and making suggestions about the principles to be included in future competitions, putting a particular emphasis on learning as much as possible from their implementation in order to help improve forecasting accuracy and uncertainty. We discuss that the task of forecasting often presents a multitude of challenges that can be difficult to capture in a single forecasting contest. To assess the caliber of a forecaster, we, therefore, propose that organizers of future competitions consider a multicontest approach. We suggest the idea of a forecasting-“athlon” in which different challenges of varying characteristics take place.
... The M-competitions, named after their initiator Spyros Makridakis, are a series of time series forecasting competitions that aim to improve empirical evidence and advance the theory and practice of statistical forecasting (Makridakis et al., 1982;Makridakis et al., 1993;Makridakis and Hibon, 2000;Makridakis et al., 2020). The most recent competition, M4, included 100,000 time series on topics such as finance, industry, and demographics. ...
Article
Full-text available
This CSS Risk and Resilience Report reviews methods in strategic foresight that can help organizations to deal with and reduce uncertainty. The report uses examples from the Chemical, Biological, Radiological, and Nuclear (CBRN) domains and discusses some caveats, such as information hazards, that are particularly relevant to this context. However, the overview it provides can be useful across a wide range of strategic decision- making processes. A few key points highlighted by the report: 1) Horizon scanning aims to find leading indicators for future developments, and it profits from the organization of networks that cross disciplinary and institutional boundaries. 2) The short- term forecasting of technological and political events contains clear elements of skill. Prediction challenges are an attractive approach to such questions because they establish individual track records. 3) Longer- term (five years or more) forecasting of events is not feasible. However, structural shifts and their implications can and should be analyzed. 4) Scenarios are a tool to explore the sequences, consequences, and interplay between different actors in a possible future that has been identified as relevant.
... Therefore, it is critical to investigate the distribution of SMAPE errors instead of providing only table of averages. However, in many time series forecasting evaluations, SMAPE is applied regardless of an evaluation of the SMAPE distribution and therefore the averages are also discussed here [37,38]. The median is statistically more robust than the mean and is therefore applied here. ...
Article
Full-text available
The forecasting of univariate time series poses challenges in industrial applications if the seasonality varies. Typically, a non-varying seasonality of a time series is treated with a model based on Fourier theory or the aggregation of forecasts from multiple resolution levels. If the seasonality changes with time, various wavelet approaches for univariate forecasting are proposed with promising potential but without accessible software or a systematic evaluation of different wavelet models compared to state-of-the-art methods. In contrast, the advantage of the specific multiresolution forecasting proposed here is the convenience of a swiftly accessible implementation in R and Python combined with coefficient selection through evolutionary optimization which is evaluated in four different applications: scheduling of a call center, planning electricity demand, and predicting stocks and prices. The systematic benchmarking is based on out-of-sample forecasts resulting from multiple cross-validations with the error measure MASE and SMAPE for which the error distribution of each method and dataset is estimated and visualized with the mirrored density plot. The multiresolution forecasting performs equal to or better than twelve comparable state-of-the-art methods but does not require users to set parameters contrary to prior wavelet forecasting frameworks. This makes the method suitable for industrial applications.
... The last datum is the forecast origin o (Tashman 2000 Fig. 1a; Appendix S1). For fitting and predicting, we use data in hand to validate our models, iterating the evaluation over time via a probabilistic and sequential (prequential sensu Dawid [1984]) approach to testing existing data, compared to validating models only after future data are collected (Makridakis et al. 1993). Prequential methods are well defined, preferred in established fields (Dawid 1984), and implementable in ecological forecasting (Dietze et al. 2018). ...
Article
Full-text available
Probabilistic near‐term forecasting facilitates evaluation of model predictions against observations and is of pressing need in ecology to inform environmental decision‐making and effect societal change. Despite this imperative, many ecologists are unfamiliar with the widely used tools for evaluating probabilistic forecasts developed in other fields. We address this gap by reviewing the literature on probabilistic forecast evaluation from diverse fields including climatology, economics, and epidemiology. We present established practices for selecting evaluation data (end‐sample hold out), graphical forecast evaluation (times‐series plots with uncertainty, probability integral transform plots), quantitative evaluation using scoring rules (log, quadratic, spherical, and ranked probability scores), and comparing scores across models (skill score, Diebold–Mariano test). We cover common approaches, highlight mathematical concepts to follow, and note decision points to allow application of general principles to specific forecasting endeavors. We illustrate these approaches with an application to a long‐term rodent population time series currently used for ecological forecasting and discuss how ecology can continue to learn from and drive the cross‐disciplinary field of forecasting science.
... However, the results of such studies have been inconclusive. For instance, while Lawrence, Edmundson, and O'Connor (1985) and Makridakis et al. (1993) found that unaided human judgment can be as good as the best statistical methods from the M1 forecasting competition, Carbone and Gorr (1985) and Sanders (1992) found judgmental point forecasts to be less accurate than statistical methods. The reason for these results is the fact that well-known biases govern judgmental forecasts, such as the tendency of forecasters to dampen trends (Lawrence, Goodwin, O'Connor, & Önkal, 2006;Lawrence & Makridakis, 1989), as well as anchoring and adjustment (O'Connor, Remus, & Griggs, 1993) and the confusion of the signal with noise (Harvey, 1995;Reimers & Harvey, 2011). ...
... The sequence of forecasting 'competitions' commencing with [1] analysing 111 time series, through [2][3][4] with M3, then M4 (see [5], held by the M Open Forecasting Center at the University of Nicosia in Cyprus) created realistic, immutable, and shared data sets that could be used as testbeds for forecasting methods. Ref. [6] provides a more general history and [7] consider their role in improving forecasting practice and research. ...
Article
Full-text available
Economic forecasting is difficult, largely because of the many sources of nonstationarity influencing observational time series. Forecasting competitions aim to improve the practice of economic forecasting by providing very large data sets on which the efficacy of forecasting methods can be evaluated. We consider the general principles that seem to be the foundation for successful forecasting, and show how these are relevant for methods that did well in the M4 competition. We establish some general properties of the M4 data set, which we use to improve the basic benchmark methods, as well as the Card method that we created for our submission to that competition. A data generation process is proposed that captures the salient features of the annual data in M4.
... Moreover, the average percentage errors and mean absolute deviations were calculated but not reported. In contrast to its predecessor, the second M-Competition (Makridakis et al., 1993) considered only 26 time series and the measures MAPE, average ranking, percentage better, and mean percentage error. However, this competition lasted almost four years, as participants, starting in 1987, received real-time data and feedback on their submitted forecasts as new data became available. ...
Thesis
Full-text available
These days, we are living in a digitalized world. Both our professional and private lives are pervaded by various IT services, which are typically operated using distributed computing systems (e.g., cloud environments). Due to the high level of digitalization, the operators of such systems are confronted with fast-paced and changing requirements. In particular, cloud environments have to cope with load fluctuations and respective rapid and unexpected changes in the computing resource demands. To face this challenge, so-called auto-scalers, such as the threshold-based mechanism in Amazon Web Services EC2, can be employed to enable elastic scaling of the computing resources. However, despite this opportunity, business-critical applications are still run with highly overprovisioned resources to guarantee a stable and reliable service operation. This strategy is pursued due to the lack of trust in auto-scalers and the concern that inaccurate or delayed adaptations may result in financial losses. To adapt the resource capacity in time, the future resource demands must be "foreseen", as reacting to changes once they are observed introduces an inherent delay. In other words, accurate forecasting methods are required to adapt systems proactively. A powerful approach in this context is time series forecasting, which is also applied in many other domains. The core idea is to examine past values and predict how these values will evolve as time progresses. According to the "No-Free-Lunch Theorem", there is no algorithm that performs best for all scenarios. Therefore, selecting a suitable forecasting method for a given use case is a crucial task. Simply put, each method has its benefits and drawbacks, depending on the specific use case. The choice of the forecasting method is usually based on expert knowledge, which cannot be fully automated, or on trial-and-error. In both cases, this is expensive and prone to error. Although auto-scaling and time series forecasting are established research fields, existing approaches cannot fully address the mentioned challenges: (i) In our survey on time series forecasting, we found that publications on time series forecasting typically consider only a small set of (mostly related) methods and evaluate their performance on a small number of time series with only a few error measures while providing no information on the execution time of the studied methods. Therefore, such articles cannot be used to guide the choice of an appropriate method for a particular use case; (ii) Existing open-source hybrid forecasting methods that take advantage of at least two methods to tackle the "No-Free-Lunch Theorem" are computationally intensive, poorly automated, designed for a particular data set, or they lack a predictable time-to-result. Methods exhibiting a high variance in the time-to-result cannot be applied for time-critical scenarios (e.g., auto-scaling), while methods tailored to a specific data set introduce restrictions on the possible use cases (e.g., forecasting only annual time series); (iii) Auto-scalers typically scale an application either proactively or reactively. Even though some hybrid auto-scalers exist, they lack sophisticated solutions to combine reactive and proactive scaling. For instance, resources are only released proactively while resource allocation is entirely done in a reactive manner (inherently delayed); (iv) The majority of existing mechanisms do not take the provider's pricing scheme into account while scaling an application in a public cloud environment, which often results in excessive charged costs. Even though some cost-aware auto-scalers have been proposed, they only consider the current resource demands, neglecting their development over time. For example, resources are often shut down prematurely, even though they might be required again soon. To address the mentioned challenges and the shortcomings of existing work, this thesis presents three contributions: (i) The first contribution-a forecasting benchmark-addresses the problem of limited comparability between existing forecasting methods; (ii) The second contribution-Telescope-provides an automated hybrid time series forecasting method addressing the challenge posed by the "No-Free-Lunch Theorem"; (iii) The third contribution-Chamulteon-provides a novel hybrid auto-scaler for coordinated scaling of applications comprising multiple services, leveraging Telescope to forecast the workload intensity as a basis for proactive resource provisioning. In the following, the three contributions of the thesis are summarized: Contribution I - Forecasting Benchmark To establish a level playing field for evaluating the performance of forecasting methods in a broad setting, we propose a novel benchmark that automatically evaluates and ranks forecasting methods based on their performance in a diverse set of evaluation scenarios. The benchmark comprises four different use cases, each covering 100 heterogeneous time series taken from different domains. The data set was assembled from publicly available time series and was designed to exhibit much higher diversity than existing forecasting competitions. Besides proposing a new data set, we introduce two new measures that describe different aspects of a forecast. We applied the developed benchmark to evaluate Telescope. Contribution II - Telescope To provide a generic forecasting method, we introduce a novel machine learning-based forecasting approach that automatically retrieves relevant information from a given time series. More precisely, Telescope automatically extracts intrinsic time series features and then decomposes the time series into components, building a forecasting model for each of them. Each component is forecast by applying a different method and then the final forecast is assembled from the forecast components by employing a regression-based machine learning algorithm. In more than 1300 hours of experiments benchmarking 15 competing methods (including approaches from Uber and Facebook) on 400 time series, Telescope outperformed all methods, exhibiting the best forecast accuracy coupled with a low and reliable time-to-result. Compared to the competing methods that exhibited, on average, a forecast error (more precisely, the symmetric mean absolute forecast error) of 29%, Telescope exhibited an error of 20% while being 2556 times faster. In particular, the methods from Uber and Facebook exhibited an error of 48% and 36%, and were 7334 and 19 times slower than Telescope, respectively. Contribution III - Chamulteon To enable reliable auto-scaling, we present a hybrid auto-scaler that combines proactive and reactive techniques to scale distributed cloud applications comprising multiple services in a coordinated and cost-effective manner. More precisely, proactive adaptations are planned based on forecasts of Telescope, while reactive adaptations are triggered based on actual observations of the monitored load intensity. To solve occurring conflicts between reactive and proactive adaptations, a complex conflict resolution algorithm is implemented. Moreover, when deployed in public cloud environments, Chamulteon reviews adaptations with respect to the cloud provider's pricing scheme in order to minimize the charged costs. In more than 400 hours of experiments evaluating five competing auto-scaling mechanisms in scenarios covering five different workloads, four different applications, and three different cloud environments, Chamulteon exhibited the best auto-scaling performance and reliability while at the same time reducing the charged costs. The competing methods provided insufficient resources for (on average) 31% of the experimental time; in contrast, Chamulteon cut this time to 8% and the SLO (service level objective) violations from 18% to 6% while using up to 15% less resources and reducing the charged costs by up to 45%. The contributions of this thesis can be seen as major milestones in the domain of time series forecasting and cloud resource management. (i) This thesis is the first to present a forecasting benchmark that covers a variety of different domains with a high diversity between the analyzed time series. Based on the provided data set and the automatic evaluation procedure, the proposed benchmark contributes to enhance the comparability of forecasting methods. The benchmarking results for different forecasting methods enable the selection of the most appropriate forecasting method for a given use case. (ii) Telescope provides the first generic and fully automated time series forecasting approach that delivers both accurate and reliable forecasts while making no assumptions about the analyzed time series. Hence, it eliminates the need for expensive, time-consuming, and error-prone procedures, such as trial-and-error searches or consulting an expert. This opens up new possibilities especially in time-critical scenarios, where Telescope can provide accurate forecasts with a short and reliable time-to-result. Although Telescope was applied for this thesis in the field of cloud computing, there is absolutely no limitation regarding the applicability of Telescope in other domains, as demonstrated in the evaluation. Moreover, Telescope, which was made available on GitHub, is already used in a number of interdisciplinary data science projects, for instance, predictive maintenance in an Industry 4.0 context, heart failure prediction in medicine, or as a component of predictive models of beehive development. (iii) In the context of cloud resource management, Chamulteon is a major milestone for increasing the trust in cloud auto-scalers. The complex resolution algorithm enables reliable and accurate scaling behavior that reduces losses caused by excessive resource allocation or SLO violations. In other words, Chamulteon provides reliable online adaptations minimizing charged costs while at the same time maximizing user experience.
... Whilst the water study again supports the conclusions of Makridakis et al. 28 that exponential smoothing typically outperforms the Box-Jenkins methodology, for gas the relationship is in the opposite direction. Arguments also persist concerning the relative virtues of the transfer function approach compared to dynamic regression. ...
Article
Full-text available
Machine learning forecasting methods are compared to more traditional parametric statistical models. This comparison is carried out regarding a number of different situations and settings. A survey of the most used parametric models is given. Machine learning methods, such as convolutional networks, TCNs, LSTM, transformers, random forest, and gradient boosting, are briefly presented. The practical performance of the various methods is analyzed by discussing the results of the Makridakis forecasting competitions (M1–M6). I also look at probability forecasting via GARCH-type modeling for integer time series and continuous models. Furthermore, I briefly comment on entropy as a volatility measure. Cointegration and panels are mentioned. The paper ends with a section on weather forecasting and the potential of machine learning methods in such a context, including the very recent GraphCast and GenCast forecasts.
Article
Accepted by: Aris Syntetos Machine learning (ML) has evolved into a crucial tool in supply chain management, effectively addressing the complexities associated with decision-making by leveraging available data. The utilization of ML has markedly surged in recent years, extending its influence across various supply chain operations, ranging from procurement to product distribution. In this paper, based on a systematic search, we provide a comprehensive literature review of the research dealing with the use of ML in supply chain management. We present the major contributions to the literature by classifying them into five classes using the five processes of the supply chain operations reference framework. We demonstrate that the applications of ML in supply chain management have significantly increased in both trend and diversity over recent years, with substantial expansion since 2019. The review also reveals that demand forecasting has attracted most of the applications followed by inventory management and transportation. The paper enables to identify the research gaps in the literature and provides some avenues for further research.
Chapter
Many economic and financial time series do not have a constant mean. Rather they show growth or decay. The type of growth–whether deterministic or stochastic–has important implications for policy. In this chpater we examine the different ways to detrend the data. We spend particular attention on the effects of “differencing” (and over-differencing) the data, especially in the context of random walk models with and without drift. We also introduce the ARIMA class of models.
Thesis
Full-text available
La creciente degradación y pérdida actual de los bosques de México y el mundo, ocasionadas principalmente por las actividades antropogénicas, plagas, especies no nativas y cambio climático, se han convertido en temas que son cada vez más relevantes para la comunidad internacional, debido a las consecuencias que ocasionan: eventos climáticos extremos, agotamiento de los recursos forestales, pérdida de biodiversidad, contaminación ambiental y la disminución en la calidad y cantidad de agua. En este contexto, el objetivo fue proyectar hasta el año 2100 las variables clave del sector forestal de México: producción, importación y exportación forestal maderable y la superficie forestal arbolada. Se realizó una revisión de literatura, para describir la situación actual y seleccionar las variables de estudio. Se descargaron los datos anuales de las bases de datos oficiales y se elaboraron las series de tiempo. Las proyecciones al 2100 se estimaron con el uso de Redes Neuronales Recurrentes (Recurrent Neural Network, RNN) del tipo “Long Short-Term Memory” (LSTM) de aprendizaje profundo. Éstas aportaron un error bajo, después de un entrenamiento previo a cada una de las proyecciones anuales obtenidas. Con el análisis de las proyecciones se obtuvo que el crecimiento y necesidades de la población de México y el mundo seguirán afectando las más de 137 millones de hectáreas (ha) de superficie forestal hasta el año 2100, de continuar con las tendencias actuales: el 31 % de los bosques mexicanos presentan vegetación modificada o alterada y la producción ilegal de madera cubre más de la mitad de la demanda nacional. Se estima que para el año 2100 la superficie forestal arbolada (66 millones de ha) disminuya cerca de 6 millones de ha, la producción maderable se mantendrá en un promedio de 44 millones de m3 y se exportará e importará entre 40-50 y 25-50 mil m3, respectivamente. Se concluye que, si continua la degradación forestal en México se requerirá de evaluar, crear e implementar políticas públicas y planes de manejo sostenibles, enfocados a cada región, para satisfacer las necesidades presentes y futuras de la sociedad.
Chapter
While the Autoregressive Integrated Moving Average (ARIMA) model has been dominantly used to capture a linear component of time series data in the field of economic forecast for years, the Artificial Neural Networks (ANNs) increasingly are applying to explore tough challenge due to an existence of both linear and nonlinear patterns in a certain time series dataset. Regarding the time series forecasting, most studies have applied only ARIMA model or ANNs to produce multiple-step prediction with an insufficiently reasonable accuracy. That’s why this paper suggests a hybrid model with the advantages of either ARIMA or ANNs to analyze the linear and nonlinear relationships in Vietnam CPI time series from January 1995 to July 2022. The result shows that an effectiveness in the multiple-step prediction of the hybrid model is more precise in comparison with ARIMA model and ANNs.
Article
Full-text available
Univariate forecasting methods are fundamental for many different application areas. M-competitions provide important benchmarks for scientists, researchers, statisticians, and engineers in the field, for evaluating and guiding the development of new forecasting techniques. In this paper, the Dynamic Time Scan Forecasting (DTSF), a new univariate forecasting method based on scan statistics, is presented. DTSF scans an entire time series, identifies past patterns which are similar to the last available observations and forecasts based on the median of the subsequent observations of the most similar windows in past. In order to evaluate the performance of this method, a comparison with other statistical forecasting methods, applied in the M4 competition, is provided. In the hourly time domain, an average sMAPE of 12.9% was achieved using the method with the default parameters, while the baseline competition-the simple average of the forecasts of Holt, Damped, and Theta methods-was 22.1%. The method proved to be competitive in longer time series, with high repeatability.
Article
Full-text available
In this study, we present the results of the M5 “Accuracy” competition, which was the first of two parallel challenges in the latest M competition with the aim of advancing the theory and practice of forecasting. The main objective in the M5 “Accuracy” competition was to accurately predict 42,840 time series representing the hierarchical unit sales for the largest retail company in the world by revenue, Walmart. The competition required the submission of 30,490 point forecasts for the lowest cross-sectional aggregation level of the data, which could then be summed up accordingly to estimate forecasts for the remaining upward levels. We provide details of the implementation of the M5 “Accuracy” challenge, as well as the results and best performing methods, and summarize the major findings and conclusions. Finally, we discuss the implications of these findings and suggest directions for future research.
Article
Full-text available
Research background: Demand forecasting helps companies to anticipate purchases and plan the delivery or production. In order to face this complex problem, many statistical methods, artificial intelligence-based methods, and hybrid methods are currently being developed. However, all these methods have similar problematic issues, including the complexity, long computing time, and the need for high computing performance of the IT infrastructure. Purpose of the article: This study aims to verify and evaluate the possibility of using Google Trends data for poetry book demand forecasting and compare the results of the application of the statistical methods, neural networks, and a hybrid model versus the alternative possibility of using technical analysis methods to achieve immediate and accessible forecasting. Specifically, it aims to verify the possibility of immediate demand forecasting based on an alternative approach using Pbands technical indicator for poetry books in the European Quartet countries. Methods: The study performs the demand forecasting based on the technical analysis of the Google Trends data search in case of the keyword poetry in the European Quartet countries by several statistical methods, including the commonly used ETS statistical methods, ARIMA method, ARFIMA method, BATS method based on the combination of the Cox-Box transformation model and ARMA, artificial neural networks, the Theta model, a hybrid model, and an alternative approach of forecasting using Pbands indicator. The study uses MAPE and RMSE approaches to measure the accuracy. Findings & value added: Although most currently available demand prediction models are either slow or complex, the entrepreneurial practice requires fast, simple, and accurate ones. The study results show that the alternative Pbands approach is easily applicable and can predict short-term demand changes. Due to its simplicity, the Pbands method is suitable and convenient to monitor short-term data describing the demand. Demand prediction methods based on technical indicators represent a new approach for demand forecasting. The application of these technical indicators could be a further forecasting models research direction. The future of theoretical research in forecasting should be devoted mainly to simplifying and speeding up. Creating an automated model based on primary data parameters and easily interpretable results is a challenge for further research.
Article
Full-text available
Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases.
Article
Demand forecasts are the basis of most decisions in supply chain management. The granularity of these decisions lead to different forecast requirements. For example, inventory replenishment decisions require forecasts at the individual SKU level over lead time, whereas forecasts at higher levels, over longer horizons, are required for supply chain strategic decisions. The most accurate forecasts are not always obtained from data at the 'natural' level of aggregation. In some cases, forecast accuracy may be improved by aggregating data or forecasts at lower levels, or disaggregating data or forecasts at higher levels, or by combining forecasts at multiple levels of aggregation. Temporal and cross-sectional aggregation approaches are well established in the literature. More recently, it has been argued that these two approaches do not make the fullest use of data available at the different hierarchical levels of the supply chain. Therefore, consideration of forecasting hierarchies (over time and other dimensions), and combinations of forecasts across hierarchical levels, have been recommended. This paper provides a comprehensive review of research dealing with aggregation and hierarchical forecasting in supply chains, based on a systematic search. The review enables the identification of major research gaps and the presentation of an agenda for further research.
Article
Data analysts when forecasting large number of time series, they regularly employ one of the following methodological approaches: either select a single forecasting method for the entire dataset (aggregate selection), or use the best forecasting method for each time series (individual selection). There is evidence in the predictive analytics literature that the former is more robust than the latter, as in individual selection you tend to overfit models to the data. A third approach is to first identify homogeneous clusters within the dataset, and then select a single forecasting method for each cluster (cluster selection). To that end, we examine three machine learning clustering methods: k-medoids, k-NN and random forests. The evaluation is performed in the 645 yearly series of the M3 competition. The empirical evidence suggests: (a) random forests provide the best clusters for the sequential forecasting task, and (b) cluster selection has the potential to outperform aggregate selection. • Highlights • We compare aggregate selection versus cluster selection for forecasting a dataset • The evaluation is performed in the 645 yearly series of the M3 competition • We first use one of three clustering terchiques: k-medoids, k-NN and random forests • We then forecast every cluster with the best possible forecasting method • Random forests provide the best clusters for the sequential forecasting task • Cluster selection has the potential to outperform aggregate selection.
Conference Paper
Full-text available
In many areas of decision making, forecasting is an essential pillar. Consequently, there are many different forecasting methods. According to the "No-Free-Lunch Theorem", there is no single forecasting method that performs best for all time series. In other words, each method has its advantages and disadvantages depending on the specific use case. Therefore, the choice of the forecasting method remains a mandatory expert task. However, expert knowledge cannot be fully automated. To establish a level playing field for evaluating the performance of time series forecasting methods in a broad setting, we propose Libra, a forecasting benchmark that automatically evaluates and ranks forecasting methods based on their performance in a diverse set of evaluation scenarios. The benchmark comprises four different use cases, each covering 100 heterogeneous time series taken from different domains. The data set was assembled from publicly available time series and was designed to exhibit much higher diversity than existing forecasting competitions. Based on this benchmark, we perform a comprehensive evaluation to compare different existing time series forecasting methods.
Article
Full-text available
This article presents guidelines for making forecasts and draws inferences about research techniques. Inertia produces highly autocorrelated time series in which random events have lasting effects. Such series make it easy to draw incorrect inferences about causal processes. They also make it easy to predict accurately over the short run, using variants of linear extrapolation. In forecasting, simplicity usually works better than complexity. Complex forecasting methods mistake random noise for information. Moderate expertise proves as effective as great expertise. Linear functions make better judgments than people. Analogous principles probably apply to research. Three common myths do not stand up to scrutiny: One, using fewer categories does not reduce the effects of observational errors. Two, least-squares regression does not produce reliable findings. Three, better fitting models do not predict better, even in the very short run, if researchers use squared errors to measure fits to historical data and forecasting accuracies. However, better fitting models would predict better if researchers would replace squared-error criteria with more reliable measures.
Article
Full-text available
In this study, the authors used 111 time series to examine the accuracy of various forecasting methods, particularly time-series methods. The study shows, at least for time series, why some methods achieve greater accuracy than others for different types of data. The authors offer some explanation of the seemingly conflicting conclusions of past empirical research on the accuracy of forecasting. One novel contribution of the paper is the development of regression equations expressing accuracy as a function of factors such as randomness, seasonality, trend-cycle and the number of data points describing the series. Surprisingly, the study shows that for these 111 series simpler methods perform well in comparison to the more complex and statistically sophisticated ARMA models.
Article
Full-text available
A review of the literature indicates that linear models are frequently used in situations in which decisions are made on the basis of multiple codable inputs. These models are sometimes used (a) normatively to aid the decision maker, (b) as a contrast with the decision maker in the clinical vs statistical controversy, (c) to represent the decision maker "paramorphically" and (d) to "bootstrap" the decision maker by replacing him with his representation. Examination of the contexts in which linear models have been successfully employed indicates that the contexts have the following structural characteristics in common: each input variable has a conditionally monotone relationship with the output; there is error of measurement; and deviations from optimal weighting do not make much practical difference. These characteristics ensure the success of linear models, which are so appropriate in such contexts that random linear models (i.e., models whose weights are randomly chosen except for sign) may perform quite well. 4 examples involving the prediction of such codable output variables as GPA and psychiatric diagnosis are analyzed in detail. In all 4 examples, random linear models yield predictions that are superior to those of human judges. (52 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Clinical psychologists, physicians, and other professionals are typically called upon to combine cues to arrive at some diagnostic or prognostic decision. Mathematical representations of such clinical judges can often be constructed to capture critical aspects of their judgmental strategies. An analysis of the characteristics of such models permits a specification of the conditions under which the model itself will be a more valid predictor than will the man from whom it was derived. To ascertain whether such conditions are met in natural clinical decision making, data were reanalyzed from P. E. Meehl's (see 34:3) study of the judgments of 29 clinical psychologists attempting to differentiate psychotic from neurotic patients on the basis of their MMPI profiles. Results of these analyses indicate that for this diagnostic task models of the men are generally more valid than the men themselves. Moreover, the finding occurred even when the models were constructed on a small set of cases, and then man and model competed on a completely new set. (29 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
This paper proposes a new approach to time series forecasting based upon three premises. First, a model is selected not by how well it fits historical data but on its ability to accurately predict out-of-sample actual data. Second, a model/method is selected among several run in parallel using out-of-sample information. Third, models/methods are optimized for each forecasting horizon separately, making it possible to have different models/methods to predict each of the m horizons. This approach outperforms the best method of the M-Competition by a large margin when tested empirically with the 111 series subsample of the M-Competition data.
Article
Full-text available
Extrapolative forecasting methods are widely used in production and inventory decisions. Typically many hundreds of series are forecast and the cost-effectiveness of the decisions depends on the accuracy of the forecasting method(s) used. This paper examines how a forecasting method should be chosen based on analyzing alternative loss functions. It is argued that a population of time series must be evaluated by time period and by series. Of the alternative loss functions considered, only the geometric root mean squared error is well-behaved and has a straightforward interpretation. The paper concludes that exponential smoothing and ‘naive’ models, previously thought to be ‘robust’ performers, forecast poorly for the particular set of time series under analysis, whatever error measure is used. As a consequence, forecasters should carry out a detailed evaluation of the data series, as described in the paper, rather than relying on a priori analysis developed from earlier forecasting competitions.
Article
Full-text available
Generalization and communication issues in the use o f error measures: A reply, Fred Collopy, The Weatherhead School, Case-Western Reserve University, Cleveland, Ohio 44118, USA and J. Scott Armstrong, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA. We agree with most of what the commentators say about Armstrong and Collopy (1992), hereafter referred to as “AC,” and Fildes (1992), hereafter referred to as “F.” Here, we address three issues where we do not agree entirely: (1) Can the results from the M-competition be generalized? (2) Is Theil's U2 easy to communicate? (3) Would a richer set of measures lead to improvements in the selection and development of forecasting methods? Our own answers to these questions are “yes,” “no,” and “probably not,” respectively.
Article
Full-text available
: This study evaluated measures for making comparisons of errors across time series. We analyzed 90 annual and 101 quarterly economic time series. We judged error measures on reliability, construct validity, sensitivity to small changes, protection against outliers, and their relationship to decision making. The results lead us to recommend the Geometric Mean of the Relative Absolute Error (GMRAE) when the task involves calibrating a model for a set of time series. The GMRAE compares the absolute error of a given method to that from the random walk forecast. For selecting the most accurate methods, we recommend the Median RAE (MdRAE) when few series are available and the Median Absolute Percentage Error (MdAPE) otherwise. The Root Mean Square Error (RMSE) is not reliable, and is therefore inappropriate for comparing accuracy across series. Keywords: Forecast accuracy, M-Competition, Relative absolute error, Theil's U. 1. Introduction Over the past-two decades, many studies have been ...
Article
In 1982, the Journal of Forecasting published the results of a forecasting competition organized by Spyros Makridakis (Makridakis et al., 1982). In this, the ex ante forecast errors of 21 methods were compared for forecasts of a variety of economic time series, generally using 1001 time series. Only extrapolative methods were used, as no data were available on causal variables. The accuracies of methods were compared using a variety of accuracy measures for different types of data and for varying forecast horizons. The original paper did not contain much interpretation or discussion. Partly this was by design, to be unbiased in the presentation. A more important factor, however, was the difficulty in gaining consensus on interpretation and presentation among the diverse group of authors, many of whom have a vested interest in certain methods. In the belief that this study was of major importance, we decided to obtain a more complete discussion of the results. We do not believe that “the data speak for themselves.”
Article
A step-by-step account is given of a Box-Jenkins analysis of some sales figures showing high multiplicative seasonal variation. Various practical problems are encountered and discussed. A critical appraisal is made of the Box-Jenkins procedure and some general remarks are made on short-term sales forecasting.
Article
In the last few decades many methods have become available for forecasting. As always, when alternatives exist, choices need to be made so that an appropriate forecasting method can be selected and used for the specific situation being considered. This paper reports the results of a forecasting competition that provides information to facilitate such choice. Seven experts in each of the 24 methods forecasted up to 1001 series for six up to eighteen time horizons. The results of the competition are presented in this paper whose purpose is to provide empirical evidence about differences found to exist among the various extrapolative (time series) methods used in the competition.
Article
Results of a simulation study of the short-range forecasting effectiveness of exponentially smoothed and selected Box-Jenkins models for sixty-three monthly sales series are presented. Brown's exponentially smoothed constant, linear and six- and eight-term harmonic models are compared with two seasonal factor models and ten Box-Jenkins models. The seasonal factor models are Winters' three parameter model and a single parameter model developed by the author. The forecasting errors of the best of the Box-Jenkins models that were tested are either approximately equal to or greater than the errors of the corresponding exponentially smoothed models. Test results indicate that the four exponentially smoothed seasonal models yield approximately equivalent accuracy for most data series. If a sharply defined pattern exists, the seasonal factor models can yield smaller forecast errors than the harmonic models. If no seasonal pattern exists, the errors of the exponentially smoothed models with seasonal components are only slightly greater than those of the constant and linear models.
Article
The paper presents the results of a survey designed to discover how business firms prepare sales forecasts, what methods they prefer, and the accuracy of their predictions. The survey showed that subjective, extrapolation and naive techniques are widely used by American business firms in various forecasting situations. Also, some business firms are reducing forecasting errors by making greater use of computers and seasonal adjustments.
Article
This paper examines a strategy for structuring one type of domain knowledge for use in extrapolation. It does so by representing information about causality and using this domain knowledge to select and combine forecasts. We use five categories to express causal impacts upon trends: growth, decay, supporting, opposing, and regressing. An identification of causal forces aided in the determination of weights for combining extrapolation forecasts. These weights improved average ex ante forecast accuracy when tested on 104 annual economic and demographic time series. Gains in accuracy were greatest when (1) the causal forces were clearly specified and (2) stronger causal effects were expected, as in longer- range forecasts. One rule suggested by this analysis was: “Do not extrapolate trends if they are contrary to causal forces.” We tested this rule by comparing forecasts from a method that implicitly assumes supporting trends (Holt’s exponential smoothing) with forecasts from the random walk. Use of the rule improved accuracy for 20 series where the trends were contrary; the MdAPE (Median Absolute Percentage Error) was 18% less for the random walk on 20 one-year ahead forecasts and 40% less for 20 six-year-ahead forecasts. We then applied the rule to four other data sets. Here, the MdAPE for the random walk forecasts was 17% less than Holt’s error for 943 short-range forecasts and 43% less for 723 long-range forecasts. Our study suggests that the causal assumptions implicit in traditional extrapolation methods are inappropriate for many applications.
A comparison of autoregressive univariate forecasting procedures for macroeconomic time series
  • Meese