ArticlePDF Available

Abstract

The M4 competition is the continuation of three previous competitions started more than 45 years ago whose purpose was to learn how to improve forecasting accuracy, and how such learning can be applied to advance the theory and practice of forecasting. The purpose of M4 was to replicate the results of the previous ones and extend them into three directions: First significantly increase the number of series, second include Machine Learning (ML) forecasting methods, and third evaluate both point forecasts and prediction intervals. The five major findings of the M4 Competitions are: 1. Out Of the 17 most accurate methods, 12 were “combinations” of mostly statistical approaches. 2. The biggest surprise was a “hybrid” approach that utilized both statistical and ML features. This method's average sMAPE was close to 10% more accurate than the combination benchmark used to compare the submitted methods. 3. The second most accurate method was a combination of seven statistical methods and one ML one, with the weights for the averaging being calculated by a ML algorithm that was trained to minimize the forecasting. 4. The two most accurate methods also achieved an amazing success in specifying the 95% prediction intervals correctly. 5. The six pure ML methods performed poorly, with none of them being more accurate than the combination benchmark and only one being more accurate than Naïve2. This paper presents some initial results of M4, its major findings and a logical conclusion. Finally, it outlines what the authors consider to be the way forward for the field of forecasting.
A preview of the PDF is not available
... Deep learning networks, decision trees, and random forests have become instrumental in processing complex financial datasets and identifying subtle correlations that human analysts might miss. Natural Language Processing (NLP) capabilities have extended the scope of financial analysis by enabling the interpretation of unstructured data from news sources, social media, and market reports, providing additional context for financial decision-making [2].The M4 forecasting competition demonstrated that advanced ML approaches improved accuracy by about 10-20% over traditional benchmarks [3]. Additionally, NLP techniques-such as those analyzed by Loughran & McDonald-provide richer context from unstructured data, further refining forecasts [4]. ...
... The methodology incorporates multiple variables including historical performance metrics, market indicators, and macroeconomic factors to generate more accurate forecasts. Particular attention is paid to handling non-linear relationships and temporal dependencies that characterize financial time series data [3].From ARIMA to LSTM, ML-based time series models outperformed traditional methods in M4 by ~10-15% [3]. ...
... The methodology incorporates multiple variables including historical performance metrics, market indicators, and macroeconomic factors to generate more accurate forecasts. Particular attention is paid to handling non-linear relationships and temporal dependencies that characterize financial time series data [3].From ARIMA to LSTM, ML-based time series models outperformed traditional methods in M4 by ~10-15% [3]. ...
... For example, a monthly time series spanning over 20 years will only have 240 historical observations for training. Consequently, complex deep networks are prone to overfitting on forecasting problems and in many real-world forecasting problems their superiority, if at all, is not as profound as in other domains (Elsayed et al., 2021;Makridakis & Hibon, 2000;Makridakis et al., 2018). Arguably, reliance on a large amount of accurately labeled data and the inability to incorporate explicit knowledge are a few of the biggest limitations of DNNs. ...
... We first evaluated MSE achieved by both KDS and DNN independently. KDS achieved a significantly better result when compared to the one achieved by DNN, which is also inline with findings of Elsayed et al. (2021), Makridakis and Hibon (2000), Makridakis et al. (2018). However, the results achieved by KENN were significantly better than the results of both KDS and DNN. ...
Article
Full-text available
Purely data-driven deep learning methods often require impractical amounts of high-quality data, which is one of their major weaknesses. This particularly impacts their performance in time series domains, that intrinsically have scarcity of input features. Furthermore, deep learning methods lack the ability to incorporate explicitly defined human knowledge, which can be crucial for finding effective solutions. To address these challenges, we propose a novel fusion framework, Knowledge Enhanced Neural Network (KENN), for time series forecasting. KENN combines knowledge- and data-driven approaches in a novel residual fusion scheme, where information in knowledge and data is combined in a complementary manner. We evaluate KENN in a variety of constrained settings with limited data and inaccurate knowledge models. Even when utilizing only 10% of the data for training, KENN outperforms underlying DNN trained on the complete training set. KENN specifically alleviates data and accuracy constraints of the constituent data and knowledge driven domains while, simultaneously, improving the overall accuracy. We also compare KENN with recent State-of-the-Art (SotA) methods on 5 real-world forecasting datasets. KENN outperforms SotA by an average of 42.2%, when utilizing complete training set, and by 39.7%, when utilizing only 50% of the training set. A fusion framework that reduces dependency of DNN on large datasets and enables harnessing benefits of knowledge driven systems will prove useful in many real-world applications.
... The solar dataset consists of hourly power generation data from 137 photovoltaic plants in Alabama in 2006 [45]. The M4-hourly dataset, from the M4 Competition, includes 414 hourly time series [46]. The tourism-monthly and tourism-quarterly datasets capture tourism data from Australia, Hong Kong, and New Zealand with 366 and 427 series, respectively [47]. ...
... − solar [45] (https://www.nrel.gov/grid/solar-power-data.html) − M4-hourly [46] (https://github.com/Mcompetitions/M4-methods/tree/master) − tourism-monthly, tourism-quarterly [47] (https://robjhyndman.com/publications/the-tourism-forecasting-com petition) ...
Article
Full-text available
Probabilistic forecasting offers insights beyond point estimates, supporting more informed decision-making. This paper introduces the Neural Quantile Function with Recurrent Neural Networks (NQF-RNN), a model for multistep-ahead probabilistic time series forecasting. NQF-RNN combines neural quantile functions with recurrent neural networks, enabling applicability across diverse time series datasets. The model uses a monotonically increasing neural quantile function and is trained with a continuous ranked probability score (CRPS)-based loss function. NQF-RNN’s performance is evaluated on synthetic datasets generated from multiple distributions and six real-world time series datasets with both periodicity and irregularities. NQF-RNN demonstrates competitive performance on synthetic data and outperforms benchmarks on real-world data, achieving lower average forecast errors across most metrics. Notably, NQF-RNN surpasses benchmarks in CRPS, a key probabilistic metric, and tail-weighted CRPS, which assesses tail event forecasting with a narrow prediction interval. The model outperforms other deep learning models by 5% to 41% in CRPS, with improvements of 5% to 53% in left tail-weighted CRPS and 6% to 34% in right tail-weighted CRPS. Against its baseline model, DeepAR, NQF-RNN achieves a 41% improvement in CRPS, indicating its effectiveness in generating reliable prediction intervals. These results highlight NQF-RNN’s robustness in managing complex and irregular patterns in real-world forecasting scenarios.
... Moreover, the hybridizations of SVR with ES and ARIMA are among the best methods for M4-competition [8]. Inspired by these ideas, we decided to design a forecasting hybrid method for the Covid-19 time series. ...
Article
Full-text available
We know that SARS-Cov2 produces the new COVID-19 disease, which is one of the most dangerous pandemics of modern times. This pandemic has critical health and economic consequences, and even the health services of the large, powerful nations may be saturated. Thus, forecasting the number of infected persons in any country is essential for controlling the situation. In the literature, different forecasting methods have been published, attempting to solve the problem. However, a simple and accurate forecasting method is required for its implementation in any part of the world. This paper presents a precise and straightforward forecasting method named SVR-ESAR (Support Vector regression hybridized with the classical Exponential smoothing and ARIMA). We applied this method to the infected time series in four scenarios, which we have taken for the Github repository: the Whole World, China, the US, and Mexico. We compared our results with those of the literature showing the proposed method has the best accuracy.
Article
Full-text available
Time series forecasting (TSF) has long been a crucial task in both industry and daily life. Most classical statistical models may have certain limitations when applied to practical scenarios in fields such as energy, healthcare, traffic, meteorology, and economics, especially when high accuracy is required. With the continuous development of deep learning, numerous new models have emerged in the field of time series forecasting in recent years. However, existing surveys have not provided a unified summary of the wide range of model architectures in this field, nor have they given detailed summaries of works in feature extraction and datasets. To address this gap, in this review, we comprehensively study the previous works and summarize the general paradigms of Deep Time Series Forecasting (DTSF) in terms of model architectures. Besides, we take an innovative approach by focusing on the composition of time series and systematically explain important feature extraction methods. Additionally, we provide an overall compilation of datasets from various domains in existing works. Finally, we systematically emphasize the significant challenges faced and future research directions in this field.
Preprint
Full-text available
The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets - what most studies report - 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.
Article
Full-text available
This editorial has two parts. The first one describes a personal experience about our attempt to replicate a forecasting study, as well as the rejection of a submitted paper, in our view due to lack of objectivity. The second part discusses the need for reproducibility and replicability in forecasting research and provides suggestions for promoting them in academic journals.
Article
Full-text available
Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are considerably greater than those of statistical methods. The paper discusses the results, explains why the accuracy of ML models is below that of statistical ones and proposes some possible ways forward. The empirical results found in our research stress the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable and open competitions allowing meaningful comparisons and definite conclusions.
Article
Full-text available
It is common practice to evaluate the strength of forecasting methods using collections of well-studied time series datasets, such as the M3 data. The question is, though, how diverse and challenging are these time series, and do they enable us to study the unique strengths and weaknesses of different forecasting methods? This paper proposes a visualisation method for collections of time series that enables a time series to be represented as a point in a two-dimensional instance space. The effectiveness of different forecasting methods across this space is easy to visualise, and the diversity of the time series in an existing collection can be assessed. Noting that the diversity of the M3 dataset has been questioned, this paper also proposes a method for generating new time series with controllable characteristics in order to fill in and spread out the instance space, making our generalisations of forecasting method performances as robust as possible.
Article
Full-text available
Forecasting as a scientific discipline has progressed a lot in the last 40 years, with Nobel prizes being awarded for seminal work in the field, most notably to Engle, Granger and Kahneman. Despite these advances, even today we are unable to answer a very simple question, the one that is always the first tabled during discussions with practitioners: “what is the best method for my data?”. In essence, as there are horses for courses, there must also be forecasting methods that are more tailored to some types of data, and, therefore, enable practitioners to make informed method selection when facing new data. The current study attempts to shed light on this direction via identifying the main determinants of forecasting accuracy, through simulations and empirical investigations involving 14 popular forecasting methods (and combinations of them), seven time series features (seasonality, trend, cycle, randomness, number of observations, inter-demand interval and coefficient of variation) and one strategic decision (the forecasting horizon). Our main findings dictate that forecasting accuracy is influenced as follows: (a) for fast-moving data, cycle and randomness have the biggest (negative) effect and the longer the forecasting horizon, the more accuracy decreases; (b) for intermittent data, inter-demand interval has bigger (negative) impact than the coefficient of variation; and (c) for all types of data, increasing the length of a series has a small positive effect.
Article
Full-text available
In this study, the authors used 111 time series to examine the accuracy of various forecasting methods, particularly time-series methods. The study shows, at least for time series, why some methods achieve greater accuracy than others for different types of data. The authors offer some explanation of the seemingly conflicting conclusions of past empirical research on the accuracy of forecasting. One novel contribution of the paper is the development of regression equations expressing accuracy as a function of factors such as randomness, seasonality, trend-cycle and the number of data points describing the series. Surprisingly, the study shows that for these 111 series simpler methods perform well in comparison to the more complex and statistically sophisticated ARMA models.
Article
In the last few decades many methods have become available for forecasting. As always, when alternatives exist, choices need to be made so that an appropriate forecasting method can be selected and used for the specific situation being considered. This paper reports the results of a forecasting competition that provides information to facilitate such choice. Seven experts in each of the 24 methods forecasted up to 1001 series for six up to eighteen time horizons. The results of the competition are presented in this paper whose purpose is to provide empirical evidence about differences found to exist among the various extrapolative (time series) methods used in the competition.
Article
In this work we present a large scale comparison study for the major machine learning models for time series forecasting. Specifically, we apply the models on the monthly M3 time series competition data (around a thousand time series). There have been very few, if any, large scale comparison studies for machine learning models for the regression or the time series forecasting problems, so we hope this study would fill this gap. The models considered are multilayer perceptron, Bayesian neural networks, radial basis functions, generalized regression neural networks (also called kernel regression), K-nearest neighbor regression, CART regression trees, support vector regression, and Gaussian processes. The study reveals significant differences between the different methods. The best two methods turned out to be the multilayer perceptron and the Gaussian process regression. In addition to model comparisons, we have tested different preprocessing methods and have shown that they have different impacts on the performance.
Article
This paper describes the M3-Competition, the latest of the M-Competitions. It explains the reasons for conducting the competition and summarizes its results and conclusions. In addition, the paper compares such results/conclusions with those of the previous two M-Competitions as well as with those of other major empirical studies. Finally, the implications of these results and conclusions are considered, their consequences for both the theory and practice of forecasting are explored and directions for future research are contemplated.
Article
The purpose of the M2-Competition is to determine the post sample accuracy of various forecasting methods. It is an empirical study organized in such a way as to avoid the major criticism of the M-Competition that forecasters in real situations can use additional information to improve the predictive accuracy of quantitative methods. Such information might involve inside knowledge (e.g. a machine breakdown, a forthcoming strike in a major competitor, a steep price increase, etc.), be related to the expected state of the industry or economy that might affect the product(s) involved, or be the outcome of a careful study of the historical data and special care in procedure/methods employed while forecasting. The M2-Competition consisted of distributing 29 actual series (23 of these series came from four companies and six were of macro economic nature) to five forecasters. The data covered information including the September figures of the year involved. The objective was to make monthly forecasts covering 15 months starting from October and including December of the next year. A year later the forecasters were provided with the new data as they had become available and the process of predicting for 15 months ahead was repeated. In addition to being able to incorporate their predictions about the state of the economy and that of the industry the participating forecasters could ask for any additional information they wanted from the collaborating companies. Although the forecasters had additional information about the series being predicted the results show few or no differences in post-sample forecasting accuracy when compared to those of the M-Competition or the earlier Makridakis and Hibon empirical study.
M4 competitor's guide: prizes and rules
  • Team
M4 Team (2018). M4 competitor's guide: prizes and rules. See https:// www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-CompetitorsGuide.pdf.