Content uploaded by Spyros Makridakis
Author content
All content in this area was uploaded by Spyros Makridakis on Oct 10, 2017
Content may be subject to copyright.
The Accuracy of Machine Learning (ML) Forecasting Methods
versus Statistical Ones: Extending the Results of the
M3-Competition
Spyros Makridakisa,∗
, Evangelos Spiliotisb, Vassilios Assimakopoulosb
aDirector Institute For the Future (IFF), University of Nicosia, Cyprus
bForecasting and Strategy Unit, School of Electrical and Computer Engineering, National Technical
University of Athens, 9 Iroon Polytechniou Str, 15773 Zografou Athens, Greece
Abstract
Machine Learning (ML) methods have been proposed in the academic literature as alterna-
tives to statistical ones for forecasting. Yet, scant evidence is available about their perfor-
mance in terms of accuracy and computational requirements. The purpose of this paper is
to evaluate such performance using a large subset of 1045 monthly time series selected from
the M3-Competition. In addition to a brief review of published studies dealing with the
forecasting accuracy of ML methods, the main part of this paper compares the post-sample
accuracy of popular ML methods with eight traditional statistical ones. Such a compari-
son shows the dominance of the statistical over the ML methods for all the 18 forecasting
horizons examined across both the accuracy measures used. Moreover, the study finds that
the computational requirements of the ML methods are considerably greater than those
of the statistical ones. The paper also discusses the results and attempts to explain why
the forecasting accuracy of ML models is below expectations while proposing some possible
ways to improve it. There is also a conclusion stressing the need for objective and unbiased
ways to test the performance of forecasting methods that must be achieved through open
competitions made up of a sizable number of series, allowing meaningful comparisons and
definite conclusions.
Keywords: Forecasting, Machine learning methods, Statistical methods, M3-competition
∗Corresponding author
Email address: makridakis.s@unic.ac.cy (Spyros Makridakis)
Preprint submitted to Neural Networks July 6, 2017
1. Introduction
Artificial Intelligence (AI) has gained considerable prominence during the last decade
fueled by a number of high profile applications in Autonomous Vehicles (AV), intelligent
robots, image and speech recognition, automatic translations, medical and law usages as
well as beating champions in games like Jeopardy, GO and poker (Makridakis,2017). The
successes of AI are based on the utilization of algorithms capable of learning by trial and
error and improving their performance over time, not just by step-by-step coding instructions
based on logic, if-then rules and decision trees, the sphere of traditional programming. The
purpose of this paper is to evaluate a special class of AI models utilizing Machine Learning
(ML), in the literature also called Neural Networks (NNs), algorithms for forecasting time
series. As forecasting is of considerable practical value, improving its performance can
provide substantial benefits to end users and needs to be investigated in depth.
This paper consists of three sections. The first, briefly reviews published empirical studies
and investigates the performance of ML (or alternatively NN) methods in comparison to
statistical ones, also deliberating some major issues related to forecasting accuracy. The
second, main part of the paper, uses a subset of 1045 monthly series, selected from the 3003
of the M3-Competition (Makridakis and Hibon,2000), to calculate the performance of eight
popular ML methods and eight traditional statistical ones. Consequently, comparisons are
made between the ML and statistical methods as well as with the results of Ahmed et al.
(2010) that have also estimated the accuracy of these eight ML methods using the same
time series. All accuracy comparisons were made using the first n-18 observations in order
to train the forecasting model, where nis the length of the series, and then 18 forecasts
were produced and their accuracy was evaluated compared to the actual values not used in
training the model, called post-sample accuracy. In addition, the computational complexity
of the methods used was recorded as well as the accuracy of fitting models to the n-18
historical data (Model Fit).
The third section discusses the outcome of the comparisons and attempts to explain why
the forecasting accuracy of ML models is below expectations, while also proposing possible
2
ways to improve it. A critical question being asked is if ML methods can actually be made
to ”learn” more efficiently using more information about the future and unknown errors,
rather than the past ones. The conclusion stresses the need for realism by accepting the
inferior accuracy of ML methods versus that of statistical ones and the necessity to devise
ways for improving it. Finally, the requirement to obtain objective, unbiased information
about the performance of ML and other forecasting methods through open competitions
made up of a sizable number of series, is emphasized as the best way of improving their
performance by analyzing and understanding the reasons why some methods are less/more
accurate than others.
2. The Accuracy of ML Methods: A Brief Review and Discussion
The first application of NNs (as ML methods were called at that time and also today)
to forecasting goes back to 1964 but did not achieve much follow-up until the technique of
backpropagation was introduced almost 20 years later (Zhao,2009). Since then there have
been numerous studies utilizing NN methods and some of them comparing their accuracy
with traditional, statistical ones. A good number of these studies, going back to 1995, are
summarized in the work of Ahmed et al. (2010), who concluded: ”The outcome of all of these
studies has been somewhat mixed”. A similar conclusion was reached by Adya and Collopy
(1998) who evaluated 48 NN studies and also stated that their accuracy in comparison to
statistical methods provided mixed results. What characterized all these studies, however,
was the limited number of series employed for the comparisons.
The first large scale study, using 3003 time series, dates back to the M3-competition
published in 2000 by Makridakis and Hibon (2000) that included an Automatic NN (ANN)
method that accuracy-wise did average in comparison to the traditional statistical ones in-
cluded in the Competition and below the most accurate ones (see Table 1). Eleven years
later, Crone, Hibon and Nikolopoulos (C-H-N) published the results of a specialized NN
competition, using a subset of the M3 monthly data (Crone et al.,2011). In this competi-
tion they compared 22 NN and CI (Computational Intelligence) methods, in addition to 11
statistical ones. Their conclusion was that no ML method outperformed the Theta method
3
(Assimakopoulos and Nikolopoulos,2000), the most accurate one in the M3-Competition,
and only one (Ilies et al.,2007) was more accurate than Damped trend exponential smooth-
ing (Gardner,2006) when the symmetric mean average percentage error (sMAPE) for the
average of all 18 forecasting horizons was used. However, four NN methods did better than
the ANN of the M3-Competition denoting improvements in the accuracy of newer NN ones.
Overall, however, the accuracy of the NN methods was not exceptional, vis-`a-vis those of
the M3-Competition, or the 11 statistical ones that were included in the (C-H-N) study (see
Table 2).
Table 1: Average sMAPE across the 3003 time series of the M3-competition: Statistical and the ANN
methods. NNs are defined with gray color.
Method 1 2 3 4 5 6 8 12 15 18 1-4 1-6 1-8 1-12 1-15 1-18
Forecasting horizon Average of forecasting horizons
Theta 8.4 9.6 11.3 12.5 13.2 14.0 12.0 13.2 16.2 18.2 10.4 11.5 11.6 12.0 12.4 13.0
Damped 8.8 10.0 12.0 13.5 13.7 14.3 12.5 13.9 17.5 18.9 11.1 12.0 12.1 12.4 13.0 13.6
Box-Jenkins 9.2 10.4 12.2 13.9 14.0 14.8 13.0 14.1 17.8 19.3 11.4 12.4 12.5 12.8 13.4 14.0
ANN 9.0 10.4 11.8 13.8 13.8 15.5 13.4 14.6 17.3 19.6 11.2 12.4 12.6 13.0 13.5 14.1
Single 9.5 10.6 12.7 14.1 14.3 15.0 13.3 14.5 18.3 19.4 11.7 12.7 12.8 13.1 13.7 14.4
Holt 9.0 10.4 12.8 14.5 15.1 15.8 13.9 14.8 18.8 20.2 11.7 12.9 13.1 13.4 14.0 14.6
Naive 2 10.5 11.3 13.6 15.1 15.1 15.9 14.5 16.0 19.3 20.7 12.6 13.6 13.8 14.2 14.8 15.5
ML techniques have been gaining prominence over time as interest in AI has been rising.
They are used to predict financial series (Hamid and Habib,2014;Wang and Wang,2017),
the direction of the stock market (Qiu and Song,2016), macroeconomic variables (Kock and
Ter¨asvirta,2016), accounting balance sheet information (Gabor and Dorgo,2017), and a
good number of other applications, covering a wide range of areas (Marr,2016). A major
purpose of this study is to determine, empirically, if their performance exceeds that of
statistical methods and how their advantages could be exploited to improve forecasting
accuracy. What seems certain is that Chatfields prediction NN becoming ”breakthrough or
passing fad” will not be realized (Chatfield,1993). Their performance cannot be classified
yet as a breakthrough but at the same time they are still used to a great extent while
there are indications that such usage will be increasing over time as newer ML methods are
4
Table 2: sMAPE and ranks of errors on the complete dataset of the C-H-N study. NN methods are defined
with gray color.
Method sMAPE(%) MdRAE MASE AR sMAPE(%) MdRAE MASE AR
Average errors Rank across all methods
Theta 14.89 0.88 1.13 17.8 2 3 1 2
Illies 15.18 0.84 1.25 18.4 3 2 11 4
ForecastPro 15.44 0.89 1.17 18.2 4 4 3 3
DES 15.90 0.94 1.17 18.9 5 14 3 6
Comb S-H-D 15.93 0.09 1.21 18.8 6 5 7 5
Autobox 15.95 0.93 1.18 19.2 7 11 5 7
Flores 16.31 0.93 1.20 19.3 8 11 6 8
SES 16.42 0.96 1.21 19.6 9 16 7 12
Chen 16.55 0.94 1.34 19.5 11 14 18 9
D’yakonov 16.57 0.91 1.26 20.0 12 7 12 15
ANN (M3) 16.81 0.91 1.21 19.5 13 7 7 9
Kamel 16.92 0.90 1.28 19.6 14 5 13 12
5
introduced and more efficient ways are devised to improve their accuracy (Ahmed et al.,
2010;Zhang and Qi,2005) and their computational efficiency.
3. The Major Contribution of this paper: Comparing the Accuracy of ML Fore-
casting Methods with Traditional Statistical Ones
As highlighted in the previous section, Ahmed et al. (2010) compared in their study the
performance of eight families of ML model regarding their accuracy:
1. Multi-Layer Perceptron (MLP)
2. Bayesian Neural Network (BNN)
3. Radial Basis Functions (RBF)
4. Generalized Regression Neural Networks (GRNN), also called kernel regression
5. K-Nearest Neighbor regression (KNN)
6. CART regression trees (CART)
7. Support Vector Regression (SVR), and
8. Gaussian Processes (GP)
(for more information regarding these models see the work of Ahmed et al. (2010); Alpaydin
(2004); Hastie et al. (2009) as well as our own descriptions in section 3.3 below).
To do so, Ahmed and co-authors used a subset of 1045 series (the same ones being used in
our study), selected from the monthly ones of the M3-Competition, having a length between
81 and 126 months. However, before computing the 18 forecasts, they preprocessed the
series in order to achieve stationarity in their mean and variance. This was done using the
log transformation, then deseasonalization and finally scaling, while first differences were
also considered for removing the component of trend. Consequently, they calculated one-
step-ahead forecasts for each one of the 1045 series. The sMAPE and the ranking of the
eight ML methods can be seen in Table 3(for details of how the preprocessing was done,
how the forecasts were produced and how the accuracy measures were computed, see the
paper by Ahmed et al. (2010)). As seen, the most accurate ML method is the MLP, the next
6
one is the BNN and the third the GP. The sMAPE of the remaining methods is double digit
indicating distinct difference in their accuracy. What would be of considerable research value
is to investigate the reasons for the differences in accuracy among the eight ML methods and
come up with guidelines of selecting the most appropriate one for new types of forecasting
applications.
Table 3: The overall performance (sMAPE) of the ML methods tested in the study of Ahmed et.al.
Rank Method sMAPE(%)
1 MLP 8.34
2 BNN 8.58
3 GP 9.62
4 GRNN 10.33
5 KNN 10.34
6 SVR 10.40
7 CART 11.72
8 RBF 15.79
In this respect, the major contribution of this paper is to extend Ahmed and co-authors
study in five directions: First, we included eight statistical methods into the comparisons to
conclude whether modern ML methods outperform typical time series methods; second, we
introduced an extra, linear accuracy measure, the MASE, to ensure that the same conclusions
stand for more robust types of metrics; third, we estimated the forecasts using three ways
for obtaining the forecasts for 18 horizons ahead, not just the one-period-ahead computed by
Ahmed and co-authors, to test forecasting performance of ML methods in longer horizons;
fourth, we computed a measure of computation complexity to determine the amount of
computer time required by each method to develop the model and obtain the forecasts;
finally, we estimated a Model Fit measure to determine how well each method was fitted to
the n-18 historical data that were used for training and parameter estimation. The purpose
of such measure being to figure out possible over-fitting that may decrease the accuracy of
7
the post-sample forecasts.
3.1. Accuracy Measures
Two accuracy measures are used in this paper: The symmetric Mean Absolute Percent-
age Error (sMAPE), which was originally used in the M3-competition for evaluating the
participating methods, and the Mean Absolute Relative Error (MASE). The first measure
is defined as follows
sM AP E =2
k
k
X
t=1
Yt−ˆ
Yt
|Yt|+|ˆ
Yt|∗100%,(1)
where kis the forecasting horizon, Ytare the actual observations and ˆ
Ytthe forecasts pro-
duced by the model at point t.
Moreover, to estimate the average forecasting accuracy more indicatively, ˆ
Ytwere esti-
mated ten times and the average of the errors produced was used to avoid problems caused
by the selection of the initial values chosen for the parameterization of the ML methods
(Hansen and Salamon,1990). In this paper the first n-18 observations were used for train-
ing/validating the models, and the last 18 for testing its forecasting accuracy (following the
same procedure as that of the M-Competitions).
In addition, since sMAPE penalizes large positive errors more than negative ones (Good-
win and Lawton,1999), the Mean Absolute Relative Error (MASE) was also estimated to
complement that of sMAPE (Hyndman and Koehler,2006). The MASE is defined as follows
MASE =1
k
|Yt−ˆ
Yt|
1
n−m
n
X
t=m+1
|Yt−Yt−m|
,(2)
where nis the number of the available historical observations and mis the frequency of the
time series.
MASE, among its other characteristics, is independent of the scale of the data, its value
being less than one if the forecast is more accurate than the average model fitting prediction
of the Naive benchmark, and greater than one if it is less accurate.
8
3.2. Computational Complexity and Model Fitting
Computational complexity (CC) is used to determine the time needed to train a given
model and use it for extrapolation. Thus, CC can be simply defined as the mean compu-
tational time required by the model to predict a time series, divided by the corresponding
time needed by the Nave method to achieve the same task. In this regard, we end up with a
relative metric indicating the additional, proportional time required for obtaining the fore-
casts from the more complex methods. Computational time was estimated using a system
with the following characteristics: Intel Core i7-4790 CPU @ 3.60GHz, 8.00 GB RAM, x64
based processor.
Computational Complexity (CC) = Computational Time Model
Computational Time Naive (3)
Finally, the accuracy of how well a model fits (MF) the historical data is defined as
Model Fitting (MF) = n
n
X
t=1
(Yt−ˆ
Yt)2
(
n
X
t=1
Yt)2
∗100%,(4)
Expression (4) is actually the Mean Squared Error (MSE) of the n-k model fit forecasts,
normalized by the mean value of the time series being used.
3.3. Statistical and ML Methods Utilized
To compare the performance of ML methods to traditional statistical ones, we included
the six most accurate methods of the M3 competition plus a naive benchmark, Naive 2,
which is actually a random walk model adjusted for seasonality. The second method is
Simple Exponential Smoothing (SES) (Gardner,1985), aimed at predicting series without a
trend. The third and fourth (Holt and Damped exponential smoothing (Gardner,2006)), are
most appropriate for time series with trend. The fifth is a combination (average) of the three
exponential smoothing methods just described: SES, Holt and Damped, (Comb), aimed at
9
achieving the possible benefits of averaging the errors of multiple forecasts (Andrawis et al.,
2011). The sixth model is the Theta method (Assimakopoulos and Nikolopoulos,2000),
that achieved the best overall sMAPE in the original M3-competition. Finally, the seventh
and eighth models are an automatic ARIMA (Hyndman and Khandakar,2008) and an
automatic exponential smoothing (Hyndman et al.,2002) which recent studies have shown
to be of considerably accuracy. Moreover, the former serves as a good benchmark to compare
ML models, being the linear form of the most popular neural network, the Multi-Layer
Perceptron (Connor et al.,1994). A brief description of the same eight ML methods used
by Ahmed and co-authors as well as by this study is provided next, while for the description
of the statistical ones see Makridakis et al. (1998).
3.3.1. Multi-Layer Perceptron (MLP)
First, a single hidden layer NN is constructed. Then, the best number of input nodes N=
[1,2, ..., 5] is defined by using a 10-fold validation process, with the inputs being observations
Yt−5,Yt−4,Yt−3,Yt−2, and Yt−1for predicting the time series at point t, and doing so for all the
n−18 data. Third, the number of the hidden nodes is set to 2N+ 1 following the practical
guidelines suggested by Lippmann (1987) aimed at decreasing the computational time needed
for constructing the NN model (the number of the hidden layers used is typically of secondary
importance (Zhang et al.,1998)). The Scaled Conjugate Gradient method (Møller,1993)
is then used instead of Standard Backpropagation for estimating the optimal weights. The
method, which is an alternative of the famous Levenberg-Marquardt algorithm, has been
found to perform better in many applications and is considered more appropriate for weight
optimization. The learning rate is selected between 0.1 and 1, using random initial weights
for starting the training process with a maximum of 500 iterations. Finally, to maximize the
flexibility of the method, although the activation function of the hidden layer is a logistic
one, a linear function is used for the output nodes. This is crucial since, if a logistic output
activation function is used for optimizing trended time series, it is bounded and doomed
to fail (Zhang and Qi,2005). In addition, due to the nonlinear activation functions, the
data is scaled between 0 to 1 to avoid computational problems, meet algorithm requirement
10
and facilitate faster network learning (Zhang et al.,1998). The linear transformation is as
follows:
Y’ = Y−Ymin
Ymax −Ymin
(5)
Once all predictions are made, the forecasts are then rescaled back to the original scale.
Having defined the architecture of the optimal neural network, 100 MLP models were
additionally trained and used to extrapolate the series. The mean, median and mode of
the individual forecasts was then used as the final forecasts. This was done to evaluate the
possible benefits of forecast combination, extensively reported in the literature, especially
for the case of neural networks which are characterized by great variations when different
initial parameters are used (Kourentzes et al.,2014). Yet, given that the gains in accuracy
for combining multiple forecasts were negligible and the complexity was almost double, we
do not present all the results as doing so in unnecessary.
The MLP method was constructed using the mlp function of the RSNNS R statistical
package (Bergmeir and Benitez,2012).
3.3.2. Bayesian Neural Network (BNN)
The BNN is similar to the MLP method but optimizes the network parameters according
to the Bayesian concept, meaning that the weights are estimated assuming some a priori
distributions of errors. The method was constructed according to the suggestions provided
by (MacKay,1992) and (Dan Foresee and Hagan,1997). It uses the Nguyen and Widrow
algorithm (Nguyen and Widrow,1990) to assign initial weights and the Gauss-Newton algo-
rithm to perform the optimization. Similarly to the MLP method, the best number of input
nodes N= [1,2, ..., 5] was defined using a 10-fold validation process and the number of the
hidden nodes was set to 2N+ 1. A total number of 500 iterations were considered and the
data was linearly scaled.
The BNN method was constructed exploiting the brnn function of the brnn R statistical
package (Rodriguez and Gianola,2016).
11
3.3.3. Radial Basis Functions (RBF)
RBF is a feed-forward network with one hidden layer and is similar to the MLP method.
Yet, instead of using a sigmoid activation function, it performs a linear combination of n
basis functions that are radially symmetric around a center. Thus, information is repre-
sented locally in the network, which allows the method to be more interpretable and faster
to compute. Like the previous approaches, the best number of input nodes N= [1,2, ..., 5]
is defined using a 10-fold validation process and the number of the hidden nodes is auto-
matically set to 2N+ 1. A total number of 500 iterations were considered and the data were
linearly scaled. The output activation function is the linear one.
The RBF method was constructed exploiting the rbf function of the RSNNS R statistical
package (Bergmeir and Benitez,2012).
3.3.4. Generalized Regression Neural Networks (GRNN)
The GRNN method, also called the NadarayaWatson estimator or the kernel regression
estimator, is implemented by the algorithm proposed by Specht (1991). In contrast to the
previous methods, GRNN is nonparametric and the predictions are found by averaging the
target outputs of the training data points according to their distance from the observation
provided each time. The sigma parameter, which determines the smoothness of the fit, is
selected together with the number of the inputs Nusing the 10-fold validation process. The
inputs, linearly scaled, varied from 1 to 5 and the sigma from 0.05 to 1, with a step of 0.05.
The GRNN method was constructed exploiting the guess, learn and smooth functions of
the grnn R statistical package (P.-O. Chasset,2013).
3.3.5. K-Nearest Neighbor regression (KNN)
KNN is a nonparametric regression method basing its forecasts on a similarity measure,
the Euclidean distance between the points used for training and testing the method. Thus,
given Ninputs, the method picks the closest K training data points and sets the prediction as
the average of the target output values for these points. The Kparameter, which determines
the smoothness of the fit, is once again optimized together with the number of the inputs
12
using the 10-fold validation process. The inputs, which are linearly scaled, may vary from 1
to 5 and the Kfrom 2 to 10.
The KNN method was constructed exploiting the KNN function of the class R statistical
package (Venables and Ripley,2002).
3.3.6. CART regression trees (CART)
CART is a regression method based on tree-like recursive partitioning of the input space
(Breiman,1993). The space specified by the training sample is divided into regions, called
the terminal leaves. Then, a sequence of tests is introduced and applied to decision nodes
in order to define in which leave node an object should be classified based on the input
provided. The tests are applied serially from the root node to the leaves, till a final decision
is made. Like the previous approaches, the total number of the input nodes N= [1,2, ..., 5]
is defined using a 10-fold validation process and are then linearly scaled.
The CART method was constructed exploiting the rpart function of the rpart R statistical
package (Therneau et al.,2015).
3.3.7. Support Vector Regression (SVR)
SVR is the regression process performed by a Support Vector Machine which tries to
identify the hyperplane that maximizes the margin between two classes and minimize the
total error under tolerance (Sch¨olkopf and Smola,2001). In order an efficient SVM to be
constructed, a penalty of complexity is also introduced, balancing forecasting accuracy and
computational performance. Since in the present study accuracy is far more important than
complexity, forecasts were produced using a -regression SVM which maximizes the borders
of the margin under suitable conditions to avoid outlier inclusion, letting the SVM decide
the number of the support vectors needed. The kernel used in training and predicting is
the radial basic one, mainly due to its good general performance and the few number of
parameters it requires. Following the suggestions of Ahmed et al. (2010), is set equal to
the noise level of the training sample, while the cost of constraints violation Cis fixed to
the maximum of the target output values, which is 1. Then, the γparameter is optimized
13
together with the total number of inputs Nset for the method, using a 10-fold validation
process. The inputs are linearly scaled as in the previous methods described.
The SVR method was constructed exploiting the SVM function of the e1071 R statistical
package (Meyer et al.,2017).
3.3.8. Gaussian Processes (GP)
According to GP, every target variable can be associated with one or more normally
distributed random variables which form a multivariate normal distribution, emerging by
combining the individual distributions of the independent ones (Rasmussen and Williams,
2006). In this respect, Gaussian processes can serve as a nonparametric regression method
which assumes an a priori distribution for the input variables provided during training,
and then combines them appropriately using a measure of similarity between points (the
kernel function) to predict the future value of the variable of interest. In our case, the input
variables are the past observations of the time series, linearly scaled, while their total number
N= [1,2, ..., 5] is defined using a 10-fold validation process. The kernel function used is
the radial basis one, while the initial noise variance and the tolerance of termination was
set to 0.001 given that, as suggested by Ahmed et al. (2010), it would be computationally
prohibitive to use a three-dimensional 10-fold validation approach to define them.
The GP method was constructed exploiting the gausspr function of the kernlab R sta-
tistical package (Karatzoglou et al.,2004).
3.4. Preprocessing the Original Data
In contrast to sophisticated time series forecasting methods where achieving stationarity
in both the mean and variance is considered essential, the literature of ML is divided with
some studies claiming that ML methods are capable of effectively modeling any type of data
pattern and can be applied, therefore, to the original data (Gorr,1994). Other studies,
however, have concluded the opposite claiming that without appropriate preprocessing, ML
methods may become unstable and yield suboptimal results (Zhang and Qi,2005).
Preprocessing can be applied in three forms: Seasonal adjustments, log or power trans-
formations, and removing the trend. For instance, Sharda and Patil (1992) found that
14
MLP cannot capture seasonality adequately, while Nelson et al. (1994) claim exactly the
opposite. There is greater agreement concerning the trend, given the bouncing of MLPs
activation function that become more stable once the series has been detrended (Cottrell
et al.,1995). Yet, more empirical results is needed to support the conclusions related to
preprocessing, including the most appropriate way to eliminate the trend in the data (Nelson
and Plosser,1982).
3.4.1. Preprocessing Alternatives
There are various preprocessing alternatives when it comes to time series forecasting.
Some indicative ones are the following:
•Original data: No pre-processing is applied.
•Transforming the data: The log or the Box-Cox (Box and Cox,1964) power trans-
formation is applied to the original data in order to achieve stationarity in the variance.
•Deseasonalizing the data: The data are considered seasonal if a significant auto-
correlation coefficient at lag 12 exists. In such case the data are deseasonalized using
the classical, multiplicative decomposition approach (Makridakis et al.,1998). The
training of the ML weights, or the optimization of statistical methods, is subsequently
done on the seasonally adjusted data. The forecasts obtained are then reseasonalized
to determine the final predictions. This is not done in the case of ETS and ARIMA
methods since they include seasonal models, selected using relative tests and informa-
tion criteria that take care of seasonality and model complexity directly.
•Detrending the data: A Cox-Stuart test (Cox and Stuart,1955) is performed to de-
termine if a deterministic linear trend should be used, or alternatively first differencing,
to eliminate the trend from the data and achieve stationarity in the mean.
•Combination of the above three: The benefits of individual preprocessing tech-
niques are applied simultaneously to adjust the original data.
15
In order to speed up computations, we first determine the best preprocessing alternative
for improving the post-sample one-step-ahead forecasting performance of the MLP method
(the most popular among ML ones) and then apply that one to the remaining ML models.
Table 4summarizes the results of the various preprocessing possibilities as well as their
combinations when predicting the 18 unknown observations of the 1045 time series, as
applied in the study of Ahmed et al. (2010). The best combination according to sMAPE
is number 7 (Box-Cox transformation, deseasonalization) while the best one according to
MASE is number 10 (Box-Cox transformation, deseasonalization and detrending). Some
interesting observations from studying the results of Table 4are the following:
Table 4: Average forecasting performance using MLP and various preprocessing approaches
Approach sMAPE(%) MASE CC MF
Original data 9.15 0.67 90.24 2.80
Transformation: Log 8.99 0.67 88.04 3.06
Transformation: Box-Cox 8.97 0.67 88.07 2.99
Detrending: Linear deterministic function 10.43 0.65 87.05 2.67
Detrending: First differencing 11.87 0.86 85.02 2.95
Deseasonalisation 8.16 0.57 93.31 2.10
Combination 3&6 7.96 0.56 88.54 2.16
Combination 6&4 9.56 0.56 84.44 2.01
Combination 3&4 9.07 0.64 88.78 2.82
Combination 3&6&4 8.39 0.55 83.49 2.11
•Transforming the data with the Box-Cox method is a slightly better option than the
alternative of using logs in terms of the sMAPE. But both alternatives do not improve
much its accuracy and do not provide any improvement in MASE.
•Seasonal adjustments provide significantly better results in both sMAPE and MASE.
16
•Removing the trend from the data using a linear function provides more accurate re-
sults (in both the sMAPE and MASE) than the alternative of using the first difference.
•There are important differences in sMAPE and MASE with each type of preprocessing.
More work on the most appropriate preprocessing approach is required first to confirm
the results found using a subset of the series of the M3-Competition and second to determine
if the conclusions using the MLP method will be similar to those of other ML ones.
Transforming the data seems also beneficial when using traditional forecasting methods.
This can be seen by looking at Tables 5and 6, that display the forecasting performance
for the eight statistical methods included in this work according to sMAPE and MASE.
The sMAPE accuracies show a consistent improvement, while those of MASE are about
the same. Moreover, after transformations, the differences between the various methods
becomes smaller, meaning that simpler methods, such as Damped, can now be used instead
of ETS, which maybe more accurate but also the most time intensive one.
Table 5: Average forecasting performance using statistical forecasting methods: Original data
Method sMAPE(%) MASE CC MF
Naive 2 8.59 0.56 1.00 3.63
SES 7.36 0.49 1.53 2.37
Holt 7.41 0.48 2.31 2.35
Damped 7.30 0.48 3.96 2.34
Comb 7.27 0.48 6.88 2.32
Theta 7.31 0.48 5.84 2.34
ARIMA 7.34 0.47 43.96 2.53
ETS 7.19 0.47 34.07 2.28
Table 7compares the one-step-ahead forecasts of the eight ML methods used by Ahmed
and colleagues and by our own study once the most appropriate preprocessing has been ap-
plied. This includes Box-Cox transformation, deseasonalization and detrending since evalu-
ating forecasting performance through MASE instead of sMAPE is considered, as mentioned
17
Table 6: Average forecasting performance using statistical forecasting methods: Box-Cox transformation
Method sMAPE(%) MASE CC MF
Naive 2 8.58 0.56 1.28 3.66
SES 7.25 0.49 1.69 2.38
Holt 7.32 0.48 2.45 2.35
Damped 7.19 0.48 4.54 2.33
Comb 7.20 0.48 7.23 2.32
Theta 7.23 0.48 5.75 2.36
ARIMA 7.19 0.47 46.56 2.59
ETS 7.12 0.47 35.55 2.30
in section 3.1, a more reliable choice. There are some important similarities in the overall
sMAPE (see column 2) between the results of Ahmed et al. (2010) and our own indicating
that the forecasts of ML methods are consistent over the period of seven years. There are
also some important differences that can be justified given the long time that has passed be-
tween the two studies and the computational advancements in utilizing the ML algorithms.
At the same time the reasons for the huge difference in RBF (9.57% vs 15.79%) as well as
the smaller ones in CART and SVR need to be investigated. Undoubtedly, the use of slightly
different parameterisation for applying the individual methods, as well as the exploitation
of different functions for their implementation (R instead of Matlab) might explain part of
the variations.
The results in Table 7show that MLP and BNN outperform the remaining ML methods.
Thus, these two methods will be the only ones to be further investigated by comparing their
forecasting accuracy beyond one-step-ahead predictions to multiple horizons, important for
all those interested in predicting beyond one horizon.
3.5. Multiple-horizon forecasts
There are three multiple period alternative ways when forecasting with ML models.
18
Table 7: Forecasting performance of the eight ML methods included in the study for one-step-ahead forecasts
having applied the most appropriate preprocessing alternative. The corresponding accuracies of Ahmed and
coauthors, from their Table 3 p. 611, are also shown for reasons of comparison.
Method sMAPE(%) MASE CC MF
MLP 8.39 0.55 83.49 2.11
MLP (Ahmed et al.) 8.34
BNN 8.17 0.53 47.44 2.11
BNN (Ahmed et al.) 8.56
RBF 9.57 0.71 146.11 1.66
RBF (Ahmed et al.) 15.79
GRNN 9.49 0.67 388.73 1.80
GRNN (Ahmed et al.) 10.33
KNN 11.49 0.80 12.01 3.30
KNN (Ahmed et al.) 10.34
CART 10.28 0.74 8.89 1.74
CART (Ahmed et al.) 11.72
SVR 8.88 0.61 9.79 2.11
SVR (Ahmed et al.) 10.40
GP 9.14 0.62 29.39 2.09
GP (Ahmed et al.) 9.62
19
3.5.1. Iterative forecasting
The first forecast in this approach is found in the exact same way as the one-step-ahead
one described and used previously by Ahmed et al. and by our own study. It is possible,
however, to obtain forecasts for more than one-step-ahead using the first forecast produced
by the model, instead of the actual value, to get a forecast for horizon two and then use
the two forecasts to estimate the one for horizon 3 and so on until predictions for all 18
horizons have been found (this is also the approach used by the ARIMA models in the M-
Competitions). This means that we obtain 18 forecasts using exclusively the first n-18 data
points only. As the forecasting horizon increases, the new forecasts depend on the accuracy
of the previous ones, meaning that longer term ones may deteriorate. The advantage of this
approach is, however, its simplicity and computational easiness.
3.5.2. Direct forecasting
The direct approach produces multiple-step-ahead forecasts, instead of one-step-ahead
ones, by training and exploiting an 18-output-node neural network able of producing 18 fore-
casts simultaneously, one for each forecasting horizon. The first output node is responsible
for forecasting horizon h=1, output node 2 for forecasting horizon h=2, ending with output
mode 18 for forecasting horizon 18. It is interesting to see if this multi-forecasts approach
will be more accurate than the iterative one, knowing well that the Direct approach is more
complex and computationally much more demanding, while fewer observations are available
for training.
3.5.3. Multi-neural network forecasting
In this approach single output node NNs are trained for producing forecasts. Yet, instead
of training one NN for each horizon simoultaneously, 18 separate NNs are trained, each one
for predicting a single h-step-ahead forecast. In this respect, if we wish to forecast the value
of the time series for one horizon-ahead we use the first NN trained using the n-18 data, for
two the second NN again trained in the n-18 data, and so on for eighteen times in total.
The three options for multi-horizon forecasts can be visualized in Figure 1.
20
Figure 1: The three possible multi-step-ahead forecasting approaches used by ML methods: (a) the iterative,
(b) the direct and (c) the multi-inputs.
The results of each the three approaches for predicting 18 months ahead are displayed in
Table 8and 9for both the ML methods as well as the eight statistical ones using sMAPE and
MASE. To simplify the presentation, the results are grouped into three forecasting horizons:
Short-term (1 to 6 months ahead), Medium-term (7 to 12 months ahead) and, finally, Long-
term (13 to 18 months ahead), while the accuracies for all horizons can be found in Tables
A1 and A2 in the Appendix at the end of this paper. Tables 8and 9as well as A1 and
A2 allow us to evaluate the accuracy achieved by each method across multiple horizons and
decide about their appropriateness for various applications. We also note that for the case
of the BNN, the direct approach was excluded since it is not supported by the brrn function
exploited for obtaining the forecasts in our study.
In brief, statistical models seem to generally outperform ML methods across all fore-
casting horizons, with Theta, Comb and ARIMA being the dominant ones among the com-
petitors according to both error metrics examined. It is also notable that good forecasting
accuracy comes with great efficiency, meaning that CC is not significantly increased for the
best performing methods. It is also worth mentioning that more complex approaches of
21
extrapolation through ML methods, such as the direct and Multi one, display less accurate
results indicating that complex is not always better and that ML methods fail to learn how
to best predict for each individual forecasting horizon.
Table 8: Forecasting performance of ML and Statistical methods across various horizons using sMAPE.
Method Short Medium Long Average CC
MLP Iterative 9.53 12.34 15.00 12.29 245.58
MLP Direct 10.72 13.55 16.20 13.49 438.53
MLP Multi 9.53 12.69 16.08 12.77 4006.82
BNN Iterative 9.39 12.08 14.80 12.09 141.91
BNN Multi 9.48 12.70 15.96 12.71 2046.49
Naive 2 10.78 12.46 15.08 12.77 1.48
SES 9.17 10.85 13.77 11.26 1.60
Holt 9.07 11.18 14.29 11.51 1.75
Damped 8.96 10.63 13.46 11.02 2.07
Comb 8.95 10.57 13.38 10.97 2.65
Theta 8.96 10.53 13.19 10.89 1.70
ARIMA 8.93 11.08 13.84 11.28 73.50
ETS 9.07 10.98 13.74 11.26 56.66
4. Accuracy of ML and Statistical forecasting methods
Figure 2shows the overall sMAPE for all the statistical and ML methods included in
this paper as well as the ML accuracies reported by Ahmed and colleagues for performing
one-step-ahead forecasts. As seen, the six most accurate methods are statistical, confirming
their dominance over the ML ones. Even Naive 2 (a seasonal Random Walk (RW) bench-
mark) is more accurate than half of the ML methods. The most interesting question and
biggest challenge is to find out the reasons for their poor performance with the objective of
improving their accuracy and exploiting their huge potential. AI learning algorithms have
22
Table 9: Forecasting performance of ML and Statistical methods across various horizons using MASE.
Method Short Medium Long Average CC
MLP Iterative 0.66 0.98 1.24 0.96 245.58
MLP Direct 0.76 1.10 1.38 1.08 438.53
MLP Multi 0.65 1.02 1.37 1.01 4006.82
BNN Iterative 0.64 0.94 1.20 0.93 141.91
BNN Multi 0.65 1.02 1.35 1.01 2046.49
Naive 2 0.76 1.05 1.35 1.05 1.48
SES 0.67 0.96 1.29 0.97 1.60
Holt 0.64 0.92 1.25 0.94 1.75
Damped 0.64 0.91 1.21 0.92 2.07
Comb 0.64 0.902 1.20 0.91 2.65
Theta 0.64 0.89 1.17 0.90 1.70
ARIMA 0.61 0.89 1.17 0.89 73.50
ETS 0.64 0.92 1.21 0.92 56.66
23
revolutionized a wide range of applications in diverse fields and there is no reason that the
same thing cannot be achieved with the ML methods in forecasting.
Figure 2: Forecasting performance of Statistical and ML methods according to sMAPE: One-step-ahead
forecasts.
ML models are nonlinear functions connecting the inputs and outputs of neurons. The
goal of the network is to learn by solving an optimization problem in order to choose a
set of parameters, or weights, that minimize an error function, typically the sum of square
errors. However, the same type of optimization is done in ARIMA (or regression) models
except that in the latter case the functional form used is linear. There is no obvious reason,
therefore, to justify the more than 1.15% higher sMAPE of MLP, the best ML method, in
comparison to that of ARIMA, or that the sMAPE of this MLP is only 0.24% more accurate
than Naive 2, the seasonally adjusted random walk model. Clearly, if there was any form
of learning, the accuracy of ML methods should have exceeded that of ARIMA ones and
greatly outperform the Naive 2. Thus, its imperative to investigate the reasons that this
is not happening, e.g. by comparing the accuracy of ML and statistical methods series by
series, explaining differences observed and identifying the reasons involved.
The more serious issue, simply put, is how ML method can be made to learn about the
24
unknown future rather how well a model fits past data. For this to be done the ML methods
must have access to information about the future and their objective must be to minimize
future errors rather than those of fitting a model to available data. Until a later time when
more advanced ML methods will become available and in order to simplify things, we suggest
that the data is deseasonalized before some ML model is utilized, as research (Sharda and
Patil,1992) has shown small or no differences between the post-sample accuracy of models
applied to original and seasonal adjusted data.
A practical way to allow learning about the unknown future errors is by dividing the
n−18 data into two parts, with the first one containing the 1/3 of the n−18 data and the
second the remaining 2/3. If the data is first deseasonalized, a much simpler model can be
developed using the first (n−18)/3 data and then trained to learn how to best predict the
next 18 observations. Then, the first (n−18)/3 + 1 data can be used to let the method learn
how to best predict the next 18 observations and continue using the first (n−18)/3 + 2,
the first (n−18)/3 + 3 and so on until having used all the observations available. Clearly,
such a sliding simulation, attempting to predict future values based on post-sample accuracy
optimization, will probably be a step into the right directions even though its performance
needs to be empirically tested.
Another possible idea is to provide ML methods with alternative forecasts (e.g. the
ones produced by the best statistical methods) and ask them to learn to select the most
accurate one (or their combination) for each forecasting horizon and series in such a way as to
minimize the post-sample errors. This may require to cluster the data into various categories
(micro, macro, demographic etc.) or types of series (seasonal/non-seasonal, trended/non-
trended, of high, medium or low randomness etc.) and develop different models for each
category/type. In Table 6, for instance, of Ahmed et al. (2010), accuracy varies significantly
depending on the category of the series with the best one being in demographic and macro
data, the worst in micro and industry time series, and finance in between. This may indicate
that ML methods could under-perform among others, due to the fact that they are confused
when attempting to optimize specific or heterogeneous data patterns.
An additional concern could be the extent of randomness in the series and the ability
25
of ML models to distinguish the patterns from the noise of the data, avoiding over-fitting.
This can be a challenging problem since, in contrast to linear statistical methods, where
over-fitting can be directly controlled by some information criteria (e.g. the AIC) taking
into account the number of parameters utilized, ML methods are nonlinear and training
is performed dynamically, meaning that different forecasts may arise according e.g. to
the maximum iterations considered, even if the complexity of the network’s architecture is
identical. Since the importance of possible over-fitting by ML methods is critical, the topic
will be covered in detail on its own in section 4.1 below.
A final concern with ML methods could be the need for preprocessing that requires indi-
vidual attention to select the most appropriate transformation, possible deseasonalization,
as well as trend removal. Effective ML method must, however, be able to learn and decide
on their own the most appropriate preprocessing as there are few possibilities available. If,
for example, the Box-Cox criterion can be used to determine the most appropriate transfor-
mation for statistical methods, it makes no sense that something similar cannot be applied
by ML methods to automate preprocessing, simplify the modeling process and probably
improve accuracy by doing so.
4.1. Over-fitting
Tables 4,5, and 6report among others the goodness of fit, indicating how well the trained
model fitted the n-18 observations available for each series. Yet, model fit is not a good
predictor of post-sample forecasting accuracy, meaning that methods with low fitting errors
might result in higher post-sample ones and vice versa. One would expect for instance that
the MLP method, displaying a model fitting error of 2.11%, would forecast more accurately
than the ARIMA whose corresponding error is higher (2.59%). However, this is not the
case as the post-sample sMAPE of the two methods are 8.39% and 7.19%, respectively. A
possible reason for the improved accuracy of the ARIMA models is that its parameterization
is done through the minimization of the AIC criterion (Sakamoto et al.,1986), which avoids
over-fitting by considering both goodness of fit and model complexity. In contrast, the MLP
method specifies its complexity (input nodes) through cross-validation, but no additional
26
criteria are applied for mitigating over-fitting e.g. by specifying when training should stop.
The maximum number of iterations defined serves that purpose, yet there is no global
optima: in some time series over-fit might occur after few iterations, while in others after
many hundreds.
Figure 3shows the sMAPE (horizontal axis) and the accuracy of model fit (vertical
axis). It is clear from this Figure that the old belief that minimizing the model fit errors
would guarantee more accurate post-sample predictions does not hold and that some cri-
teria similar to the AIC would be required to indicate to ML methods when to stop the
optimization process and avoid considering as pattern part of the noise of the data. In our
view, considerable improvements can result by such an action.
Figure 3: Forecasting accuracy (sMAPE) versus Model Fit.
There are several possible, additional improvements that need to be tested empirically
with the purpose of improving the forecasting accuracy of ML methods and exploiting their
great potential to learn and improve their performance. Even a few years ago it would
have seemed impossible that the AlphaGo algorithm would be able to learn to play Go
by itself and consequently beat the world champion. In our mind, there is no doubt that
27
similar improvements can be introduced to ML forecasting algorithms making them capable
of achieving breakthrough improvements through a learning process capable of minimizing
future rather than past errors.
4.2. Computational complexity
As forecasting methods are used in various applications, the computational time required
to obtain the forecasts becomes critical. It would be impractical for example to utilize the
ML GRNN method (the most computationally demanding) to predict hundreds of thousands
of inventory items, even though computers are getting faster and cheaper. For this reason,
the information provided in Figure 4is of high value, as it proves the low computational
requirements of all statistical methods, which all lie in the lower left part of Figure, and
additionally stresses that superior accuracy levels can be achieved in contrast to more time-
intensive ones. In particular, those five inside the square box (Damped, Comb, Theta,
SES and Holt) are not only the most accurate ones - apart from ETS - but also the least
computationally demanding.
For practical reasons, if ML methods are to be applied by business and non-profit or-
ganizations, their computational requirements must be reduced considerably. This can be
done by deaseasonalizing the data first, utilizing simpler models, limiting the number of
training iterations or choosing the initial values of the weights not in a completely arbitrary
manner but through some a guided search that would provide values not too far from the
optimal ones. Alternatively, the speed of moving towards the optimal can increase in order
to reduce the computational time to reach such optimal. These improvements would require
testing to determine the trade-offs between the lesser accuracy resulting from the reduction
in computational time versus the savings from such a reduction.
5. Conclusions: The state of the art and future directions in forecasting
The field of quantitative forecasting has progressed a great deal since the early dates when
Brown (1959) used exponential smoothing, in the late 1940s, for predicting the inventory
demand for many thousands of items in navy shipyards. The introduction of the Box-Jenkins
28
Figure 4: Forecasting accuracy (sMAPE) versus Computational Complexity.
methodology to ARIMA models (Box and Jenkins,1970) brought academic respectability
to a field dominated by practitioners until then, while the extensive usage of regression and
econometric models (Pearl,2000) further enlarged the field. Finally, multivariate GARCH
models were also made available (Bauwens et al.,2006;Laurent et al.,2012) broadening the
coverage of the field (for an excellent survey of the latest developments see Special Issue on
”Simple Versus Complex Forecasting” (Green and Armstrong,2015)).
A major innovation that has distinguished forecasting from other fields has been the
good number of empirical studies aimed at both the academic community as well as the
practitioners interested in utilizing the most accurate methods for their various applications
and reducing cost or maximizing benefits by doing so. These studies contributed to estab-
lishing two major changes in the attitudes towards forecasting: First, it was established that
methods, or models, that best fitted available data did not necessarily resulted in more accu-
rate predictions (a common belief until then). Second, the post-sample predictions of simple
statistical methods were found to be at least as accurate as those of sophisticated ones. This
finding was objected furiously by theoretical statisticians (Makridakis et al.,1979) who were
29
claiming that a simple method being a special case of e.g. ARIMA models could not be
more accurate than the ARIMA one, refusing to accept the empirical evidence proving the
opposite. These two findings have fundamentally changed the field of forecasting and are
also evident in this paper both in Figure 3, showing post-sample versus in-sample accuracy,
as well as in 2, displaying the accuracy level of various statistical and ML methods, with
the latter being much more sophisticated and computationally demanding than the former.
Knowing that a certain sophisticated method is not as accurate as a much simpler one
is upsetting from a scientific point of view as complex methods require a great deal of
academic expertise and ample computer time to be applied. At the same time, understanding
the reasons of their underperformance is the only way towards improving them. This has
certainty been the case with ARIMA models whose accuracy with monthly data (not the
same as those used in this study) in the 1982 M-Competition was 17.9% and decreased to
11.28% in the present study, tying the accuracy of the damped exponential smoothing, one
of the most accurate methods of the M-Competitions. ARIMAs improved performance is
mainly due to the utilization of the AIC criterion and other optimization processes which
enable effective automatic model selection and parameterization.
Will ML theorists working on forecasting applications accept that their methods are
considerably less accurate than statistical ones and do something to improve them? For
instance, the only thing exponential smoothing methods do is smoothing the most recent
errors exponentially and then extrapolating the latest pattern in order to forecast. Given
their ability to learn, it is imperative ML methods to beat simple benchmarks, like the one
mentioned above. As psychologists believe, accepting a problem is the first step in devising
workable solutions and we hope that those in the field of AI and ML will accept the empirical
findings and instead of finding justifications for the poor performance of their methods, will
work to improve their forecasting accuracy.
A remarkable problem with the academic ML forecasting literature is that the majority
of published studies provide forecasts and claim satisfactory accuracies without comparing
them with alternatives (e.g. traditional statistical methods) or even naive benchmarks.
Doing so raises expectations that ML methods provide accurate predictions, but without
30
any empirical proof that this is the case. In our view, this situation is the same with what
was happening in the statistical literature in the late 1970s and 1980s. At that time, it
was thought that forecasting methods were of superior accuracy simply because of their
sophistication and their mathematical elegance. Now it is obvious that their value must be
proven empirically in an objective, indisputable manner through large scale competitions.
Thus, when it comes to papers proposing new ML methods or more effective ways to use
them, academic journals must demand comparisons with statistical methods or at least
benchmarks and require that the data of the articles being published to be made available
for those who want to replicate the results. In our experience this is not the case at present,
making replications practically impossible and allowing conclusions that may not hold.
In addition to empirically testing, research work is needed to help users understand how
the forecasts of ML methods are generated (this is the same problem with all AI models
whose output cannot be explained) as getting numbers from a black box is not acceptable,
in particular by business users and practitioners who need to know how forecasts arise and
how they can be influenced or adjusted.
A final, equally important concern is that in addition to point forecasts, ML methods
must be also capable of specifying the uncertainty around them, or alternatively providing
confidence intervals. At present, the issue of uncertainty has not been included in the re-
search agenda of the ML field, leaving a huge vacuum that must be fulfilled as estimating the
uncertainty in future predictions is as important as the forecasts themselves. To overcome
this issue, many researchers propose simulating the intervals by iteratively generating mul-
tiple future sample paths. Yet, even in that case, the forecast distribution of the methods
is empirically and not analytically derived, raising a lot of doubts about its quality.
In summary, ML methods have a long way to go to become more accurate, less time
demanding and less of a black box. The major contribution of this paper is in showing that
traditional statistical methods are considerably more accurate than ML ones and pointing
out the urgent need to discover the reasons involved, as well as devising ways to reverse
the situation. At this point, the following suggestions/speculations that must be verified
empirically can be made about which is the way forward regarding the ML methods:
31
•Getting more information about the unknown future values of the data rather than
their past ones and base the optimization/learning on such future values as much as
possible.
•Deseasonalizing the data before using ML methods. This will result to a simpler one,
reducing the computational time required to arrive at optimal weights and, therefore,
learn faster.
•Using a sliding simulation approach to gain as much information about future values
and the resulting uncertainty as possible and learn more effectively how to minimize
them.
•Clustering the series into various homogeneous categories and/or types of data and
developing ML methods that optimally extrapolates them.
•Avoiding over-fitting as it is not clear if ML model can correctly distinguish the noise
from the pattern of the data.
•Automating preprocessing and avoiding the extra decisions required from the part of
the user.
•Allowing the estimation of uncertainty for the point forecasts and providing informa-
tion for the construction of confidence intervals around such forecasts.
Although the conclusion of our paper that the forecasting accuracy of ML models is
inferior to that of statistical methods may seem disappointing, we are extremely positive
about the great potential of ML ones for forecasting applications. Clearly a lot of work is
needed to improve such methods but the same has been the case with all AI techniques and
even advanced forecasting methods that have improved considerably over time. Who could
have believed even ten years ago that we will have AV, personal assistance on our mobile
phones understanding and speaking in natural languages, automatic translations in Skype,
or AlphaGo beating the world GO champion. There is no reason that the same type of
breakthroughs could not be achieved in ML methods applied to forecasting. Even though,
32
we must realize that applying AI to forecasting is quite different than doing so in games
or in image and speech recognition and may require different, specialized algorithms to be
successful. In contrast to other applications, the future is never identical to the past and
training of AI methods cannot be exclusively depend on it.
Table 10: Features of Various Artificial Intelligence (AI) Applications.
Type of Appli-
cation
Rules are
known and do
not change
The
environment is
known and
stable
Predictions
can influence
the future
Extent of
Uncertainty
(or amount of
noise)
Examples
Games Yes Yes No None Chess,GO
Image and
speech recogni-
tion
Yes Yes No Minimal (can
be minimized
by big data)
Face Recog-
nition, Siri,
Cortana,
Google AI
Predictions
based on the
Law of large
numbers
Yes Yes Minimally Measurable
(Normally
distributed)
Forecasting the
sales of beer,
coffee, soft
drinks, weather
etc.
Autonomous
Functions
Yes Yes No Can be
assessed and
minimized
Self-Driving
Vehicles
Strategy,
Competition,
Investments
No No Yes, often to a
great extent
Cannot be
measured (fat
tails)
Decisions,
Anticipations,
Forecasts
Combinations
of the above
It may be the ultimate challenge moving towards GAI (General AI)
but also increasing the level of complexity and sophistication of algo-
rithms
Eventually
it can cover
everything
Table 10 is our attempt to show that not all applications can be modeled equally well
using AI algorithms. Games are the easiest as the rules are known and do not change, the
environment is also known and stable, the predictions cannot influence the future and there
is no uncertainty. The exact opposite is involved with forecasting applications where not
only the rules are not known but can also change, there are structural instabilities in the
data while there is plenty of uncertainty, noise, that can confuse the search for the optimal
weights. Moreover, in certain applications, the forecasts themselves can influence, or even
33
change the future creating self-fulfilling or self-defeating prophesies, expanding the level of
noise and increasing the level of uncertainty. It may be necessary, therefore, to adapt the
algorithms to these conditions and make sure that there is no over-fitting. Judging from the
results of this paper, there is little doubt that ML algorithms applied to forecasting may
require considerable research to experiment with innovative ideas and significant adjustments
to produce more accurate predictions.
Appendices: Detailed results
These are the analytical results of the forecasting models used in the present study. The
accuracy is evaluated per forecasting horizon first according to sMAPE, and they to MASE.
Table A2: The sMAPE for each of the 18 forecasting horizon and their overall average.
Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Mean
MLP Iterative 7.98 8.62 9.47 9.81 10.75 10.56 12.07 11.52 12.03 12.86 12.35 13.23 13.44 13.65 14.49 15.53 15.71 17.20 12.29
MLP Direct 8.92 9.88 10.82 11.07 11.82 11.83 13.18 12.81 13.28 14.10 13.59 14.37 14.51 14.93 16.06 16.73 16.76 18.21 13.49
MLP Multi 7.87 8.33 9.39 9.77 10.88 10.92 12.30 11.65 12.25 13.34 12.86 13.77 14.02 14.52 16.01 16.69 16.83 18.38 12.77
BNN Iterative 7.92 8.37 9.46 9.67 10.54 10.39 11.86 11.28 11.73 12.53 12.04 13.03 13.28 13.54 14.33 15.28 15.55 16.82 12.09
BNN Multi 7.95 8.19 9.34 9.75 10.74 10.90 12.36 11.63 12.38 13.20 12.86 13.79 13.93 14.64 15.77 16.46 16.68 18.25 12.71
Nave2 10.75 9.07 11.50 11.92 10.38 11.07 11.31 12.59 13.24 12.28 12.98 12.38 12.89 14.02 15.87 16.01 15.18 16.49 12.77
SES 8.57 7.58 9.43 9.83 9.62 9.98 10.54 10.34 11.24 10.80 10.87 11.29 11.29 12.78 14.12 14.92 14.06 15.46 11.26
Holt 7.97 7.75 8.92 9.51 10.16 10.08 11.11 10.45 11.33 11.67 11.02 11.53 12.34 12.72 14.14 15.21 15.21 16.12 11.51
Damped 8.14 7.44 9.00 9.52 9.72 9.97 10.60 10.10 10.83 10.72 10.62 10.93 11.27 12.25 13.67 14.46 13.93 15.19 11.02
Comb 8.18 7.50 9.00 9.50 9.70 9.84 10.54 10.00 10.87 10.75 10.47 10.82 11.18 12.15 13.53 14.51 13.96 14.99 10.97
Theta 8.20 7.59 9.02 9.44 9.68 9.82 10.54 9.98 10.73 10.77 10.25 10.91 11.06 12.04 13.21 14.14 13.81 14.86 10.89
ARIMA 7.78 7.56 9.00 9.31 9.84 10.11 10.88 10.62 11.27 11.40 10.88 11.41 11.85 12.38 13.71 14.34 14.72 16.03 11.28
ETS 8.06 7.57 9.17 9.60 9.94 10.08 10.91 10.66 11.09 11.21 10.82 11.16 11.80 12.37 13.83 14.65 14.35 15.45 11.26
Table A2: The MASE for each of the 18 forecasting horizon and their overall average.
Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Mean
MLP Iterative 0.51 0.54 0.61 0.72 0.76 0.81 0.90 0.91 0.97 1.05 1.02 1.05 1.10 1.14 1.20 1.29 1.28 1.43 0.96
MLP Direct 0.61 0.65 0.72 0.82 0.85 0.91 1.01 1.03 1.08 1.18 1.14 1.18 1.23 1.29 1.35 1.42 1.42 1.56 1.08
MLP Multi 0.50 0.53 0.61 0.71 0.76 0.81 0.93 0.91 0.98 1.11 1.07 1.12 1.18 1.24 1.33 1.42 1.43 1.62 1.01
BNN Iterative 0.50 0.53 0.60 0.70 0.73 0.78 0.87 0.88 0.93 1.00 0.97 1.01 1.06 1.09 1.15 1.24 1.30 1.37 0.93
BNN Multi 0.50 0.52 0.60 0.71 0.75 0.81 0.91 0.93 1.00 1.09 1.08 1.11 1.19 1.26 1.32 1.40 1.41 1.55 1.01
Nave2 0.66 0.59 0.72 0.87 0.82 0.91 0.96 1.03 1.07 1.11 1.08 1.04 1.16 1.19 1.36 1.45 1.42 1.55 1.05
SES 0.53 0.51 0.62 0.76 0.76 0.84 0.90 0.92 0.96 1.02 0.96 1.01 1.08 1.13 1.27 1.39 1.37 1.50 0.97
Holt 0.51 0.50 0.58 0.72 0.74 0.80 0.86 0.88 0.92 1.00 0.92 0.95 1.06 1.07 1.21 1.33 1.34 1.47 0.94
Damped 0.51 0.49 0.59 0.73 0.73 0.81 0.86 0.87 0.90 0.97 0.91 0.93 1.02 1.04 1.18 1.30 1.30 1.42 0.92
Comb 0.51 0.49 0.59 0.72 0.73 0.80 0.85 0.86 0.90 0.96 0.89 0.92 1.01 1.03 1.17 1.29 1.28 1.40 0.91
Theta 0.51 0.50 0.59 0.72 0.72 0.80 0.85 0.85 0.88 0.95 0.87 0.91 0.99 1.02 1.14 1.25 1.24 1.36 0.90
ARIMA 0.48 0.48 0.56 0.68 0.70 0.78 0.84 0.85 0.90 0.96 0.89 0.92 0.99 1.03 1.14 1.23 1.25 1.40 0.89
ETS 0.51 0.49 0.60 0.73 0.74 0.81 0.87 0.90 0.92 0.98 0.91 0.95 1.03 1.04 1.19 1.29 1.30 1.42 0.92
34
References
Adya, M., Collopy, F.. How effective are neural networks at forecasting and prediction? A review and
evaluation. Journal of Forecasting 1998;17(56):481–495.
Ahmed, N.K., Atiya, A.F., Gayar, N.E., El-Shishiny, H.. An Empirical Comparison of Machine Learning
Models for Time Series Forecasting. Econometric Reviews 2010;29(5-6):594–621.
Alpaydin, E.. Machine Learning:Introduction to Machine Learning. The MIT Press, 2004.
Andrawis, R.R., Atiya, A.F., El-Shishiny, H.. Forecast combinations of computational intelligence and
linear models for the NN5 time series forecasting competition. International Journal of Forecasting
2011;27(3):672–688.
Assimakopoulos, V., Nikolopoulos, K.. The theta model: a decomposition approach to forecasting. Inter-
national Journal of Forecasting 2000;16(4):521–530.
Bauwens, L., Laurent, S., Rombouts, J.V.K.. Multivariate garch models: a survey. Journal of Applied
Econometrics 2006;21(1):79–109.
Bergmeir, C., Benitez, J.. Neural Networks in R Using the Stuttgart Neural Network Simulator: RSNNS.
Journal of Statistical Software 2012;46(7):1–26.
Box, G., Jenkins, G.. Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day, 1970.
Box, G.E.P., Cox, D.R.. An Analysis of Transformations. Journal of the Royal Statistical Society Series
B (Methodological) 1964;26(2):211–252.
Breiman, L.. Classification and Regression Trees. Boca Raton, FL: Chapman & Hall, 1993.
Brown, R.. Statistical forecasting for inventory control. New York: McGraw-Hill, 1959.
Chatfield, C.. Neural networks: Forecasting breakthrough or passing fad? International Journal of
Forecasting 1993;9(1):1–3.
Connor, J.T., Martin, R.D., Atlas, L.E.. Recurrent neural networks and robust time series prediction. IEEE
transactions on neural networks / a publication of the IEEE Neural Networks Council 1994;5(2):240–54.
Cottrell, M., Girard, B., Girard, Y., Mangeas, M., Muller, C.. Neural Modeling for Time Series: A Sta-
tistical Stepwise Method for Weight Elimination. IEEE Transactions on Neural Networks 1995;6(6):1355–
1364.
Cox, D.R., Stuart, A.. Some Quick Sign Tests for Trend in Location and Dispersion. Biometrika 1955;42(1-
2):80–95.
Crone, S.F., Hibon, M., Nikolopoulos, K.. Advances in forecasting with neural networks? Empiri-
cal evidence from the NN3 competition on time series prediction. International Journal of Forecasting
2011;27(3):635–660.
Dan Foresee, F., Hagan, M.T.. Gauss-Newton approximation to bayesian learning. In: IEEE International
Conference on Neural Networks - Conference Proceedings. volume 3; 1997. p. 1930–1935.
35
Gabor, M.R., Dorgo, L.A.. Neural Networks Versus Box-Jenkins Method for Turnover Forecasting: a Case
Study on the Romanian Organisation. Transformations in Business and Economics 2017;16(1):187–211.
Gardner, E.S.. Exponential smoothing: the state of the art. Journal of Forecasting 1985;4(1):1–28.
Gardner, E.S.. Exponential smoothing: The state of the art-Part II. International Journal of Forecasting
2006;22(4):637–666.
Goodwin, P., Lawton, R.. On the asymmetry of the symmetric MAPE. International Journal of Forecasting
1999;15(4):405–408.
Gorr, W.. Research prospective on neural network forecasting. International Journal of Forecasting
1994;10(1):1–4.
Green, K.C., Armstrong, J.S.. Simple versus complex forecasting: The evidence. Journal of Business
Research 2015;68(8):1678 – 1685. Special Issue on Simple Versus Complex Forecasting.
Hamid, S.A., Habib, A.. Financial forecasting with neural networks. Academy of Accounting and Financial
Studies Journal 2014;18(4):37–55.
Hansen, L.K., Salamon, P.. Neural Network Ensembles. IEEE Transactions on Pattern Analysis and
Machine Intelligence 1990;12(10):993–1001.
Hastie, T., Tibshirani, R., Friedman, J.. The elements of statistical learning: Data mining, Inference, and
Prediction, Second Edition. Springer New York, 2009.
Hyndman, R., Khandakar, Y.. Automatic time series forecasting: the forecast package for R. Journal of
Statistical Software 2008;26(3):1 – 22.
Hyndman, R.J., Koehler, A.B.. Another look at measures of forecast accuracy. International Journal of
Forecasting 2006;22(4):679–688.
Hyndman, R.J., Koehler, A.B., Snyder, R.D., Grose, S.. A state space framework for automatic forecasting
using exponential smoothing methods. International Journal of Forecasting 2002;18(3):439–454.
Ilies, I., Jaeger, H., Kosuchinas, O., Rincon, M., Vaknas, V., Vaskevicius, N.. Stepping forward through
echoes of the past: Forecasting with Echo State Networks, Technical Report: Jacobs University Bremen.
2007.
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.. kernlab – an S4 package for kernel methods in R.
Journal of Statistical Software 2004;11(9):1–20.
Kock, A.B., Ter¨asvirta, T.. Forecasting macroeconomic variables using neural network models and three
automated model selection techniques. Econometric Reviews 2016;35(8-10):1753–1779.
Kourentzes, N., Barrow, D.K., Crone, S.F.. Neural network ensemble operators for time series forecasting.
Expert Systems with Applications 2014;41(9):4235–4244.
Laurent, S., Rombouts, J.V.K., Violante, F.. On the forecasting accuracy of multivariate garch models.
Journal of Applied Econometrics 2012;27(6):934–955.
36
Lippmann, R.P.. An Introduction to Computing with Neural Nets. IEEE ASSP Magazine 1987;4(2):4–22.
MacKay, D.J.C.. Bayesian Interpolation. Neural Computation 1992;4(3):415–447.
Makridakis, S.. The forthcoming artificial intelligence (ai) revolution: Its impact on society and firms.
Futures 2017;Article in Press.
Makridakis, S., Hibon, M.. The M3-Competition: results, conclusions and implications. International
Journal of Forecasting 2000;16(4):451–476.
Makridakis, S., Hibon, M., Moser, C.. Accuracy of forecasting: An empirical investigation. Journal of the
Royal Statistical Society Series A (General) 1979;142(2):97–145.
Makridakis, S.G., Wheelwright, S.C., Hyndman, R.J.. Forecasting: Methods and applications (Third
Edition). New York: Wiley, 1998.
Marr, B.. The Top 10 AI And Machine Learning Use Cases Everyone Should Know About, Forbes. 2016.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F.. e1071: Misc Functions of the
Department of Statistics, Probability Theory Group, TU Wien; 2017. .
Møller, M.. A scaled conjugate gradient algorithm for fast supervised learning. Neural networks 1993;6:525–
533.
Nelson, C.R., Plosser, C.R.. Trends and random walks in macroeconmic time series. Some evidence and
implications. Journal of Monetary Economics 1982;10(2):139–162.
Nelson, M., Hill, T., Remus, B., O’Connor, M.. Can neural networks applied to time series forecasting
learn seasonal patterns: an empirical investigation. System Sciences, 1994 Proceedings of the Twenty-
Seventh Hawaii International Conference on 1994;3:649–655.
Nguyen, D., Widrow, B.. Improving the learning speed of 2-layer neural networks by choosing initial values
of the adaptive weights. IJCNN Int Joint Conf Neural Networks 1990;13:C21.
P.-O. Chasset, . GRNN: General regression neural network for the statistical software R. Independant
scientist; Nancy, France; 2013. .
Pearl, J.. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press, 2000.
Qiu, M., Song, Y.. Predicting the direction of stock market index movement using an optimized artificial
neural network model. PLOS ONE 2016;11(5):1–11.
Rasmussen, C.E., Williams, C.. Gaussian Processes for Machine Learning. The MIT Press, 2006.
Rodriguez, P.P., Gianola, D.. brnn: Bayesian Regularization for Feed-Forward Neural Networks; 2016. .
Sakamoto, Y., Ishiguro, M., Kitagawa, G.. Akaike Information Criterion Statistics. D Reidel Publishing
Company 1986;.
Sch¨olkopf, B., Smola, A.J.. Learning with kernel: Support Vector Machines, Regularization, Optimization
and Beyond. The MIT Press, 2001.
Sharda, R., Patil, R.B.. Connectionist approach to time series prediction: An empirical test. Journal of
37
Intelligent Manufacturing 1992;3(1):317323.
Specht, D.F.. A general regression neural network. Neural Networks, IEEE Transactions on 1991;2(6):568–
576.
Therneau, T., Atkinson, B., Ripley, B.. rpart: Recursive Partitioning and Regression Trees; 2015. .
Venables, W.N., Ripley, B.D.. Modern Applied Statistics with S. 4th ed. New York: Springer, 2002.
Wang, J., Wang, J.. Forecasting stochastic neural network based on financial empirical mode decomposition.
Neural Networks 2017;90:8–20.
Zhang, G., Eddy Patuwo, B., Y. Hu, M.. Forecasting with artificial neural networks:: The state of the
art. International Journal of Forecasting 1998;14(1):35–62.
Zhang, G.P., Qi, M.. Neural network forecasting for seasonal and trend time series. European Journal of
Operational Research 2005;160(2):501–514.
Zhao, L.. Neural Networks In Business Time Series Forecasting: Benefits And Problems. Review of Business
Information Systems (RBIS) 2009;13(3):57–62.
38