Content uploaded by Evangelos Spiliotis
Author content
All content in this area was uploaded by Evangelos Spiliotis on Oct 06, 2020
Content may be subject to copyright.
The M5 Accuracy competition: Results, findings and conclusions
Spyros Makridakisa, Evangelos Spiliotisb,∗
, Vassilios Assimakopoulosb
aInstitute For the Future, University of Nicosia, Cyprus
bForecasting and Strategy Unit, School of Electrical and Computer Engineering, National Technical University of Athens,
Greece
Abstract
This paper describes the M5 Accuracy competition, the first of two parallel challenges of the latest M com-
petition whose objective is to advance the theory and practice of forecasting. The M5 Accuracy competition
focused on a retail sales forecasting application and extended the results of the previous four competitions
by: (a) significantly expanding the number of participating methods, especially those in the category of Ma-
chine Learning, (b) including exogenous/explanatory variables in addition to the time series data, (c) using
grouped, correlated time series, and (d) focusing on series that display intermittency. The paper presents
the background, design, and implementation of the competition, its results, the top-performing methods,
as well as its major findings and conclusions. It also discusses the implications of its findings and suggests
directions for future research.
Keywords: Forecasting Competitions, M Competitions, Forecasting Accuracy, Time Series, Machine
Learning, Retail Sales Forecasting
∗Corresponding author
Email address: spiliotis@fsu.gr (Evangelos Spiliotis)
Preprint submitted to International Journal of Forecasting October 6, 2020
1. Introduction
The aim of forecasting competitions is to empirically evaluate the accuracy of existing and new forecasting
methods, allowing the equivalent of experimentation widely used in hard sciences. (Hyndman, 2020). From
these, the M competitions (Makridakis et al., 1982, 1993; Makridakis & Hibon, 2000; Makridakis et al.,
2020e) are probably the most influential and widely cited in the field of forecasting, the most recent being
the M5 Accuracy competition, which is described in detail in this paper. The competition’s objective was
to produce the most accurate point forecasts for 42,840 time series that represent the hierarchical unit sales
of the largest retail company in the world, Walmart. For more information regarding the second challenge
of the competition, which focuses on estimating the uncertainty distribution of the realized values of the
same series, see Makridakis et al. (2020f).
Following the various comments and discussions made about the previous M4 competition (Fry &
Brundage, 2020; ¨
Onkal, 2020; Gilliland, 2020; Fildes, 2020; Hyndman, 2020; Makridakis et al., 2020d),
the M5 was designed and conducted with the aim of addressing the majority of concerns raised while ex-
tending its achievements into several directions. First, the competition was hosted by Kaggle1, a large online
community of data scientists and Machine Learning (ML) practitioners, who compete and provide solutions
for various tasks, including forecasting (Bojer & Meldgaard, 2020). As such, the number of participating
teams2increased significantly. These teams also focused on state-of-the-art methods that can be classified
as ML or “unstructured” (Januschowski et al., 2020; Barker, 2020). Second, in contrast to the previous com-
petitions, besides time series data, teams were provided with exogenous/explanatory variables that could
be used to improve forecasting accuracy. Third, instead of forecasting mostly unrelated series (Spiliotis
et al., 2020a), M5 consisted of grouped, highly correlated series, organized in a hierarchical, cross-sectional
structure, thus representing the forecasting set-up of a typical retail company. Finally, for the first time
ever, the competition involved series that display intermittency, i.e., sporadic demand including many zeros
(Syntetos & Boylan, 2005; Syntetos et al., 2005). This kind of series, although difficult to predict with
conventional forecasting methods like the ones utilized in the previous M competitions, are typical when
forecasting unit retail sales at a store or product level. Therefore, identifying accurate methods to predict
such series can be highly beneficial for numerous companies (Ghobbar & Friend, 2003; Syntetos et al., 2010;
Pooya et al., 2019).
In addition, the M5 Accuracy competition continued the major innovation of the M4 competition, which
was to predict/hypothesize its findings prior to its completion (Makridakis et al., 2020b). Rather than
rationalizing its results in a post-hoc reasoning, our ten predictions/hypotheses, submitted to the EiC of
IJF five days before launching the competition, demonstrate our expectations of its findings. We have now
1https://www.kaggle.com/c/m5-forecasting-accuracy/overview
2A team may consist of multiple members or a single participant.
2
evaluated our predictions/hypotheses in a separate paper (Makridakis et al., 2020c) and are pleased to find
that we were correct in the majority of those that referred to point forecast accuracy.
The rest of the paper consists of three main parts, along with two appendices that are provided as
supplementary material. The first part describes the background, organization, and implementation of the
M5 Accuracy competition, including the data and evaluation measures used, as well as information about
the participating teams and prizes. The second part presents the results of the competition and the top-
performing methods, summarizing its major findings. The last part concludes the paper and discusses the
implications of the competition’s findings, as well as directions for future research. The appendices provide
information regarding the benchmarks used and their forecasting accuracy, both overall and per aggregation
level.
2. The background
Forecasting competitions are vital for the objective evaluation of existing forecasting methods, intro-
ducing new, innovative ones, and determining empirically how to advance forecasting theory and practice.
Each competition introduces some new features or datasets that can be exploited for future research and
benchmarking, while also addressing limitations of the previous ones and focusing on a different forecasting
application.
Hyndman (2020) highlights the benefits and contributions of forecasting competitions, including the
M ones, concluding that: (i) a wider range of benchmarks and datasets that are regularly updated is
desired in order to mitigate over-fitting in published data used to evaluate forecasting methods, (ii) future
competitions should clearly define the domain to which they apply, (iii) objective measures that are based on
well-recognized attributes of the forecast distribution should be used, (iv) forecast distribution performance
should be assessed along with point forecast accuracy, (v) large-scale multivariate time series forecasting
should be considered to exploit possible cross-correlations between the series, (vi) high frequency data,
such as hourly, daily, and weekly should be introduced to investigate how multiple seasonal patterns and
irregularly spaced observations could be properly handled, as well as how data collected from sensors could
be optimally used, and finally, (vii) exogenous/explanatory variables should be provided along with time
series data to determine whether they contribute to more accurate forecasts. Similar remarks were made
by the discussants and commentators of the M4 competition, stressing the need for more representative,
high-frequency data that is hierarchically structured and accompanied by various exogenous/explanatory
variables (Fry & Brundage, 2020; ¨
Onkal, 2020; Gilliland, 2020; Fildes, 2020).
The M5 Accuracy competition tried to address these concerns and suggestions by introducing the fol-
lowing innovative features:
•A large dataset of 42,840 series was introduced along with 24 benchmarks. That way, existing and
3
new forecasting methods could be objectively evaluated and the results of previous studies effectively
tested for replicability.
•The competition focused on a special forecasting application, which is accurately predicting the daily
unit sales of retail companies across various locations and product categories.
•An objective measure was used for evaluating forecasting accuracy, approximating the mean of the
series being predicted.
•The series of the dataset were grouped and highly correlated, thus enabling the utilization of multi-
variate and “cross-learning” methods.
•The dataset involved daily data which requires accounting for multiple seasonal patterns, special days,
and holidays.
•The dataset included exogenous/explanatory variables, such as product prices, promotions, and special
events.
The M5 competition was initially announced at the end of December, 2019, first on the Makridakis
Open Forecasting Center (MOFC) website3and then on the International Institute of Forecasters (IIF)
blog4. In addition, just as was done in the M4, invitation emails were sent to all those who had participated
in previous forecasting competitions and forecasting conferences, as well as to those who had published
articles in respected journals in the field of forecasting (for more information about the invitations, please
see (Makridakis et al., 2020e)). Social media (Linked In, Twitter, and Facebook) were also used to promote
the competition.
The competition started on March 3rd, 2020, when the initial train set became available, and ended
on June 30th, 2020, when the final leaderboard was announced. The rules of the competition, prizes, and
additional details were all made available on Kaggle, which hosted the competition, and the M5 competition’s
website5. The preliminary results of the competition were presented virtually on 28th October, 2020 at the
40th International Symposium on Forecasting, while the final results and winning methods were presented
at the M5 conference in NYC on 10th and 11th May, 2021.
Like all the previous M competitions, M5 was a completely open competition, encouraging the partici-
pation of both academics and practitioners to ensure fairness and objectivity, while also emphasizing that
each team was free to utilize its own preferred method. Moreover, like the M4, teams were encouraged to
submit the code used for producing their forecasts, as well as the description of their methods, thus promot-
ing reproducibility and replicability (Makridakis et al., 2018a). The public discussions (about 600 topics)
3https://mofc.unic.ac.cy/
4www.forecasters.org
5https://mofc.unic.ac.cy/m5-competition/
4
and notebooks (around 650 scripts) published on Kaggle, further facilitated replicability and exchange of
information and fruitful insights and ideas among the participants and forecasting community in general.
The competition also placed particular emphasis on benchmarking, considering a variety of methods,
both traditional and state-of-the-art, that can be classified as statistical, ML, and combinations. In the first
three M competitions (Makridakis et al., 1982, 1993; Makridakis & Hibon, 2000), and for many years in the
forecasting literature (Bates & Granger, 1969; Claeskens et al., 2016), combinations and relatively simple
methods were regarded as being equally accurate as sophisticated ones. The M4 competition (Makridakis
et al., 2020e) however, despite confirming the value of combining, indicated that more complex, ML methods
could provide significantly more accurate results. These findings indicate that comparing the methods
submitted in the M5 Accuracy competition with various benchmarks would allow us to validate the findings
of the previous competitions and identify possible causes of improvements.
The benchmarks of the competition were selected based on their popularity, availability, ease of use,
and computational requirements, which are relatively low. A description of the benchmarks is provided in
Appendix A, while the code used for implementing them is publicly available in the M5 GitHub repository6.
Note that most of the benchmarks considered in the competition were previously tested in a different
dataset, involving the daily product sales of a large Greek retail company (Spiliotis et al., 2020b). The main
conclusion of this study was that some ML methods provided less biased and more accurate forecasts than
well-established, statistical methods, like the Croston’s method and its variants, especially when “cross-
learning” approaches were considered, thus confirming the main finding of M4. The forecasting accuracy of
the benchmarks, both overall and per aggregation level, is provided in Appendix B along with an analysis
on their significance.
In the remainder of the paper we decided to use ES bu7as the single most accurate benchmark against
which to compare the forecasting accuracy of all participating methods, as it is simple to compute and easy
to implement using most forecasting support systems. Given their simplicity, the Naive and sNaive (seasonal
Naive) methods are also considered as standards of comparison, as they are the simplest benchmark methods
that could be used for sales forecasting.
3. Organization and implementation
3.1. Dates and phases
The M5 Accuracy competition began on March 3rd , 2020, when the initial train dataset became available
to download on the Kaggle platform. The rules and information regarding the competition were likewise
6https://github.com/Mcompetitions/M5-methods
7The most accurate exponential smoothing model is automatically selected for predicting the Product-Store series of the
dataset and the bottom-up method is then used for forecasting the rest of the series.
5
posted, on both Kaggle and the M5 competition’s website. The competition ended on June 30th, 2020, when
the final leaderboard was announced. The deadline for accepting the competition rules (entry deadline) and
joining or merging teams (team merger deadline) was June 23rd.
Chronologically, the competition was divided into two phases, which were used for the final evaluation
of the teams, namely the validation and test phase. This was done in order to support the participating
teams to assess their forecasting methods, receive feedback through a live leaderboard, and exchange ideas
and insights with the rest of the community, while also making sure that no information is provided about
the test set that is not already available in real-life forecasting.
The validation phase took place from March 3rd, 2020 to 31st May of the same year. During this phase,
the teams were allowed to train their forecasting methods with the data initially provided by the organizers
(consisting of 1,913 days) and validate the accuracy of their methods using a hidden sample of 28 days
(equal to the forecasting horizon considered by the competition), not made publicly available. This sample
corresponded to the four weeks succeeding the initial train set, i.e., days 1,914 to 1,941. By submitting
their forecasts to the Kaggle platform (a maximum of five entries per day), the teams were informed about
the accuracy of their submission, which was then published onto Kaggle’s real-time leaderboard. Given this
feature, teams were allowed to effectively revise and resubmit their forecasts by learning from the received
feedback (Athanasopoulos & Hyndman, 2011). Note that, ideally, and in order to avoid over-fitting, the
leaderboard should be used for assessing the methods developed and not for indirectly optimizing their
settings (e.g. hyper-parameter tuning and feature selection).
After the end of the validation phase, i.e., June 1st, 2020, the teams were provided with the actual values
of the 28 days of data used for assessing their accuracy during the validation phase. They were then asked
to re-estimate or adjust their forecasting methods (if needed), in order to submit their final forecasts for
the following 28 days, i.e., the test data used for the final evaluation of the teams. During this time, there
was no leaderboard, meaning that no feedback was given to the teams about their actual performance after
submitting their forecasts. Therefore, although the teams were free to (re)submit their forecasts any time
they wished (a maximum of five entries per day), they were not aware of their absolute, as well as their
relative, performance.
The final ranks of the teams were only disclosed at the end of competition, when the test data was
made available. For their evaluation, each team had to select a single set of forecasts (one submission).
If no particular forecasts were selected, the ones of the highest accuracy during the validation phase were
automatically selected by the system. This was done in order for the competition to simulate reality as
closely as possible, given that in real life forecasters do not know the future and they have to provide a
single set of forecasts which they believe will simulate the future as accurately as possible.
At this point we should note that making submissions during the validation phase of the competition was
completely optional and teams were free to decide whether they were going to exploit the public leaderboard
6
for validating their methods, or their own, privately constructed cross-validation (CV) tests (Tashman, 2000).
However, despite the preferences of each team, assessing the post-sample accuracy of the developed methods
effectively was of critical importance for performing well in the M5 Accuracy competition, as CV can provide
useful insights about the models that are expected to derive more accurate forecasts, the optimal values of
their parameters, and the exogenous/explanatory variables that should be provided to them as input. This
is particularly true when dealing with flexible ML methods where the number of features that can potentially
be used as input, the training approaches that can possibly be adopted, and the hyper-parameters that can
be accordingly adjusted, are countless. In addition, CV strategies, as well as other modeling feautures like
information criteria (Sakamoto et al., 1986), can become useful for avoiding over-fitting and mitigating data,
parameter, and model uncertainty (Petropoulos et al., 2018).
3.2. Training, validation, and test dataset
The M5 dataset, generously made available by Walmart, involves the unit sales of various products sold
in the USA, organized in the form of grouped time series. More specifically, the dataset involves the unit
sales of 3,049 products, classified in 3 product categories (Hobbies, Foods, and Household), and 7 product
departments in which the above-mentioned categories are disaggregated. The products are sold across 10
stores, located in 3 States (CA, TX, and WI). In this respect, the most disaggregated data, i.e., product-
store unit sales, can be grouped based on either location (store and state) or product-related information
(department and category), as follows:
Total
unit sales
State
CA
Store
CA 1
Store
CA 2
Store
CA 3
Store
CA 4
State
TX
Store
TX 1
Store
TX 2
Category
Hobbies
Department
Hobbies 1
Product
1...
Product
416
Department
Hobbies 2
Product
417 ...
Product
565
Category
Foods
Department
Foods 1
Department
Foods 2
Department
Foods 3
Category
Household
Department
Household 1
Product
2003 ...
Product
2534
Department
Household 2
Product
2535 ...
Product
3049
Store
TX 3
State
WI
Store
WI 1
Store
WI 2
Store
WI 3
Given that multiple meaningful hierarchies can be constructed by the M5 data, the organizers decided to
consider all possible cross-sectional levels of aggregation for the evaluation , as shown in Table 1. Although
the identifiers of the various levels (level id) do not indicate an actual hierarchical structure, they facilitate
reference, also highlighting the extent of aggregation that takes place: high levels of aggregation generally
correspond to low identification numbers (e.g., levels 1 to 5), while low levels of aggregation to higher
identification numbers (e.g., levels 10 to 12).
7
Table 1: Number of M5 series per aggregation level.
Level id Level description Aggregation level Number of series
1 Unit sales of all products, aggregated for all stores/states Total 1
2 Unit sales of all products, aggregated for each State State 3
3 Unit sales of all products, aggregated for each store Store 10
4 Unit sales of all products, aggregated for each category Category 3
5 Unit sales of all products, aggregated for each department Department 7
6 Unit sales of all products, aggregated for each State and category State-Category 9
7 Unit sales of all products, aggregated for each State and department State-Department 21
8 Unit sales of all products, aggregated for each store and category Store-Category 30
9 Unit sales of all products, aggregated for each store and department Store-Department 70
10 Unit sales of product i, aggregated for all stores/states Product 3,049
11 Unit sales of product i, aggregated for each State Product-State 9,147
12 Unit sales of product i, aggregated for each store Product-Store 30,490
Total 42,840
The data is daily and covers the period from 2011-01-29 to 2016-06-19 (1,969 days or approximately 5.4
years). As described in the previous subsection, the first 1,913 days of data (2011-01-29 to 2016-04-24) were
initially provided to the participating teams as a train set, days 1,914 to 1941 (2016-04-25 to 2016-05-22)
served as a validation set, while the remaining 28 days, i.e., 1,942 to 1,969 (2016-05-23 to 2016-06-19), were
used as a test set.
The M5 competition dataset also involved exogenous/explanatory variables, including calendar-related
information and selling prices. Thus, apart from the past unit sales of the products and the corresponding
timestamps (e.g., date, weekday, week number, month, and year), there was also information available about:
•Special events and holidays (e.g. Super Bowl, Valentine’s Day, and Orthodox Easter), organized into
four classes, namely Sporting, Cultural, National, and Religious.
•Selling prices, provided on a week-store level (average across seven days). If not available, this means
that the product was not sold during the week examined. Although prices are constant on a weekly
basis, they may change with time.
•SNAP 8activities that serve as promotions. This is a binary variable (0 or 1) indicating whether
8The United States federal government provides a nutrition assistance benefit called the Supplement Nutrition Assistance
Program (SNAP). SNAP provides low income families and individuals with an Electronic Benefits Transfer debit card to
purchase food products. In many states, the monetary benefits are dispersed to people across 10 days of the month and on
each of these days 1/10 of the people will receive the benefit on their card. More information about the SNAP program can
be found here: https://www.fns.usda.gov/snap/supplemental-nutrition-assistance-program
8
the stores of CA, TX or WI allow SNAP purchases on the date examined. 1 indicates that SNAP
purchases are allowed.
The forecasting horizon (forecasts 28 days ahead) was determined based on the nature of the decisions
that companies typically support when forecasting data similar to that of the M5, i.e., daily series disaggre-
gated in various locations and product categories. Consequently, the test set was randomly chosen by the
organizers from the original dataset provided by Walmart (around 6 years of data), with the only restrictions
being that (i) more than 5 years of data should be available for training and (ii) at least two special events
should be included in the validation and test set to account for possible deviations in demand. Therefore,
the test set involved three special events, namely Memorial day, part of Ramadan, and the NBA finals, while
the validation set involved Pesach, Orthodox Easter, Cinco De Mayo, and Mother’s day.
3.3. Submission
All forecasts were submitted through the Kaggle platform using the template provided by the organizers.
The template for the M5 Accuracy competition only referred to the 30,490 series that consist the lowest
aggregation level of the dataset (level 12) and not all 42,840 series of the competition. This was done because
M5 deals with a real-life hierarchically structured forecasting problem, among others. This means that the
submitted forecasts must follow this hierarchical concept and, as a result, be coherent (forecasts at the lower
levels have to sum up to the ones at the higher levels so that decisions made using the forecasts across
different levels are aligned; (Spiliotis et al., 2019b)). In other words, it was assumed that the forecasting
approach used for forecasting all 42,840 series of the competition derived coherent forecasts and, therefore,
the forecasts of all levels could be automatically computed by aggregating (summing up) the ones at the
most disaggregated level.
Note that this format did not affect the way the forecasts were produced and that teams were completely
free to use their forecasting method of choice to forecast the individual series. However, having done that,
and by submitting just the forecasts of the lowest level, it was ensured that the derived forecasts were
coherent and, therefore, in an appropriate format to be evaluated. For instance, a team could just forecast
the series at the most disaggregated level of the competition (level 12) and derive the remaining forecasts
using the bottom-up method. Another team could just forecast the most aggregated series of the competition
(level 1) and achieve the remaining ones using proportions (top-down method). A mix of the previous two
approaches was also possible (middle-out method). Finally, predicting the series of all levels and obtaining
the ones of the lowest level through an appropriate weighting scheme was another option (Hyndman et al.,
2011). The benchmarks of the competition apply some of these options, involving some indicative forecasting
approaches that utilize the bottom-up and top-down methods, as well as the combination of the two.
9
3.4. Performance measure
The literature involves various measures for evaluating point forecast accuracy (Hyndman & Koehler,
2006). The first three M competitions considered several of these measures, while the previous one, M4,
examined the overall weighted average of the symmetric mean absolute percentage error (sMAPE; Makridakis
(1993)) and a variant of the mean absolute scaled error (MASE; Hyndman & Koehler (2006)). Undoubtedly,
no measure is perfect and all come with both advantages and drawbacks (Goodwin & Lawton, 1999; Kolassa,
2020). The comments made about the measures utilized in all previous M competitions by the invited
commentators clearly demonstrate this lack of agreement, and also highlight that each forecaster has his/her
own preferences (Makridakis et al., 2020d).
Considering that from the measures commonly used in the literature to assess forecasting accuracy,
measures based on scaled errors are probably the ones that display the most preferable statistical properties,
the M5 Accuracy competition utilized a variant of the MASE originally proposed by Hyndman & Koehler
(2006), to be called Root Mean Squared Scaled Error (RMSSE). The measure is defined as follows:
RM SSE =
v
u
u
u
u
u
u
u
t
1
h
n+h
X
t=n+1
(yt−ˆyt)2
1
n−1
n
X
t=2
(yt−yt−1)2
,
where ytis the actual future value of the examined time series at point t, ˆytthe forecast of the method being
evaluated, nthe length of the training sample (number of historical observations), and hthe forecasting
horizon (28 days). Note that the denominator of RMSSE (in-sample, one-step-ahead mean squared error of
the Naive method) is computed only for the periods during which the examined product(s) are actively sold,
i.e., the periods following the first non-zero demand observed for the series under evaluation. This is done
since many of the products included in the dataset started being sold later than the first available date.
Like MASE, RMSSE is independent of the scale of the data, has a predictable behavior, i.e., becomes
infinite or undefined only when all the errors of the Naive method are equal to zero, has a defined mean
and a finite variance, and is symmetric in the sense that it penalizes positive and negative forecast errors,
as well as errors in large and small forecasts, equally. The choice for this particular measure can be further
justified as follows:
•Many of the competition’s series are characterized by intermittency, involving sporadic unit sales with
lots of zeros. This means that absolute errors, which are optimized for the median (Schwertman et al.,
1990), would assign lower scores (better accuracy) to forecasting methods that derive forecasts close
to zero. However, the objective of the competition is to accurately forecast the average sales. As a
10
result, the accuracy measure used builds on squared errors, which are optimized for the mean (Kolassa,
2016).
•In contrast to other measures of similar statistical properties, such as relative errors and measures
(Davydenko & Fildes, 2013), RMSSE can be safely computed for all M5 series as it does not rely on
divisions with values that could be equal or close to zero. For example, this is typically the case in
percentage errors when ytis equal to zero or relative errors when the error of the benchmark used for
scaling is zero.
After estimating RMSSE for all the 42,840 time series of the competition (average accuracy reported
for each series across the complete forecasting horizon), the overall accuracy of the forecasting method is
computed by averaging the RMSSE scores across all the series of the dataset using appropriate weights. The
measure, to be called Weighted RMSSE (WRMSSE), is defined as follows:
W RM SS E =
42,840
X
t=1
wi×RM SSEi,(1)
where wiand RM SSEiis the weight and the RMSSE is the score of the ith series of the competition,
respectively. The weights are computed based on the last 28 observations of the training sample of the
dataset, specifically based on the cumulative actual dollar sales that each series displayed in that particular
period (sum of units sold multiplied by their respective price). Lower WRMSSE scores indicate more accurate
forecasts. Note that the estimation of WRMSSE differs from the approaches adopted in the previous M
competitions. In the first three competitions, all errors were computed both per series and per forecasting
horizon, and then equally averaged together. In the M4 competition, the errors were first averaged per
series, exactly as done in M5, but then averaged again using equal weights.
We believe that the weighting scheme adopted in the M5 competition, involving the unit sales of var-
ious products of different selling volumes and prices that are organized in a hierarchical fashion, is more
appropriate for successfully identifying forecasting methods that add significant value to retail companies
interested in accurately forecasting the series that mostly translate to relatively higher revenues. Business-
wise, in order for a forecasting method to be considered appropriate, it must provide accurate forecasts
across all aggregation levels, especially for series of high importance, i.e. series that represent significant
sales, measured in monetary terms. In other words, we expect the “best” performing forecasting methods
to derive lower forecasting errors for the series that are more valuable to the company.
Note that, according to WRMSSE, all aggregation levels are equally weighted. The reason is that the
total dollar sales of a product, measured across all three States, are equal to the sum of the dollar sales
of this product when measured across all ten stores. Similarly, because the total dollar sales of a store’s
product category are equal to the sum of the dollar sales of the departments that this category consists of,
11
as well as the sum of the dollar sales of the corresponding departments’ products. Moreover, given that
M5 does not focus on a particular decision-making problem, there is no obvious reason for weighting the
individual levels unequally.
An indicative example for computing WRMSSE can be found in the Competitors’ Guide of the compe-
tition, available on the M5 website9. The code for estimating WRMSSE, as well as the exact weight of each
series, can be found in the GitHub repository of the competition.
3.5. Prizes
In order for a team to be eligible for a prize, point forecasts had to be provided for all 30,490 series of
the competition’s 12th aggregation level (product-store level), which can be properly aggregated (summed
up) to produce forecasts for the rest of the levels. Moreover, teams had to provide code for reproducing
the forecasts originally submitted to the competition, as well as some documentation for understanding the
forecasting method used.
Just like in M4, objectivity and reproducibility was a prerequisite for collecting any prize (Makridakis
et al., 2018a) and therefore, the winning teams, with the exception of companies providing forecasting
services and those claiming proprietary software, had to upload their code onto the Kaggle platform no
later than 14 days after the end of the competition (i.e., the 14th of July, 2020). This material was later
uploaded onto the M5 public GitHub repository, in order for individuals and companies interested in using
the winning methods to be able to do so, while crediting the team that has developed them. Companies
providing forecasting services and those claiming proprietary software had to provide the organizers with a
detailed description of how their forecasts were made and a source or execution file for reproducing their
forecasts.
After receiving the code and documentation from all the winning teams, the organizers evaluated the
reproducibility of their results. Since ML algorithms typically involve random initializations, the organizers
considered as fully reproducible any method that displayed a reproducibility rate, i.e., absolute percentage
difference of WRMSSE between the original and reproduced forecasts, higher than 98%. Although all
winning methods were found to be fully reproducible, if this was not true, the prizes would have been given
to the next best-performing and fully reproducible submission.
The prizes of the M5 Accuracy competition are listed in Table 2. Note that there were no restrictions
preventing a team from collecting both a regular and a student10 prize. Moreover, there were no restrictions
preventing a team from collecting a prize both at the M5 Accuracy and the M5 Uncertainty competitions.
The awards were given during the virtual, online M5 conference on October 29th, 2020.
9https://mofc.unic.ac.cy/m5-competition/
10A student team is one for which at least half of the team members are current full-time students. Teams that were eligible
for the student prize have a name followed by “ STU”.
12
Table 2: The six prizes of the M5 Accuracy competition.
Prize name Description Amount
1st prize Best performing method according to WRMSSE $25,000
2nd prize Second-best performing method according to WRMSSE $10,000
3rd prize Third-best performing method according to WRMSSE $5,000
4th prize Fourth-best performing method according to WRMSSE $3,000
5th prize Fifth-best performing method according to WRMSSE $2,000
Student prize Best performing method among student teams according to WRMSSE. $5,000
Total $50,000
An amount of $40,000 was generously provided by Kaggle, that also waived the fees for hosting the
M5 competition. In addition, MOFC and Google generously provided $20,000 each, while Walmart, apart
from the M5 dataset, also generously provided an amount of $10,000. Finally, the global transportation
technology company Uber generously provided $5,000, while IIF generously provided another $5,000. The
total amount of $100,000 was equally distributed between the accuracy and uncertainty challenges of the
M5 competition.
3.6. Participating teams and submissions
The M5 Accuracy competition involved 7,092 participants on 5,507 teams from 101 countries. Of these
teams, 4,373 entered the competition during the validation phase and 1,134 during the test phase. Moreover,
1,434 teams made submissions during both the validation and the test phase of the competition, while 2,939
only during the validation phase. In total, the participating teams made 88,136 submissions, most of which
(about 78.3%) were submitted during the validation phase. Note that most of the teams made a single
submission, while the majority of the rest made between 3 and 20 submissions. It is worth mentioning
that for 1,563 participants, including 15 in the top 100, this was their first time participating in a Kaggle
competition.
Unfortunately, due to privacy regulations, no information was made available about the academic back-
ground of the participating teams, their experience and skills, and the type of methods utilized (e.g., sta-
tistical, ML, combination or hybrid), with the exception of the winning teams and a few more that were
willing to share this information with the organizers. However, based on the general characteristics of the
Kaggle community, we assume that most of the teams had (at least) an adequate background in statistics
and computer science, and were also familiar with ML forecasting methods, such as Neural Networks (NNs)
and Regression Trees (RTs).
Out of the participating teams, 2,666 (48.4%) managed to outperform the Naive benchmark, 1,972
13
(35.8%) outperformed the sNaive benchmark, and 415 (7.5%) beat the top-performing benchmark (ES bu).
However, it is important to note that these numbers refer to the forecasts selected by each team for the final
evaluation of their performance and not to the “best” submission made per case while the competition was
still running. In the latter case, 3,510 (63.7%), 2,685 (48.8%), and 672 (12.2%) teams would have managed
to outperform the Naive, sNaive, and ES bu benchmarks, respectively. This indicates that many teams
failed to choose the best method developed, probably due to misleading validation scores.
Figure 1 summarizes this information, presenting the daily number of submissions made and the cumu-
lative number of participating teams, the number of participants per country, the distribution of accuracy
of the teams that did better than the Naive benchmark, and the accuracy of the teams that did better than
the top-performing benchmark, along with their respective ranks.
Mar Apr May Jun Jul
500 1000 1500
Count
0 1000 2000 3000 4000 5000
Sum
Submissions
Participating teams
End of validation phase
US
JP
IN
CN
RU
GB
FR
DE
KR
TW
HK
NL
CA
DK
BR
AU
ES
UA
SG
PL
Other
Freqency
0
200
400
600
800
1000
1200
0.6 0.8 1.0 1.2 1.4 1.6 1.8
0 2 4 6
WRMSSE
Density
0 100 200 300 400
0.55 0.60 0.65
Rank
WRMSSE
−20 −15 −10 −5 0
Error decrease (%)
Figure 1: Summary of the participating teams and submissions made. Top-left: The daily number of submissions made (black
line) and the cumulative number of participating teams (blue line). The red dotted line indicates the end of the validation
phase; Top-right: Number of participants per country (top 20 in terms of participation), as estimated based on their IP address;
Bottom-left: The distribution of the accuracy (WRMSSE) achieved by the teams that did better than the Naive benchmark.
The green dotted line indicates the accuracy of the ES bu benchmark, while the purple dotted line the accuracy of sNaive;
Bottom-right: The accuracy (WRMSSE) and ranks of the teams that did better than the top-performing benchmark (ES bu).
Percentage improvements over ES bu are also reported.
14
By observing Figure 1 we find that:
•The majority of the teams made most of their submissions during the validation phase, when the
public leaderboard was available and live feedback could be received. During the test phase, most
of the teams probably used their own, private CV strategies to fine-tune their methods, which were
mainly submitted four days before the competition ended.
•The majority of the participants originated from USA (17%), Japan (17%), India (10%), China (10%),
and Russia (6%). Thus, we conclude that there is a large, active community interested in forecasting
in both developed and developing countries.
•Only a limited number of teams managed to outperform the top-performing benchmark of the com-
petition, with the majority of the teams being outperformed by more than 13% by ES bu.
•From the 415 teams that managed to outperform all the benchmarks of the competition, 5 displayed
an improvement greater than 20%, 42 greater than 15%, 106 greater than 10%, and 249 greater
than 5%. These improvements are substantial and demonstrate the superiority of these methods over
standard forecasting approaches. Moreover, the five winners of the competition were the only teams
to accomplish an accuracy improvement greater than 20%, thus achieving a clear victory over the rest.
The various tables presented in the remainder of this paper focus on the top 50 performing teams of the
competition, as well as the benchmarks considered by the organizers, where appropriate. The reasoning is
twofold: First, for practical reasons, as it would be impossible to analyze and report in detail the results
of all the teams that participated in the competition. Second, given that very few teams were willing to
share detailed information about the methods utilized, we feel that there is more to learn from the top
performers for which adequate information is available, rather than the rest, for which we know little about.
Furthermore, given the complexity of the data and the competition in general, we believe it is safer to draw
conclusions from methods that worked well rather than rationalizing why some methods performed poorly.
4. Results, winning submissions, and key findings
4.1. Results
Table 3 presents the accuracy (WRMSSE) achieved by the top 50 teams of the competition, both overall
and across the 12 aggregation levels. The last column of the table displays the overall (42,840 series)
percentage improvement of the teams over the top-performing benchmark (ES bu).
By observing Table 3 we find that all top 50 submissions improve the overall forecasting accuracy of
the top-performing benchmark by more than 14%, while the improvements are higher than 20% for the
top five performing methods and an impressive 22.4% for the winning team. Taking into consideration
15
Table 3: The performance of the top 50 teams of the M5 Accuracy competition in terms of WRMSSE. The results are presented
both per aggregation level and overall. Overall percentage improvements are also reported in comparison to the top-performing
benchmark (ES bu).
Rank Team Aggregation level Average Improvement
1 2 3 4 5 6 7 8 9 10 11 12 over ES bu (%)
1 YJ STU 0.199 0.310 0.400 0.277 0.365 0.390 0.474 0.480 0.573 0.966 0.929 0.884 0.520 22.4
2 Matthias 0.186 0.294 0.416 0.246 0.349 0.381 0.481 0.497 0.594 1.023 0.964 0.907 0.528 21.3
3 mf 0.236 0.319 0.421 0.308 0.397 0.405 0.496 0.505 0.600 0.950 0.917 0.875 0.536 20.2
4 monsaraida 0.254 0.340 0.418 0.302 0.377 0.411 0.483 0.490 0.579 0.963 0.928 0.886 0.536 20.1
5 Alan Lahoud 0.213 0.324 0.414 0.272 0.361 0.416 0.494 0.503 0.595 0.995 0.950 0.897 0.536 20.1
6 wyzJack STU 0.248 0.367 0.431 0.319 0.396 0.436 0.502 0.502 0.584 0.953 0.918 0.875 0.544 18.9
7 RandomLearner 0.194 0.317 0.423 0.276 0.404 0.408 0.516 0.503 0.608 1.029 0.968 0.910 0.546 18.6
8 SHJ 0.279 0.357 0.419 0.336 0.406 0.429 0.498 0.497 0.586 0.956 0.922 0.878 0.547 18.5
9 gest 2 0.197 0.322 0.424 0.269 0.406 0.420 0.536 0.513 0.624 1.000 0.953 0.901 0.547 18.5
10 DenisKokosinskiy STU 0.294 0.363 0.419 0.341 0.401 0.429 0.493 0.494 0.581 0.955 0.921 0.878 0.547 18.4
11 XueWang 0.288 0.358 0.417 0.348 0.423 0.424 0.501 0.490 0.582 0.959 0.921 0.876 0.549 18.2
12 yq STU 0.226 0.320 0.456 0.294 0.403 0.399 0.496 0.526 0.614 1.010 0.954 0.899 0.550 18.1
13 PoHaoChou 0.212 0.317 0.459 0.322 0.402 0.413 0.504 0.539 0.630 0.968 0.940 0.898 0.550 18.0
14 Tsuru 0.257 0.335 0.402 0.325 0.416 0.421 0.506 0.503 0.608 0.994 0.951 0.900 0.552 17.8
15 bk 18 0.217 0.333 0.420 0.303 0.433 0.431 0.537 0.510 0.615 0.986 0.943 0.893 0.552 17.8
16 N60610 0.195 0.350 0.436 0.298 0.409 0.441 0.530 0.521 0.619 0.976 0.945 0.900 0.552 17.8
17 MonashSL STU 0.247 0.342 0.446 0.308 0.404 0.412 0.501 0.520 0.622 0.992 0.944 0.892 0.552 17.7
18 leo clement 0.270 0.354 0.410 0.322 0.415 0.434 0.526 0.498 0.603 0.986 0.945 0.896 0.555 17.3
19 minghui Tju 0.254 0.365 0.428 0.327 0.425 0.439 0.529 0.505 0.602 0.979 0.936 0.886 0.556 17.1
20 zfc613 0.236 0.343 0.470 0.291 0.387 0.412 0.506 0.539 0.627 1.008 0.959 0.907 0.557 17.0
21 No dalpoints 0.312 0.376 0.423 0.353 0.416 0.443 0.508 0.500 0.587 0.963 0.925 0.880 0.557 17.0
22 CPUkiller 0.263 0.367 0.427 0.333 0.431 0.438 0.529 0.504 0.601 0.979 0.935 0.885 0.558 16.9
23 dont overfit 0.263 0.367 0.427 0.333 0.431 0.438 0.529 0.504 0.601 0.979 0.935 0.885 0.558 16.9
24 Dan Hargreaves 0.269 0.350 0.439 0.316 0.405 0.431 0.517 0.522 0.616 0.984 0.946 0.900 0.558 16.9
25 M0T0 STU 0.252 0.346 0.425 0.345 0.431 0.445 0.530 0.526 0.623 0.967 0.932 0.886 0.559 16.7
26 Genryu 0.295 0.368 0.436 0.340 0.410 0.445 0.515 0.523 0.611 0.961 0.924 0.880 0.559 16.7
27 Moscow Five 0.245 0.349 0.442 0.309 0.435 0.436 0.542 0.526 0.624 0.988 0.941 0.889 0.560 16.5
28 Daniela A 0.162 0.333 0.483 0.278 0.441 0.412 0.569 0.558 0.679 0.988 0.942 0.892 0.561 16.3
29 shuheioka 0.267 0.354 0.440 0.313 0.419 0.430 0.524 0.517 0.616 1.002 0.954 0.903 0.562 16.3
30 sk 2 0.191 0.381 0.511 0.263 0.364 0.470 0.552 0.585 0.661 0.962 0.932 0.887 0.563 16.1
31 nagao 0.279 0.382 0.443 0.328 0.413 0.456 0.531 0.523 0.609 0.975 0.936 0.889 0.564 16.0
32 AjayNagar 0.221 0.324 0.518 0.285 0.412 0.400 0.515 0.573 0.660 1.010 0.954 0.900 0.564 15.9
33 cjwh 0.248 0.348 0.449 0.314 0.420 0.442 0.544 0.535 0.639 1.002 0.955 0.903 0.566 15.6
34 CWD75 0.237 0.326 0.422 0.330 0.452 0.442 0.551 0.526 0.637 1.004 0.960 0.912 0.567 15.6
35 Gro ot 0.278 0.384 0.443 0.342 0.432 0.458 0.540 0.519 0.611 0.979 0.937 0.887 0.567 15.4
36 Astral 0.299 0.381 0.453 0.342 0.401 0.453 0.520 0.528 0.611 0.984 0.945 0.896 0.568 15.4
37 Logistic 0.278 0.386 0.445 0.344 0.436 0.457 0.541 0.518 0.610 0.979 0.936 0.886 0.568 15.3
38 jdsc perceiving team 0.262 0.372 0.461 0.326 0.433 0.445 0.532 0.531 0.623 0.990 0.948 0.897 0.568 15.3
39 Abzal 0.314 0.373 0.434 0.351 0.420 0.447 0.519 0.515 0.603 0.998 0.956 0.906 0.570 15.1
40 Pianus 0.287 0.383 0.473 0.342 0.435 0.451 0.535 0.536 0.626 0.964 0.926 0.880 0.570 15.1
41 NAU 0.277 0.366 0.456 0.310 0.425 0.440 0.537 0.532 0.633 1.002 0.957 0.906 0.570 15.0
42 shirokane friends 0.300 0.387 0.454 0.347 0.429 0.461 0.540 0.534 0.619 0.965 0.926 0.880 0.570 15.0
43 Alexnet 0.301 0.390 0.444 0.353 0.435 0.463 0.540 0.520 0.610 0.975 0.934 0.885 0.571 14.9
44 Griffin Series 0.317 0.380 0.469 0.361 0.442 0.448 0.527 0.529 0.618 0.971 0.933 0.887 0.574 14.5
45 Hiromitsu Kigure 0.291 0.380 0.462 0.342 0.428 0.449 0.533 0.535 0.629 0.991 0.950 0.895 0.574 14.5
46 YK 0.247 0.369 0.464 0.314 0.438 0.453 0.551 0.542 0.644 1.011 0.958 0.904 0.575 14.4
47 PASSTA 0.339 0.396 0.460 0.366 0.421 0.457 0.521 0.532 0.614 0.970 0.933 0.886 0.575 14.4
48 golubyatniks 0.359 0.413 0.455 0.387 0.434 0.466 0.519 0.521 0.600 0.956 0.922 0.879 0.576 14.2
49 b elkasanek 0.184 0.329 0.538 0.260 0.427 0.416 0.549 0.608 0.701 1.028 0.964 0.905 0.576 14.2
50 Random prediction 0.249 0.348 0.455 0.347 0.457 0.460 0.563 0.558 0.655 0.986 0.943 0.890 0.576 14.2
that the improvements of the winning submissions of the M3 and M4 competitions over the corresponding
benchmarks were less than 10% (Makridakis et al., 2020e), we can conclude that M5 included more accurate
16
approaches that reduced the error over the most accurate benchmark by more than one fifth. This means
that retail and logistic companies could gain substantial benefits by utilizing such innovative forecasting
approaches in practice, where small improvements in accuracy lead to considerable inventory reductions
(Syntetos et al., 2010) and slight inaccuracies to higher stock holdings and lower service levels (Ghobbar &
Friend, 2003; Pooya et al., 2019).
Another interesting finding is that the winning team (YJ STU ) does not display the most accurate
forecasts across all 12 aggregation levels; it is the best approach at only levels 3, 7, 8, and 9, and the second
best at levels 2 and 6. This is particularly true for the lowest three aggregation levels of the dataset (10,
11, and 12) where, out of the 50 submissions, the YJ STU is ranked 13rd, 12th , and 11th , respectively.
The same stands for the runner-up (Matthias), which is ranked 1st at levels 2, 4, 5, and 6, but displays
almost the worst performance out of the 50 methods examined at levels 10, 11, and 12, being ranked 48rd,
49th, and 48th , respectively. Daniela A, ranked 28th in total, displays the best performance at level 1, mf,
ranked 3rd, displays the best performance at levels 10 and 11, while wyzJack STU, ranked 6th , displays
the best performance at level 12, which contains the vast majority of the series requested to be forecast.
We can therefore conclude that, depending on the aggregation level, different forecasting methods are more
appropriate and, as the literature suggests, there are indeed “horses for courses” (Petropoulos et al., 2014).
Thus, depending on the forecasting task and the nature of the data, different forecasting methods should
be used to support decisions and optimize forecasting performance at different aggregation levels.
We also find that the accuracy of the top-performing methods deteriorates at a lower aggregation level,
as uncertainty increases when forecasting more disaggregated data where sales are volatile and patterns like
trend and seasonality are difficult to capture (Kourentzes et al., 2014b). This finding can be better visualized
in Figure 2 which presents the distribution of WRMSSE for the top 50 performing teams per aggregation
level along with the accuracy of ES bu. As seen, although the top benchmark is outperformed by all teams
at levels 1 to 9, the improvements reported for the rest of the levels are less significant, with some teams
performing even worse than the benchmark. For example, the average improvement of the methods over the
benchmark is 40% at level 1, which drops to about 23% at levels 5, 6, and 7, and reaches 3% at levels 10, 11,
and 12. Therefore, we can conclude that the gains of the top-performing methods mainly refer to the top
and middle parts of the hierarchies, and are rather limited in terms of WRMSSE at product, product-State,
and product-store levels.
In order to further investigate the differences reported between the top 50 submissions, as well as the
top-performing benchmark, we employ multiple comparisons with the best (MCB) test (Koning et al.,
2005). The test computes the average ranks of the forecasting methods according to RMSSE across the
complete dataset of the competition and concludes whether or not these are statistically different. Figure
3 presents the results of the analysis. If the intervals of two methods do not overlap, this indicates a
statistically different performance. Thus, methods that do not overlap with the gray interval of the figures
17
0.00
0.25
0.50
0.75
1.00
1.25
1:Total
2:State
3:Store
4:Category
5:Department
6:State−Category
7:State−Department
8:Store−Category
9:Store−Department
10:Product
11:Product−State
12:Product−Store
Hierarchical level
WRMSSE
Figure 2: Forecasting accuracy (WRMSSE) of the top 50 performing teams of the M5 Accuracy competition. The results
are reported per aggregation level and box-plots are used to display the distribution of the average errors recorded for the
examined methods (minimum value, 1st quantile, median, 3rd quantile, maximum value, and outliers, noted with black dots).
The red dots indicate the performance of the top-performing benchmark of the competition (ES bu), while the green dots the
performance of the winning team (YJ STU ).
are considered significantly worse than the best, and vice versa.
As seen, teams SHJ (ranked 8th), DenisKokosinskiy STU (ranked 10th ), and XueWang (ranked 11th)
provide significantly better forecasts than the rest of the examined methods, and are more accurate for the
majority of the series. Note also that apart from mf (ranked 3rd in total), none of the five winning teams
performs equally as well as SHJ, while wyzJack STU, which ranked 1st at level 12 according to WRMSSE,
also displays a significantly worse performance. Based on this observation, we conclude that the winning
teams developed methods that mostly focused on expensive and fast-moving products for which WRMSSE
is minimized, thus providing less accurate results for the rest of the series which are probably less valued
by the company. This highlights that the objective of the competition (producing accurate forecasts across
all aggregation levels and especially for high-valued series), expressed though the accuracy measure used,
was critical for determining the winning submissions and optimizing their parameters. Consequently, we
find that, in such weighting settings, the “best” forecasts depend on the accuracy measure used (Kolassa,
2020), especially when developing flexible ML methods whose loss function can be adjusted accordingly to
optimize forecasts based on the selected measure.
18
Mean ranks
8−SHJ
10−DenisKokosinskiy_STU
11−XueWang
6−wyzJack_STU
48−golubyatniks
3−mf
47−PASSTA
21−Nodalpoints
42−shirokane_friends
40−Pianus
43−Alexnet
31−nagao
1−BigLeader.YJ_STU
22−CPUkiller
23−dont.overfit
30−sk.2
4−monsaraida
19−minghui_Tju
44−Griffin_Series
35−Groot
27−Moscow.Five
37−Logistic
15−bk_18
38−jdsc_perceiving_team
26−Genryu
24−Dan.Hargreaves
50−Random_prediction
16−N60610
5−Alan.Lahoud
45−Hiromitsu.Kigure
12−yq_STU
29−shuheioka
39−Abzal
20−zfc613
32−AjayNagar
25−M0T0_STU
36−Astral
7−RandomLearner
17−MonashSL_STU
9−gest.2
49−belkasanek
Benchmark (ES−BU)
14−Tsuru
46−YK
28−Daniela.A
13−PoHaoChou
33−cjwh
2−Matthias
18−leoclement
34−CWD75
41−NAU
22 24 26 28 30
Figure 3: Average ranks and 95% confidence intervals of the top 50 performing teams of the M5 Accuracy competition, plus the
top-performing benchmark (ES bu) over all series: multiple comparisons with the best (RMSSE used for ranking the methods)
as proposed by Koning et al. (2005). The overall rank of the teams in terms of WRMSSE is displayed to the left of their names.
Finally, we investigate the length of the forecasting horizon’s impact on the accuracy achieved by the top
50 performing methods of the competition. To do so, we first compute the weighted root squared scaled error
(WRSSE) of these methods for each forecasting horizon and series separately and then aggregate the results
per aggregation level and horizon. A summary of the results is presented in Figure 4. As seen, although
in most of the cross-sectional levels the accuracy remains rather constant, and is even slightly reduced in
some cases, this is not true for the lowest aggregation levels (10, 11, and 12) where the accuracy significantly
deteriorates as the forecasting horizon increases. This finding is closely related with the characteristics
that the series of each level display: At the higher levels, trend and seasonality dominate randomness and,
19
therefore, uncertainty does not significantly affect forecasting accuracy, at least for the relatively short
forecasting horizon that the competition considered (28 days). On the other hand, at lower aggregation
levels, intermittency, erraticness, and lack of trend increase the uncertainty and negatively affect forecasting
performance. Also, in many aggregation levels, and especially at the lowest ones, the errors display some
sort of periodicity (larger errors are observed during the weekends), indicating that part of the seasonality
present in the data was not appropriately captured by the forecasting methods of the competition, even the
top-ranked ones.
0.0
0.3
0.6
0.9
1.2
0 10 20
Forecasting horizon
WRSSE
Level 1:Total
0.25
0.50
0.75
1.00
0 10 20
Forecasting horizon
WRSSE
Level 6:State−Category
0.6
0.8
1.0
0 10 20
Forecasting horizon
WRSSE
Level 12:Product−Store
0.25
0.50
0.75
1.00
0 10 20
Forecasting horizon
WRSSE
level
1
2
3
4
5
6
7
8
9
10
11
12
Figure 4: Forecasting horizon length’s impact on forecasting accuracy. Top-left: Forecasting accuracy (WRSSE) of the top 50
performing methods of the competition per forecasting horizon for the top level of the dataset. The blue line represents LOESS
(locally estimated scatterplot smoothing). Top-right: Similar to the Top-left figure, but this time the results are reported for
the middle aggregation level of the dataset (State-category). Bottom-left: Similar to the Top-left figure, but this time the
results are reported for the lowest aggregation level of the dataset (product-store); Bottom-right: The average error of all top
50 performing methods across all 12 aggregation levels and forecasting horizons.
4.2. Winning submissions
Unfortunately, as previously mentioned, a very limited number of teams that participated in the M5
Accuracy competition were willing to share with the organizers and the Kaggle community the description
of their methods, and even fewer to share their code. Although the organizers tried to reach at least the
top 50 performing teams of the competition through e-mails (a template for describing the main features of
20
the utilized methods was provided), such information was obtained for just 17 of them, either by receiving
a direct reply, or by observing the public discussions and notebooks posted by these teams onto Kaggle.
Nevertheless, we still believe that there are many lessons to be learned from these methods as they all
provided significantly more accurate forecasts than the benchmarks considered and thousands of other
participating teams.
Before presenting the five winning methods of the competition, we should first note that most of the
methods examined utilized LightGBM11, a ML algorithm for performing non-linear regression using gradient
boosted trees (Ke et al., 2017). LightGBM displays several advantages over other ML alternatives in
forecasting tasks, like the one considered by the M5 Accuracy competition as it allows the effective handling
of multiple features (e.g., past sales and exogenous/explanatory variables) of various types (numeric, binary,
and categorical), is fast to compute compared to typical gradient boosting (GBM) implementations, does not
depend on data pre-processing and transformations, and requires the optimization of only a relatively small
number of parameters (e.g., learning rate, number of iterations, maximum number of bins that feature values
will be bucketed in, number of estimators, and loss function). In this regard, LightGBM is very convenient
to experiment with and develop solutions that can be accurately generalized for a large number of series
that display cross-correlations. In fact, LightGBM can be considered as the standard method of choice
in Kaggle’s recent forecasting competitions, if we consider that the winners of the “Corporaci´on Favorita
Grocery Sales Forecasting” and “Recruit Restaurant Visitor Forecasting” competitions built their approaches
on this method (Bojer & Meldgaard, 2020) and that the discussion and notebooks posted on Kaggle for the
M5 Accuracy competition focused on LightGBM implementations and variants of such approaches.
The forecasting methods of the five winning teams can be summarized as follows:
•First place (YJ STU ; YeonJun Im): The winner of the competition, a senior undergraduate
student at a South Korean university, considered an equal weighted combination (arithmetic mean)
of various LightGBM models that were trained to produce forecasts for the product-store series using
data per store (10 models), store-category (30 models), and store-department (70 models). Two
variations were considered for each type of model, the first applying a recursive and the second a
non-recursive forecasting approach (Bontempi et al., 2013). In this respect, a total of 220 models
were built and each series was forecast using the average of 6 models, each one exploiting a different
learning approach and train set. The models were optimized without considering early stopping and
by maximizing the negative log-likelihood of the Tweedie distribution (Zhou et al., 2020), which is
considered an effective approach when dealing with data with a probability mass of zero and non-
negative, highly right-skewed distribution. The method was fine-tuned using the last four 28-day-long
windows of available data for CV and by measuring both the mean and the standard deviation of
11https://lightgbm.readthedocs.io/en/latest/index.html
21
the errors produced by the individual models and their combinations. That way, the final solution
was chosen so that it provided both accurate and robust forecasts. Regarding the features used, the
models considered various identifiers, calendar-related information, special events, promotions, prices,
and unit sales data, both in a recursive and a non-recursive format.
•Second place (Matthias; Matthias Anderer): This method was also based on an equally weighted
combination of various LightGBM models, however, was externally adjusted through multipliers ac-
cording to the forecasts produced by N-BEATS (deep-learning NN for time series forecasting; (Oreshkin
et al., 2019)) for the top five aggregation levels of the dataset. Essentially, LightGBM models were
first trained per store (10 models) and then five different multipliers were used to adjust their forecasts
and properly capture the trend. In this regard, a total of 50 models were built and each series of the
product-store level of the dataset was forecast using a combination of five different models. The loss
function used was a custom, asymmetric one. The last four 28-day-long windows of available data
were used for CV and model building. The LightGBM models were trained using only some basic
features about calendar effects and prices (past unit sales were not considered), while the N-BEATS
model was based solely on historical unit sales.
•Third place (mf ; Yunho Jeon & Sihyeon Seong): This method involved an equally weighted
combination of 43 deep-learning NNs (Salinas et al., 2020), each consisting of multiple LSTM layers
that were used to recursively predict the product-store series. From the models trained, 24 considered
dropout, while the remaining 19 did not. Note that these models originated from just 12 models
and corresponded to the last, more accurate instances observed for these models while training, as
specified through CV (last fourteen 28-day-long windows of available data). Similar to the winner,
the method considered Tweedie regression, but was modified however to optimize weights based on
sampled predictions instead of actual values. The Adam optimizer and the cosine annealing was used
for the learning rate schedule. The NNs considered a total of 100 features of similar nature to those of
the winning submission (sales data, calendar-related information, prices, promotions, special events,
identifiers, and zero-sales periods).
•Fourth place (monsaraida; Masanori Miyahara): This method produced forecasts for the
product-store series of the dataset using non-recursive LightGBM models, trained per store (10 mod-
els). However, in contrast to the rest of the methods, each week of the forecasting horizon was forecast
separately using a different model (4 models per store). Thus, a total of 40 models were built to
produce the forecasts. The features used as inputs were similar to those of the winning submission,
with the exception of the recursive ones. Tweedie regression was considered for training the models,
with no early stopping, and no optimization was performed in terms of training parameters. The last
five 28-day-long windows of available data were used for CV.
22
•Fifth place (Alan Lahoud; Alan Lahoud): This method considered recursive LightGBM models,
trained per department (7 models). After producing the forecasts for the product-store series, these
were externally adjusted so that that the mean of each of the series at the store-department level was
the same as the one of the previous 28 days. This was done using appropriate multipliers. The models
were trained using Poisson regression with early stopping and validated using a random sample of 500
days. The features used as input were similar to those of the winning submission.
Regarding the rest of the top 50 performing methods for which a method description was available,
we should mention that almost all of them adopted similar approaches to the winning submission, training
recursive and non-recursive LightGBM models per store, department, or store-department. The main excep-
tions were: N60610, ranked 16th, who predicted the product-store series of the dataset using both LightGBM
and a Kalman filter and selected the most appropriate approach per series, MonashSL STU, ranked 17th,
who used an equal-weighted combination of LightGBM and a Pooled Regression Model, Nodalpoints, ranked
21st, who employed a weighted combination of LightGBM and NNs trained across all series or per store,
and Astral, ranked 36th, who considered a non-recursive Prophet-like model that mixes classical statistics
practices with non-linear optimization ML techniques, namely XGBoost and LightGBM.
4.3. Key findings
Below is a summary of the findings related to the performance of the top five methods:
Finding 1: The superiority of simple ML methods. For many years it has been empirically found
that simple methods are as accurate as complex or statistically sophisticated ones (Makridakis et al., 2020e).
Limited data availability, inefficiency of algorithms, the need for preprocessing, and restricted computational
power were just some of the factors that deteriorated the accuracy of ML methods in comparison to statis-
tical ones (Makridakis et al., 2018b). M4 was the first forecasting competition identifying two ML methods
significantly more accurate than simple, statistical ones, highlighting the potential value of ML approaches
to more accurate forecasting (Makridakis et al., 2020e). The first method that won the M4 competition was
a hybrid approach that mixed recurrent NNs and exponential smoothing (Smyl, 2020), while the second,
ranked 2nd, was a method that used XGBoost to optimally weight the forecasts produced by standard time
series forecasting methods (Montero-Manso et al., 2020). Although both of the M4 winning submissions
were ML in nature, they both built on statistical, series-specific functionalities, while also being similarly
accurate to a simple combination (median) of four statistical methods (Petropoulos & Svetunkov, 2020).
M5 is, therefore, the first competition where all top-performing methods were both “pure” ML ones and
significantly better than all statistical benchmarks and their combinations. LightGBM proved that it can
be used effectively to process numerous, correlated series and exogenous/explanatory variables and reduce
23
forecast error. Moreover, deep learning methods like DeepAR and N-BEATS, providing advanced, state-of-
the-art ML implementations, have shown forecasting potential motivating further research in this direction.
Finding 2: The value of combining. The M5 Accuracy competition confirmed the findings of the previ-
ous four M competitions and those of numerous other studies, suggesting that combining forecasts of different
methods, even relatively simple ones (Petropoulos & Svetunkov, 2020), results in improved accuracy. The
winner of the M5 Accuracy competition employed a very simple, equal-weighted combination, involving 6
models, each one exploiting a different learning approach and train set. Similarly, the runner-up utilized
an equal-weighted combination of 5 models, each one of a different estimate for trend, while the third best
performing method, an equal-weighted combination of 43 NNs. Simple combinations of models were also
reported for the methods ranked 14th , 17th, 21st , 24th, 25th and 44th . Of these combination approaches, only
the one ranked 25th considered unequally weighting the individual methods. The value of combining is also
supported by the comparisons made between the benchmarks of the competition. As shown in Appendix
B, the combination of exponential smoothing and ARIMA models performed better than the individual
methods, while the combination of a top-down and bottom-up reconciliation method outperformed both
top-down and bottom-up.
Finding 3: The value of “cross-learning”: In the previous M competitions, most of the series were un-
correlated, of a different frequency and domain, and chronologically unaligned. Therefore, although both of
the top-performing submissions of M4 considered “cross-learning” from multiple series concurrently, instead
of one series at a time, their approach was difficult to implement effectively in practice, and did not demon-
strate the full potential of the approach. In contrast, since the M5 consisted of aligned, highly-correlated
series structured in a hierarchical fashion, “cross-learning” was made much easier to apply, achieving supe-
rior results when compared to methods that were trained in a series-by-series fashion. Note that, apart from
resulting in more accurate forecasts, “cross-learning” implies the use of a single model instead of multiple
ones, each trained on the data of a different series, thus reducing overall computational cost and mitigating
difficulties related to limited historical observations. Essentially, all top 50 performing methods in M5 uti-
lized “cross-learning”, exploiting all the information being offered by the dataset.
Finding 4: The significant differences between the winning methods and benchmarks used for
sales forecasting. As noted, the M5 Accuracy competition considered 24 benchmarks of various types
that are typically used in sales forecasting applications, including traditional and state-of-the-art statistical
methods, ML methods, and combinations. As shown in Figure 3 and Table 3, the winning submissions
provided significantly more accurate forecasts in terms of ranks when compared to these benchmarks, and
were also, on average, more than 20% better in terms of WRMSSE. Although the differences were smaller at
24
lower aggregation levels, the results clearly demonstrate their superiority and motivates additional research
in the area of ML forecasting methods that can be used to predict complex, non-linear relationships between
the series, as well as to include exogenous/explanatory variables.
Finding 5: The beneficial effect of external adjustments. Forecast adjustments are typically used
when forecasters exploit external information, as well as inside knowledge and their expertise to improve
forecasting accuracy (Davydenko & Fildes, 2013). Such adjustments were considered in the M2 competition
where it was found that they did not improve the forecasts of pure statistical methods (Makridakis et al.,
1993). In the M5 Accuracy competition, some of the top-performing methods, namely the ones ranked 2nd
and 5th, utilized such adjustments in the form of multipliers to enhance the forecasts derived by the ML
models. Although they were not completely based on judgment, rather than on the analytical alignment
of the forecasts produced at the lowest aggregation levels with those at the higher ones, these adjustments
proved to be beneficial, and possibly helped the forecasting models reduce bias and better account for the
long-term trends that are easier to observe at higher aggregation levels (Kourentzes et al., 2014b). Even
though, the actual value of such adjustments requires further investigation, the concept of reconciling fore-
casts produced at different aggregation levels is not new in the field of forecasting, with numerous studies
empirically proving its benefits, especially when forecasts and information from the complete hierarchy are
exploited (Hyndman et al., 2011; Spiliotis et al., 2020c).
Finding 6: The value added by effective CV strategies. As discussed in section 3.1, when deal-
ing with complex forecasting tasks, adopting effective CV strategies is critical for objectively simulating
post-sample accuracy, avoiding over-fitting, and mitigating uncertainty. The importance of adopting such
strategies is demonstrated by the results of the M5 Accuracy competition, indicating that a significant num-
ber of teams failed to select the most accurate set of forecasts from those submitted while the competition
was still running (see section 3.6). Yet, various CV strategies can be adopted and, based on their design,
different conclusions can be drawn. Selecting the time period during which the CV will be performed, the
size of the validation windows, the way these windows will be updated, and the criteria that will be used to
summarize forecasting performance, are just some of the factors that forecasters have to take into account.
In the M5 Accuracy competition, the top four performing methods and the vast majority of the top 50
submissions considered a CV strategy where at least the last four 28-day-long windows of available data
were used to assess forecasting accuracy, thus providing a reasonable approximation of post-sample perfor-
mance. What the winner did in addition to this CV scheme, was that he measured both the mean and the
standard deviation of the models he had developed. According to his validations, the recursive models of his
approach were found to be, on average, more accurate than the non-recursive ones but of greater instability.
As such, he decided to combine those two models to make sure that the forecasts produced would be both
25
accurate and stable. Spiliotis et al. (2019a) stressed the necessity of accounting for the full distributions of
forecasting errors and especially their tails when evaluating forecasting methods, indicating that robustness
is a prerequisite for achieving high accuracy. We hope that the results of M5 will encourage more research
in this area and contribute to the development of more powerful CV strategies.
Finding 7: The importance of exogenous/explanatory variables. Time series methods are typically
sufficient for identifying and capturing their historical data pattern (level, trend, and seasonality) and
produce accurate forecasts by extrapolating such patterns. However, time series methods that solely rely
on historical data fail to effectively account for the effect of holidays, special events, promotions, prices and
possibly the weather. Moreover, given that such factors can affect the values of the historical data, they
can distort the time series pattern unless removed before the data is used to forecast. In such settings, the
information from exogenous/explanatory variables becomes of critical importance to improve forecasting
accuracy (Ma et al., 2016). In the M5 Accuracy forecasting competition all winning submissions utilized
external information to improve the forecasting performance of their models. For example, monsaraida and
other top teams found that several price-related features were of significant importance for improving the
accuracy of their results. Furthermore, the importance of exogenous/explanatory variables is also supported
by the comparisons made between the benchmarks of the competition, as shown in Appendix B. For instance,
ESX, which used information about promotions and special events as exogenous variables within exponential
smoothing models, was 6% better than ES td, which employed the same exponential smoothing models but
without considering exogenous variables. The same was true in the case of the ARIMA models, where
ARIMAX was found to be 13% more accurate than ARIMA td.
5. Discussion, limitations, and directions for future research
5.1. Discussion
What became clear from the M5 Accuracy competition, is that ML methods have entered the mainstream
of forecasting applications, at least in the area of retail sales forecasting. The potential benefits are sub-
stantial and there is little doubt that retail firms will adopt them to improve the accuracy of their forecasts
and better support decisions related to their operations and supply chain management. More importantly,
these improvements are reported over methods which have been explicitly developed for forecasting demand
and traditionally used for many years by numerous firms as standards in retail sales forecasting.
Table 4 provides a simple comparison of the Croston’s method (CRO), widely used for forecasting
intermittent demand data, with the sNaive, SES, ES bu, ES td, and ESX benchmarks (for more information
about the benchmarks, please see Appendices A and B). As seen, sNaive (a naive method that accounts
for seasonality) is on average 11.5% more accurate than CRO, but at the same time its improvements are
26
extremely uneven across various cross-sectional levels. The improvements start at 37.8% at the highest
aggregation level, drop to 9.7% at level 9, and become negative at levels 10, 11, and 12, with CRO being
more accurate than sNaive by 13.0%, 20.2%, and 27.0%, respectively. These results prove the value of CRO
in forecasting intermittent demand data, while also highlighting its limitations when it comes to continuous
series characterized by seasonality. A similar comparison of CRO with SES (a simple exponential smoothing
method that does not account for seasonality) further proves the value added by CRO as, in this case, SES
is on average 1.3% less accurate than CRO, providing slightly better forecasts only at level 10. Finally,
by comparing the accuracy of CRO over the three top-performing exponential smoothing benchmarks of
the competition (ES bu, ES td, and ESX), all capable of accounting for seasonality, we observe an average
improvement of about 28%, starting at 54% at the top level and dropping to 1.1% at the lowest, product-
store level. Note that the improvements are consistent for the three exponential smoothing methods across
all aggregation levels.
Table 4: Percentage improvements (according to WRMSSE) reported between indicative benchmarks of the competition,
namely CRO, sNaive, SES, ES bu, ES td, and ESX.
Methods Aggregation level Average
compared 1 2 3 4 5 6 7 8 9 10 11 12
CRO vs. sNaive 37.8% 26.4% 22.2% 31.5% 27.0% 19.2% 16.0% 14.8% 9.7% -13.0% -20.2% -27.0% 11.5%
CRO vs. SES -2.3% -2.6% -2.3% -2.0% -1.3% -2.0% -1.5% -1.7% -1.1% 1.1% 0.0% -0.6% -1.3%
CRO vs. ES bu 52.7% 43.8% 37.1% 47.4% 42.7% 38.7% 33.8% 31.6% 25.9% 6.5% 3.2% 1.2% 29.9%
CRO vs. ES td 47.8% 39.9% 28.0% 41.7% 34.1% 33.1% 26.4% 23.7% 18.5% 5.0% 2.8% 1.2% 24.7%
CRO vs. ESX 61.1% 46.0% 32.0% 51.8% 41.5% 37.3% 29.9% 26.4% 20.7% 5.2% 2.7% 1.0% 29.0%
ES td vs. ESX 25.5% 10.1% 5.6% 17.3% 11.3% 6.2% 4.7% 3.4% 2.7% 0.2% -0.1% -0.2% 5.7%
It becomes evident that the majority of the improvements reported between CRO and the three top-
performing exponential smoothing benchmarks came from the ability of the latter to adequately capture
seasonality, as well as their capacity to exploit explanatory/exogenous variables. In order to separate the
effect of these two influencing factors, we compare ES td with ESX, as both of these methods employ the
same exponential smoothing models but the latter also considers some indicative explanatory/exogenous
variables. As seen, the average improvement of ESX over ES td is 5.7%, starting with 25.5% at the top
level, dropping to 2.7% at level 9, and turning negative at levels 10, 11, and 12. Thus, we find that, although
external information can improve forecasting accuracy, seasonality, observed mainly at higher aggregation
levels, is the most critical factor for improving overall forecasting performance.
The improvements become much more substantial when CRO is compared to the top-performing method
of the competition. The improvements start at 77.9% at the top level, dropping to 10.8%, 7.3%, and 4.7% at
levels 10, 11, and 12, respectively. However, the average improvement overall is 45.6% with superior values
up to the eighth level. With such numbers the superiority of the winning method is not likely to be disputed,
27
Table 5: Percentage improvements (according to WRMSSE) reported between the winning submission (YJ STU ) and the
Croston’s method.
Methods Aggregation level Average
compared 1 2 3 4 5 6 7 8 9 10 11 12
Winning team 0.199 0.310 0.400 0.277 0.365 0.390 0.474 0.480 0.573 0.966 0.929 0.884 0.520
CRO 0.900 0.915 0.923 0.909 0.971 0.941 0.987 0.940 0.983 1.083 1.002 0.926 0.957
Improvement 77.9% 66.1% 56.7% 69.6% 62.4% 58.6% 52.0% 49.0% 41.7% 10.8% 7.3% 4.5% 45.6%
particularly up to level 8, at least until new, more accurate ML methods for handling similar forecasting
tasks are developed. At the same time, CRO and other standards used for forecasting intermittent demand,
like SBA, TSB, ADIDA, and iMAPA (for more details about these methods, please see Appendices A and
B), seem to hold some value for low cross-sectional levels, especially at the product-store level.
Tables 4 and 5 provide us with useful information. The ML method employed by the winning team
of the competition that was based on the LightGBM algorithm, reported substantial improvements over
the Croston’s method, providing significantly better forecasts at the higher aggregation levels but similarly
accurate ones at the product-store level, which is the hardest one to predict due to the randomness involved.
A notable part of these improvements is the proper modeling of seasonality and the exploitation of useful
explanatory/exogenous variables. The rest of the improvements can probably be attributed to the “cross-
learning” approach adopted, as well as to the non-linear nature of the models exploited by the winning team.
Two interesting questions that need to be answered is whether LightGBM is truly the most suitable ML
method for applying “cross-learning” in such forecasting applications, and how ML methods could possibly
be adjusted so that they provide significant better forecasts than the existing approaches at both high and
low aggregation levels.
The biggest advantage of the ML methods used by all top 50 performing methods, was probably their
versatility in accurately predicting all 30,490 product-store series of the competition concurrently using
“cross-learning”, and their flexibility to be fine-tuned based on the idiosyncrasies of the dataset. At the
same time, few teams tried to identify and use a single, “best” model to predict the series. Instead, there
were many alternative models built that were subsequently averaged to forecast, with the winner developing
220 different models and using six models to predict each series. Furthermore, the approaches used by the
top-performing teams to determine the most accurate forecasting method were unstructured, data-driven
ones that were based on CV strategies, thus requiring little knowledge about forecasting and statistics.
This is in contrast to the structured approaches widely used before the M4 competition, which focused
on identifying a statistical method or a combination of statistical methods that could provide the most
accurate forecasts for each series, as well as the hybrid approaches employed by the winning submissions of
M4, which mixed elements of both statistical and ML methods. The unstructured, agnostic approaches used
28
in the M5 Accuracy competition required less experience about modeling and even less knowledge about
the forecasting application considered, mostly depending on experimentation and data mining, without the
need to understand the data itself and its characteristics. This can be asserted by the fact that the winning
team was a student with little forecasting knowledge and little experience in building sales forecasting
models. Yet, he effectively managed to win the competition and outperform thousands of competitors,
including experienced Kaggle grandmasters among others. As ML and computer science in general gain
more acceptance in the field of forecasting, it would not be surprising to see the value of knowledge and
experience become less important in identifying and extrapolating data patterns.
In the M4 competition, the five methods that achieved the most accurate results were also among the
top five in terms of average ranks, as determined by the MCB test (Makridakis et al., 2020e). This indicated
that there was a high degree of correspondence between these two evaluation measures; i.e., the methods
that provided the most accurate forecasts on average were also the ones that most of the times provided
the most accurate forecasts separately for each series. However, this has not been the case with the M5
Accuracy competition. As can be seen in Figure 3, the winning method was ranked 13th according to the
MCB test, the runner-up was ranked 47th, while the third, fourth, and fifth were ranked 6th, 17th, and 29th,
respectively. It seems that the top five winning methods of M5 achieved their objective by minimizing the
overall WRMSSE, weighting the more expensive and fast-moving products more heavily to achieve their
single objective, rather than trying to provide accurate forecasts for every single series of the competition.
As noted earlier, it remains to be seen whether these methods can be effectively adjusted to accurately
predict all the series of the competition equally, especially the ones at the most disaggregated level.
Regarding the applicability of the competition’s results, it would be interesting to know how long it will
take until LightGBM and other ML methods are accepted by academics and widely utilized in practice by
retail sales firms. In the academic world, exploring and adopting new methods does not usually take long as
information is disseminated fast through journals and conferences. Of course, the results of relevant studies
will have to be replicated by other researchers and unless they disagree with those of the M5, they will
hopefully be accepted with little delay. However, in the business world things move slower. First, it will
take some time until practitioners learn the results of the M5 Accuracy competition, who will then need to
be persuaded of their superior value. Second, a software program will have to be developed, either in-house
or by some consulting firm, to implement the competition’s winning methods or their variants. Third, it is
necessary that the software will not require any special fine-tuning to produce forecasts of the same accuracy
to that reported in the competition. Fourth, the computational cost should not be prohibitively expensive
to produce hundreds of thousands or even millions of forecasts on a weekly basis (Seaman, 2018) and finally,
the software should be easy to integrate with the rest of the firms systems in order to retrieve the raw data
and provide the corresponding forecasts (Petropoulos, 2015). In case these requirements are impossible to
meet, then the most accurate benchmarks of the competition, which are relatively simple to implement and
29
computationally cheap, would be exploited instead.
5.2. Limitations
The M5 Accuracy competition, as well as other empirical studies conducted in the past, provide valuable
information about the accuracy of various forecasting approaches to guide academic research and precious
advice to practitioners on what methods to use to improve their forecasting performance and make better
decisions. The value of such information, however, strongly depends on the extent to which the data used
for conducting the empirical comparison is representative of reality. It is hard to argue otherwise, when
100,000 times series that cover most data frequencies and various domains are utilized (Spiliotis et al.,
2020a), but it is still possible even for such large datasets to differ on average to those used in particular
forecasting applications, for instance involving high-frequency data or cross-correlated series. At the same
time, when certain datasets cover a specific aspect of reality, like daily Wikipedia page visits or daily retail
sales, their findings are likely more representative of the application examined but cannot be generalized
beyond the specific area covered by the data, except to draw some general conclusions about the methods
used or how they were selected or evaluated. Therefore, there is a fundamental difference between the data
of M4, covering six different data frequencies and six different domains, and those of M5, referring to retail
sales data, structured across twelve cross-sectional levels. Although the M5 dataset refers to such a specific
forecasting application, we still believe that it will be of great interest to a large number of retail firms that
are specifically concerned with how to best forecast their daily sales and determine their inventory levels
accordingly (Seaman, 2018). This has also been the case for past Kaggle competitions that involved daily
and weekly product sales of large retail firms (Bojer & Meldgaard, 2020).
Another limitation of the M5 Accuracy competition, is that it focused on the point forecast accuracy of
the submitted methods, which were not directly linked to Walmart’s underlying operational costs. Although
it has been empirically found that minor improvements in accuracy can lead to substantial reductions in stock
holding and higher service levels (Syntetos et al., 2010; Ghobbar & Friend, 2003; Pooya et al., 2019), making
such a connection and translating forecasting error reduction into cost savings is far from trivial. This is
because, depending on the supply chain of each company, its facilities, its holding costs, and replenishment
policies, different savings can be assumed for the same gains in accuracy. Moreover, many assumptions must
be made, particularly about the backlog costs and lost sales. Unfortunately, this detailed information was not
made available to the M5 participants, making it impossible for the organizers to evaluate the implications
of the accuracy improvements reported for the winning submissions in monetary terms. However, we hope
that the results of the competition will inspire such studies and motivate relevant research in the field.
As noted, the train and test data of the competition was made publicly available at the end of the
competition, ensuring openness and ob jectivity by allowing everyone who wants to replicate the results of the
competition, test alternative forecasting methods, and propose new, more accurate ones, to do so. This type
30
of openness and objectivity is not possible in Kaggle competitions where the test data is not made available
after their competition has ended and the participants are not required to reveal the methods developed, with
more than half not doing so, nor sharing their code for others to use (Bojer & Meldgaard, 2020). Contrary
to the first four M competitions, this has also been a serious problem with the M5 Accuracy competition as
only 17 of the 50 top-performing methods shared information about their forecasting approaches, even after
the organizers had sent several emails requesting this information. On the positive side, the competition’s
rules was stated that in order for the winners to receive their prizes, they had to reveal the method they
used and make their code available in order for the organizers to be able to reproduce their results and
compare the accuracy achieved to that of the originally submitted forecasts. This has been done and the
forecasts have been successfully reproduced, allowing us to provide the detailed description for the five
top-performing methods presented in section 4.2 and assure that the respective code for those participants
that do not belong to a business firm will be uploaded to GitHub, from where it can be downloaded and
used for free. In addition, just as was done with the M4 competition, a special issue of the IJF, exclusively
devoted to the M5 competition, will be published presenting and discussing all its findings in detail while
also including discussion papers about all aspects of the competition.
5.3. Directions for future research
The findings of the M4 and M5 Accuracy competitions, as well as those of the latest two Kaggle sales
forecasting competitions (Bojer & Meldgaard, 2020), indicate that ML methods are becoming more accurate
than statistical ones, and therefore require a reassessment of their theoretical value and potential usage by
organizations. On the academic side, more research is needed to verify that the results of greater accuracy
apply to other areas beyond hierarchical, retail sales forecasting and that other ML methods are not more
accurate than the winning methods of these competitions.
On the practical side there is a need to determine the extra cost of running such ML methods, versus
the standard, statistical ones, and whether their accuracy improvements would justify such higher cost
(Nikolopoulos & Petropoulos, 2018; Fry & Brundage, 2020). If both concerns could be satisfied, there would
be two additional ones that would require further investigation. The first relates to understanding how ML
methods create their forecasts and account for factors like price, promotions, and special events. Managers
are typically unwilling to make decisions when they cannot understand the logic of the methods they are
going to use. This is a big problem affecting all ML models that would eventually need to be solved. Until
then, some interim solution must be found by comparing the forecasts of ML methods to those of known
benchmarks, as shown in Tables 4 and 5, that would allow them to indirectly determine the contribution of
each factor. The second concern relates to which and how many ML models will have to be combined to
achieve such improvements in accuracy. Perhaps, instead of developing ensembles of hundreds of models, it
could be the case that eliminating the worst ones, or those less likely to improve forecasting performance,
31
could improve overall accuracy without exaggerating in terms of computational cost.
Another alternative to be further explored is the concept of “horses for courses” (Petropoulos et al.,
2014), i.e., the idea that different methods could potentially be used to forecast the various aggregation
levels separately based on their corresponding performances. If Daniela’s (ranked 28th) method was used
to forecast the top level, followed by Matthias (ranked 2nd) for levels 2 to 6, YJ STU for levels 7 to
9 (ranked 1st), mf for levels 10 and 11 (ranked 3rd) and wyzJack STU for level 12 (ranked 6th), the
overall accuracy of the winning submission would have been improved by an additional 2.3%. Such a
selective approach, however, would have required that the best performing method at each level be effectively
identified beforehand. Moreover, it would have also required the forecasts produced by these methods to
be reconciled so that they become coherent across the various aggregation levels. This task proved much
more challenging than expected in the M5 Accuracy competition, with many teams trying to apply well-
known reconciliation methods (Hyndman et al., 2011) but failing to do so, probably due to size of the
dataset, the complexity of the underlying hierarchy, and other present limitations (non-negative forecasts
and additional computational cost). These insights indicate that there is a lot of potential in this particular
area of forecasting and that developing new hierarchical forecasting methods, capable of reconciling forecasts
so that the forecast error is minimized, not only on average but separately at each aggregation level, could
provide substantial accuracy improvements. The potential of such methods is highlighted if we consider that
the simple, equal weighted combination of the above-mentioned five submissions, which directly provides
coherent forecasts, results on average in 2% more accurate forecasts than the winning submission, which,
however, are not always the best ones for each individual level, as shown in Table 6.
Table 6: Percentage improvements (according to WRMSSE) reported between the winning submission (YJ STU ), the best
performing submission of each aggregation level, i.e., Daniela for level 1, Matthias for levels 2 to 6, YJ STU for levels 7 to 9,
mf for levels 10 and 11, and wyzJack STU for level 12, and the simple, equal weighted combination of these five methods.
Methods Aggregation level Average
compared 1 2 3 4 5 6 7 8 9 10 11 12
Winning team 0.199 0.310 0.400 0.277 0.365 0.390 0.474 0.480 0.573 0.966 0.929 0.884 0.520
Best team per level 0.162 0.294 0.400 0.246 0.349 0.381 0.474 0.480 0.573 0.950 0.917 0.875 0.508
Combination (COMB) 0.181 0.283 0.390 0.260 0.365 0.371 0.473 0.472 0.572 0.960 0.921 0.876 0.510
Improvement of COMB over winner 8.9% 8.5% 2.5% 6.1% 0.1% 4.7% 0.3% 1.7% 0.1% 0.6% 0.9% 0.9% 2.0%
Improvement of COMB over the best team per level -11.9% 3.5% 2.5% -5.7% -4.5% 2.6% 0.3% 1.7% 0.1% -1.0% -0.5% -0.1% -0.4%
6. Conclusions
It has been almost 40 years since the first M forecasting competition took place, which was the first of
its kind in a scientific field still in its infancy (Makridakis et al., 1982). At that time, there were just six
contestants pitting their methods against each other and predicting up to 1,001 time series to determine the
most accurate one that, contrary to expectations, was a simple exponential smoothing method rather than
32
the statistically sophisticated Box-Jenkins methodology to ARIMA models. In addition, the competition
established the value of combining, proving empirically that combining the forecasts of more than one
method improved accuracy and reduced uncertainty. That was an important finding at a time when it
was advocated that a single, appropriate model existed for each time series and that such a model had
to be identified judgmentally by inspecting the characteristics of the series. The M2 (Makridakis et al.,
1993) was also a small-scale competition with five participants that took place in real-time between 1987
and 1989, so that the contestants could incorporate their judgment by adjusting the statistical forecasts
using inside company information and knowledge about the current economy. The competition discovered
that, as opposed to what was expected, human judgment did not improve the accuracy of the statistical
forecasts and combining was the most accurate way to predict the 29 series of the competition. In the M3
(Makridakis & Hibon, 2000), run in 2000, the number of time series increased substantially to 3,003 and
the number of participants grew to fifteen, covering both simple and statistically sophisticated methods,
as well as rule-based and NN ones. Still, simple methods outperformed the relatively more complex ones,
with a new simple method, called Theta (Assimakopoulos & Nikolopoulos, 2000), being the most accurate
of all others on average, and forecast combinations continuing to produce more accurate results than the
individual methods being combined.
The M4 (Makridakis et al., 2020e), conducted in 2018, witnessed a dramatic increase to 100,000 time
series and 63 participants and, in addition to the accuracy competition, it also included the requirement
to estimate uncertainty by asking participants to provide the 95% prediction interval around their point
forecasts for each of the 100,000 series. As was mentioned, the M4 ended a long forecasting winter by
reversing the findings of the previous three competitions and concluding that two sophisticated methods,
using a mixture of statistical and ML concepts, outperformed all others, both in terms of accuracy and
uncertainty. The forecasting spring continued with the M5 that proved the superiority of ML methods,
particularly the LightGBM, with the 50 top-performing ones achieving a superior performance of more than
14% over the most accurate statistical benchmark and the top five more than 20%. What has remained
constant in all five M competitions is the finding that combining improves forecasting accuracy. What has
changed is the finding of M1, M2, and M3 that simple statistical methods were more accurate than more
complex, sophisticated ones. In M4, only two sophisticated methods were found to be more accurate than
simple, statistical ones, with the latter dominating the top positions of the competition. On the contrary,
in the M5 all 50 top-performing methods were ML. Therefore, M5 is the first M competition where all top-
performing methods were both ML ones and significantly better than all statistical benchmarks and their
combinations. LightGBM proved that it can be used to effectively process numerous, correlated series and
exogenous/explanatory variables and reduce forecast error. Moreover, deep learning methods, like DeepAR
and N-BEATS, that provide advanced, state of the-art ML implementations, showed potential for further
improving forecasting accuracy in hierarchical retail sales applications.
33
In summary, the M5 Accuracy competition provided the forecasting community with the following three,
new, important findings:
•The superior accuracy of the LightGBM method for predicting hierarchical retail sales that resulted
in substantial improvements over the benchmarks considered.
•The benefits of external adjustments that some methods utilized to improve the forecasting accuracy
of the baseline forecasting models.
•The importance of exogenous/explanatory variables to improve the forecasting accuracy of time series
methods.
In addition, M5 reaffirms the value of the following three findings of the previous M competitions in
improving forecasting accuracy:
•Combining
•“Cross-learning”
•Cross-validation
The exceptional performance of statistical methods versus ML ones found by Makridakis et al. (2018b), as
well as in the early Kaggle competitions (Bojer & Meldgaard, 2020), first shifted towards ML and statistical
methods in the M4 competition, and then to exclusively ML methods like in the Kaggle competitions which
started in 2018 and the M5 described in this paper. It will be of great interest if ML methods continue to
dominate statistical ones in future competitions, particularly for other types of data that are not exclusively
related to hierarchical, retail sales applications.
Finally, what is important for the forecasting area is the integration of statistics and data science into a
unique field covering all academic aspects of forecasting and uncertainty, while determining how to increase
the Usage of Forecasting in Organizations (UFO) for its practice, by persuading executives of the benefits
of systematic forecasting for improving their bottom line (Makridakis et al., 2020a).
References
Assimakopoulos, V., & Nikolopoulos, K. (2000). The theta model: a decomposition approach to forecasting. International
Journal of Forecasting,16 , 521–530.
Athanasopoulos, G., & Hyndman, R. J. (2011). The value of feedback in forecasting competitions. International Journal of
Forecasting,27 , 845–849.
Barker, J. (2020). Machine learning in M4: What makes a good unstructured model? International Journal of Forecasting,
36 , 150–155.
34
Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Journal of the Operational Research Society,20 ,
451–468.
Bergmeir, C., & Ben´ıtez, J. M. (2012). Neural networks in R using the stuttgart neural network simulator: RSNNS. Journal
of Statistical Software,46 , 1–26.
Bojer, C. S., & Meldgaard, J. P. (2020). Kaggle forecasting competitions: An overlooked learning opportunity. International
Journal of Forecasting, (pp. 1–17).
Bontempi, G., Ben Taieb, S., & Le Borgne, Y.-A. (2013). Machine learning strategies for time series forecasting. In M.-A.
Aufaure, & E. Zim´anyi (Eds.), Business Intel ligence: Second European Summer School, eBISS 2012, Brussels, Belgium,
July 15-21, 2012, Tutorial Lectures (pp. 62–77). Berlin, Heidelberg: Springer Berlin Heidelberg.
Breiman, L. (2001). Random forests. Machine Learning,45 , 5–32.
Claeskens, G., Magnus, J. R., Vasnev, A. L., & Wang, W. (2016). The forecast combination puzzle: A simple theoretical
explanation. International Journal of Forecasting,32 , 754–762.
Croston, J. D. (1972). Forecasting and Stock Control for Intermittent Demands. Journal of the Operational Research Society,
23 , 289–303.
Davydenko, A., & Fildes, R. (2013). Measuring forecasting accuracy: The case of judgmental adjustments to sku-level demand
forecasts. International Journal of Forecasting,29 , 510–522.
Fildes, R. (2020). Learning from forecasting competitions. International Journal of Forecasting,36 , 186–188.
Fry, C., & Brundage, M. (2020). The M4 forecasting competition – A practitioner’s view. International Journal of Forecasting,
36 , 156–160.
Gardner Jr., E. S. (1985). Exponential smoothing: The state of the art. Journal of Forecasting,4, 1–28.
Ghobbar, A. A., & Friend, C. H. (2003). Evaluation of forecasting methods for intermittent parts demand in the field of
aviation: a predictive model. Computers & Operations Research,30 , 2097–2114.
Gilliland, M. (2020). The value added by machine learning approaches in forecasting. International Journal of Forecasting,
36 , 161–166.
Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE. International Journal of Forecasting,15 ,
405–408.
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., O’Hara-Wild, M., Petropoulos, F., Razbash, S.,
Wang, E., & Yasmeen, F. (2020). forecast: Forecasting functions for time series and linear models. R package version 8.12.
Hyndman, R., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. Journal of Statistical
Software,26 , 1–22.
Hyndman, R. J. (2020). A brief history of forecasting competitions. International Journal of Forecasting,36 , 7–14.
Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. (2011). Optimal combination forecasts for hierarchical
time series. Computational Statistics & Data Analysis,55 , 2579–2589.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting,
22 , 679–688.
Hyndman, R. J., Koehler, A. B., Snyder, R. D., & Grose, S. (2002). A state space framework for automatic forecasting using
exponential smoothing methods. International Journal of Forecasting,18 , 439–454.
Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., & Callot, L. (2020). Criteria for
classifying forecasting methods. International Journal of Forecasting,36 , 167–177.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient
boosting decision tree. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett
(Eds.), Advances in Neural Information Processing Systems 30 (pp. 3146–3154). Curran Associates, Inc.
Kolassa, S. (2016). Evaluating predictive count data distributions in retail sales forecasting. International Journal of Fore-
35
casting,32 , 788–803.
Kolassa, S. (2020). Why the “best” point forecast depends on the error or accuracy measure. International Journal of
Forecasting,36 , 208–211.
Koning, A. J., Franses, P. H., Hibon, M., & Stekler, H. (2005). The M3 competition: Statistical tests of the results. International
Journal of Forecasting,21 , 397–409.
Kourentzes, N., Barrow, D. K., & Crone, S. F. (2014a). Neural network ensemble operators for time series forecasting. Expert
Systems with Applications,41 , 4235–4244.
Kourentzes, N., Petropoulos, F., & Trapero, J. R. (2014b). Improving forecasting by estimating time series structural compo-
nents across multiple frequencies. International Journal of Forecasting,30 , 291–302.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News,2, 18–22.
Ma, S., Fildes, R., & Huang, T. (2016). Demand forecasting with high dimensional data: The case of sku retail sales forecasting
with intra- and inter-category promotional information. European Journal of Operational Research,249 , 245–257.
Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting,9, 527–529.
Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., Newton, J., Parzen, E., & Winkler, R.
(1982). The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting,
1, 111–153.
Makridakis, S., Assimakopoulos, V., & Spiliotis, E. (2018a). Objectivity, reproducibility and replicability in forecasting research.
International Journal of Forecasting,34 , 835–838.
Makridakis, S., Bonnell, E., Clarke, S., Fildes, R., Gilliland, M., Hoover, J., & Tashman, L. (2020a). The benefits of systematic
forecasting for organizations: The ufo project. Foresight: The International Journal of Applied Forecasting, (pp. 45–56).
Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T., Ord, K., & Simmons, L. F. (1993). The M2-competition: A
real-time judgmentally based forecasting study. International Journal of Forecasting,9, 5–22.
Makridakis, S., & Hibon, M. (2000). The m3-competition: results, conclusions and implications. International Journal of
Forecasting,16 , 451–476.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018b). Statistical and machine learning forecasting methods: Concerns
and ways forward. PLOS ONE ,13 , 1–26.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020b). Predicting/hypothesizing the findings of the M4 Competition.
International Journal of Forecasting,36 , 29–36.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020c). Predicting/hypothesizing the findings of the M5 competition.
International Journal of Forecasting, (pp. 1–8). Working paper.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020d). Responses to discussions and commentaries. International Journal
of Forecasting,36 , 217–223.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020e). The M4 Competition: 100,000 time series and 61 forecasting
methods. International Journal of Forecasting,36 , 54–74.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020f). The M5 Uncertainty competition: Results, findings and conclusions.
International Journal of Forecasting, (pp. 1–24). Working paper.
Møller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks,6, 525–533.
Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J., & Talagala, T. S. (2020). Fforma: Feature-based forecast model
averaging. International Journal of Forecasting,36 , 86–92.
Nikolopoulos, K., & Petropoulos, F. (2018). Forecasting for big data: Does suboptimality matter? Computers & Operations
Research,98 , 322–329.
Nikolopoulos, K., Syntetos, A. A., Boylan, J. E., Petropoulos, F., & Assimakopoulos, V. (2011). An aggregate–disaggregate
intermittent demand approach (ADIDA) to forecasting: an empirical proposition and analysis. Journal of the Operational
36
Research Society,62 , 544–554.
Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: neural basis expansion analysis for interpretable
time series forecasting. CoRR,abs/1905.10437.
Petropoulos, F. (2015). Forecasting support systems: Ways forward. Foresight: The International Journal of Applied Fore-
casting, (pp. 5–11).
Petropoulos, F., Hyndman, R. J., & Bergmeir, C. (2018). Exploring the sources of uncertainty: Why does bagging for time
series forecasting work? European Journal of Operational Research,268 , 545–554.
Petropoulos, F., & Kourentzes, N. (2015). Forecast combinations for intermittent demand. Journal of the Operational Research
Society,66 , 914–924.
Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014). ‘horses for courses’ in demand forecasting.
European Journal of Operational Research,237 , 152–163.
Petropoulos, F., & Svetunkov, I. (2020). A simple combination of univariate models. International Journal of Forecasting,36 ,
110–115.
Pooya, A., Pakdaman, M., & Tadj, L. (2019). Exact and approximate solution for optimal inventory control of two-stock with
reworking and forecasting of demand. Operational Research: An International Journal,19 , 333–346.
Sakamoto, Y., Ishiguro, M., & Kitagawa, G. (1986). Akaike Information Criterion Statistics . D. Reidel Publishing Company.
Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive
recurrent networks. International Journal of Forecasting,36 , 1181–1191.
Schwertman, N. C., Gilks, A. J., & Cameron, J. (1990). A simple noncalculus proof that the median minimizes the sum of the
absolute deviations. The American Statistician,44 , 38–39.
Seaman, B. (2018). Considerations of a retail forecasting practitioner. International Journal of Forecasting,34 , 822 – 829.
Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Inter-
national Journal of Forecasting,36 , 75–85.
Spiliotis, E., Kouloumos, A., Assimakopoulos, V., & Makridakis, S. (2020a). Are forecasting competitions data representative
of the reality? International Journal of Forecasting,36 , 37–53.
Spiliotis, E., Makridakis, S., Semenoglou, A.-A., & Assimakopoulos, V. (2020b). Comparison of statistical and machine learning
methods for daily SKU demand forecasting. Operational Research: An International Journal, (pp. 1–25).
Spiliotis, E., Nikolopoulos, K., & Assimakopoulos, V. (2019a). Tales from tails: On the empirical distributions of forecasting
errors and their implication to risk. International Journal of Forecasting,35 , 687–698.
Spiliotis, E., Petropoulos, F., & Assimakopoulos, V. (2019b). Improving the forecasting performance of temporal hierarchies.
PLOS ONE,14 , 1–21.
Spiliotis, E., Petropoulos, F., Kourentzes, N., & Assimakopoulos, V. (2020c). Cross-temporal aggregation: Improving the
forecast accuracy of hierarchical electricity consumption. Applied Energy,261 , 114339.
Svetunkov, I. (2020). smooth: Forecasting Using State Space Models. R package version 2.5.6.
Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand estimates. International Journal of Forecasting,
21 , 303–314.
Syntetos, A. A., Boylan, J. E., & Croston, J. D. (2005). On the categorization of demand patterns. Journal of the Operational
Research Society,56 , 495–503.
Syntetos, A. A., Nikolopoulos, K., & Boylan, J. E. (2010). Judging the judges through accuracy-implication metrics: The case
of inventory forecasting. International Journal of Forecasting,26 , 134–143.
Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal of Forecast-
ing,16 , 437 – 450.
Teunter, R. H., & Duncan, L. (2009). Forecasting intermittent demand: a comparative study. Journal of the Operational
37
Research Society,60 , 321–329.
Teunter, R. H., Syntetos, A. A., & Babai, M. Z. (2011). Intermittent demand: Linking forecasting to inventory obsolescence.
European Journal of Operational Research,214 , 606–615.
Zhang, G., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with artificial neural networks:: The state of the art. International
Journal of Forecasting,14 , 35–62.
Zhou, H., Qian, W., & Yang, Y. (2020). Tweedie gradient boosting for extremely unbalanced zero-inflated data. Communica-
tions in Statistics - Simulation and Computation,0, 1–23.
¨
Onkal, D. (2020). M4 competition: What’s next? International Journal of Forecasting,36 , 206–207.
Appendix A: Description of benchmarks
This appendix provides information about the 24 forecasting methods selected to serve as benchmarks to
compare the accuracy of the methods submitted by the participating teams. In summary, the benchmarks
include 16 statistical (benchmarks 1 to 16), 4 ML (benchmarks 17 to 20), and 4 combination (benchmarks
21 to 24) methods.
Unless specified otherwise, the benchmarks are used to predict the product-store series (level 12) and
the bottom-up method is then used to forecast the rest of the series to ensure that the forecasts derived
across the various aggregation levels are coherent. In this respect, benchmark methods 1-10, 12, 15, 17-21,
and 24 utilize the bottom-up approach. On the contrary, benchmark methods 11, 13, 14, 16, and 22 are
used to predict total sales (level 1) and the top-down method is then used to obtain forecasts for the rest
of the series (levels 2-12), considering historical proportions that are estimated for the last 28 days of the
train set. Finally, benchmark 23 considers a mixture of the bottom-up and top-down method.
1. Naive: The forecasts at time t, ˆyt, are equal to the last known observation of the time series, y, as
follows:
ˆyt=yt−1.
2. Seasonal Naive (sNaive): The forecasts at time tare equal to the last known observation of the
same period, t−m, as follows:
ˆyt=yt−m,
where mis the frequency of the series. In M5, mis set equal to 7 since the series are daily. Contrary
to the Naive method, sNaive can capture possible seasonal variations. Although sales do not usually
display strong seasonality at low cross-sectional levels, this is very likely at higher aggregation levels.
3. Simple Exponential Smoothing (SES): The simplest exponential smoothing model, aimed at
predicting series without a trend (Gardner Jr., 1985). Forecasts are calculated using weighted averages
that decrease exponentially across time, specified through the smoothing parameter aas follows:
38
ˆyt=ayt+ (1 −a)ˆyt−1.
Typically, in an intermittent demand context, low smoothing parameter values are recommended in
the literature (Syntetos & Boylan, 2005; Teunter & Duncan, 2009), with aranging from 0.1 to 0.3.
Thus, the optimal value from this range is selected by minimizing the in-sample mean squared error
(MSE) of the model, initialized using the first observation of the series.
4. Moving Averages (MA): Moving averages are often used in practice to forecast sales (Syntetos &
Boylan, 2005). Forecasts are computed by averaging the last kobservations of the series as follows:
ˆyt=
k
X
i=1
yt−i
k.
The order of the MA ranges between 2 and 14 and is specified by minimizing the in-sample MSE of
the method.
5. Croston’s method (CRO): Croston (1972) proposed forecasting intermittent demand time series
by separating them into two components and extrapolating them individually: the non-zero demand
size, zt, and the inter-demand intervals, pt. The forecasts are given as follows
ˆyt=ˆzt
ˆpt
and are only updated when demand occurs. Both ztand ptare forecast by SES, originally using a
smoothing parameter of 0.1 and an initial value equal to the first observation of each series. Croston’s
method is regarded as the standard method for forecasting intermittent demand.
6. Optimized Croston’s method (optCRO): Like CRO, but this time the smoothing parameter is
selected from the range 0.1 to 0.3, as is done with SES, in order to allow for more flexibility. The non-
zero demand size and the inter-demand intervals are smoothed separately using (potentially) different
aparameters.
7. Syntetos-Boylan Approximation (SBA): Syntetos & Boylan (2005) showed that Croston’s method
is biased, depending on the value of the parameter aused for smoothing the inter-demand intervals.
In this regard, they proposed utilizing the Croston’s method along with a debiasing factor as follows:
ˆyt= (1 −a
2)ˆzt
ˆpt
.
As is done for CRO, ais set equal to 0.1 and the first observations of ztand ptare used for initializing.
39
8. Teunter-Syntetos-Babai method (TSB): Teunter et al. (2011) showed that Croston’s method is
inappropriate for dealing with obsolescence issues since its updating only occurs in non-zero demand
periods. In this respect, they proposed replacing the inter-demand intervals component of the Croston’s
method with the demand probability, dt, being 1 if demand occurs at time tand 0 otherwise. Similar
to CRO, dtis forecast using SES. The forecasts are given as follows:
ˆyt=ˆ
dtˆzt.
9. Aggregate-Disaggregate Intermittent Demand Approach (ADIDA): Nikolopoulos et al. (2011)
proposed the utilization of temporal aggregation for reducing the presence of zero observations and
mitigating the undesirable effect of the variance observed in the intervals. ADIDA uses equally sized
time buckets to perform non-overlapping temporal aggregation and predict the demand over a pre-
specified lead time. The time bucket is set equal to the mean inter-demand interval (Petropoulos &
Kourentzes, 2015) and SES is used to obtain the forecasts.
10. Intermittent Multiple Aggregation Prediction Algorithm (iMAPA): Petropoulos & Kourentzes
(2015) suggested another approach for implementing temporal aggregation in demand forecasting. In
contrast to ADIDA, considering a single aggregation level, iMAPA considers multiple ones, aimed
at capturing different dynamics of the data. Thus, iMAPA averages the derived point forecasts at
each temporal level, in this implementation generated by SES (iMAPA originally involves a selection
between the Croston’s method, SBA, and SES). The maximum aggregation level is set equal to the
maximum inter-demand interval.
11. Exponential Smoothing with top-down reconciliation (ES td): An algorithm is used to auto-
matically select the most appropriate exponential smoothing model for predicting total sales (level 1),
indicated through information criteria (Hyndman et al., 2002). The top-down method is then used to
forecast the rest of the series (levels 2-12).
12. Exponential Smoothing with bottom-up reconciliation (ES bu): The same algorithm used in
ES td is employed for forecasting the product-store series of the dataset (level 12). Then, the rest of
the series (levels 1-11) are predicted using the bottom-up method.
13. Exponential Smoothing with eXogenous/eXplanatory variables (ESX): Similar to ES td, but
this time two exogenous/explanatory variables are used as regressors in addition to historical data to
improve forecasting accuracy by providing additional information about the future. The first variable
is discrete and takes values 0, 1, 2 or 3, based on the number of States that allow SNAP purchases
on the examined date. The second variable is binary and indicates whether or not the examined date
includes a special event.
40
14. AutoRegressive Integrated Moving Average with top-down reconciliation (ARIMA td):
An algorithm is used to automatically select the most appropriate ARIMA model for predicting total
sales (level 1), indicated through information criteria (Hyndman & Khandakar, 2008). Then, the rest
of the series (levels 1-11) are predicted using the bottom-up method.
15. AutoRegressive Integrated Moving Average with bottom-up reconciliation (ARIMA bu):
The same algorithm used in ARIMA td is employed for forecasting the product-store series of the
dataset (level 12). Then, the rest of the series (levels 1-11) are predicted using the bottom-up method.
16. AutoRegressive Integrated Moving Average with eXogenous/eXplanatory variables (ARI-
MAX): Similar to ARIMA, but this time two external variables are used as regressors to improve
forecasting accuracy by providing additional information about the future, exactly as done for the case
of ESX. The top-down method is used for reconciliation.
17. Multi-Layer Perceptron (MLP): A single hidden layer NN of 14 input nodes, 28 hidden nodes, and
one output node, inspired by the work done by Makridakis et al. (2018b) and Spiliotis et al. (2020b).
The networks are trained using the standard approach of constant size, rolling input and output
windows (Smyl, 2020). The produced one-step-ahead forecasts are used to predict all 28 periods. The
Scaled Conjugate Gradient and weight decay backpropagation (Møller, 1993) is used for estimating
the weights of the network, that are randomly initialized. The maximum iterations are set equal
to 500. The activation functions of the hidden and output layers are the logistic and linear ones,
respectively. In total, 10 networks are trained to forecast each series and then the median operator
is used to average the individual forecasts in order to mitigate possible variations due to poor weight
initializations (Kourentzes et al., 2014a). Due to the nonlinear activation functions used, the data is
scaled before training between 0 to 1 to avoid computational problems, meet algorithm requirement,
and facilitate faster learning (Zhang et al., 1998).
18. Random Forest (RF): This is a combination of multiple regression trees, each one depending on
the values of a random vector sampled independently and with the same distribution (Breiman, 2001).
Given that RF averages the predictions of multiple trees, it is more robust to noise and less likely to
over-fit the training data. We consider a total of 500 non-pruned trees and four randomly sampled
variables at each split. Bootstrap sampling is done with replacement. As done in MLP, the last 14
observations of the series are considered for training the model, using constant size, and rolling input
and output windows, while the produced one-step-ahead forecasts are used to predict all 28 periods.
19. Global Multi-Layer Perceptron (gMLP): Like MLP, but this time, instead of training multiple
models, one for each series, a single model that learns across all series is constructed. This is done
given that M4 indicated the beneficial effect of “cross-learning”. A random sample of three 14-long
windows are selected from each of the product-store series of the dataset and used as inputs, along with
information about the coefficient of variation of non-zero demands and the average number of time-
41
periods between two successive non-zero demands. This additional information is used to facilitate
learning across series of different characteristics.
20. Global Random Forest (gRF): Like gMLP, but instead of using an MLP to obtain the forecasts,
a RF is exploited.
21. Average of ES and ARIMA, as computed using the bottom-up approach (Com b): The
simple arithmetic mean of ES bu and ARIMA bu.
22. Average of ES and ARIMA, as computed using the top-down approach (Com t): The
simple arithmetic mean of ES td and ARIMA td.
23. Average of the two ES methods, the first computed using the top-down approach and
the second using the bottom-up approach (Com tb): The simple arithmetic mean of ES td
and ES bu.
24. Average of the global and local MLPs (Com lg): The simple arithmetic mean of MLP (“local”
method) and gMLP (“global method”).
The code for implementing the benchmarks is publicly available in the M5 GitHub repository (https://github.com/Mcompetitions/M5-
methods). The code developed by the organisers of the competition utilises the randomForest (Liaw &
Wiener, 2002), RSNNS (Bergmeir & Ben´ıtez, 2012), smooth (Svetunkov, 2020), and forecast (Hyndman
et al., 2020) packages for R.
Appendix B: Forecasting accuracy of benchmarks
This appendix presents the accuracy (WRMSSE) achieved by the 24 benchmarks of the M5 Accuracy
competition, as described in Appendix A. Table B presents the results both per aggregation level and
overall. Moreover, Figure B displays the results of the multiple comparisons with the best (MCB) test,
which evaluates the differences observed between the ranks of the methods in terms of significance using
RMSSE for ranking the methods.
By observing Table B and Figure B we find that:
•Naive methods are significantly less accurate than the rest of the benchmarks considered for sales
forecasting.
•Seasonality is critical for producing more accurate forecasts, especially at higher aggregation levels.
For example, the improvements of sNaive over the Naive benchmark range from about 6%, 11%, and
17% for levels 12, 11, and 10, respectively, and reach 72% for level 1 (average improvement of 52%).
Similarly, ES bu, that captures possible seasonality, is on average 31% more accurate than SES.
•Methods which are typically considered superior for forecasting sales, and especially intermittent
demand, do not perform significantly better than, theoretically, less appropriate methods. For example,
42
Table B: The accuracy of the 24 benchmarks in terms of WRMSSE. The results are presented both overall and per aggregation
level.
Method Aggregation level Average
1 2 3 4 5 6 7 8 9 10 11 12
Naive 1.967 1.904 1.880 1.947 1.914 1.881 1.878 1.798 1.764 1.479 1.360 1.253 1.752
sNaive 0.560 0.673 0.718 0.623 0.708 0.760 0.829 0.801 0.888 1.223 1.205 1.176 0.847
SES 0.921 0.938 0.944 0.927 0.983 0.959 1.002 0.956 0.994 1.071 1.002 0.932 0.969
MA 0.891 0.918 0.931 0.900 0.960 0.940 0.986 0.944 0.988 1.070 1.006 0.938 0.956
CRO 0.900 0.915 0.923 0.909 0.971 0.941 0.987 0.940 0.983 1.083 1.002 0.926 0.957
optCRO 0.902 0.916 0.926 0.910 0.970 0.940 0.987 0.942 0.985 1.084 1.004 0.928 0.958
SBA 0.902 0.917 0.926 0.914 0.983 0.943 0.993 0.940 0.983 1.073 0.994 0.919 0.957
TSB 0.911 0.926 0.935 0.918 0.975 0.949 0.994 0.948 0.988 1.068 0.997 0.928 0.961
ADIDA 0.902 0.917 0.924 0.912 0.969 0.943 0.987 0.941 0.982 1.063 0.993 0.922 0.955
iMAPA 0.909 0.925 0.932 0.917 0.973 0.948 0.992 0.946 0.986 1.065 0.996 0.925 0.960
ES td 0.470 0.550 0.664 0.530 0.640 0.629 0.727 0.717 0.801 1.029 0.973 0.915 0.720
ES bu 0.426 0.514 0.580 0.478 0.557 0.577 0.654 0.643 0.728 1.012 0.969 0.915 0.671
ESX 0.350 0.494 0.627 0.438 0.567 0.590 0.692 0.692 0.779 1.026 0.974 0.917 0.679
ARIMA td 0.615 0.673 0.753 0.656 0.768 0.725 0.810 0.785 0.856 1.027 0.969 0.910 0.796
ARIMA bu 0.829 0.850 0.870 0.844 0.905 0.882 0.932 0.893 0.938 1.048 0.981 0.917 0.908
ARIMAX 0.374 0.514 0.638 0.459 0.606 0.604 0.707 0.700 0.787 1.019 0.968 0.912 0.691
MLP 0.892 0.942 0.974 0.910 0.972 0.965 1.016 0.984 1.026 1.084 1.014 0.943 0.977
RF 0.960 0.989 1.026 0.962 1.012 1.003 1.047 1.023 1.057 1.085 1.010 0.940 1.010
gMLP 0.882 0.914 0.919 0.923 0.996 0.967 1.013 0.953 0.997 1.063 0.991 0.920 0.961
gRF 1.062 1.073 1.081 1.071 1.108 1.096 1.116 1.075 1.089 1.078 1.001 0.932 1.065
Com b 0.522 0.591 0.644 0.561 0.641 0.647 0.718 0.696 0.771 1.012 0.963 0.907 0.723
Com t 0.517 0.592 0.693 0.571 0.688 0.661 0.755 0.738 0.819 1.026 0.970 0.912 0.745
Com tb 0.444 0.520 0.598 0.496 0.587 0.588 0.673 0.658 0.743 1.008 0.960 0.905 0.682
Com lg 0.886 0.922 0.936 0.898 0.959 0.948 0.989 0.948 0.986 1.058 0.989 0.921 0.953
TSB, SBA, and optCRO display similar performance to CRO, with the latter also performing equally
well with SES and MA.
•Combinations perform better or equally well with the individual methods that they consist of.
•Combinations provide significantly better forecasts in terms of ranks, especially when the top-down
and the bottom-up reconciliation approaches are mixed (Com td) or different base forecasting models
(ETS and ARIMA) are used along with the bottom-up method (Com b).
•Utilising exogenous/explanatory variables significantly improves the performance of the methods that
only depend on historical time series data. For example, ESX is on average 6% more accurate than
ES td, while ARIMAX is 13% more accurate than ARIMA td.
43
Mean ranks
Com_b
Com_tb
ES_bu
ARIMAX
ESX
ADIDA
Com_t
ES_td
SBA
ARIMA_td
CRO
ARIMA_bu
optCRO
iMAPA
Com_lg
TSB
MLP
SES
gMLP
MA
gRF
RF
Naive
sNaive
10 12 14 16 18 20 22
Figure B: Average ranks and 95% confidence intervals of the 24 benchmarks of the M5 Accuracy competition over all M5 series:
multiple comparisons with the best (RMSSE used for ranking the methods) as proposed by Koning et al. (2005).
•Temporal aggregation generally provides more accurate forecasts than traditional methods used in
sales forecasting, like SES, CRO, SBA, and TSB. However, the improvements are minor in terms of
WRMSSE, as ADIDA and iMAPA are just 1.5% more accurate on average than SES.
44