ArticlePDF Available

Measuring Forecasting Accuracy: The Case Of Judgmental Adjustments To Sku-Level Demand Forecasts

Authors:

Abstract and Figures

Forecast adjustment commonly occurs when organizational forecasters adjust a statistical forecast of demand to take into account factors which are excluded from the statistical calculation. This paper addresses the question of how to measure the accuracy of such adjustments. We show that many existing error measures are generally not suited to the task, due to specific features of the demand data. Alongside the well-known weaknesses of existing measures, a number of additional effects are demonstrated that complicate the interpretation of measurement results and can even lead to false conclusions being drawn. In order to ensure an interpretable and unambiguous evaluation, we recommend the use of a metric based on aggregating performance ratios across time series using the weighted geometric mean. We illustrate that this measure has the advantage of treating over- and under-forecasting even-handedly, has a more symmetric distribution, and is robust.Empirical analysis using the recommended metric showed that, on average, adjustments yielded improvements under symmetric linear loss, while harming accuracy in terms of some traditional measures. This provides further support to the critical importance of selecting appropriate error measures when evaluating the forecasting accuracy.
Content may be subject to copyright.
International Journal of Forecasting 29 (2013) 510–522
Contents lists available at SciVerse ScienceDirect
International Journal of Forecasting
journal homepage: www.elsevier.com/locate/ijforecast
Measuring forecasting accuracy: The case of judgmental
adjustments to SKU-level demand forecasts
Andrey Davydenko ,Robert Fildes
Department of Management Science, Lancaster University, Lancaster, Lancashire, LA1 4YX, United Kingdom
article info
Keywords:
Judgmental adjustments
Forecasting support systems
Forecast accuracy
Forecast evaluation
Forecast error measures
abstract
Forecast adjustment commonly occurs when organizational forecasters adjust a statistical
forecast of demand to take into account factors which are excluded from the statistical
calculation. This paper addresses the question of how to measure the accuracy of such
adjustments. We show that many existing error measures are generally not suited to the
task, due to specific features of the demand data. Alongside the well-known weaknesses
of existing measures, a number of additional effects are demonstrated that complicate the
interpretation of measurement results and can even lead to false conclusions being drawn.
In order to ensure an interpretable and unambiguous evaluation, we recommend the use
of a metric based on aggregating performance ratios across time series using the weighted
geometric mean. We illustrate that this measure has the advantage of treating over- and
under-forecasting even-handedly, has a more symmetric distribution, and is robust.
Empirical analysis using the recommended metric showed that, on average, adjust-
ments yielded improvements under symmetric linear loss, while harming accuracy in
terms of some traditional measures. This provides further support to the critical impor-
tance of selecting appropriate error measures when evaluating the forecasting accuracy.
©2012 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
1. Introduction
The most well-established approach to forecasting
within supply chain companies starts with a statistical
time series forecast, which is then adjusted by managers in
the company based on their expert knowledge. This pro-
cess is usually carried out at a highly disaggregated level
of SKUs (stock-keeping units), where there are often hun-
dreds if not thousands of series to consider (Fildes & Good-
win, 2007;Sanders & Ritzman, 2004). At the same time, the
empirical evidence suggests that judgments under uncer-
tainty are affected by various types of cognitive biases and
are inherently non-optimal (Tversky & Kahneman, 1974).
Such biases and inefficiencies have been shown to apply
specifically to judgmental adjustments (Fildes, Goodwin,
Lawrence, & Nikolopoulos, 2009). Therefore, it is impor-
tant to monitor the accuracy of judgmental adjustments in
Corresponding author. Tel.: +44 1524 593879.
E-mail address: a.davydenko@lancaster.ac.uk (A. Davydenko).
order to ensure the rational use of the organisation’s re-
sources which are invested in the forecasting process.
The task of measuring the accuracy of judgmental
adjustments is inseparably linked with the need to choose
an appropriate error measure. In fact, the choice of an
error measure for assessing the accuracy of forecasts across
time series is itself an important topic for forecasting
research. It has theoretical implications for the comparison
of forecasting methods and is of wide practical importance,
since the forecasting function is often evaluated using
inappropriate measures (see, for example, Armstrong &
Collopy, 1992;Armstrong & Fildes, 1995), and therefore
the link to economic performance may well be distorted.
Despite the continuing interest in the topic, the choice of
the most suitable error measure for evaluating companies’
forecasts still remains controversial. Due to their statistical
properties, popular error measures do not always ensure
easily interpretable results when applied to real-world
data (Hyndman & Koehler, 2006;Kolassa & Schutz,
2007). In practice, the proportion of firms which track
0169-2070/$ – see front matter ©2012 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.ijforecast.2012.09.002
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 511
the aggregated accuracy is surprisingly small, and one
apparent reason for this is the inability to agree on
appropriate accuracy metrics (Hoover, 2006). As McCarthy,
Davis, Golicic, and Mentzer (2006) reported, only 55% of
the companies surveyed believed that their forecasting
performance was being formally evaluated.
The key issue when evaluating a forecasting process is
the improvements achieved in supply chain performance.
While this has only an indirect link to the forecasting
accuracy, organisations rely on accuracy improvements as
a suitable proxy measure, not least because of their ease of
calculation. This paper examines the behaviours of various
well-known error measures in the particular context of
demand forecasting in the supply chain. We show that,
due to the features of SKU demand data, well-known error
measures are generally not advisable for the evaluation
of judgmental adjustments, and can even give misleading
results. To be useful in supply chain applications, an error
measure usually needs to have the following properties:
(i) scale independence—though it is sometimes desirable
to weight measures according to some characteristic
such as their profitability; (ii) robustness to outliers; and
(iii) interpretability (though the focus might occasionally
shift to extremes, e.g., where ensuring a minimum level of
supply is important).
The most popular measure used in practice is the mean
absolute percentage error, MAPE (Fildes & Goodwin, 2007),
which has long been being criticised (see, for example,
Fildes, 1992,Hyndman & Koehler, 2006,Kolassa & Schutz,
2007). In particular, the use of percentage errors is often
inadvisable, due to the large number of extremely high
percentages which arise from relatively low actual demand
values.
To overcome the disadvantages of percentage mea-
sures, the MASE (mean absolute scaled error) measure was
proposed by Hyndman and Koehler (2006). The MASE is a
relative error measure which uses the MAE (mean abso-
lute error) of a benchmark forecast (specifically, the ran-
dom walk) as its denominator. In this paper we analyse the
MASE and show that, like the MAPE, it also has a number of
disadvantages. Most importantly: (i) it introduces a bias to-
wards overrating the performance of a benchmark forecast
as a result of arithmetic averaging; and (ii) it is vulnerable
to outliers, as a result of dividing by small benchmark MAE
values.
To ensure a more reliable evaluation of the effectiveness
of adjustments, this paper proposes the use of an enhanced
measure that shows the average relative improvement in
MAE. In contrast to MASE, it is proposed that the weighted
geometric average be used to find the average relative
MAE. By taking the statistical forecast as a benchmark,
it becomes possible to evaluate the relative change in
forecasting accuracy yielded by the use of judgmental
adjustments, without experiencing the limitations of other
standard measures. Therefore, the proposed statistic can
be used to provide a more robust and easily interpretable
indicator of changes in accuracy, meeting the criteria laid
down earlier.
The importance of the choice of an appropriate error
measure can be seen from the fact that previous studies
of the gains in accuracy from the judgmental adjustment
process have produced conflicting results (e.g., Fildes et al.,
2009,Franses & Legerstee, 2010). In these studies, different
measures were applied to different datasets and arrived at
different conclusions. Some studies where a set of mea-
sures was employed reported an interesting picture, where
adjustments improved the accuracy in certain settings ac-
cording to MdAPE (median absolute percentage error),
while harming the accuracy in the same settings accord-
ing to MAPE (Fildes et al.,2009;Trapero, Pedregal, Fildes,
& Weller, 2011). In practice, such results may be damaging
for forecasters and forecast users, since they do not give a
clear indication of the changes in accuracy that correspond
to some well-known loss function. Using real-world data,
this paper considers the appropriateness of various previ-
ously used measures, and demonstrates the use of the pro-
posed enhanced accuracy measurement scheme.
The next section describes the data employed for the
analysis in this paper. Section 3illustrates the disadvan-
tages and limitations of various well-known error mea-
sures when they are applied to SKU-level data. In Section 4,
the proposed accuracy measure is introduced. Section 5
contains the results from measuring the accuracy of judg-
mental adjustments with real-world data using the alter-
native measures and explains the differences in the results,
demonstrating the benefits of the proposed enhanced ac-
curacy measure. The concluding section summarises the
results of the empirical evaluation and offers practical rec-
ommendations as to which of the different error measures
can be employed safely.
2. Descriptive analysis of the source data
The current research employed data collected from a
company specialising in the manufacture of fast-moving
consumer goods (FMCG). This is an extended data set
from one of the companies considered by Fildes et al.
(2009). The company concerned is a leading European
provider of household and personal care products to a wide
range of major retailers. Table 1 summarises the data set
and indicates the number of cases used for the analysis.
Each case includes (i) the one-step-ahead monthly forecast
prepared using some statistical method (this will be called
the system forecast); (ii) the corresponding judgmentally
adjusted forecast (this will be called the final forecast);
and (iii) the corresponding actual demand value. The
system forecast was obtained using an enterprise software
package, and the final forecast was obtained as a result of
a revision of the statistical forecast by experts (Fildes et al.,
2009). The two forecasts coincide when the experts had
no extra information to add. The data set is representative
of most FMCG manufacturing or distribution companies
which deal with large numbers of time series of different
lengths relating to different products, and is similar to
the other manufacturing data sets considered by Fildes
et al. (2009), in terms of the total number of time series,
the proportion of judgmentally adjusted forecasts and the
frequencies of occurrence of zero errors and zero actuals.
Since the data relate to FMCG, the numbers of cases
of zero demand periods and zero errors are not large
(see Table 1). However, the further investigation of the
properties of error measures presented in Section 3will
512 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
Table 1
Source data summary.
Total number of cases 6882
Total number of time series (SKUs) 412
Period of observations March 2004–July 2007
Total number of adjusted statistical
forecasts (% of total number of cases)
4779 (69%)
Number of zero actual demand periods
(% of total number of cases)
271 (4%)
Number of zero-error statistical forecasts
(% of total number of cases)
47 (<1%)
Number of zero-error judgmentally
adjusted forecasts (% of total number of
adjusted forecasts)
61 (1%)
Number of positive adjustments
(% of total number of adjusted forecasts)
3394 (71%)
Number of negative adjustments
(% of total number of adjusted forecasts)
1385 (29%)
also consider possible situations when the data involve
small counts, and zero observations occur more frequently
(as is common with intermittent demand data).
As Table 1 shows, for this particular data set, ad-
justments of positive sign occur more frequently than
adjustments of negative sign. However, in order to
characterise the average magnitude of the adjustments,
an additional analysis is required. In their study of judg-
mental adjustments, Fildes et al. (2009) analysed the size
of judgmental adjustments using the measure of rela-
tive adjustments that is defined as 100 ×(Final forecast
System forecast)/System forecast.
Since the values of the relative adjustments are scale-
independent, they can be compared across time series.
However, the above measure is asymmetrical. For exam-
ple, if an expert doubles a statistical forecast (say from 10
units to 20 units), he/she increases it by 100%, but if he/she
halves a statistical forecast (say from 20 units to 10 units),
he/she decreases it by 50% (not 100%). The sampling distri-
bution of the relative adjustment is bounded by 100% on
the left side and unbounded on the right side (see Fig. 1).
Generally, these effects mean that the distribution of the
relative adjustment may become non-informative about
the size of the adjustment as measured on the original
scale. When defining a ‘symmetric measure’, Mathews and
Diamantopoulos (1987) argued for a measure where the
adjustment size is measured relative to an average of the
system and final forecasts. The same principle is used in the
symmetric MAPE (sMAPE) measure proposed by Makri-
dakis (1993). However, Goodwin and Lawton (1999) later
showed that such approaches still do not lead to the desir-
able property of symmetry.
In this paper, in order to avoid the problem of the non-
symmetrical scale of the relative adjustment, we carry out
the analysis of the magnitude of adjustments using the
natural logarithm of the (Final forecast/System forecast)
ratio. From Fig. 2, it can be seen that the log-transformed
relative adjustment follows a leptokurtic distribution. As is
well known, the sample mean is not an efficient measure of
location under departures from normality (Wilcox, 2005).
We therefore used the trimmed mean as a more robust
summary measure of location. The optimal trim level
that corresponds to the lowest variance of the trimmed
mean depends on the distribution, which is unknown
in the current case. Some studies have shown that, for
symmetrical distributions, a 5% trim generally ensures a
high efficiency with a useful degree of robustness (e.g., Hill
& Dixon, 1982). However, it is also known that the trimmed
mean gives a biased estimate if the distribution is skewed
(Marques, Neves, & Sarmento, 2000). We used a 2% trim
in order to eliminate the influence of outliers while at the
same time avoiding introducing a substantial bias.
The results presented in Table 2 suggest that, on aver-
age, for the dataset under consideration, the magnitude of
positive adjustments is higher than the magnitude of nega-
tive adjustments, measured relative to the system forecast.
The average magnitude of a positive relative adjustment is
about twice as large as the average magnitude of a nega-
tive adjustment. Also, adjustments with positive signs have
much higher ranges than negative ones.
3. Appropriateness of existing measures for SKU-level
demand data
3.1. Percentage errors
Let the forecasting error for a given time period tand
SKU ibe
ei,t=Yi,tFi,t,
where Yi,tis a demand value for SKU iobserved at time t,
and Fi,tis the forecast of Yi,t.
A traditional way to compare the accuracy of forecasts
across multiple time series is based on using absolute
percentage errors (Hyndman & Koehler, 2006). Let us
define the percentage error (PE) as pi,t=100 ×ei,t/Yi,t.
Hence, the absolute percentage error (APE) is |pi,t|. The
most popular PE-based measures are MAPE and MdAPE,
which are defined as follows:
MAPE =mean(|pi,t|),
MdAPE =median(|pi,t|),
where mean(|pi,t|)denotes the sample mean of |pi,t|over
all available values of iand t, and median(|pi,t|)is the
sample median.
In the study by Fildes et al. (2009), these measures
served as the main tool for the analysis of the accuracy
of judgmental adjustments. In order to determine the
change in forecasting accuracy, MAPE and MdAPE values
of the statistical baseline forecasts and final judgmentally
adjusted forecasts were calculated and compared. The
significance of the change in accuracy was assessed based
on the distribution of the differences between the absolute
percentage errors (APEs) of forecasts. The difference
between APEs is defined as
dAPE
i,t=pf
i,tps
i,t,
where pf
i,tand ps
i,tdenote APEs for the final and baseline
statistical forecasts, respectively, for a given SKU iand
period t. It can be tested whether the final forecast
APE differs statistically from the statistical forecast APE;
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 513
Fig. 1. Histogram of the relative adjustment, measured in percentages.
Fig. 2. Histogram of ln(Final forecast/System forecast).
Table 2
Summary statistics for the magnitude of adjustment.
Sign of adjustment ln(Final forecast/System forecast)
1st quartile Median 3rd quartile Mean(2% trim)exp[Mean(2% trim)]
Positive 0.123 0.273 0.592 0.412 1.510
Negative 0.339 0.153 0.071 0.290 0.749
Both 0.043 0.144 0.425 0.218 1.243
because of the non-normal distribution of the difference,
Fildes et al. (2009) tested whether the median of dAPE
i,t
differs significantly from zero using a two-sample paired
(Wilcoxon) sign rank test.
The sample mean of dAPE
i,tis the difference between
the MAPE values corresponding to the statistical and final
forecasts:
mean dAPE
i,t=mean pf
i,tmean ps
i,t
=MAPEfMAPEs.(1)
Therefore, testing the mean or median (in cases where
the underlying distribution is symmetric) of dAPE
i,tagainst
zero using the above-mentioned test leads to establishing
whether MAPEfdiffers significantly from MAPEs.
The results reported suggest that, overall, the value
of MAPE was improved by the use of adjustments, but
the accuracy of positive and negative adjustments differed
substantially. Based on the MAPE measure, it was found
that positive adjustments did not change the forecasting
accuracy significantly, while negative adjustments led
to significant improvements. However, percentage error
measures have a number of disadvantages when applied
to the adjustments data, as we explain below.
One well-known disadvantage of percentage errors is
that when the actual value Yi,tin the denominator is
relatively small compared to the forecast error ei,t, the
resulting percentage error pi,tbecomes extremely large,
which distorts the results of further analyses (Hyndman
& Koehler, 2006). Such high values can be treated as
outliers, since they often do not allow for a meaningful
interpretation (large percentage errors are not necessarily
harmful or damaging, as they can arise merely from
relatively low actual values). However, identifying outliers
in a skewed distribution is a non-trivial problem, where
it is necessary to determine an appropriate trimming
level in order to find robust estimates, while at the
same time avoiding losing too much information. Usually
authors choose the trimming level for MAPE based on
their experience after experimentation (for example,
Fildes et al., 2009, used a 2% trim), but this decision
still remains subjective. Moreover, the trimmed mean
gives a biased estimate of location for highly skewed
distributions (Marques et al., 2000), which complicates the
interpretation of the trimmed MAPE. In particular, for a
random variable that follows a highly skewed distribution,
the expected value of the trimmed mean differs from
the expected value of the random variable itself. This
bias depends on both the trim level and the number
of observations used to calculate the trimmed mean.
Therefore, it is difficult to compare the measurement
514 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
results based on the trimmed means for samples that
contain different numbers of observations, even when the
trim level remains the same.
SKU-level demand time series typically exhibit a high
degree of variation in actual values, due to seasonal
effects and the changing stages of a product’s life
cycle. Therefore, data on adjustments can contain a high
proportion of low demand values, which makes PE-
based measures particularly inadvisable in this context.
Considering extremes, a common occurrence in the
situation of intermittent demand is for many observations
(and forecasts) to be zero (see the discussion by Syntetos
& Boylan, 2005). All cases with zero actual values must
be excluded from the analysis, since the percentage error
cannot be computed when Yi,t=0, due to its definition.
The extreme percentage errors that can be obtained
can be shown using scaled values of errors and actual
demand values (Fig. 3). The variables shown were scaled
by the standard deviation of actual values in each series
in order to eliminate the differences between time series.
It can be seen that the final forecast errors have a
skewed distribution and are correlated with both the
actual values and the signs of adjustments; it is also clear
that a substantial number of the errors are comparable
to the actual demand values. Excluding observations with
relatively low values on the original scale (here, all
observations less than 10 were excluded from the analysis,
as was done by Fildes et al., 2009) still cannot improve the
properties of percentage errors sufficiently, since a large
number of observations still remain in the area where
the actual demand value is less than the absolute error.
This results in extremely high APEs (>100%), which are
all too easy to misinterpret (since very large APEs do not
necessarily correspond to very damaging errors, and arise
primarily because of low actual demand values). In Fig. 3,
the area below the dashed line shows cases in which
the errors were higher than the actual demand values.
These cases result in extreme percentage errors, as shown
in Fig. 4. Due to the presence of extreme percentages,
the distribution of APEs becomes highly skewed and
heavy-tailed, which makes MAPE-based estimates highly
unstable.
A widely used robust alternative to MAPE is MdAPE.
However, MdAPE is neither easily interpretable nor suffi-
ciently indicative of changes in accuracy when forecast-
ing methods have different shaped error distributions.
The sample median of the APEs is resistant to the influ-
ence of extreme cases, but is also insensitive to large er-
rors, even if they are not outliers or extreme percent-
ages. Comparing the accuracy using the MdAPE shows the
changes in accuracy that relate to the lowest 50% of APEs.
However, MdAPE’s improvement can be accompanied by
remaining more damaging errors which lie above the me-
dian if the shapes of the error distributions differ. In
Section 5, it will be shown that, while the MdAPE indicates
that judgmental adjustments improve the accuracy for a
given dataset, the trimmed MAPE suggests the opposite to
be the case. Therefore, additional indicators are required in
order to be able to draw better-substantiated conclusions
with regard to the forecasting accuracy.
Apart from the presence of extreme APEs, another
problem with using PE-based measures is that they can
bias the comparison in favour of methods that issue low
forecasts (Armstrong,1985;Armstrong & Collopy, 1992;
Kolassa & Schutz, 2007). This happens because, under
certain conditions, percentage errors put a heavier penalty
on positive errors than on negative errors. In particular,
we can observe this when the forecast is taken as fixed.
To illustrate this phenomenon, Kolassa and Schutz (2007)
provide the following example. Assume that we have
a time series that contains values distributed uniformly
between 10 and 50. If we are using a symmetrical loss
function, the best forecast for this time series would be
30. However, a forecast of 22 produces a better accuracy
in terms of MAPE. As a result, if the aim is to choose a
method that is better in terms of a linear loss, then the
values of PE-based measures can be misleading. The way
in which the use of MAPE can bias the comparison of the
performances of judgmental adjustments of different signs
will be illustrated below.
One important effect which arises from the presence of
cognitive biases and the non-negative nature of demand
values is the fact that the most damaging positive
adjustments (producing the largest absolute errors)
typically correspond to relatively low actuals (left corner
of Fig. 3(a)), while the worst negative adjustments (pro-
ducing the largest absolute errors) correspond to higher
actuals (centre section, Fig. 3(b)). More specifically, the fol-
lowing general dependency can be found within most time
series. The difference between the absolute final forecast
error ef
i,tand the absolute statistical forecast error es
i,tis
positively correlated with the actual value Yi,tfor positive
adjustments, while there is a negative correlation for neg-
ative adjustments. To reveal this effect, distribution-free
measures of the association between variables were used.
For each SKU i, Spearman’s ρcoefficients were calculated,
representing the correlation between the improvement
in terms of absolute errors ef
i,tes
i,tand the actual
value Yi,t.Fig. 5 shows the distributions of the coefficients
ρ+
i, calculated for positive adjustments, and ρ
i, corre-
sponding to negative adjustments (the coefficients can
take values 1 and 1 when only a few observations are
present in a series). For the given dataset, mean ρ+
i
0.47 and mean ρ
i≈ −0.44, indicating that the im-
provement in forecasting is markedly correlated with the
actual demand values. This illustrates the fact that posi-
tive adjustments are most effective for larger values of de-
mand, and least effective (or even damaging) for smaller
values of demand. Actually, efficient averaging of correla-
tion coefficients requires applying Fisher’s ztransforma-
tion to them and then transforming back the result (see,
e.g., Mudholkar, 1983). But here we used raw coefficients
because we only wanted to show that the ρvalue clearly
correlates with the adjustment sign.
Because of the division by the scale factor that is
correlated with the numerator, the difference of APEs
(which is calculated as dAPE
i,t=100 ×ef
i,tes
i,t/Yi,t)
will not reflect changes in forecasting accuracy in terms
of a symmetric loss function. More specifically, for
positive adjustments, dAPE
i,twill systematically downgrade
improvements in accuracy and exaggerate degradations of
accuracy (on the percentage scale). In contrast, for negative
adjustments, the improvements will be exaggerated, while
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 515
scaled actual demand value
scaled forecast error
scaled forecast error
(a) Positive adjustments. (b) Negative adjustments.
Fig. 3. Dependencies between forecast error, actual value, and the sign of adjustment (based on scaled data).
scaled actual demand value
percentage error, 100%
percentage error, 100%
(a) Positive adjustments. (b) Negative adjustments.
Fig. 4. Percentage errors, depending on the actual demand value and adjustment sign.
(a) Positive adjustments. (b) Negative adjustments.
Fig. 5. Spearman’s ρcoefficients showing the correlation between the improvement in accuracy and the actual demand value for each time series (relative
frequency histograms).
the errors from harmful forecasts will receive smaller
weights. Since the difference in MAPEs is calculated as
the sample mean of dAPE
i,t(in accordance with Eq. (1)), the
comparison of forecasts using MAPE will also give a result
which is biased towards underrating positive adjustments
and overrating negative adjustments. Consequently, since
the forecast errors arising from adjustments of different
signs are penalised differently, the MAPE measure is flawed
when comparing the performances of adjustments of
different signs. One of the aims of the present research has
therefore been to reinterpret the results of previous studies
through the use of alternative measures.
A second measure based on percentage errors was also
used by Franses and Legerstee (2010). In order to evaluate
the accuracy of improvements, the RMSPE (root mean
square percentage error) was calculated for the statistical
and judgmentally adjusted forecasts, and the resulting
values were then compared. Based on this measure, it
was concluded that the expert adjusted forecasts were no
better than the model forecasts. However, the RMSPE is
516 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
also based on percentage errors, and is affected by the
outliers and biases described above even more strongly.
3.2. Relative errors
Another approach to obtaining scale-independent mea-
sures is based on using relative errors. The relative error
(RE) is defined as
REi,t=ei,t/eb
i,t,
where eb
i,tis the forecast error obtained from a benchmark
method. Usually a naïve forecast is taken as the benchmark
method.
Well-known measures based on relative errors include
Mean Relative Absolute Error (MRAE), Median Relative
Absolute Error (MdRAE), and Geometric Mean Relative
Absolute Error (GMRAE):
MRAE =mean REi,t,
MdRAE =median REi,t,
GMRAE =gmean REi,t.
Averaging the ratios of absolute errors across individual
observations overcomes the problems related to dividing
by actual values. In particular, the RE-based measures are
not affected by the presence of low actual values, or by the
correlation between errors and actual outcomes. However,
REs also have a number of limitations.
The calculation of REi,trequires division by the non-
zero error of the benchmark forecast eb
i,t. In the case of
calculating GMRAE, it is also required that ei,t̸= 0. The
actual and forecasted demands are usually count data,
which means that the forecasting errors are count data as
well. With count data, the probability of a zero error of
the benchmark forecast can be non-zero. Such cases must
be excluded from the analysis when using relative errors.
When using intermittent demand data, the use of relative
errors becomes impossible due to the frequent occurrences
of zero errors (Hyndman,2006;Syntetos & Boylan, 2005).
As was pointed out by Hyndman and Koehler (2006),
in the case of continuous distributions, the benchmark
forecast error eb
i,tcan have a positive probability density at
zero, and therefore the use of MRAE can be problematic.
In particular, REi,tcan follow a heavy-tailed distribution
for which the sample mean becomes a highly inefficient
estimate that is vulnerable to outliers. In addition, the
distribution of |REi,t|is highly skewed. At the same time,
while MdRAE is highly robust, it cannot be sufficiently
informative, as it is insensitive to large REs which lie in
the tails of the distribution. Thus, even if the large REs
are not outliers which arise from the division by relatively
small benchmark errors, they still will not be taken into
account when using MdRAE. Averaging the absolute REs
using GMRAE is preferable to using either MRAE or MdRAE,
as it provides a reliable and robust estimate, and at the
same time takes into account the values of REs which lie
in the tails of the distribution. Also, when averaging the
benchmark ratios, the geometric mean has the advantage
that it produces rankings which are invariant to the choice
of the benchmark (see Fleming & Wallace, 1986).
Fildes (1992) recommends the use of the Relative
Geometric Root Mean Square Error (RelGRMSE). The
RelGRMSE for a particular time series iis defined as
RelGRMSEi=
tTiei,t2
tTieb
i,t2
1
2ni
,
where Tiis a set containing time periods for which non-
zero errors ei,tand eb
i,tare available, and niis the number
of elements in Ti.
After obtaining the RelGRMSE for each series, Fildes
(1992) recommends finding the geometric mean of
the RelGRMSEs across all time series, thus obtaining
gmean (RelGRMSEi). As Hyndman (2006) pointed out,
the Geometric Root Mean Square Error (GRMSE) and
the Geometric Mean Absolute Error (GMAE) are identical
because the square roots cancel each other in a geometric
mean. Similarly, it can be shown that
gmean (RelGRMSEi)=GMRAE.
An alternative representation of GMRAE is:
GMRAE =exp
1
m
i=1
ni
m
i=1
tTi
ln REi,t
,
where mis the total number of time series, and other
variables retain their previous meaning.
For the adjustments data set under consideration, only a
small proportion of observations contain zero errors (about
1%). It has been found empirically that for the given data
set, the log-transformed absolute REs, ln REi,t, can be
approximated adequately using a distribution which has a
finite variance. In fact, even if a heavy-tailed distribution
of ln REi,tarises, the influence of extreme cases can be
eliminated based on various robustifying schemes such as
trimming or Winsorizing. In contrast to APEs, the use of
such schemes for ln REi,tis unlikely to lead to biased
estimates, since the distribution of ln REi,tis not highly
skewed.
Though GMRAE (or, equivalently, gmean (RelGRMSEi))
has some desirable statistical properties and can give a
reliable aggregated indication of changes in accuracy, its
use can be complicated for the following two reasons.
Firstly, as was mentioned previously, zero-error forecasts
cannot be taken into account directly. Secondly, in a
similar way to the median, the geometric mean of absolute
errors generally does not reflect changes in accuracy under
standard loss functions. For instance, for a particular time
series, GMAE (and, hence, GMRAE) favours methods which
produce errors with heavier tailed-distributions, while for
the same series RMSE (root mean square error) can suggest
the opposite ranking.
The latter aspect of using GMRAE can be illustrated us-
ing the following example. Suppose that for a particular
time series, method A produces errors eA
tthat are inde-
pendent and identically distributed variables following a
heavy-tailed distribution. More specifically, let eA
tfollow
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 517
the t-distribution with ν=3 degrees of freedom: eA
ttν.
Also, let method B produce independent errors that follow
the normal distribution: eB
tN(0,3). Let method B be the
benchmark method. It can be shown analytically that the
variances for eA
tand eB
tare equal: Var eA
t=Var eB
t=3.
Thus, the relative RMSE (RelRMSE, the ratio of the two RM-
SEs) for this series is 1. However, the Relative Geometric
RMSE (or, equivalently, GMRAE) will show that method A
is better than method B: GMRAE 0.69 (based on 106
simulated pairs of eA
tand eB
t). Now if, for example, eB
t
N(0,2.5), then the RelRMSE and GMRAE will be 1.10 and
0.76, respectively. This means that method B is now prefer-
able in terms of the variance of errors, while method A is
still (substantially) better in terms of the GMRAE. However,
the geometric mean absolute error is rarely used when op-
timising predictions with the use of mathematical models.
Some authors claim that the comparison based on RelRMSE
can be more desirable, as in this case the criterion used for
the optimisation of predictions corresponds to the evalua-
tion criteria (Diebold,1993;Zellner,1986).
Thus, analogously to what was said with regard to PE-
based measures, if the aim of the comparison is to choose
a method that is better in terms of a linear or a quadratic
loss, then GMRAE may not be sufficiently informative, or
may even lead to counterintuitive conclusions.
3.3. Scaled errors
In order to overcome the imperfections of PE-based
measures, Hyndman and Koehler (2006) proposed the use
of the MASE (mean absolute scaled error). For the scenario
when forecasts are produced from varying origins but with
a constant horizon, the MASE is calculated as follows (see
Appendix):
qi,t=ei,t
MAEb
i
,MASE =mean |qi,t|,
where qi,tis the scaled error and MAEb
iis the mean absolute
error (MAE) of the naïve (benchmark) forecast for series i.
Though this was not specified by Hyndman and Koehler
(2006), it is possible to show (see Appendix) that in the
given scenario, the MASE is equivalent to the weighted
arithmetic mean of relative MAEs, where the number of
available values of ei,tis used as the weight:
MASE =1
m
i=1
ni
m
i=1
niri,ri=MAEi
MAEb
i
,(2)
where mis the total number of series, niis the number of
values of ei,tfor series i,MAEb
iis the MAE of the benchmark
forecast for series i, and MAEiis the MAE of the forecast
being evaluated against the benchmark.
It is known that the arithmetic mean is not strictly
appropriate for averaging observations representing rela-
tive quantities, and in such situations the geometric mean
should be used instead (Spizman & Weinstein, 2008).
As a result of using the arithmetic mean of MAE ratios,
Eq. (2) introduces a bias towards overrating the accuracy
of a benchmark forecasting method. In other words, the
penalty for bad forecasting becomes larger than the reward
for good forecasting.
To show how the MASE rewards and penalises forecasts,
it can be represented as
MASE =1+1
m
i=1
ni
m
i=1
ni(ri1).
The reward for improving the benchmark MAE from A
to B(A>B)in a series iis Ri=ni(1B/A), while the
penalty for harming MAE by changing it from Bto Ais Pi=
ni(A/B1). Since Ri<Pi, the reward given for improving
the benchmark MAE cannot balance the penalty given for
reducing the benchmark MAE by the same quantity. As a
result, obtaining MASE >1 does not necessarily indicate
that the accuracy of the benchmark method was better on
average. This leads to ambiguity in the comparison of the
accuracy of forecasts.
For example, suppose that the performance of some
forecasting method is compared with the performance
of the naïve method across two series (m=2) which
contain equal numbers of forecasts and observations. For
the first series, the MAE ratio is r1=1/2, and for the
second series, the MAE ratio is the opposite: r2=2/1.
The improvement in accuracy for the first series obtained
using the forecasting method is the same as the reduction
for the second series. However, averaging the ratios gives
MASE =1/2(r1+r2)=1.25, which indicates that
the benchmark method is better. While this is a well-
known point, its implications for error measures, with the
potential for misleading conclusions, are widely ignored.
In addition to the above effect, the use of MASE (as for
MAPE) may result in unstable estimates, as the arithmetic
mean is severely influenced by extreme cases which arise
from dividing by relatively small values. In this case,
outliers occur when dividing by the relatively small MAEs
of benchmark forecasts which can appear in short series.
Some authors (e.g., Hoover, 2006) recommend the use
of the MAD/MEAN ratio. In contrast to the MASE, the
MAD/MEAN ratio approach assumes that the forecasting
errors are scaled by the mean of time series elements,
instead of by the in-sample MAE of the naïve forecast.
The advantage of this scheme is that it reduces the
risk of dividing by a small denominator (see Kolassa &
Schutz, 2007). However, Hyndman (2006) notes that the
MAD/MEAN ratio assumes that the mean is stable over
time, which may make it unreliable when the data exhibit
trends or seasonal patterns. In Section 5, we show that
both the MASE and the MAD/MEAN are prone to outliers
for the data set we consider in this paper. Generally, the
use of these schemes has the risk of producing unreliable
estimates that are based on highly skewed left-bounded
distributions.
Thus, while the use of the standard MAPE has long been
known to be flawed, the newly proposed MASE also suffers
from some of the same limitations, and may also lead to
an unreliable interpretation of the empirical results. We
therefore need a measure that does not suffer from these
problems. The next section presents an improved statistic
which is more suitable for comparing the accuracies of
SKU-level forecasts.
518 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
4. Recommended accuracy evaluation scheme
The recommended forecast evaluation scheme is based
on averaging the relative efficiencies of adjustments
across time series. The geometric mean is the correct
average to use for averaging benchmark ratio results,
since it gives equal weight to reciprocal relative changes
(Fleming & Wallace, 1986). Using the geometric mean
of MAE ratios, it is possible to define an appropriate
measure of the average relative MAE (AvgRelMAE). If the
baseline statistical forecast is taken as the benchmark, then
the AvgRelMAE showing how the judgmentally adjusted
forecasts improve/reduce the accuracy can be found as
AvgRelMAE =m
i=1
rni
i1/
m
i=1
ni
,ri=MAEf
i
MAEs
i
,(3)
where MAEs
iis the MAE of the baseline statistical forecast
for series i,MAEf
iis the MAE of the judgmentally adjusted
forecast for series i,niis the number of available errors of
judgmentally adjusted forecasts for series i, and mis the
total number of time series. This differs from the proposals
of Fildes (1992), who examined the behaviour of the GRM-
SEs of the individual relative errors.
The MAEs in Eq. (3) are found as
MAEf
i=1
ni
tTief
i,t,MAEs
i=1
ni
tTies
i,t,
where ef
i,tis the error of the judgmentally adjusted forecast
for period tand series i,Tiis a set containing the time
periods for which ef
i,tare available, and es
i,tis the error of
the baseline statistical forecast for period tand series i.
AvgRelMAE is immediately interpretable, as it rep-
resents the average relative value of MAE adequately,
and directly shows how the adjustments improve/reduce
the MAE compared to the baseline statistical forecast.
Obtaining AvgRelMAE <1 means that on average MAEf
i<
MAEs
i, and therefore adjustments improve the accuracy,
while AvgRelMAE >1 indicates the opposite. The aver-
age percentage improvement in MAE of forecasts is found
as (1AvgRelMAE)×100. If required, Eq. (3) can also be
extended to other measures of dispersion or loss functions.
For example, instead of MAE one might use the MSE (mean
square error), interquartile range, or mean prediction in-
terval length. The choice of the measure depends on the
purposes of analysis. In this study, we use MAE, assuming
that the penalty is proportional to the absolute error.
Equivalently, the geometric mean of MAE ratios can be
found as
AvgRelMAE =exp
1
m
i=1
ni
m
i=1
niln ri
.
Therefore, obtaining m
i=1niln ri<0 means an average
improvement of accuracy, and m
i=1niln ri>0 means the
opposite.
In theory, the following effect may complicate the
interpretation of the AvgRelMAE value. If the distributions
of errors ef
i,tand es
i,twithin a given series ihave different
levels of the kurtosis, then ln riis a biased estimate of
lnEef
i,t/Ees
i,t. Thus, the indication of an improvement
under linear loss given by the AvgRelMAE may be biased.
In fact, if ni=1 for each i, then the AvgRelMAE becomes
equivalent to the GMRAE, which has the limitations
described in Section 3.2. However, our experiments have
shown that the bias of ln ridiminishes rapidly as ni
increases, becoming negligible for ni>4.
To eliminate the influence of outliers and extreme
cases, the trimmed mean can be used in order to
define a measure of location for the relative MAE. The
trimmed AvgRelMAE for a given threshold t(0
t0.5)is calculated by excluding the [tm]lowest
and [tm]highest values of niln rifrom the calculations
(square brackets indicate the integer part of tm). As was
mentioned in Section 2, the optimal trim level depends
on the distribution. In practice, the choice of the trim
level usually remains subjective, since the distribution
is unknown. Wilcox (1996) wrote that ‘Currently there
is no way of being certain how much trimming should
be done in a given situation, but the important point
is that some trimming often gives substantially better
results, compared to no trimming’ (p. 16). Our experiments
show that a 5% level can be recommended for the
AvgRelMAE measure. This level ensures high efficiency,
because the underlying distribution usually does not
exhibit large departures from the normal distribution. A
manual screening for outliers could also be performed in
order to exclude time series with non-typical properties
from the analysis.
The results described in the next section show that the
robust estimates obtained using a 5% trimming level are
very close to the estimates based on the whole sample.
The distribution of niln riis more symmetrical than the
distribution of either the APEs or absolute scaled errors.
Therefore, the analysis of the outliers in relative MAEs can
be performed more efficiently than the analysis of outliers
when using the measures considered previously.
Since the AvgRelMAE does not require scaling by actual
values, it can be used in cases of low or zero actuals, as
well as in cases of zero forecasting errors. Consequently,
it is suitable for intermittent demand forecasts. The only
limitation is that the MAEs in Eq. (3) should be greater than
zero for all series.
Thus, the advantages of the recommended accuracy
evaluation scheme are that it (i) can be interpreted
easily, (ii) represents the performance of the adjustments
objectively (without the introduction of substantial biases
or outliers), (iii) is informative and uses all available
information efficiently, and (iv) is applicable in a wide
range of settings, with minimal assumptions about the
features of the data.
5. Results of empirical evaluation
The results of applying the measures described above
are shown in Table 3.
For the given dataset, a large number of APEs have
extreme values (>100%) which arise from low actual
demand values (Fig. 6). Following Fildes et al. (2009), we
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 519
Table 3
Accuracy of adjustments according to different error measures.
Error measure Positive adjustments Negative adjustments All nonzero adjustments
Statistical
forecast
Adjusted
forecast
Statistical
forecast
Adjusted
forecast
Statistical
forecast
Adjusted
forecast
MAPE, % 38.85 61.54 70.45 45.13 47.88 56.85
MAPE, % (2% trimmed) 30.98 40.56 48.71 30.12 34.51 37.22
MdAPE, % 25.48 20.65 23.90 17.27 24.98 19.98
GMRAE 1.00 0.93 1.00 0.70 1.00 0.86
GMRAE (5% trimmed) 1.00 0.94 1.00 0.71 1.00 0.87
MASE 0.97 0.97 0.95 0.70 0.96 0.90
Mean (MAD/Mean) 0.37 0.42 0.33 0.24 0.36 0.37
Mean (MAD/Mean) (5% trimmed) 0.34 0.35 0.29 0.21 0.33 0.31
AvgRelMAE 1.00 0.96 1.00 0.71 1.00 0.90
AvgRelMAE (5% trimmed) 1.00 0.96 1.00 0.73 1.00 0.89
Avg. improvement based on AvgRelMAE 0.00 0.04 0.00 0.29 0.00 0.10
Fig. 6. Box-and-whisker plot for absolute percentage errors (log scale, zero-error forecasts excluded).
used a 2% trim level for MAPE values. However, as noted,
it is difficult to determine an appropriate trim level. As
a result, the difference in APEs between the system and
final forecasts has a very high dispersion and cannot be
used efficiently to assess improvements in accuracy. It
can also be seen that the distribution of APEs is highly
skewed, which means that the trimmed means cannot be
considered as unbiased estimates of the location. Albeit
the distribution of the APEs has a very high kurtosis, our
experiments show that increasing the trim level (say from
2% to 5%) would substantially bias the estimates of the
location of the APEs due to the extremely high skewness of
the distribution. We therefore use the 2% trimmed MAPE
in this study. Also, the use of this trim level makes the
measurement results comparable to the results of Fildes
et al. (2009).
Table 3 shows that the rankings based on the trimmed
MAPE and MdAPE differ, suggesting different conclusions
about the effectiveness of adjustments. As was explained
in Section 3.1, the interpretation of PE-based measures is
not straightforward. While MdAPE is resistant to outliers,
it is not sufficiently informative, as it is insensitive to APEs
which lie above the median. Also, PE-measures produce
a biased comparison, since the improvement on the real
scale within each series is correlated markedly with the
actual value. Therefore, applying percentage errors in
the current setting leads to ambiguous results and to
confusion in their interpretation. For example, for positive
adjustments, the trimmed MAPE and MdAPE suggest the
opposite rankings: while the trimmed MAPE shows a
substantial worsening of the final forecast due to the
judgmental adjustments, the MdAPE value points in the
opposite direction.
The absolute scaled errors found using the MASE
scheme (as described in Section 3.3) also follow a non-
symmetrical distribution and can take extremely large
values (Fig. 7) in short series where the MAE of the naïve
forecast is smaller than the error of judgmental forecast.
For the adjustments data, the lengths of the series vary
substantially, so the MASE is affected seriously by outliers.
Fig. 8 shows that the use of the MAD/MEAN scheme
instead of the MASE does not improve the properties of
the distribution of the scaled errors. Table 3 shows that
a trimmed version of the MAD/MEAN scheme gives the
opposite rankings with regard to the overall accuracy of
adjustments, which indicates that this scheme is highly
unstable. Moreover, with such distributions, the use of
trimming for either MASE or MAD/MEAN leads to biased
estimates, as was the case with MAPE.
Fig. 9 shows that the log-transformed relative absolute
errors follow a symmetric distribution and contain outliers
that are easier to detect and to eliminate. Based on the
shape of the underlying distribution, it seems that using
a 5% trimmed GMRAE would give a location estimate
with a reasonable level of efficiency. Although the GMRAE
measure is not vulnerable to outliers, its interpretation
can present difficulties, for the reasons explained in
Section 3.2.
Compared to the APEs and the absolute scaled er-
rors, the log-transformed relative MAEs are not affected
severely by outliers and have a more symmetrical distri-
bution (Fig. 10). The AvgRelMAE can therefore serve as a
more reliable indicator of changes in accuracy. At the same
time, in terms of a linear loss function, the AvgRelMAE
scheme represents the effectiveness of adjustments ade-
quately and gives a directly interpretable meaning.
The AvgRelMAE result shows improvements from both
positive and negative adjustments, whereas according to
MAPE and MASE, only negative adjustments improve the
accuracy. For the whole sample, adjustments improve the
MAE of statistical forecasts by 10%, on average. Positive
adjustments are less accurate than negative adjustments
and provide only minor improvements.
520 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
Fig. 7. Box-and-whisker plot for the absolute scaled errors found by the MASE scheme (log scale, zero-error forecasts excluded).
Fig. 8. Box-and-whisker plot for absolute scaled errors found by the MAD/MEAN scheme (log scale, zero-error forecasts excluded).
Fig. 9. Box-and-whisker plot for the log-transformed relative absolute errors (using the statistical forecast as the benchmark).
Fig. 10. Box-and-whisker plot for the weighted log-transformed relative MAEs (niln ri).
Table 4
Results of using the binomial test to analyse the frequency of a successful adjustment.
Adjustment
sign
Total number
of adjustments
Number of adjustments
that improved forecast
p-value Probability of a successful
adjustment
95% confidence interval for the
probability of a successful adjustment
Positive 3394 1815 <0.001 0.535 0.518 0.552
Negative 1385 915 <0.001 0.661 0.635 0.686
Both 4779 2730 <0.001 0.571 0.557 0.585
To determine whether the probability of a successful
adjustment is higher than 0.5, a two-sided binomial test
was applied. The results are shown in Table 4.
Based on the p-values obtained for each sample, it can
be concluded that adjustments improved the accuracy of
forecasts more frequently than they reduced it. However,
the probability of a successful intervention was rather low
for positive adjustments.
6. Conclusions
The appropriate measurement of accuracy is important
in many organizational settings, and is not of merely
academic interest. Due to the specific features of SKU-
level demand data, many well-known error measures are
not appropriate for use in evaluating the effectiveness of
adjustments. In particular, the use of percentage errors is
not advisable because of the considerable proportion of
low actual values, which lead to high percentage errors
with no direct interpretation for practical use. Moreover,
the errors corresponding to adjustments of different signs
are penalised differently when using percentage errors,
because the forecasting errors are correlated with both
the actual demand values and the adjustment sign. As a
result, measures such as MAPE and MdAPE do not provide
sufficient indication of the effectiveness of adjustments, in
terms of a linear loss function. Similar arguments were also
found to apply to the calculation of MASE, which can also
induce biases and outliers as a result of using the arithmetic
mean to average relative quantities. Thus, an organization
which determines its forecast improvement strategy based
on an inadequate measure will misallocate its resources,
and will therefore fail in its objective of improving the
accuracy at the SKU level.
In order to overcome the disadvantages of existing
measures, it is recommended that an average relative MAE
A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522 521
be used which is calculated as the geometric mean of
relative MAE values. This scheme allows for the objective
comparison of forecasts, and is more reliable for the
analysis of adjustments.
For the empirical dataset, the analysis has shown
that adjustments improved accuracy in terms of the
average relative MAE (AvgRelMAE) by approximately
10%. For the same dataset, a range of well-known error
measures, including MAPE, MdAPE, GMRAE, MASE, and the
MAD/MEAN ratio, indicated conflicting results. The MAPE-
based results suggested that, on the whole, adjustments
did not improve the accuracy, while the MdAPE results
showed a substantial improvement (dropping from 25%
to 20%, approximately). The analysis using MASE and
the MAD/MEAN ratio was complicated, due to a highly
skewed underlying distribution, and did not allow any
firm conclusions to be reached. The GMRAE showed that
adjustments improved the accuracy by 13%, a result that
is close to that obtained using the AvgRelMAE. Since
analyses based on different measures can lead to different
conclusions, it is important to have a clear understanding
of the statistical properties of any error measure used. We
have described various undesirable effects that complicate
the interpretation of the well-known error measures. As
an improved scheme which is appropriate for evaluating
changes in accuracy under linear loss, we recommend
using the AvgRelMAE. The generalisation of this scheme
can be obtained straightforwardly for other loss functions
as well.
The process by which a new error measure is developed
and accepted by an organisation has not received any
research attention. A case in point is intermittent demand,
where service improvements can be achieved, but only by
abandoning the standard error metrics and replacing them
with service-level objectives (Syntetos & Boylan, 2005).
When an organisation and those to whom the forecasting
function reports insist on retaining the MAPE or similar
(as will mostly be the case), the forecaster’s objective
must then shift to delivering to the organisation’s chosen
performance measure, whilst using a more appropriate
measure, such as the AvgRelMAE, to interpret what is really
going on with the data. In essence, the forecaster cannot
reasonably resort to using the organisation’s measure and
expect to achieve a cost-effective result.
Appendix. Alternative representation of MASE
According to Hyndman and Koehler (2006), for the
scenario when forecasts are made from varying origins but
with a constant horizon (here taken as 1), the scaled error
is defined as1
qi,t=ei,t
MAEb
i
,MAEb
i=1
li1
li
j=2
|Yi,jYi,j1|,
where MAEb
iis the MAE from the benchmark (naïve)
method for series i,ei,tis the error of a forecast being
1The formula corresponds to the software implementation described
by Hyndman and Khandakar (2008).
evaluated against the benchmark for series iand period t,li
is the number of elements in series i, and Yi,jis the actual
value observed at time jfor series i.
Let the mean absolute scaled error (MASE) be calculated
by averaging the absolute scaled errors across time periods
and time series:
MASE =1
m
i=1
ni
m
i=1
tTiei,t
MAEb
i
,
where niis the number of available values of ei,tfor series
i,mis the total number of series, and Tiis a set containing
time periods for which the errors ei,tare available for series
i.
Then,
MASE =1
m
i=1
ni
m
i=1
tTiei,t
MAEb
i
=1
m
i=1
ni
m
i=1
tTi
|ei,t|
MAEb
i
=1
m
i=1
ni
m
i=1
ni
1
ni
tTi
|ei,t|
MAEb
i
=1
m
i=1
ni
m
i=1
niri,ri=MAEi
MAEb
i
,
where MAEiis the MAE for the series ifor the forecast being
evaluated against the benchmark.
References
Armstrong, S. (1985). Long-range forecasting: from crystal ball to computer.
New York: John Wiley.
Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing
about forecasting methods: empirical comparisons. International
Journal of Forecasting,8, 69–80.
Armstrong, J. S., & Fildes, R. (1995). Correspondence on the selection of
error measures for comparisons among forecasting methods. Journal
of Forecasting,14(1), 67–71.
Diebold, F. X. (1993). On the limitations of comparing mean square
forecast errors: comment. Journal of Forecasting,12, 641–642.
Fildes, R. (1992). The evaluation of extrapolative forecasting methods.
International Journal of Forecasting,8(1), 81–98.
Fildes, R., & Goodwin, P. (2007). Against your better judgment? How
organizations can improve their use of management judgment in
forecasting. Interfaces,37, 570–576.
Fildes, R., Goodwin, P., Lawrence, M., & Nikolopoulos, K. (2009). Effective
forecasting and judgmental adjustments: an empirical evaluation and
strategies for improvement in supply-chain planning. International
Journal of Forecasting,25(1), 3–23.
Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics: the
correct way to summarize benchmark results. Communications of the
ACM,29(3), 218–221.
Franses, P. H., & Legerstee, R. (2010). Do experts’ adjustments on
model-based SKU-level forecasts improve forecast quality? Journal of
Forecasting,29, 331–340.
Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric
MAPE. International Journal of Forecasting,4, 405–408.
Hill, M., & Dixon, W. J. (1982). Robustness in real life: a study of clinical
laboratory data. Biometrics,38, 377–396.
522 A. Davydenko, R. Fildes / International Journal of Forecasting 29 (2013) 510–522
Hoover, J. (2006). Measuring forecast accuracy: omissions in today’s
forecasting engines and demand-planning software. Foresight: The
International Journal of Applied Forecasting,4, 32–35.
Hyndman, R. J. (2006). Another look at forecast-accuracy metrics for
intermittent demand. Foresight: The International Journal of Applied
Forecasting,4(4), 43–46.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting:
the forecast package for R. Journal of Statistical Software,27(3).
Hyndman, R., & Koehler, A. (2006). Another look at measures of forecast
accuracy. International Journal of Forecasting,22(4), 679–688.
Kolassa, S., & Schutz, W. (2007). Advantages of the MAD/MEAN ratio over
the MAPE. Foresight: The International Journal of Applied Forecasting,6,
40–43.
Makridakis, S. (1993). Accuracy measures: theoretical and practical
concerns. International Journal of Forecasting,9, 527–529.
Marques, C. R., Neves, P. D., & Sarmento, L. M. (2000). Evaluating
core inflation indicators. Working paper 3-00. Economics Research
Department. Banco de Portugal.
Mathews, B., & Diamantopoulos, A. (1987). Alternative indicators of
forecast revision and improvement. Marketing Intelligence,5(2),
20–23.
McCarthy, T. M., Davis, D. F., Golicic, S. L., & Mentzer, J. T. (2006). The
evolution of sales forecasting management: a 20-year longitudinal
study of forecasting practice. Journal of Forecasting,25, 303–324.
Mudholkar, G. S. (1983). Fisher’s z-transformation. Encyclopedia of
Statistical Sciences,3, 130–135.
Sanders, N., & Ritzman, L. (2004). Integrating judgmental and quantitative
forecasts: methodologies for pooling marketing and operations
information. International Journal of Operations and Production
Management,24, 514–529.
Spizman, L., & Weinstein, M. (2008). A note on utilizing the geometric
mean: when, why and how the forensic economist should employ the
geometric mean. Journal of Legal Economics,15(1), 43–55.
Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand
estimates. International Journal of Forecasting,21(2), 303–314.
Trapero, J. R., Pedregal, D. J., Fildes, R., & Weller, M. (2011). Analysis of
judgmental adjustments in presence of promotions. Paper presented at
the 31th international symposium on forecasting. ISF2011. Prague.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty:
heuristics and biases. Science,185, 1124–1130.
Wilcox, R. R. (1996). Statistics for the social sciences. San Diego, CA:
Academic Press.
Wilcox, R. R. (2005). Trimmed means. Encyclopedia of Statistics in
Behavioral Science,4, 2066–2067.
Zellner, A. (1986). A tale of forecasting 1001 series: the Bayesian knight
strikes again. International Journal of Forecasting,2, 491–494.
Andrey Davydenko is working in the area of the development and
software implementation of statistical methods for business forecasting.
He has a Ph.D. from Lancaster University. He holds a candidate of science
degree in mathematical methods in economics. His current research
focuses on the composite use of judgmental and statistical information
in forecasting support systems.
Robert Fildes is Professor of Management Science in the School of
Management, Lancaster University, and Director of the Lancaster Centre
for Forecasting. He has a mathematics degree from Oxford and a Ph.D.
in statistics from the University of California. He was co-founder of
the Journal of Forecasting in 1981 and of the International Journal of
Forecasting in 1985. For ten years from 1988 he was Editor-in-Chief of
the IJF. He was president of the International Institute of Forecasters
between 2000 and 2004. His current research interests are concerned
with the comparative evaluation of different forecasting methods, the
implementation of improved forecasting procedures in organizations and
the design of forecasting systems.
... Rather than averaging individual values of relative errors across different forecast horizons, this method involves computing the average error of the estimated model across all horizons and dividing it by the corresponding average error of the benchmark (Chen et al. 2017). Thus, Davydenko and Fildes (2013) recommend using the geometric mean instead of the arithmetic mean to aggregate relative measures such as the average relative mean absolute percentage error (AvgRelMAPE), which is defined as follows (Koutsandreas et al. 2022): where k refers to the number of series used to evaluate the examined method. ...
... Finally, Table 3 presents the Average Relative MAPE (AvgRelMAPE) proposed by Davydenko and Fildes (2013) of different forecasting methods for the two scenarios across the various time horizons. One benefit of such techniques is their interpretability (Hyndman and Koehler 2006). ...
Article
Full-text available
Accurately forecasting demand poses challenges for revenue managers, especially amid supply and demand uncertainties increased by the recent global pandemic. In addition, demand forecasting is particularly challenging in the hotel industry due to anomalous days and repeating seasonal patterns. This study investigates techniques like TBATS, MSTL, and STL Decomposition against Linear Regression in hotel demand time series analysis, focusing on daily occupancy and average daily rate seasonalities. Using a 5-year dataset from an Upper Upscale branded property, the study employs in-sample data for model development and a rolling window approach for testing. Results highlight the robust performance of TBATS and MSTL across different forecasting horizons, consistently outperforming Seasonal-Trend Decomposition (STLF) and linear regression, providing insights crucial for revenue optimization and strategic decision-making in the hotel industry.
... These adjustments account for factors not included in the initial forecast calculations [9]. However, it is essential to evaluate whether these adjustments improve forecast accuracy or, instead, introduce bias [10,11]. Moreover, the vast scale of product-level sales forecasting in the retail sector underscores the need for greater computational efficiency and speed [12,13], while manual adjustments of system-generated forecasts become increasingly impractical. ...
Article
Full-text available
Retailers depend on accurate sales forecasts to effectively plan operations and manage supply chains. These forecasts are needed across various levels of aggregation, making hierarchical forecasting methods essential for the retail industry. As competition intensifies, the use of promotions has become a widespread strategy, significantly impacting consumer purchasing behavior. This study seeks to improve forecast accuracy by incorporating promotional data into hierarchical forecasting models. Using a sales dataset from a major Portuguese retailer, base forecasts are generated for different hierarchical levels using ARIMA models and Multi-Layer Perceptron (MLP) neural networks. Reconciliation methods including bottom-up, top-down, and optimal reconciliation with OLS and WLS (struct) estimators are employed. The results show that MLPs outperform ARIMA models for forecast horizons longer than one day. While the addition of regressors enhances ARIMA’s accuracy, it does not yield similar improvements for MLP. MLPs present a compelling balance of simplicity and efficiency, outperforming ARIMA in flexibility while offering faster training times and lower computational demands compared to more complex deep learning models, making them highly suitable for practical retail forecasting applications.
... 5. Average Relative Mean Squared Error (AvgReIMSE): This measure compares the accuracy of two different forecasts and accounts for both over-and under-predictions (Davydenko and Fildes, 2013). It is frequently used in the literature to assess model performance in various contexts (Kourentzes and Athanasopoulos, 2019). ...
Preprint
Full-text available
The CIR# model is an advanced forecasting tool that enhances the original Cox-Ingersoll-Ross model by addressing volatility, regime shifts, and negative interest rates. It uses ARIMA residuals instead of Brownian motion, improving accuracy and reducing computational costs. While initially designed for forecasting interest rates, the model is also highly effective in predicting tourism demand, outperforming traditional methods like Holt-Winters and SARIMA, especially in the face of disruptions such as the COVID-19 pandemic. Its simplicity, transparency, and efficiency make it a valuable tool for decision-making across diverse fields, including finance, tourism, and corporate strategy, by providing reliable forecasts and enabling better risk management and strategic planning.
... Various metrics have been used to evaluate the performance of demand forecast models such as MSE, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), symmetric MAPE (sMAPE) Mean Absolute Scaled Error (MASE), and Overall Weighted Average (OWA). We refer the reader to [11,19,23] for a detailed review of the use of the forecast accuracy metrics. ...
... It is also important to analyze whether the forecasts using different reconciliation approaches yield better results than the base forecasts. The relative root mean squared error (AvgRelRMSE) recommended by [29] can be defined through the geometric mean of the ratios of the mean absolute errors. The AvgRelRMSE is calculated as follows: ...
Article
Full-text available
This study compares reconciliation techniques and base forecast methods to forecast a hierarchical time series of the number of fire spots in Brazil between 2011 and 2022. A three-level hierarchical time series was considered, comprising fire spots in Brazil, disaggregated by biome, and further disaggregated by the municipality. The autoregressive integrated moving average (ARIMA), the exponential smoothing (ETS), and the Prophet models were tested for baseline forecasts, and nine reconciliation approaches, including top-down, bottom-up, middle-out, and optimal combination methods, were considered to ensure coherence in the forecasts. Due to the need for transformation to ensure positive forecasts, two data transformations were considered: the logarithm of the number of fire spots plus one and the square root of the number of fire spots plus 0.5. To assess forecast accuracy, the data were split into training data for estimating model parameters and test data for evaluating forecast accuracy. The results show that the ARIMA model with the logarithmic transformation provides overall better forecast accuracy. The BU, MinT(s), and WLS(v) yielded the best results among the reconciliation techniques.
... The results of the model should not be unique, and different pre-processing of the data will give different results. There has been some discussion about accuracy analysis (Hyndman and Koehler, 2006;Davydenko and Fildes, 2013;Koutsandreas et al., 2022). In this study, as shown in Figure 2-2d, we adopt three grades accuracy (Chiou et al., 2004;Huang and Lee, 2011;Jasni et al., 2022;Mehdiyev et al., 2016) to evaluate the forecasting performance and examine the validity of the used forecasting model (Armstrong, 1978). ...
Article
Full-text available
Manufacturing systems have to adapt to changing requirements of their internal and external customers. In fact, new requirements may appear unexpectedly and may change multiple times. Change is a straightforward reality of production, and the engineer has to deal with the dynamic work environment. In this perspective, this paper proposes a decision model in order to fit actual and future processes' needs. The proposed model is based on the dynamic quality function deployment (DQFD), grey forecasting model GM (1,1) and the technique for order preference by similarity to ideal solution (TOPSIS). The cascading QFD-based model is used to show the applicability of the proposed methodology. The simulation results illustrate the effect of the manufacturing needs changes on the strategic, operational and technical improvements.
... where J represents the forecast horizon, x(t) denotes the actual value of the series at time t, andx(t) represents the forecast for time t. The MASE, proposed by Hyndman and Koehler (2006), can be considered as a weighted arithmetic mean of the MAE based on the variations of the sales data in the estimation period (Davydenko & Fildes, 2013). It is defined as ...
Article
Seasonal demand forecasting is critical for effective supply chain management. However, conventional forecasting methods face difficulties in accurately estimating seasonal variations, owing to time-varying demand trends and limited data availability. In this paper, we propose a Fourier time-varying grey model (FTGM) to tackle this issue. The FTGM builds upon grey models, which are effective with limited data, and leverages Fourier functions to approximate time-varying parameters that allow it to represent seasonal variations. A data-driven selection algorithm adaptively determines the appropriate Fourier order of the FTGM without prior knowledge of data characteristics. Using the well-known M5 competition data, we compare our model with state-of-the-art forecasting methods taken from grey models, statistical methods, and architectures of neural network-based methods. The experimental results show that the FTGM outperforms popular seasonal forecasting methods in terms of standard accuracy metrics, providing a competitive alternative for seasonal demand forecasting in retail companies.
Article
Full-text available
Sales forecasting is increasingly complex due to a range of factors, such as the shortening of product life cycles, increasingly competitive markets, and aggressive marketing. Often, forecasts are produced using a Forecasting Support System that integrates univariate statistical forecasts with judgment from experts in the organization. Managers then add information to the forecast, such as future promotions, potentially improving the accuracy. Despite the importance of judgment and promotions, papers devoted to studying their relationship with forecasting performance are scarce. We analyze the accuracy of managerial adjustments in periods of promotions, based on weekly data from a manufacturing company. Intervention analysis is used to establish whether judgmental adjustments can be replaced by multivariate statistical models when responding to promotional information. We show that judgmental adjustments can enhance baseline forecasts during promotions, but not systematically. Transfer function models based on past promotions information achieved lower overall forecasting errors. Finally, a hybrid model illustrates the fact that human experts still added value to the transfer function models.
Article
Many decisions are based on beliefs concerning the likelihood of uncertain events such as the outcome of an election, the guilt of a defendant, or the future value of the dollar. Occasionally, beliefs concerning uncertain events are expressed in numerical form as odds or subjective probabilities. In general, the heuristics are quite useful, but sometimes they lead to severe and systematic errors. The subjective assessment of probability resembles the subjective assessment of physical quantities such as distance or size. These judgments are all based on data of limited validity, which are processed according to heuristic rules. However, the reliance on this rule leads to systematic errors in the estimation of distance. This chapter describes three heuristics that are employed in making judgments under uncertainty. The first is representativeness, which is usually employed when people are asked to judge the probability that an object or event belongs to a class or event. The second is the availability of instances or scenarios, which is often employed when people are asked to assess the frequency of a class or the plausibility of a particular development, and the third is adjustment from an anchor, which is usually employed in numerical prediction when a relevant value is available.
Article
In this paper it is pointed out that a Bayesian forecasting procedure performed better according to an average mean square error (MSE) criterion than the many other forecasting procedures utilized in the forecasting experiments reported in an extensive study by Makridakis et al. (1982). This fact was not mentioned or discussed by the authors. Also, it is emphasized that if criteria other than MSE are employed, Bayesian forecasts that are optimal relative to them should be employed. Specific examples are provided and analyzed to illustrate this point.
Chapter
This article has no abstract.
Article
Why learn statistics? describing data probability discrete random variables continuous random variables sampling distributions hypothesis testing and confidence intervals comparing two independent groups ANOVA - comparing two or more groups two-way designs comparing dependent groups multiple comparisons correlation and regression categorical data methods based on ranks.
Article
Review of J. Scott Armstrong's 1978 book on forecasting. Click on the DOI link above to read the review
Article
Forensic Economists often utilize the arithmetic average for calculating growth rates to estimate economic damages. While it may be convenient to calculate the arithmetic average, it is mathematically inaccu- rate when such a rate is compounded. In such cases, it is incumbent upon the Forensic Economist to employ the geometric mean. In this note, we out- line the when, why and how to employ the accurate use of the geometric mean.
Article
Accurate forecasting has become a challenge for companies operating in today's business environment, characterized by high uncertainty and short response times. Rapid technological innovations and e-commerce have created an environment where historical data are often of limited value in predicting the future. In business organizations, the marketing function typically generates sales forecasts based on judgmental methods that rely heavily on subjective assessments and “soft” information, while operations rely more on quantitative data. Forecast generation rarely involves the pooling of information from these two functions. Increasingly, successful forecasting warrants the use of composite methodologies that incorporate a range of information from traditional quantitative computations usually used by operations, to marketing's judgmental assessments of markets. The purpose of this paper is to develop a framework for the integration of marketing's judgmental forecasts with traditional quantitative forecasting methods. Four integration methodologies are presented and evaluated relative to their appropriateness in combining forecasts within an organizational context. Our assessment considers human factors such as ownership, and the location of final forecast generation within the organization. Although each methodology has its strengths and weaknesses, not every methodology is appropriate for every organizational context.
Article
A number of alternative measures of subjective forecast revision and forecast improvement/degradation are examined, their relative merits and limitations discussed, and some guidelines for selection provided. The analysis is restricted to indicators suitable for cross-sectional data, and the main focus is on the mathematical properties of the measures. It is shown that, with respect to forecast revision measures, those indicators which are most easily understood (and, therefore, intuitively appealing) are limited in terms of their practical usefulness. As far as the indicators of forecast improvement/degradation are concerned, the analysis shows that universally applicable percentage measures cannot be established with any degree of confidence, although they may be appropriate in a limited number of situations.
Article
The primary goal of this book is to provide students in social sciences with a modern introduction to the basic concepts and procedures in statistics. Another goal is to describe various trends and developments that can make a substantial difference in the conclusions reached in analyzing data. This book is aimed at students whose primary interests are not quantitative methods, but who require modern statistical procedures to do their research. The book is intended for a 2-semester course for graduate students, but it can also be used for a 1-semester course at the undergraduate level. (PsycINFO Database Record (c) 2012 APA, all rights reserved)