ArticlePDF Available

Abstract and Figures

This paper looks into the issue of evaluating forecast accuracy measures. In the theoretical direction, for comparing two forecasters, only when the errors are stochastically ordered, the ranking of the forecasts is basically independent of the form of the chosen measure. We propose well-motivated Kullback-Leibler Divergence based accuracy measures. In the empirical direction, we study the performance of several familiar accuracy measures and some new ones in two important aspects: in terms of selecting the known-to-be-better forecaster and the robustness when subject to random disturbance. In addition, our study suggests that, for cross-series comparison of forecasts, individually tailored measures may improve the performance of differentiating between good and poor forecasters.
Content may be subject to copyright.
Assessing Forecast Accuracy Measures
Zhuo Chen
Department of Economics
Heady Hall 260
Iowa State University
Ames, Iowa, 50011
Phone: 515-294-5607
Yuhong Yang
Department of Statistics
Snedecor Hall
Iowa State University
Ames, IA 50011-1210
Phone: 515-294-2089
Fax: 515-294-4040
March 14, 2004
This paper looks into the issue of evaluating forecast accuracy measures. In the theoretical
direction, for comparing two forecasters, only when the errors are stochastically ordered, the ranking
of the forecasts is basically independent of the form of the chosen measure. We propose well-
motivated Kullback-Leibler Divergence based accuracy measures. In the empirical direction, we
study the performance of several familiar accuracy measures and some new ones in two important
aspects: in terms of selecting the known-to-be-better forecaster and the robustness when subject to
random disturbance. In addition, our study suggests that, for cross-series comparison of forecasts,
individually tailored measures may improve the performance of dierentiating between good and
poor forecasters.
Keywords: Accuracy Measure, forecasting competition
Zhuo Chen is Ph.D. candidate in the Department of Economics at Iowa State University. He received
his BS and MS degrees in Management Science from the University of Science and Technology of China
in 1996 and 1999 respectively. He graduated from the Department of Statistics at Iowa State University
with MS degree in May, 2002.
Yuhong Yang (Corresponding author) received his Ph.D. in Statistics from Yale University in 1996.
Then he joined the Department of Statistics at Iowa State University as assistant professor and be-
came associate professor in 2001. His research interests include nonparametric curve estimation, pattern
recognition, and combining procedures. He has published papers in statistics and related journals in-
cluding Annals of Statistics,Journal of the American Statistical Association,Bernoulli,Statistica Sinica,
Journal of Multivariate Analysis, IEEE Transaction on Information Theory, International Journal of
Forecasti ng and Econometric Theory.
Needless to say, forecasting is an important task in modern life. With many dierent methods in
forecasting, understanding their relative performance is critical for more accurate prediction of the
quantities of interest. Various accuracy measures have been used in the literature and their properties
have been discussed to some extent. A fundamental question is: are some accuracy measures better
than others? If so, in which sense? Addressing such questions is not only intellectually interesting,
but also highly relevant to the application of forecasting. Not surprisingly, it is a commonly accepted
wisdom that there cannot be any single best forecasting method or any single best accuracy measure,
and that assessing the forecasts and the accuracy measures is necessarily subjective. However, can there
be certain degree of objectivity? Obviously, it is one thing that no accuracy measure dominates the
others and it is another that all reasonable accuracy measures are equally ne.
Adiculty in assessing forecast accuracy is that when dierent forecasts and dierent forecast
accuracy measures are involved, the comparison of forecasts and the comparison of accuracy measures
are very much entangled. Is it possible to separate these two issues?
In this work, having the above questions in mind, we intend to go one-step further both theoretically
and empirically on assessing forecast accuracy measures. In the theoretical direction, when two fore-
casts have error distributions stochastically ordered, then the two forecasts can be compared basically
regardless of the choice of the accuracy measure; on the other hand, when the forecast errors are not
stochastically ordered (as is much more often the case in application), which forecast is better depends
on the choice of the accuracy measure and then in general the comparison of dierent forecasts cannot be
totally objective. As will be seen, the rst part of this fact can be used to objectively compare dierent
accuracy measures from a certain appropriate angle. If one has a good understanding of the distribution
of the future uncertainty, we advocate the use of the Kullback-Leibler divergence based measures. For
cross-series comparison, we argue that there can be advantage using dierent accuracy measures for
dierent series. We demonstrate this advantage with several examples. In the empirical direction, we
compare the popular accuracy measures and some new ones in terms of their ability to select the better
forecast as well as in terms of the stability of the measures with slight perturbation of the original series.
As will be seen, such forecast comparisons provide us very useful information about the behaviors of the
dierent measures.
In the rest of the introduction we briey review some previous related works in the literature. More
details of the existing accuracy measures will be given in Sections 3 and 4.
Econometricians and Statisticians have constructed various accuracy measures to evaluate and rank
forecasting methods. Diebold & Mariano (1995) proposed tests of the null hypothesis that there is no
dierence in accuracy between two competing forecasts. Christoersen & Diebold (1998) suggested a
forecast accuracy measure that can value the maintenance of cointegration relationships among variables.
It is generally agreed that the mean squared error (Henceforth MSE) or MSE based accuracy measures
are not good choices for cross-series comparison since they are typically not invariant to scale changes.
Armstrong & Fildes (1995) suggested no single accuracy measure would be the best in the sense of
capturing necessary complexity of real data. This, of course, does not mean that one can arbitrarily
choose a performance measure that meets a basic requirement (e.g., scale invariance). It is desirable
to compare dierent accuracy measures to nd out which measures perform better in what situations
and which ones have very serious aws and thus should be avoided in practice. We notice that only
a handful of studies compared multiple forecast accuracy measures (e.g., Tashman, 2000; Makridakis
1993, Yokum and Armstrong 1995). Tashman (2000) and Koehler (2001) discussed the results of the
latest M-Competition (Makridakis & Hibon, 2000) focusing on forecast accuracy measures.
The comparison of dierent performance measures is a very challenging task since there is no obvious
way to do it objectively. To our knowledge, there has not been any systematically empirical investigation
in this direction in the literature. In this work, we approach the problem from two angles: the ability of
a measure to distinguish between good and bad forecasts and the stability of the measure when there is
a small perturbation of the data.
Section 2 of this paper studies the theoretical comparability of dierent forecasts for one series and
provides the theoretical motivation for the new accuracy measures. Section 3 reviews accuracy measures
for cross-series comparison and we show an advantage of the use of individually tailored accuracy mea-
sures. In Section 4 we give details of the accuracy measures investigated in our empirical study. The
comparison results are given in Section 5. Conclusions are in Section 6.
2 Theoretical comparability of dierent forecasts for a single
Suppose that we have a time series {Yi}to be forecasted and there are two forecasters (or two methods)
with forecasts ˆ
Yi,1and ˆ
Yi,2of Yimade at time i1based on the series itself up to Yi1and possibly
with outside information available to the forecasters (such as exogenous variables). The forecast errors
are ei,1=ˆ
Yi,1Yiand ei,2=ˆ
Yi,2Yifor the two forecasters respectively.
A fundamental question is how should the two forecasters be compared? Can we have any objective
statement on which forecaster is doing a better job?
There are two types of comparisons of dierent forecasts. One is theoretical and the other is empirical.
For a theoretical comparison, assumptions on the nature of the data (i.e., data generating process) must
be made. But such assumptions are not needed for empirical comparisons, which draw conclusions based
on data.
In this section, we consider the issue of whether two forecasters can be compared fairly. We realize
the complexity of this issue and will focus our attention on a very simple setting where some theoretical
understanding is possible. Basically, under a simplifying assumption on the forecast errors, we show that
sometimes the two forecasts can be ordered consistently in terms of prediction risk under any reasonable
loss function; for other cases, the conclusion regarding which forecaster is better depends subjectively
on the loss function chosen (i.e., it can happen that forecaster one is better under one loss function but
forecaster two is better under another loss function). For the latter case, clearly, unless one can justify
a particular loss function (or certain type of losses) as the only appropriate one for the problem, there
is no completely objective ordering of the two forecasters.
Let the cumulative distribution functions of |ˆ
Yi,1Yi|and |ˆ
Yi,2Yi|be F1and F2respectively.
Obviously, the supports of F1and F2are contained in [0,).
Following the statistical decision theory framework, we usually use a loss function for comparing
estimators or predictions. Let L(Y, ˆ
Y)be a chosen loss function. Here we only consider loss functions
of the type L(Y, ˆ
Y|)for a nonnegative function gdened on [0,).This class contains the
familiar losses such as absolute error loss and squared error loss.
Given a loss function g(|Yˆ
Y|),we say that forecaster 1 is (theoretically) better (equal or worse)
than forecaster 2 if Eg(|ei1|)<Eg(|ei2|)(Eg(|ei1|)=Eg(|ei2|)or Eg(|ei1|)>Eg(|ei2|)), where the
expectation is with respect to the true data generating process (assumed for the theoretical investigation).
Note that, given a loss function, two forecasts ˆ
Yi,1and ˆ
Yi,2can always be compared by the above denition
at each time i.
Clearly, when multiple periods are involved, to compare two forecasters in an overall sense, assump-
tions on the errors are necessary. One simple assumption is that for each forecaster, the errors at dierent
times are independent and identical distributed. Then the theoretical comparison of the forecasters is
simplied to the comparison at any given time i.
In reality, however, the forecast errors are typically not iid and the comparison between the forecasters
becomes theoretically intractable. Indeed, it is quite possible that forecaster 1 is better than forecaster
2 for some sample sizes but worse for other sample sizes. Even though the results in this section do not
address such cases, we hope that the insight gained under the simple assumption can be helpful more
2.1 When the forecasting error distributions are stochastically ordered
Can two forecasters be theoretically compared independently of the loss function chosen? We give a
result more or less in that direction.
Here we assume that F1is stochastically smaller than F2,i.e., for any x0,F
means that the absolute errors of the forecasters are ordered in a probabilistic sense. It is then not
surprising that the loss function does not play any important role in the theoretical comparison of the
two forecasters.
Denition:AlossfunctionL(Y, ˆ
Y|)is said to be monotone if gis a non-decreasing
Proposition 1: If the error distributions satisfy that F1is stochastically smaller than F2,then for
any monotone loss function L(Y, ˆ
Y|),forecaster 1 is (theoretically) no worse than forecaster
The proof of Proposition 1 is not dicult and thus omitted.
From the proposition, when the error distributions are stochastically ordered, regardless of the loss
function (as long as being monotone), the forecasters are consistently ordered. Therefore there is an
objective ordering of the two forecasters.
Let us comment briey on the stochastic ordering assumption. For example, if the forecast errors
of forecaster 1 and 2 are both normally distributed with mean zero but dierent variances. Then the
assumption is met. More generally, if the distributions of |ˆ
Yi,1Yi|and |ˆ
Yi,2Yi|both fall in a scale
family, then they are stochastically ordered, and thus the forecasters are comparable naturally without
the need of specifying a loss function.
However, the situation is quite dierent when the forecasting error distributions are not stochastically
ordered, as we will see next.
2.2 When the forecasting error distributions are not stochastically ordered
Suppose that F1and F2are not stochastically ordered, i.e., there exists 0<x
2such that F1(x1)>
F2(x1)and F1(x2)<F
Proposition 2:WhenF1and F2are not stochastically ordered, we can nd two monotone loss
functions L1(Y, ˆ
Y|)and L2(Y, ˆ
Y|)such that forecaster 1 is better than
forecaster 2 under loss function g1and forecaster 1 is worse than forecaster 2 under loss function g2.
Thus, from the Proposition, in general, there is no hope to order the forecasters objectively. The rel-
ative performance of the forecasts depends heavily on the loss function chosen. The proof of Proposition
2 is left to the reader.
2.3 Comparing forecast accuracy measures based on stochastically ordered
An important implication of Proposition 1 is that it can be used to objectively compare two accuracy
measures from an appropriate angle. The idea is that when the errors from two forecasts are stochastically
ordered, then one forecast is better than another, independently of the loss function. Consequently, we
can compare the accuracy measures through their ability to pick the better forecast. This is a basis for
the empirical comparison in Section 5.1.
2.4 How should the loss function be chosen for comparing forecasts for one
From the section 2.2, we know that generally, in theory, we cannot avoid the use of a loss function
to compare forecasts. In this subsection, we briey discuss the issue of choosing a loss function for
comparing forecasts for one series. The issue of cross-series comparison will be addressed in Section 3.
There are dierent approaches. One is to use a familiar and/or mathematically convenient loss
function such as squared error loss and absolute error loss. Squared error loss seems to be the most
popular in statistics for mathematical convenience. Another approach is to use an intrinsic measure
which does not depend on transformations of the data. For this approach, one must make assumptions
on the data generating process so that transformation-invariant measures can be derived, as will be seen
soon. The third approach is to choose a loss function that seems most natural for the problem at hand
based on non-statistical considerations (e.g., how the accuracy of the forecast may be related to the
ultimate good of interest). Perhaps except few cases, there may be dierent views regarding the most
natural loss functions for a particular problem.
2.5 Some intrinsic measures
Here we derive some intrinsic and new measures to compare dierent forecasts. They are obtained under
strong assumptions on the data generating process. In a certain sense, these measures can pay a heavy
price when the assumed data generating process does not describe the data well but they do have the
advantage of a substantial gain of dierentiating dierent forecasts when the assumed data generating
process reasonably capture the nature of the data. In addition, even if the assumption on the data
generating process is wrong, these measures are still sensible and better than MSE and absolute error
because they are invariant under location-scale transformations.
2.5.1 The K-L based measure is optimal in certain sense
We assume that conditional on the previous observations of Yprior to time iand the outside information
available, Yihas conditional probability density of the form 1
σi),where fis a probability density
function (pdf) with mean zero and variance 1. Let ˆ
Yibe a forecast of Yi.We will consider an intrinsic
distance to measure performance of ˆ
Kullback-Leibler divergence (information, distance) is a fundamentally important quantity in sta-
tistics and information theory. Let pand qbe two probability densities with respect to a dominating
measure µ. Then the K-L divergence between pand qis dened as D(pkq)=Rplog p
qdµ. Let Xbe
a random variable with pdf pwith respect to µ. Then D(pkq)=Elog p(X)
q(X). It is well-known that
D(pkq)0(though it does not satisfy the triangle inequality and is asymmetric). Let X0=h(X),
where his a one-to-one transformation. Let p0denote the pdf of X0and let q0denote the pdf of h(e
where e
Xhas pdf q. An important property of the K-L divergence is its invariance under a one-to-one
transformation. That is, D(pkq)=D(p0kq0).K-L divergence plays crucial roles in statistics, for ex-
amples, in hypothesis testing (Cover and Thomas 1991), minimax function estimation in deriving upper
and lower bounds (e.g., Yang and Barron (1999) and earlier references thereof), and adaptive estimation
(Barron (1987) and Yang (2000)).
We rst assume that σiis known. Then with the forecast ˆ
Yireplacing mi,we have an estimated
conditional pdf of Yi:1
σi). The K-L divergence between p(y)= 1
σi)and q(y)= 1
is D(pkq)=Rf(x)log f(x)
σi´dµ. Let J(a)=Rf(x)log f(x)
f(xa)dµ. Note that the function Jis well
denedoncewespecifythepdff. Now, D(pkq)=J(miˆ
σi).From the invariance property of the K-L
divergence, for linear transformations of the series, as long as the forecasting methods are equi-variant
under linear transformations, the K-L divergence stays unchanged.
From the above points, it makes sense that if computable, J(miˆ
σi)would be a good measure of
performance. Due to that miand σare unknown, one can naturally replace σiby an estimate ˆσibased
on earlier data and replace miby the observed value Yi.This motivates the loss function
Note that if ˆσiis location-scale invariant and the forecasting method is location-scale invariant, then
Yi)is location-scale invariant.
Now let’s consider two special cases. One is Gaussian and the other is double exponential (Laplace).
For no r mal, f(y)= 1
2πexp ¡1
2y2¢and consequently, J(a)= a2
It perhaps is worth pointing out that Ei(Yiˆ
Yi)2,where Eidenotes expectation
conditional on the information prior to observing Yi.Since σ2
idoes not depend on any forecasting method,
for this case, J³Yiˆ
ˆσi´is essentially equivalent to J³miˆ
For double exponential, f(y)=1
2exp(|y|)and J(a)=exp(|a|
For our approach, the main diculty is the estimation of σi, especially when dealing with nonsta-
tionary series.
2.5.2 The best choice of loss depends on the nature of the data
Here we show that the MSE is the optimal choice of performance measure in a certain appropriate sense
when the errors have normal distributions and absolute error (ABE) is the optimal choice when the
errors have double exponential distributions.
Consider two forecasts ˆ
Yi,1and ˆ
Yi,2of Yiwith the forecast errors ei,1=ˆ
Yi,1Yiand ei,2=ˆ
respectively. We assume that the errors are iid with mean zero for both the forecasters.
MSE or ABE? Firstweassumethatei,1N(0,σ2
1)and ei,2N(0,σ2
2).From Proposition 1, we
know that forecaster 1 is theoretically better than forecaster 2 if σ2
2for any monotone loss function.
In reality, of course, one does not know the variances and need to compare the forecasters empirically by
looking at the history of their forecasting errors. In this last task, the choice of a loss function becomes
Under our assumptions, (e1,1, ..., en,1)has joint pdf
exp ÃPn
i=1 e2
and (e1,2, ..., en,2)has joint pdf
exp ÃPn
i=1 e2
Thus for each of the two forecasters, Pn
i=1 e2
i,j is a sucient statistic for j=1,2.In contrast, Pn
i=1 |ei,j |
is not sucient. This suggests that for each forecaster, when the errors are normally distributed, the
use of MSE (Pn
i=1 e2
i,j ) better captures the information in the errors than other choices including ABE
i=1 |ei,j |). On the other hand, when the errors have double exponential distribution, Pn
i=1 |ei,j |is
sucient but Pn
i=1 e2
i,j is not, and thus the choice of ABS is better than MSE. Note also that when the
errors of the two forecasts are all independent and normally distributed, for testing H0:σ2
2,there is a uniformly most powerful unbiased test based on Pn
i=1 e2
i=1 e2
again is in the form of MSE.
A simulation Here we study the two types of errors mentioned above.
Case 1 (normal). ei,1N(0,1) and ei,2N(0,1.5) for i=1,··· ,100.Replicate 1000 times and
record the numbers of times Pn
i=1 e2
i=1 e2
i,2and Pn
i=1 |ei,1|>Pn
i=1 |ei,2|, respectively.
Case 2 (double exponential). ei,1DE(0,1) and ei,2DE(0,1.5) for i=1,··· ,100.Replicate 1000
times and record the number of times Pn
i=1 e2
i=1 e2
i,2and Pn
i=1 |ei,1|>Pn
i=1 |ei,2|, respectively.
The numbers of times that the above inequalities hold are presented in Table 1.
Squared Error Absolute Error
Normal 0.311 0.327
DE 0.270 0.247
Table 1: Comparing MSE and ABE
From the above simulations, we clearly see that for dierentiating the competing forecasters, the
choice of loss function does matter. When the errors are normally distributed, MSE is better and
when the errors are double exponentially distributed, ABE is better. A sensible recommendation for
application is that when the errors look like normally distributed (e.g., by examining the Q-Q plot),
MSE is a good choice; and when the errors seem to have a distribution with heavier tail, ABE is a better
3 Accuracy measures for cross-series comparison
Forecast accuracy measures have been used in empirical evaluation of forecasting methods, e.g., in the
M-Competitions (Makridakis, Hibon & Moser 1979; Makridakis & Hibon 2000). Measures used in M1
Competition are: MSE (Mean square error), MAPE (Mean average percentage error), and Theil’s U2-
statistic. More measures are used in M3 Competition, i.e., symmetric mean absolute percentage error
(sMAPE), average ranking, percentage better, median symmetric absolute percentage error (msAPE),
and median relative absolute error (mRAE).
Here we classify the forecast accuracy measures into two types. The rst category is stand-alone
measures, i.e., measures can be determined by the forecast under evaluation alone. The second type is
the relative measures that compare the forecasts to a baseline/naive forecast, i.e., random walk, or a
(weighted) average of available forecasts.
3.1 Stand-Alone Accuracy Measures
Stand-alone accuracy measures are those that can be obtained without additional reference forecasts.
They are usually associated with a certain loss function though there are a few exceptions (e.g., Granger
& Jeon (2003a,b) proposed a time-distance criterion for evaluating forecasting models). In our study,
we include several accuracy measures that are based on quadratic and absolute loss functions.
Accuracy measures based on mean squared error criterion, especially MSE itself, have been used
widely for a long time in evaluating forecasts for a single series. Indeed, Carbone and Armstrong (1982)
foundthatRootMeanSquaredError(RMSE)hadbeenthe most preferred measure of forecast accuracy.
However, for cross-series comparison, it is well known that MSE and the like are not appropriate since
they are not unit-free. Newbold (1983) criticized the use of MSE in the rst M-Competition (Makridakis
et al., 1982). Clements & Hendry (1993) proposed the Generalized Forecast Error Second Moment
(GFESM) as an improvement to the MSE. Armstrong & Fildes (1995) again suggested that the empirical
evidence showed that the mean square error is inappropriate to serve as a basis for comparison.
Ahlburg (1992) found that out of seventeen population research papers he surveyed, ten used Mean
Absolute Percentage Error (MAPE). However, MAPE was criticized for the problem of asymmetry and
instability when the original value is small (Koehler, 2001; Goodwin & Lawton, 1999).
In addition, Makridakis (1993) pointed out that MAPE may not be appropriate in certain situations,
such as budgeting, where the average percentage errors may not properly summarize accounting results
and prots. MAPE as accuracy measure is aected by four problems: (1) Equal errors above the
actual value result in a greater APE; (2) Large percentage errors occur when the value of the original
series is small; (3) Outliers may distort the comparisons in forecasting competitions or empirical studies;
(4) MAPEs cannot be compared directly with naïve models such as random work (Makridakis 1993).
Makridakis (2000) proposed modied MAPE measure (Symmetric Median Absolute Percent Error) and
used it in the M2 and M3 competitions. However, Koehler (2001) found sMAPE penalizes low forecasts
more than high forecasts and thus favors large predictions than smaller ones.
3.2 Relative Measures
The idea of relative measures is to evaluate the performance of a forecast relative to that of a benchmark
(sometimes just a “naive”) forecast. Measures may produce very big numbers due to outliers and/or
inappropriate modeling, which in turn make the comparison of dierent forecasts not feasible or not
reliable. A shock may make all forecasts perform very poorly, and stand-alone measures may put
excessive weight on this period and choose a measure that is less eective in most other periods. Relative
measures may eliminate the bias introduced by potential trends, seasonal components and outliers,
provided that the benchmark forecast handles these issues appropriately. However, we need to note that
choosing the benchmark forecast is subjective and not necessarily easy. The earliest relative forecast
accuracy measure seems to be Theil’s U2-statistic, of which the benchmark forecast is the value of the
last observation.
Collopy and Armstrong (1992a) suggested that Theil’s U2 had not gained more popularity because it
was less easy to communicate. Collopy and Armstrong (1992b, p.71) proposed a similar measure (RAE).
Thompson (1990) proposed an MSE based statistic— log mean squared error ratio— as an improvement
of MSE to evaluate the forecasting performances across dierent series.
3.3 The same measure across series or individually tailored measures?
As far as we know, in cross-series comparison of dierent forecasters, for each measure under investiga-
tion, it is applied to all the series. A disadvantage of this approach is that a xed measure may be well
suited for some series but may be inappropriate for others (e.g., due to a lack of power to distinguish
dierent forecasts or too strong inuence by a few points). For such cases, using individually tailored
measures may improve the comparison of the forecasters.
Example 1: Suppose that the data set has 100 series. The sample size for each series is 50. The
rst 75 series are generated as y=α0+α1x1+α2x2+α3x3+e,whereα0,α1,α2,α3are generated as
random draws from uniform distribution unif(1,1),x1,x2,x3are exogenous variables independently
distributed as N(0,1) and eis independent and normal distributed as N(0,5). The remaining 25 series
are generated as y=α0+α1x1+α2x2+α3x3+eas above except that eis distributed as double
exponential DE(0,10/2).
We compare the two forecasts y1and y2, which are generated by:
where bα0t,bα1t,bα2t,bα3tare estimated adaptively by regressing yon x1,x2and x3(with a constant term)
on previous data, i.e., x1,1,x
the “ideal” forecast with the parameters known.
We consider three measures to compare the two forecasts. One is the KL-N , another is the KL-DE21,
and the third is an adaptive measure that uses KL-N for the rst 75 series and KL-DE2 for the remaining
25 series. The two forecasts are evaluated based on their forecasts of the last 10 periods. We make 2000
replications and record the percentage of choosing the better forecast2(i.e., y1
t) by the three measures.
We report the means and their corresponding standard errors of the dierence between the percentage
of choosing the better forecast by the individually tailored measure and the other two measures in Table
1Please refer to the next section for the details of KL-N and KL-DE2.
2We understand that there might be concerns over whether the conditional mean is ideal or not, but it is denitely free
of estimation error. Furthermore, since we are varying the coecients for each series and average the percentage over the
100 series, we pretty much eliminate the possibility that y2
t“happens” to be superior to the conditional mean.
2. The table shows that the individually tailored measure improves the ability to distinguish between
forecasts with dierent accuracy. The improvement of the percentage of choosing the better forecast
is about 0.19% to the KL-DE2 and 0.58% to KL-N. Besides being statistically signicant, even though
these numbers seem to be small, they are not practically insignicant (note that Makridakis & Hibon
(2000) showed that the percentage better of sixteen forecasting procedures with respect to a baseline
method was from -1.7% to 0.8%).
KL NKLDE2Adaptive Measures
Example 1
Percent 71.60% 71.99% 72.18%
Dierence with the Adaptive measure 0.58% 0.19%
Standard error of the dierence 0.03% 0.05%
Example 2
Percent 65.00% 72.90% 73.00%
Dierence with the Adaptive measure 1.30% 0.14%
Standard error of the dierence 0.04% 0.04%
Example 3
Percent 81.11% 81.18% 81.68%
Dierence with the Adaptive measure 0.57% 0.50%
Standard error of the dierence 0.04% 0.03%
Table 2: Percentage of Choosing the Better Forecast
Example 2: Example 2 has the same setting as in Example 1 except that we change the ratio of
series with normal error and double exponential error to 1:1. The new measure is still better than that
of the two original measures but the extent varies, which gives another evidence that the performance
of accuracy measures may be inuenced by the error structure.
Example 3: To address the concern that the conditional mean may not necessarily be better than
the other forecast, we generate yas a series random drawn from a uniform distribution is unif(0,1)
and the two forecasts are: y1=y+e1,y
2=y+e2,wheree1tis distributed as iid N(0,1) and e2t
is distributed as iid N(0,2) for the rst 50 series and e1tis distributed as iid DE(0,2/2) and e2tis
distributed as iid DE(0,2) for the remaining 50 series. Replicate it for 2000 times and we report the
quantities in the lower part of Table 2. In this case, it is obvious that y2is stochastically dominated by
y1in forecast accuracy, and thus we know for sure that y1is the better forecast. The result is similar to
those in Examples 1 and 2.
The examples show that it is potentially better to use adaptive measures (as opposed to a xed
measure) when comparing forecasts. The adaptive measure (or individually tailored measures) can
better distinguish the candidate forecasts using the individual characteristics of the series. It should
be mentioned that in these examples, KL-N and KL-DE2 are applied with the knowledge of the nature
of the series. In a real application, of course, one is not typically told whether the forecast errors are
normally distributed or double-exponentially distributed. One then needs to analyze this aspect using,
e.g., Q-Q plots or formal tests. We leave this for future work.
4 Measures in Use in our Empirical Study
In the empirical study of this paper, we try to assess eighteen accuracy measures, including a few new
ones motivated from K-L divergence.
4.1 Stand-Alone Accuracy measures
We consider eleven stand-alone accuracy measures. MAPE, sMAPE and RMSE are familiar in the
literature. We propose several new measures based on Kullback-Leibler divergence, i.e., KL-N, KL-N1,
KL-N2, KL-DE1, and KL-DE2. We also suggest several variations of MSE and APE based measures,
i.e., msMAPE, NMSE. IQR is a new measure based on MSE and adjusted by inter quartile range. Let
mbe the number of observations we use in the evaluation of forecasts. Below we give the details of the
aforementioned measures.
The commonly used MAPE (mean absolute percentage error) has the form:
Makridakis & Hibon (2000) used sMAPE (symmetric mean absolute percentage error):
The measure reaches the maximum value of two when either |yi|or |byi|equals to zero (undened when
both are zero).
To avoid the possibility of an ination of sMAPE caused by zero values in the series, we add a
component in the denominator of the symmetric MAPE and denote it msMAPE (modied sMAPE),
which is formulated as:
where Si=1
k=1 |ykyi1|, yi1=1
k=1 yk.
RMSE is the usual root mean square error measure:
NMSE (normalized MSE) is formulated as:
where y=1
k=1 yk.
KL-N is proposed based on the Kullback-Leibler (KL) divergence. The measure corresponds to the
quadratic loss function (normal error) scaled with (adaptively moving) variance estimate. Its formula is:
where S2
k=1(ykyi1)2, yi1=1
k=1 yk. We discussed the theoretical motivation of K-L
divergence based measures in Section 2.5.1.
KL-N1 is a modied version of KL-N. We use a dierent variance estimate that only considers the
last 5 periods. The reason for considering only a few recent periods is to allow the variance estimator
to perform well when S2
idoes not converge properly due to e.g., un-removed trends. Its formula is:
where S2
k=i6(ykyi1,5)2, yi1,5=1
KL-N2 uses a variance estimate that considers the last 10 period. Its formula is:
where S2
i,10 =1
10 Pi1
k=i10(ykyi1,10)2, yi1,10 =1
10 Pi1
k=i10 yk.
KL-DE1 is an accuracy measure we proposed based on the K-L divergence and the assumption of
double exponential error. Its formula is:
where bσ2
KL-DE2 is an accuracy measure similar to KL-DE1 but with a dierent estimator of the scale
parameter from the one used in KL-DE1. Its formula is same with KL-DE1 but bσi=1
j=1 |yjyi1|.
IQR is an accuracy measure based on inter quartile range. Its formula is:
where Iqr is the inter quartile range of Y1, ..., Ymdened as the dierence between the third quartile
and the rst quartile of the data. Note that this measure is local-scale transformation invariant and
normalizes the absolute error in terms of Iqr.
4.2 Relative Accuracy Measures
We will use seven relative forecast accuracy measures.
RSE (Relative Squared Error) is the square root of the mean of the ratio of MSE relative to that of
random walk forecast at the evaluated time periods. It is motivated by RAE (relative absolute error)
proposed by Collopy and Armstrong (1992b). It is formulated as:
where yi,rw =yi1.
We propose mRSE (modied RSE) to improve RSE in the case when the series remains unchanged for
one or more time periods. To achieve this, we add a variance estimates component to the denominator,
thus its formula can be written as:
(yiyi,rw )2+S2
where yi,rw =yi1,S
k=1(ykyi1)2, yi1=1
k=1 yk(an alternative is to replace S2
by the average of (yiyi,rw)2).
Theil’s U2 is: sP(byiyi)2
RAE (Collopy and Armstrong, 1992b) is:
P|yiyi,rw |,
It should be pointed out that the relative measures are not without any problem. For example, if for
one series, a forecasting method is much worse than random walk, then the measure can be arbitrarily
large, which can be overly inuential when multiple series are compared. Another weakness is that when
the random walk forecast is very poor, then the measures take very small values and consequently these
series play a less important role compared to series where random walk forecast is comparable to the
other forecasts.
MSEr1 (MSE relative 1) is the square root of the mean of the ratio of MSE relative to the variance
of the available forecasts at the current time. Its formula is:
j=1(byji yi)2,
where byji is the jth forecast of ith observation.
MSEr2 (MSE relative 2) is the square root of the mean of the ratio of MSE relative to the sample
variance of the dierence between Yand the mean of the competing forecasts. Its formula is:
where byl=1
j=1 byjl .
MSEr3 (MSE relative 3) is the square root of the mean of the ratio of MSE relative to the average
mean squared errors of the candidate forecasts. Its formula is:
l=1 1
j=1(yjbylj )2.
5 Evaluating the Accuracy Measures
Armstrong & Fildes (1995) pointed out that the purpose of an error measure is to provide an informative
and clear summary of the error distribution. They suggested that error measure should use a well-
specied loss function, be reliable, resistant to outliers, comprehensible to decision makers and should
also provide a summary of the forecast error distribution for dierent lead times. Clements and Hendry
(1993) emphasized that the robustness of an error measure to the linear transformation of the original
series is an important factor to consider.
In this section we evaluate the performance of the forecast accuracy measures from two angles.
We investigate the ability of the measures in picking up the “better” forecast; study the stability of
the forecasts to small disturbances on the original series and the stability of the measures to linear
transformations of the series.
5.1 Ability to select the better forecast
Naturally, we hope that a forecast accuracy measure can dierentiate between good and poor forecasts.
For real data sets, we cannot decide which forecast is really the best if dierent measures disagree and
there is no dominant forecast. Part of the reason is that we have no denite information on the real
data generating process (DGP).
When selecting the “better” (or “best”) forecast is the criterion, of course, dening “better” (or
“best”) appropriately is crucial. However this becomes somewhat circular because an accuracy measure
is typically needed to quantify the comparison between the forecasts. To overcome the diculty, our
strategies are as follows.
Suppose a forecaster is given the information of the DGP with known structural parameters. Then the
conditional mean can be naturally used as a good forecast. For a forecaster who is given the form of the
DGP but with unknown structural parameters, he/she needs to estimate the parameters for forecasting,
which clearly introduces additional variability in the forecast. Since the rst one should be advantageous
compared to the second one, we can evaluate an accuracy measure in terms of its tendency to prefer the
rst one. Moving further in this direction, we can work with two forecasters that have stochastically
ordered error distributions and assess the goodness of an accuracy measure using the frequency that it
yields a better value for the better forecaster.
We agree with Armstrong & Fildes (1995) that simulated data series might not be a good represen-
tation of real data. Given a forecast accuracy measure, data sets can be used to evaluate the competing
forecasts objectively. For assessing an accuracy measure, however, due to the fact that the eects of the
forecasts and the accuracy measure are entangled, maintaining objective and informative is much more
challenging. The use of simulated data then becomes important for a rigorous comparison of accuracy
We consider nine cases in this subsection.
5.1.1 Cases 1-7
The seven cases in this subsection represent various scenarios we may encounter in real applications (but
obviously by no means they give a complete representation) and they can give us some useful information
regarding the performance of the accuracy measures. We replicate all the simulation 20000 times. The
numbers reported in Table 3 are the percentages that each measure chooses the better forecast over all
the replications.
1. Data generating process is AR(1) with auto-regressive coecient 0.75, and the series length
is 50. Random disturbance is distributed as N(0,1). Using the eighteen measures, we compare the
forecasts generated by the true model, in which we know the true model structure but not the structural
parameters, to the better forecast available, which is the conditional mean of the series (i.e., when the
auto-regression coecient is known).
Figure 1 presents the boxplot of the values measured for the forecasts produced by the conditional
mean and when m=20. The values greater than 20 are clipped. From the gure, clearly for some of
themeasures,thedistribution are highly asymmetric.
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18
Accuracy Measures (difference)
Figure 1: Boxplot of the Values of the Accuracy Measues
2. Data generating process is white noise distributed as N(0,1), and the series length is 50. We
compare forecasts generated by a white noise, which is also distributed as N(0,1)(true model) to the
better forecast, which is the conditional mean of the series, zero.
3. Data generating process is AR(1) with auto-regressive coecient 0.75, and the series length is
50. Random disturbance is distributed as N(0,1). We compare the forecast generated by white noise
process distributed as N(0,1),to the better forecast available, i.e., the conditional mean of the series.
4. Data generating process is:
y=α0+α1x1+α2x2+α3x3+e,whereα0is generated as a random draw from uniform distribution
unif(0,1),α1,α2,α3are generated as random draws from uniform distribution unif(1,1).Thesample
size n=50,x1t,x2t,x3tare exogenous variables independently generated from N(0,1) and etis the
random disturbance distributed as iid N(0,1),t=1,··· ,n. We compare the two forecasts y1and y2,
where y1is generated by assuming we know the true parameters and y2is generated using coecients
estimated based on available data:
for t=nm+1,··· ,n, where bα0t,bα1t,bα2t,bα3tare estimated by regressing yon x1,x2,andx3with
a constant term using previous data, i.e., x1,1,x
5. The setting of Case 5 is the same as Case 4 except that etis distributed as double exponential
DE(0,1) for t=1,··· ,n.
6. Data generating process is: y=x1,wherex1is exogenous variables independently distributed as
unif(0,1). The sample size n=50.Wecomparethetwoforecastsy1and y2, which are generated by:
for t=nm+1,··· ,n, where e1tdistributed as iid N(0,1) and e2tdistributed as iid N(0,2).Note
that here y1
tdominates y2
tindependently of the loss function.
7. The setting of Case 7 is the same as Case 6 except that e1tdistributed as iid DE(0,1) and e2t
distributed as iid DE(0,2).AsinCase6,y1
tbeats y2
The results in Table 3 reveal the following. First, sMAPE performs very poorly when the true value
is close to zero. A forecast of zero will be deemed as the worst (maximum in value) of the measured
performance, no matter what values the other forecasts take. If the true value is zero, the measure
will also give out the maximum error measure of 2 for any forecast not equal to zero. After adding a
non-negative component to the denominator, the msMAPE is superior to sMAPE and MAPE (except
Case 2, when compared to MAPE). Second, measures with dierent error structure motivation seem
to perform better when they correspond to the true error structure. Third, Theil’s U2, RSE, IQR and
KL-divergence based measures perform relatively well. Lastly, the table shows that the measures choose
the better forecaster more often when using more observations to evaluate the forecasts.
5.1.2 Case 8
We consider another case in which the original DGP is white noise, series length is 30.
We compare two forecasts both generated by independent white noise with the same noise level. Our
interest is to see whether the measures wrongly claim one forecast is better than the other though they
are actually the same. In each replication we generate 40 series and evaluate the two forecasts with the
eighteen measures. Thus for each replication we produce two series of values of measured performance.
We test the null hypothesis of that the two forecasts perform equally well (poor) by a paired t-test
with signicance level of 0.05. The empirical size is recorded as the number of rejection of the null
based on the accuracy measures. We make 10000 replications and present the mean of the empirical
sizes of the test for the 16 measures in Table 4 with dierent number of periods (m=2,5,10).Note
that Armstrong and Fildes (1995) suggested that geometric mean might be better than arithmetic mean
when evaluating forecasts with multi-series. We introduce the geometric mean of NMSE, Theil’s U2 and
RAE as GmNMSE, GmTheil’sU2 and GmRAE. We have not observed consistent improvements over the
arithmetic mean in our simulation.
From the table, clearly, in general, large myields size closer to 0.05 for the measures. MAPE, RSE,
and MSEr3 are too conservative. The other measures are satisfactory for this aspect.
5.1.3 Case 9
We construct another setting to study the performance of the accuracy measures dealing with series of
dierent natures.
For each replication, we have kseries with series length n=50. The data generating process is:
y=α0+α1x1+α2x2+α3x3+e, where for 50 percent of the replications, α0,α1,α2,α3are generated
as random draw from uniform distribution unif(1,1) and from unif(10,10) for the other half. Here
x1,x2,x3are exogenous variables independently distributed as N(0,1) and eis independent normal
distributed as N(0,σ2)(or double exponential DE(0,2σ/2))3with σ=1for 10% of the series and
σ=0.05 for the remaining 90%. This way, the dierent series are not homogenous. We compare the
two forecasts y1and y2, which are generated by:
for t=nm+1,··· ,n, where bα0t,bα1t,bα2t,bα3tare estimated by regressing yon x1,x2,x3,anda
constant term using previous data, i.e., y1,y
For each replication, we sum the numbers produced by the accuracy measures across the 100 series.
We declare that a measure chooses the right forecast if the sum of the measured value of y1
tis less than
that of y2
We repeat the replication for 10000 times and record the percentages of choosing the better forecast
by the accuracy measures over the replications in Table 5. We also evaluate the three geometric mean
methods along with others.
3We multiply 2/2to make the variance of the exponential component equal to that of the normal error component.
This makes the simulation “fair”.
The table suggests that: rst, it is better when the number of series used in each replication is
larger, which supports the idea of M3-competition that including more series can reduce the inuence of
dominating series; second, evaluating with veperiodsisbetterthanevaluatingwithjusttwoperiods;
third, geometric means slightly improves for the case of NMSE but not exactly so for Theil’s U2 and
RAE. MAPE, sMAPE, RSE, and MSEr3 perform poorly relative to others.
5.2 Stability of the Accuracy Measures
Stability of accuracy measures is another issue worthy of serious consideration. Since the observation
are typically subject to errors, measures that are robust to minor contaminations have an advantage of
reliably capturing the performance of the forecasts. With a minor contamination at a sensible level, the
more a measure changes, the less it is credible. Obviously, being stable does not qualify an accuracy
measure to be a good one, but being unstable with a minor contamination at a level typically seen in
an application is denitely a serious weakness.
5.2.1 Stability to Linear Transformation
As Clements and Hendry (1993) suggested, stability of accuracy measures with respect to the linear
transformation of the original series is an important factor. Here we use a series of monthly Austria/U.S.
foreign exchange rate from January 1998 to December 2001. The original series is measured as how many
Austrian Schillings are equivalent to one U.S. Dollar. The data was obtained from the web page of Federal
Reserve Bank of St. Louis. It is calculated as the average of daily noon buying rates in New York City
for cable transfers payable in foreign currencies. We round it to the rst digit after the decimal point
and perform a linear transformation of the original series by minus the mean of the series and multiply
10, i.e.,
ynew =10·yoriginal 10 ·mean(yorig inal)
We have four forecasts generated by random walk, ARIMA(1,1,0), ARIMA(0,1,1), and a forecast
generated by a model selected based on BIC criterion from ARIMA models with AR, MA and dierence
orders from zero to one. Table 6 presents the change of the values produced by the accuracy measures
using the last 20 points. We note that the rst ve accuracy measures produced very dierent values
after the transformation since they are not location-scale transformation invariant. Note also that the
last three accuracy measures had some minor changes. This suggested that the rst ve measures are
generally not good for cross-series comparison of forecasting procedures since a linear transformation of
the original series may change the ranking of the forecast.
5.2.2 Stability to Perturbation
In addition to robustness to linear transformation, a good accuracy measure should be robust to mea-
surement error. It is common that available quantities are subject to some disturbances, e.g., due to
rounding, truncation or measurement errors. When the original series (F) is added with a disturbance
term simulating the rounded digit, the accuracy measures may produce a dierent ranking of the fore-
casts. The change of the best ranked forecast indicates the instability of the accuracy measures with
respect to such a disturbance. In addition, we can add a small normally distributed disturbance on the
original series.
The data set used is Earnings Yield of All Common Stocks on the New York Stock Exchange from
1871 to 1918. The series is obtained from NBER (National Bureau of Economic Research) website. The
unitispercentandthenumbersareroundedtotwodecimals.Wehavetwoforecasts: onegeneratedby
random walk and the other from ARIMA with AR, MA and dierence orders selected by BIC over the
choices of zero and one. The forecasts are ranked using the accuracy measures. We perturb the data by
adding a small disturbance.
(1) Rounding: F0=F+u,whereuis generated from a uniform distribution Unif(0.005,0.005).
This addition is used to simulate the actual numbers which were rounded up to two decimals (as given
in the data).
(2) Truncation: F0=F+u,whereuis generated from a uniform distribution Unif(0,0.01).Thisis
used to simulate the actual numbers assuming that the numbers in the data were truncated up to two
(3) Normal 1: F0=F+e,whereeis random draw from a normal distribution of N(0,(0.1σF)2),
where σFis the sample standard deviation of the original series.
(4) Normal 2: F0=F+e,whereeis random draw from N(0,(0.12σF)2).
The perturbation is replicated 5000 times. Then we can make forecasts based on the perturbed
dataset and obtain the new ranking of the two dierent forecast methods. Table 7 shows the percentage
change for the earnings yield data set. Note that KL N1,KLN2,MSEr1,MSEr2,MSEr3are
relatively unstable when subject to rounding, truncation, or normal perturbation. The poor performance
of these measures is probably due to the poor variance estimation in the denominator of the measures.
It is rather surprising that RSE performs so well in this example, but we suspect that this does not hold
generally. Note that RSE faces a problem when the denominator happens to be close to zero, which
is reected in its poor performance shown in the earlier tables. Its modication mRSE addresses this
diculty and has a good overall performance. Not surprisingly, the measures are less stable when the
variance of the normal perturbation is greater. Even though MAPE performs well under rounding and
normal perturbations, it is highly unstable when truncation is involved.
5.3 Evaluating at one point vs. evaluating at multiple points
As to how many points we should use to compare dierent forecasts under MSE based on a single
series, Ashley (2003) presented an analysis from statistical signicance point of view. For cross-series
comparison, our earlier experiments suggest that the ability of choosing the better forecast improves
signicantly when using more points for the evaluation as found in Tables 3, 4 and 5. Another observation
is that when mis small, accuracy measures of dierent error structure motivation perform more similarly
than when mis large. An extreme example is that linear loss function and absolute value loss function
are equivalent when m=1.
In this paper, we studied various forecast accuracy measures. Theoretically speaking, for comparing
two forecasters, only when the errors are stochastically ordered, the ranking of the forecasts is basi-
cally independent of the form of the chosen accuracy measure. Otherwise, the ranking depends on the
specication of the accuracy measure. Under some conditions on the conditional distribution of Y,K-L
divergence based accuracy measures are well-motivated and have certain nice invariance properties.
In the empirical direction, we studied the performance of the familiar accuracy measures and some
new ones. They were compared in two important aspects: in selecting the known-to-be-better forecaster
and the robustness when subject to random disturbance, e.g., measurement error.
The results suggest the following:
(1) For cross-series comparison of forecasts, individually tailored measures may improve the perfor-
mance of dierentiating between good and poor forecasters. More work needs to be done on how to
select a measure based on the characteristics of each individual series. For example, we may use a QQ
plot and/or other means to have a good sense on the shape of the error distribution and then apply the
corresponding accuracy measures.
(2) Stochastically ordered forecast errors provide a tool for objectively comparing dierent forecast
accuracy measures by assessing their ability to choose the better or best forecast.
(3) In addition to the known facts that MAPE and sMAPE are not location invariant, and they have
amajoraw when the true value of the forecast is close to zero, we obtained new information on MAPE
and related measures: their ability to pick out the better forecast is substantially worse than the other
accuracy measures. The proposed msMAPE showed a signicant improvement over MAPE and sMAPE
in this aspect. The MSE based relative measures are generally better than MAPE and sMAPE, but not
as good as K-L divergence based measures.
(4) We proposed the well motivated KL-divergence and IQR based measures, which were shown to
have relatively good performance in the simulations.
7 Acknowledgments
The work of the second author was supported by the United States National Science Foundation CA-
REER Award Grant DMS-00-94323.
[1] Ahlburg, A (1992) “A commentary on error measures: Error measures and the choice of a forecast method”,
International Journal of Forecasting, Vol 8, pp99-100
[2] Armstrong, S. & R. Fildes (1995), “On the Selection of Error Measures for Comparisons Among Forecasting
Methods,” Journal of Forecasting, 14, 67-71
[3] Ashley. R. “Statistically signicant forecasting improvements: how much out-of-sample data is likely nec-
essary?”, International Journal of Forecasting, Volume 19, Issue 2, April-June 2003, Pages 229-239
[4] Barron, A.R. (1987). “Are Bayes rules consistent in information?” Open Problems in Communication and
Computation, pp85-91. T.M. Cover and B. Gopinath eds., Springer, NY.
[5] Carbone R. and J.S. Armstrong, 1982, “Evaluation of extrapolative forecasting methods: Results of a survey
of academicians and practitioners,” Jour nal of Forecasting 1, 215-217.
[6] Christoersen, P. F. and F.X. Diebold (1998), “Cointegration and Long Horizon Forecasting,” Journal of
Business and Economic Statistics,v.15.
[7] Clements, M.P. and D.F. Hendry (1993). “On the limitations of comparing mean squared forecast errors”.
Journ al of Forecasting, 12, 617-637. With discussion
[8] Collopy, F., & Armstrong, J. S. (1992a). “Rule-based forecasting”. Management Science 38, 1394-1414
[9] Collopy, F., & Armstrong, J. S. (1992b) “Error Measures For Generalizing About Forecasting Methods:
Empirical Comparisons”, International Journal of Forecasting, 8 (1992), 69-80
[10] Cover,T.M., & J.A. Thomas Elements of Information Theory, 1991, John Wiley and Sons.
[11] Diebold, F.X.; R. Mariano (1995) “Comparing forecast accuracy”, Journal of Business Economics and
Statistics, Vol 13, pp253-265
[12] Goodwin, P. & Lawton, R. (1999) “On the asymmetry of the symmetric MAPE”, International Journal of
Forecasting, Vol. 15, No.4, pp405-408
[13] Granger, C. W. J.; Jeon, Y. (2003) “A Time-Distance Criterion for Evaluating forecasting models”, Inter-
national Journal of Forecasting, Vol 19, pp199-215
[14] Granger, C. W. J.; Jeon, Y. (2003) “Comparing Forecasts of Ination Using Time Distance”, International
Journ al of Forecasting, Vol 19, pp339-349
[15] Granger, C. W. J.; Pesaran, M. H. (2000) “Economic and Statistical Measures of Forecast Accuracy”,
Journ al of Forecasting, Vol 19, pp537-560
[16] Koehler, A.B. (2001) “The asymmetry of the sAPE measure and other comments on the M3-competition”,
International Journal of Forecasting, Vol. 17, pp570-574
[17] Makridakis, S. (1993) “Accuracy measures: theoretical and practical concerns”, International Journal of
Forecasting, Vol.9, pp527-529
[18] Makridakis, S.; Hibon, M., & Moser, C. (1979) “Accuracy of Forecasting: An Empirical Investigation”,
Journal of the Royal Statistical Society, Series A (General) Vol. 142, Issue 2, pp97-145
[19] Makridakis, S.; & Hibon, M. (2000) “The M3-Competition: results, conclusions and implications”, Inter-
national journal of Forecasting Vol. 16, pp451-476
[20] Newbold, P., “The competition to end all competitions,” Journal of Forecasting, 2 (1983), 276-9.
[21] Tashman, L. J. (2000) “Out-of-sample tests of forecasting accuracy: an analysis and review”, International
Journ al o f Forecasting, Vol. 16, pp437-450
[22] Thompson, P.A. (1990) “An MSE statistic for comparing forecast accuracy across series”, International
Journ al of Forecasting,Vol.6,pp219-227
[23] Yang, Y. and Barron, A.R. (1999) “Information-theoretic determination of minimax rates of convergence”,
Ann. Statistics, 27, 1564-1599.
[24] Yang, Y. (2000) “Mixing Strategies for Density Estimation”, Annals of Statistics , vol. 28, pp. 75-87.
[25] Yokum, J.T., and Armstrong J. S. (1995) “Beyond Accuracy: Comparison of Criteria Used to Select
Forecasting Methods”, International Journal of Forecasting, Vol. 11, pp591-597
Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7
m20 220 220 220 220 220 220 2
MAPE 0.596 0.546 0.999 0.764 0.830 0.750 0.639 0.568 0.700 0.594 0.786 0.648 0.736 0.617
sMAP E 0.623 0.506 0.000 0.000 0.963 0.729 0.675 0.560 0.760 0.588 0.751 0.578 0.726 0.572
msM AP E 0.688 0.530 0.575 0.537 0.986 0.771 0.739 0.574 0.823 0.600 0.815 0.614 0.780 0.599
RMS E 0.757 0.544 0.979 0.721 0.996 0.791 0.825 0.576 0.824 0.592 0.938 0.662 0.841 0.627
NMSE 0.757 0.544 0.979 0.721 0.996 0.791 0.825 0.576 0.824 0.592 0.938 0.662 0.841 0.627
KL N0.760 0.544 0.977 0.720 0.996 0.791 0.824 0.576 0.821 0.592 0.937 0.662 0.840 0.627
KL N10.709 0.546 0.946 0.711 0.979 0.777 0.778 0.576 0.770 0.593 0.910 0.661 0.823 0.626
KL N20.752 0.547 0.966 0.717 0.992 0.788 0.807 0.577 0.799 0.591 0.931 0.662 0.836 0.625
KL DE10.758 0.543 0.976 0.721 0.996 0.790 0.819 0.580 0.840 0.596 0.929 0.661 0.860 0.626
KL DE20.757 0.543 0.975 0.720 0.996 0.789 0.817 0.579 0.844 0.596 0.928 0.661 0.860 0.625
IQR 0.758 0.544 0.977 0.720 0.996 0.791 0.822 0.577 0.820 0.593 0.934 0.663 0.841 0.627
RSE 0.633 0.541 0.836 0.701 0.992 0.851 0.600 0.564 0.642 0.581 0.699 0.639 0.671 0.613
mRSE 0.781 0.549 0.986 0.721 1.000 0.824 0.763 0.574 0.792 0.589 0.911 0.656 0.817 0.623
0sU20.757 0.544 0.979 0.721 0.996 0.791 0.825 0.576 0.824 0.592 0.938 0.662 0.841 0.627
RAE 0.724 0.544 0.970 0.716 0.995 0.788 0.784 0.577 0.854 0.602 0.925 0.659 0.861 0.624
MSEr10.610 0.521 0.916 0.674 0.975 0.747 0.658 0.557 0.776 0.591 0.864 0.636 0.810 0.615
MSEr20.691 0.529 0.953 0.647 0.984 0.708 0.759 0.550 0.740 0.571 0.902 0.610 0.796 0.589
MSEr30.686 0.529 0.926 0.647 0.961 0.708 0.749 0.550 0.730 0.571 0.865 0.610 0.778 0.589
Table 3: Percentage of Choosing the best model
m=2 m=5 m=10
MAPE 0.022 0.020 0.021
sM AP E 0.057 0.045 0.051
msM AP E 0.057 0.047 0.052
RMSE 0.053 0.052 0.050
NMSE 0.044 0.048 0.049
KL N0.051 0.051 0.051
KL N10.051 0.051 0.048
KL N20.053 0.050 0.049
KL DE10.050 0.049 0.051
KL DE20.051 0.049 0.050
IQR 0.052 0.051 0.051
RSE 0.018 0.021 0.018
mRSE 0.051 0.050 0.054
Theil0sU20.039 0.050 0.050
RAE 0.040 0.047 0.048
GmNMSE 0.055 0.050 0.049
GmT heil0sU 20.055 0.050 0.049
GmRAE 0.055 0.051 0.050
MSEr10.053 0.050 0.050
MSEr20.045 0.041 0.041
MSEr30.024 0.020 0.018
Table 4: Empirical Size of the Paired ttest
#ofseries 60 20
Error Normal Error Double Exp. Normal Error Double Exp.
m5 2 5 2 5 2 5 2
MAPE 0.750 0.736 0.837 0.792 0.735 0.663 0.792 0.794
sM AP E 0.776 0.749 0.863 0.814 0.751 0.678 0.802 0.793
msM AP E 0.997 0.939 0.999 0.978 0.925 0.818 0.973 0.902
RMSE 1.000 0.981 1.000 0.993 0.979 0.871 0.990 0.928
NMSE 0.997 0.892 0.998 0.932 0.950 0.786 0.966 0.853
KL N1.000 0.964 0.999 0.986 0.970 0.863 0.980 0.906
KL N10.998 0.954 0.998 0.987 0.961 0.847 0.972 0.890
KL N21.000 0.956 0.998 0.982 0.972 0.862 0.981 0.904
KL DE10.973 0.906 0.975 0.908 0.941 0.838 0.939 0.842
KL DE20.973 0.909 0.971 0.909 0.939 0.836 0.941 0.847
IQR 1.000 0.963 0.999 0.986 0.967 0.849 0.980 0.903
RSE 0.701 0.707 0.772 0.771 0.687 0.661 0.721 0.730
mRSE 0.997 0.948 0.997 0.980 0.954 0.850 0.968 0.887
Theil0sU20.999 0.913 0.998 0.941 0.961 0.815 0.973 0.860
RAE 0.993 0.891 0.998 0.952 0.937 0.797 0.974 0.879
GmNMSE 0.999 0.907 1.000 0.979 0.965 0.791 0.977 0.899
GmT heil0sU20.999 0.907 1.000 0.979 0.965 0.791 0.977 0.899
GmRAE 0.998 0.898 1.000 0.986 0.950 0.788 0.987 0.909
MSEr10.972 0.862 0.999 0.987 0.884 0.762 0.968 0.899
MSEr20.936 0.827 0.949 0.879 0.858 0.715 0.875 0.791
MSEr30.782 0.720 0.823 0.757 0.762 0.665 0.796 0.710
Table 5: Percentage of Choosing the Better Forecaster
forecast Random Walk ARIMA(1,1,0) ARIMA(0,1,1) ARIMA (BIC)
series original new original new original new original new
MAPE 0.024 0.302 0.023 0.306 0.022 0.296 0.030 0.381
sM AP E 0.024 0.280 0.023 0.264 0.022 0.256 0.030 0.371
msM AP E 0.023 0.207 0.022 0.196 0.021 0.190 0.029 0.263
RMSE 0.428 4.278 0.431 4.305 0.421 4.207 0.569 5.489
NMSE 0.426 0.135 0.428 0.135 0.423 0.134 0.492 0.153
KL N0.410 0.410 0.415 0.415 0.409 0.409 0.590 0.574
KL N10.869 0.869 0.822 0.822 0.805 0.805 1.147 1.089
KL N20.677 0.677 0.666 0.666 0.649 0.649 0.853 0.830
KL DE10.071 0.071 0.071 0.071 0.069 0.069 0.132 0.122
KL DE20.109 0.109 0.109 0.109 0.106 0.106 0.202 0.188
IQR 0.398 0.398 0.419 0.419 0.414 0.414 0.626 0.611
RSE 0.975 0.975 1.158 1.158 1.113 1.113 1.612 1.470
mRSE 0.348 0.348 0.347 0.347 0.341 0.341 0.501 0.478
Theil0sU21.000 1.000 1.006 1.006 0.983 0.983 1.331 1.283
RAE 1.000 1.000 0.965 0.965 0.934 0.934 1.247 1.156
MSEr11.078 1.100 0.929 0.934 0.870 0.878 1.104 1.071
MSEr20.703 0.710 0.726 0.733 0.705 0.711 0.880 0.864
MSEr30.729 0.739 0.752 0.762 0.731 0.740 0.912 0.899
Table 6: Stability of Accuracy Measure to Linear Transformation
Rounding Truncation Normal 1 Normal 2
MAPE 0.068 0.959 0.213 0.520
sM AP E 0.092 0.024 0.362 0.704
msM AP E 0.089 0.025 0.431 0.728
RMSE 0.056 0.043 0.620 0.877
NMSE 0.056 0.043 0.619 0.877
KL N0.089 0.070 0.612 0.867
KL N10.278 0.896 0.368 0.134
KL N20.148 0.062 0.598 0.861
KL DE10.061 0.030 0.605 0.831
KL DE20.047 0.024 0.602 0.828
IQR 0.071 0.053 0.611 0.874
RSE 0.004 0.001 0.017 0.067
mRSE 0.018 0.010 0.249 0.356
Theil0sU20.056 0.043 0.620 0.877
RAE 0.044 0.031 0.541 0.873
MSEr10.169 0.859 0.604 0.170
MSEr20.201 0.077 0.571 0.634
MSEr30.243 0.086 0.562 0.627
Table 7: Rate of ranking change
... menggunakan RMSE sebagai metode evaluasi untuk memilih model yang sesuai [6]. Rumusan RMSE training (RMSEtraining) dapat dilihat pada persamaan (9) [21]. Sedangkan RMSE testing (RMSEtesting) dapat dilihat pada persamaan (10) [22]. ...
... sMAPE lebih stabil dibandingkan MAPE [23]. Rumusan sMAPE training (sMAPEtraining) dapat dilihat pada persamaan (11) [21]. Sedangkan sMAPE testing (sMAPEtesting) dapat dilihat pada persamaan (12) [22]. ...
... Kajian ini dibatasi pada pemilihan window size optimal untuk prediksi data inflasi berdasarkan nilai root mean square error (RMSE). RMSE adalah parameter yang digunakan untuk melihat akurasi suatu metode peramalan [22]. Semakin kecil RMSE maka semakin akurat metode prediksi yang digunakan. ...
... Segmen kedua dimulai dari data ke-2 sampai dengan data ke-6 yang digunakan untuk memprediksi data ke-7. Proses ini terus dilanjutkan sampai dengan seluruh data hasil observasi habis tersegmentasi [22]. Tujuan dari sliding window adalah mengurangi error aproksimasi (misalnya jarak Euclidean atau jarak vertikal antara aproksimasi yang sebenarnya dengan time series). ...
AbstrakInflasi adalah indikator yang penting dalam penentuan kebijakan pemerintah. Data inflasi dirilis oleh Badan Pusat Statistik (BPS) di setiap awal bulan. Jika data inflasi dapat diprediksi lebih awal, pemerintah bisa menerapkan kebijakan yang tepat. Backpropagation neural network adalah salah satu metode prediksi yang lazim digunakan. Dengan menggunakan data bulan-bulan sebelumnya, inflasi dapat diprediksi menggunakan metode neural network dengan menggunakan teknik sliding window yang juga disebut metode windowing. Windowing adalah pembentukan struktur dari data time series menjadi data cross sectional. Ukuran dari windowing akan mempengaruhi akurasi dari hasil prediksi. Pada penelitian ini, penulis melakukan percobaan dengan tiga window size yaitu 6, 12, dan 18 untuk melihat adakah perbedaan akurasi hasil dari beberapa window size tersebut. Hasil percobaan menyimpulkan bahwa window size 6 memiliki akurasi paling baik untuk memprediksi inflasi dengan RMSE 0,435.Keywords: backpropagation, prediksi, sliding window
... The forecasting performance of PV power prediction models is evaluated using three statistical indicators, which are the mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) [24]. Their corresponding formulas are given by Equations (7)-(9): ...
Full-text available
Solar photovoltaic (PV) power generation is prone to drastic changes due to cloud cover. The power is easily affected within a very short period of time. Thus, the accuracy of grasping cloud distribution is important for PV power forecasting. This study proposes a novel sky image method to obtain the cloud coverage rate used for short-term PV power forecasting. The authors developed an image analysis algorithm from the sky images obtained by an on-site whole sky imager (WSI). To verify the effectiveness of cloud coverage rate as the parameter for PV power forecast, four different combinations of weather features were used to compare the accuracy of short-term PV power forecasting. In addition to the artificial neural network (ANN) model, long short-term memory (LSTM) and the gated recurrent unit (GRU) were also introduced to compare their applicability conditions. After a comprehensive analysis, the coverage rate is the key weather feature, which can improve the accuracy by about 2% compared to the case without coverage feature. It also indicates that the LSTM and GRU models revealed better forecast results under different weather conditions, meaning that the cloud coverage rate proposed in this study has a significant benefit for short-term PV power forecasting.
... After obtaining every hourly forecast value of one-day-ahead PV power with the proposed model, it was necessary to validate the applicability of the model through certain indices. In machine learning, the most common indices used to evaluate the performance of models are MAE, RMSE, and MAPE [40]. ...
Full-text available
An increase in renewable energy injected into the power system will directly cause a fluctuation in the overall voltage and frequency of the power system. Thus, renewable energy prediction accuracy becomes vital to maintaining good power dispatch efficiency and power grid operation security. This article compares the one-day-ahead PV power forecasting results of three models paired with three groups of weather data. Since the number, loss, and matching problem of weather data will all influence the training results of the model, a pre-processing data framework is proposed to solve the problem in this study. The models used are a deep learning algorithm-based artificial neural network (ANN), long short-term memory (LSTM), and gated recurrent unit (GRU). The weather data groups are Central Weather Bureau (CWB), local weather station (LWS), and hybrid data (the combination of CWB and LWS data). Compared to the other two groups, hybrid data showed a 5–8% improvement in measurements. In addition, when it comes to different weather conditions, the advantages of the LSTM model were highlighted. After further analysis, the LSTM model combined with hybrid data showed the most accurate measurements, which was proved through forecasting results for one month. Finally, the results indicate that when the amount of data is limited, using hybrid data and the five weather features is helpful for training the model. Accordingly, the proposed model shows better one-day-ahead PV forecasting.
... In the literature, there are multiple definitions of the sMAPE. We choose the one introduced in [69] because it is bounded between 0 and 2; specifically, it has a maximum value of 2 when either y orŷ is zero, and it is zero when the two values are identical. The sMAPE has two important drawbacks: it is undefined when both y,ŷ are zero and it can be numerically unstable when the denominator in Eq. 4.1 is close to zero. ...
In the airline industry, price prediction plays a significant role both for customers and travel companies. The former are interested in knowing the price evolution to get the cheapest ticket, the latter want to offer attractive tour packages and maximize their revenue margin. In this work we introduce some practical approaches to help travelers in dealing with uncertainty in ticket price evolution and we propose a data-driven framework to monitor time-series forecasting models' performance. Stochastic Gradient Descent (SGD) represents the workhorse optimization method in the field of machine learning and this is true also for distributed systems, which in last years are increasingly used for complex models trained on massive datasets. In asynchronous systems workers can use stale versions of the parameters, which slows SGD convergence. In this thesis we fill the gap in the literature and study sparsification methods in asynchronous settings. We provide a concise convergence rate analysis when the joint effects of sparsification and asynchrony are taken into account, and show that sparsified SGD converges at the same rate of standard SGD. Recently, SGD has played an important role also as a way to perform approximate Bayesian Inference. Stochastic gradient MCMC algorithms use indeed SGD with constant learning rate to obtain samples from the posterior distribution. Despite some promising results restricted to simple models, most of the existing works fall short in easily dealing with the complexity of the loss landscape of deep models. In this thesis we introduce a practical approach to posterior sampling, which requires weaker assumptions than existing algorithms.
... We adopted the 5 most commonly used accuracy measures to compare the models' forecasting results with the actual daily new confirmed COVID-19 case counts. The accuracy measures included the RMSE, mean absolute error (MAE), mean absolute percentage error (MAPE), correlation with forecasting target, and correlation of increment with forecasting target (the formulas for the accuracy indexes are presented in Multimedia Appendix 4) [14,49]. We conducted the analyses with the R version 4.0.2 ...
Background: The SARS-COV-2 virus and its variants pose extraordinary challenges for public health worldwide. Timely and accurate forecasting of the COVID-19 epidemic is the key to sustaining interventions and policies and efficient resources allocation. Internet-based data sources have shown great potential to supplement traditional infectious disease surveillance, and the combination of different Internet-based data sources has shown greater power to enhance epidemic forecasting accuracy than using a single Internet-based data source. However, existing methods incorporating multiple Internet-based data sources only used real-time data from these sources as exogenous inputs but did not take all the historical data into account. Moreover, the predictive power of different Internet-based data sources in providing early warning for COVID-19 outbreaks has not been fully explored. Objective: The main aim of our study is to explore whether combining real-time and historical data from multiple Internet-based sources could improve the COVID-19 forecasting accuracy over the existing baseline models. A secondary aim is to explore the COVID-19 forecasting timeliness based on different Internet-based data sources. Methods: We first used core terms and symptoms-related keywords-based methods to extract COVID-19 related Internet-based data from December 21, 2019, to February 29, 2020. The Internet-based data we explored included 90,493,912 online news articles, 37,401,900 microblogs, and all the Baidu search query data during that period. We then proposed an autoregressive model with exogenous inputs, incorporating the real-time and historical data from multiple Internet-based sources. Our proposed model was compared with baseline models, and all the models were tested during the first wave of COVID-19 epidemics in Hubei province and the rest of mainland China separately. We also used the lagged Pearson correlations for the COVID-19 forecasting timeliness analysis. Results: Our proposed model achieved the highest accuracy in all the five accuracy measures, compared with all the baseline models of both Hubei province and the rest of mainland China. In the mainland China except for Hubei, the COVID-19 epidemics forecasting accuracy differences between our proposed model (model i) and all the other baseline models were statistically significant (model 1, t=-8.722, P<.001; model 2, t=-5.000, P<.001, model 3, t=-1.882, P =0.063, model 4, t=-4.644, P<.001; model 5, t=-4.488, P<.001). In Hubei province, our proposed model's forecasting accuracy improved significantly compared with the baseline model using historical COVID-19 new confirmed case counts only (model 1, t=-1.732, P=0.086). Our results also showed that Internet-based sources could provide a 2-6 days earlier warning for COVID-19 outbreaks. Conclusions: Our approach incorporating real-time and historical data from multiple Internet-based sources could improve forecasting accuracy for COVID-19 epidemics and its variants, which may help improve public health agencies' interventions and resources allocation in mitigating and controlling new waves of COVID-19 or other relevant epidemics. Clinicaltrial:
... Prediction model developed for forecasting of late blight of potato multiple linear regression were generated [45]. For assessment to check the model accuracy Root mean squared error [46] and Mean absolute percentage error [47] were used. Different tests were performed to test the assumption of multiple linear regression viz., gvlma [48], Outliner test (Bonferonni test), nonconstant variance test (Breusch-Pagan test) and normality test (Shapiro-Wilk test) by using R version 4.0.2 ...
Full-text available
Epidemiological studies conducted during rabi 2018-20 revealed that weather parameters viz., maximum temperature, minimum temperature, maximum relative humidity, minimum relative humidity and rainfall were responsible for the development of late blight of potato. The primary infection of the disease initiated at 50 th Standard Meteorological Week (SMW) with maximum infection rate of 0.108 on 3 rd SMW. Correlation analysis showed that maximum temperature had significantly positive correlation with the intensity of late blight of potato, whereas, minimum relative humidity had significantly negative correlation. Stepwise multiple linear regression equations for two years (2018-20), revealed that the maximum temperature, maximum and minimum relative humidity and rainfall were responsible for 95.93 contribute in the disease. This work is licensed under a Creative Commons Attribution Non-Commercial 4.0 International License.
... Following the generation of forecasts, the next step is measuring forecasting accuracy relative to the random walk as a benchmark. Chen and Yang (2004) classify measures of forecasting accuracy into two types: (i) stand-alone measures and (ii) relative measures. The stand-alone accuracy measures are calculated without the need to compare the forecasts to a given reference or a benchmark such as mean absolute error, mean square error, and root mean square error. ...
Full-text available
This research presents a macroeconometric model describing interactions between real and financial variables in the economy of Kuwait, which has distinctive characteristics. While Kuwait is considered to be a developing country, it enjoys the benefits linked with developed economies. The objective of developing this model is to trace the relationship between financial, monetary, and real variables in the economy. The model provides an analytical tool to determine how the monetary and real sectors affect each other, make it possible to quantify the connection between prices, income and money in a macroeconomic framework. The model is a recursive system of equations that is estimated in three forms: autoregressive-distributed lag (ARDL), static long-run relation, and the error correction model (ECM). The predictive power of the model is examined by generating out-of-sample forecasts by utilising the recursive approach (expanding window). The accuracy of the forecasts is assessed by estimating several forecasting accuracy measures based on the magnitude of the error, such as the root mean square error (RMSE) and Theil’s inequality coefficient, in addition to measures of direction accuracy. Furthermore, to measure the profitability of trading based on the forecasts, several forecasting-based trading strategies are applied to stock prices and interest rates. Subsequently, the profitability of the trading strategies is measured by estimating the average annual compound rate of return (AACRR) and the cumulative return on the portfolios. The empirical analysis is performed by using quarterly time series data covering the period from 1995 to 2017. The estimation results reveal that the model is well specified and that it has a high explanatory power. While several equations pass all of the diagnostic tests, some of them do not pass the normality test, which is attributed to the presence of outliers. Moreover, cointegration tests reveal the presence of cointegration between the variables in all of the equations, indicating that there is a stable long-run relation between the variables. The main conclusion to be drawn from the forecasting accuracy measures is that the random walk cannot be outperformed in terms of the measures based on the magnitude of the error, which is in line with the findings of Meese and Rogoff (1983). Furthermore, the findings indicate that most of the equations have a direction accuracy of more than 50%, which means the model’s predictive power for directional changes is by far better than that of the random walk, which always predicts no change. The trading results indicate that, when the appropriate trading strategy is applied, the model is capable of generating profits. In terms of profitability, trading based on the interest rate forecasts yields better cumulative returns than trading based on stock price forecasts. Nevertheless, political instability in the region and the global financial crisis negatively affected the results of trading based on the forecasts of the stock prices. It must be stated at this early stage that this is a finance rather than economics thesis, in which case emphasis is placed on the use of predictions generated by the model to trade on the basis of variations in stock prices and interest rates. This procedure allows us to judge the predictive power of the model in terms of profitability, which is more appropriate than judging it by the statistical measures that depend on the magnitude of the error. The model is not built for its own sake or to conduct policy analysis, as in economics, but rather to trade, as in finance. Yet, the model can be used to derive some policy implications, particularly the estimated elasticities. The use of a multi-equation model that incorporates real and financial variables is a reflection of the belief that financial markets do not operate in vacuum and that financial variables affect and are affected by the real economy. This is not typically emphasised in the finance models of stock prices and interest rates.
Pressure ulcers are a common and costly health problem in clinical medicine, caused primarily by continuous vertical pressure on local tissues of the body. Recently, flexible devices have been used to alert pressure ulcers by continuously monitoring the pressure applied to the human skin. Although many pressure sensors provide excellent sensing performance, they still rely on uncomfortable, unstretchable systems with rigid electronics. These devices are mounted on the human body, which hinders continuous pressure monitoring. Herein, we propose a fully integrated flexible electronic system (FIFES) with highly sensitive multi-walled carbon nanotubes (MWNTs) piezoresistive array sensors for pressure monitoring, which accurately achieves external pressure signals acquisition and independently performs processing and wireless transmission. We develop a heat seal connecting technology for integrating the piezoresistive array sensors and its flexible processing circuit, which overcomes the compatibility defect of a rigid circuit board and the flexible sensors. Furthermore, the Cu–Ge metal sputtering and transferring technologies are employed to achieve an extensible circuit. The as-fabricated fully integrated electronic system has excellent flexibility, a rapid response time (the response/recovery time is 0.11/0.08 s, respectively), a wide dynamic range (6–50 kPa), and an excellent sensitivity (1.61 kPa <sup xmlns:mml="" xmlns:xlink="">−1</sup> for the range under the pressure of 18 kPa, 0.47 kPa <sup xmlns:mml="" xmlns:xlink="">−1</sup> for the range of 18–50 kPa) attributes to the optimized conical microstructures number. Owing to its excellent performance, the FIFES has extensive applications for pressure monitoring in different human epidermises, such as elbow bending and knee-bending. Also, it has immense potential applications in sleep posture monitoring and pressure ulcer warning.
Machine learning methods are increasingly used in analyzing remotely sensed data and studying different aspects of agricultural production. In particular, several of these flexible models are widely adopted to predict regional crop yield during or after the growing season. However, most existing models cannot be applied when dealing with functional covariates. In this paper, an approach based on multidimensional scaling is proposed to generate a set of artificial covariates from empirical density functions of different phenomena captured within specific administrative boundaries through satellites. In contrast to traditional aggregation methods, this approach is designed to reduce the loss of information associated with the use of summary statistics as covariates. The proposed methodology is applied to NASA remote sensing data, combined with information from surveys and USDA’s end-of-season county estimates, to study the prediction accuracy of different crop-yield models for three major crops in North Dakota.
Full-text available
In this study, the authors used 111 time series to examine the accuracy of various forecasting methods, particularly time-series methods. The study shows, at least for time series, why some methods achieve greater accuracy than others for different types of data. The authors offer some explanation of the seemingly conflicting conclusions of past empirical research on the accuracy of forecasting. One novel contribution of the paper is the development of regression equations expressing accuracy as a function of factors such as randomness, seasonality, trend-cycle and the number of data points describing the series. Surprisingly, the study shows that for these 111 series simpler methods perform well in comparison to the more complex and statistically sophisticated ARMA models.
Full-text available
Linear models are invariant under non-singular, scale-preserving linear transformations, whereas mean square forecast errors (MSFEs) are not. Different rankings may result across models or methods from choosing alternative yet isomorphic representations of a process. One approach can dominate others for comparisons in levels, yet lose to another for differences, to a second for cointegrating vectors and to a third for combinations of variables. The potential for switches in ranking is related to criticisms of the inadequacy of MSFE against encompassing criteria, which are invariant under linear transforms and entail MSFE dominance. An invariant evaluation criterion which avoids misleading outcomes is examined in a Monte Carlo study of forecasting methods.
Full-text available
In evaluations of forecasting accuracy, including forecasting competitions, researchers have paid attention to the selection of time series and to the appropriateness of forecast-error measures. However, they have not formally analyzed choices in the implementation of out-of-sample tests, making it difficult to replicate and compare forecasting accuracy studies. In this paper, I (1) explain the structure of out-of-sample tests, (2) provide guidelines for implementing these tests, and (3) evaluate the adequacy of out-of-sample tests in forecasting software. The issues examined include series-splitting rules, fixed versus rolling origins, updating versus recalibration of model coefficients, fixed versus rolling windows, single versus multiple test periods, diversification through multiple time series, and design characteristics of forecasting competitions. For individual time series, the efficiency and reliability of out-of-sample tests can be improved by employing rolling-origin evaluations, recalibrating coefficients, and using multiple test periods. The results of forecasting competitions would be more generalizable if based upon precisely described groups of time series, in which the series are homogeneous within group and heterogeneous between groups. Few forecasting software programs adequately implement out-of-sample evaluations, especially general statistical packages and spreadsheet add-ins.
When considering the relative quality of forecasts the method of comparison is relevant: should we use vertical measures, such as mean square forecasting error, or the recently developed horizontal measure time distance. Four models for inflation in the US are considered based on univariate time series, a leading indicator, a univariate model combining with the specifications of the two models, and a bivariate model. According to the mean squared forecast errors an AR(1) model is superior, but it performs much less well than models using a leading indicator when considered in terms of time distance. These results hold for both standard procedures and for the bootstrap reality check. (C) 2002 Published by Elsevier B.V. on behalf of International Institute of Forecasters.
This paper argues in favour of a closer link between decision and forecast evaluation problems. Although the idea of using decision theory for forecast evaluation appears early in the dynamic stochastic programming literature, and has continued to be used in meteorological forecasts, it is hardly mentioned in standard academic textbooks on economic forecasting. Some of the main issues involved are illustrated in the context of a two-state, two-action decision problem as well as in a more general setting. Relationships between statistical and economic methods of forecast evaluation are discussed and useful links between Kuipers score, used as a measure of forecast accuracy in the meteorology literature, and the market timing tests used in finance, are established. An empirical application to the problem of stock market predictability is also provided, and the conditions under which such predictability could be exploited in the presence of transaction costs are discussed.
Information theory answers two fundamental questions in communication theory: what is the ultimate data compression (answer: the entropy H), and what is the ultimate transmission rate of communication (answer: the channel capacity C). For this reason some consider information theory to be a subset of communication theory. We will argue that it is much more. Indeed, it has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam's Razor: “The simplest explanation is best”) and to probability and statistics (error rates for optimal hypothesis testing and estimation). The relationship of information theory to other fields is discussed. Information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (algorithmic complexity). We describe these areas of intersection in detail.
This paper describes the M3-Competition, the latest of the M-Competitions. It explains the reasons for conducting the competition and summarizes its results and conclusions. In addition, the paper compares such results/conclusions with those of the previous two M-Competitions as well as with those of other major empirical studies. Finally, the implications of these results and conclusions are considered, their consequences for both the theory and practice of forecasting are explored and directions for future research are contemplated.