PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

In the mid-1900s, there were two streams of thought about forecasting methods. One stream-led by econometricians-was concerned with developing causal models by using prior knowledge and evidence from experiments. The other was led by statisticians, who were concerned with identifying idealized "data generating processes" and with developing models from statistical relationships in data, both in the expectation that the resulting models would provide accurate forecasts. At that time, regression analysis was a costly process. In more recent times, regression analysis and related techniques have become simple and inexpensive to use. That development led to automated procedures such as stepwise regression, which selects "predictor variables" on the basis of statistical significance. An early response to the development was titled, "Alchemy in the behavioral sciences" (Einhorn, 1972). We refer to the product of data-driven approaches to forecasting as "data models." The M4-Competition (Makridakis, Spiliotis, Assimakopoulos, 2018) has provided extensive tests of whether data models-which they refer to as "ML methods"-can provide accurate extrapolation forecasts of time series. The Competition findings revealed that data models failed to beat naïve models, and established simple methods, with sufficient reliability to be of any practical interest to forecasters. In particular, the authors concluded from their analysis, "The six pure ML methods that were submitted in the M4 all performed poorly, with none of them being more accurate than Comb and only one being more accurate than Naïve2" (p. 803.) Over the past half-century, much has been learned about how to improve forecasting by conducting experiments to compare the performance of reasonable alternative methods. On the other hand, despite billions of dollars of expenditure, the various data modeling methods have not contributed to improving forecast accuracy. Nor can they do so, as we explain below.
Data Models for Forecasting:
No Reason to Expect Improved Accuracy
J. Scott Armstrong
jscott@upenn.edu
The Wharton School, University of Pennsylvania
and Ehrenberg-Bass Institute, University of South Australia
Kesten C. Green
kesten.green@unisa.edu.au
University of South Australia Business School
and Ehrenberg-Bass Institute, University of South Australia
January 26, 2019 - Version 22
In the mid-1900s, there were two streams of thought about forecasting methods. One stream—led by
econometricianswas concerned with developing causal models by using prior knowledge and evidence
from experiments. The other was led by statisticians, who were concerned with identifying idealized “data
generating processes” and with developing models from statistical relationships in data, both in the
expectation that the resulting models would provide accurate forecasts.
At that time, regression analysis was a costly process. In more recent times, regression analysis and
related techniques have become simple and inexpensive to use. That development led to automated
procedures such as stepwise regression, which selects “predictor variables on the basis of statistical
significance. An early response to the development was titled, Alchemy in the behavioral sciences
(Einhorn, 1972). We refer to the product of data-driven approaches to forecasting as “data models.”
The M4-Competition (Makridakis, Spiliotis, Assimakopoulos, 2018) has provided extensive tests of
whether data modelswhich they refer to as “ML methods”can provide accurate extrapolation
forecasts of time series. The Competition findings revealed that data models failed to beat naïve models,
and established simple methods, with sufficient reliability to be of any practical interest to forecasters. In
particular, the authors concluded from their analysis, “The six pure ML methods that were submitted in
the M4 all performed poorly, with none of them being more accurate than Comb and only one being more
accurate than Naïve2” (p. 803.)
Over the past half-century, much has been learned about how to improve forecasting by conducting
experiments to compare the performance of reasonable alternative methods. On the other hand, despite
billions of dollars of expenditure, the various data modeling methods have not contributed to improving
forecast accuracy. Nor can they do so, as we explain below.
1. Data Models Violate the Golden Rule of Forecasting
The Golden Rule of forecasting is to “Be conservative by using prior knowledge about the situation
and about forecasting methods.” It was developed and tested by Armstrong, Green and Graefe (2015).
The Golden Rule paper provided 28 evidence-based guidelines, which were tested by reviewing papers
with relevant evidence. The review identified 105 papers with 150 experimental comparisons. All
comparisons supported the guidelines. On average, ignoring a single guideline increased forecast error by
more than 40% on average.
One way to incorporate prior knowledge is to use the findings of experiments on which methods work
best for the type of situation being forecast. In addition, in practical forecasting situations one can use
experts’ domain knowledge about the expected directions of trends. In an earlier M-Competition, these
two sources of knowledge were implemented by Rule-based forecasting (Collopy and Armstrong
1992).
Rule-based forecasting uses domain knowledge to select extrapolation models based on 28 conditions
of the data in order to produce combined forecasts. It then uses 99 simple rules to weight each of the
forecasting methods. For six-year ahead forecasts, the ex ante forecasts provided a 42% reduction
compared to those from equal-weights combinations. Other sources of prior knowledge that can be used
include decomposition of time-series by level and change (Armstrong and Tessier, 2015), and by causal
forces (Armstrong & Collopy, 1993.)
2. Data Models Violate Occam’s Razor
Occam’s Razor, a principle that was described by Aristotle, states that one should prefer the simplest
hypothesis or model that does the job. A review of 32 studies found 97 comparisons between simple and
complex methods (Green and Armstrong, 2015). None found that complexity improved forecast accuracy.
On the contrary, complexity increased errors by an average of 27% in the 25 papers with quantitative
comparisons.
Unsurprisingly, then, all of the validated methods for forecasting are simple (see the Methods
checklist at ForecastingPrinciples.com.)
3. Data Models Enable Advocacy, Leading to
Unscientific and, Potentially, Unethical Practices
The nature of data modeling procedures is such that researchers can develop data models to provide
the forecasts that they know their clients or sponsors would prefer. Doing so helps them to get grants and
promotions. They can also use them to support their own preferred hypotheses.
It is not difficult to obtain statistically significant findings. And, of course, all tested relationships
become statistically significant if the sample sizes are large enough.
Despite the widespread use of statistical significance testing, the technique has never been validated.
On the contrary, decades of research have found that statistical significance testing harms scientific
advances. For example, Koning, et al. (2005) concluded that statistical tests showed that combining
forecasts did not reduce forecasting errors in the M3-Competition. Those findings were refuted in
Armstrong (2007). Research since that time has continued to find gains in accuracy from combining (see
Armstrong and Green, 2018, for a recent summary of the research.) As Makridakis et al. (2018, p. 803)
concluded from their analysis of the M4-Competition forecasts, “The combination of methods was the
king of the M4. Of the 17 most accurate methods, 12 were ‘combinations’ of mostly statistical
approaches.In that case, the combinations were done within methods. Combinations across validated
methods have been found to be even more effective at reducing forecast errors due to the different
knowledge, information, and biases that different methods bring (Armstrong and Green, 2018.)
4. Data Models Violate the Guidelines for Regression analysis
We rated data modeling procedures against the Checklist for forecasting using regression analysis (at
ForecastingPrinciples.com.) By our ratings, typical procedures for estimating data models violate at least
15 of the 18 guidelines in the checklist. In particular, they violate the guidance to limit models to three or
fewer causal variables when using non-experimental data. A summary of evidence on the limitations of
regression analysis is provided in Armstrong (2012).
5. Data Models Failed when Previously Tested
Before the M4 Competition, an extensive evaluation of published time-series data mining methods
using diverse data sets and performance measures found insufficient evidence to conclude that the
methods could be useful in practice (Keogh and Kasetty 2003). That paper has been cited over 1,700
times to date. Those who contemplate using data models might want to familiarize themselves with these
findings before proceeding.
6. “Knowledge Models” for Forecasting When Data are Plentiful
Benjamin Franklin proposed a simple method that we now refer to as “knowledge models” and,
previously, as “index models.” The procedure is as follows:
a. Use domain knowledge to specify:
all important causal variables,
directions of their effects, and
if possible, magnitudes of their relationships.
Variables should be defined to as to be positively related to the thing that is being forecast.
There is no limit on the number of causal variables that can be used.
Variables may be as simple as binary, or “dummy” variables to econometricians.
b. Use equal weights and standardized variables in the model unless there is strong evidence of
differences in relative effect sizes. Equal weights are often more accurate than regression
weights, especially when there are many variables and where prior knowledge about relative
effect sizes is poor.
c. Forecast values of the causal variables in the model.
d. Apply the model weights to the forecast causal variable values and sum to calculate a score.
e. The score is a forecast.
A higher score means that the thing being forecast is likely to be better, greater, or more
likely than would be the case for a lower score.
If there are sufficient data to do so, estimate a single regression model that relates scores from
the model with the actual values of the thing being forecast.
Evidence to date suggests that knowledge models are likely to produce forecasts that are more
accurate than those from data models in situations where many causal variables are important. One study
found error reductions of 10% to 43% compared to established regression models for forecasting elections
in the US and Australia (Graefe, Green, and Armstrong, 2019).
Conclusion: Data Models Should Never be Used for Forecasting, and a Suggestion
Forecasters have used data modeling methods of various kinds since the 1960s. Despite the resources
that have been devoted to finding support for statistical methods that use big data, we have been unable to
find scientific evidence to support their use under any conditions.
Science advances not by looking for evidence to support a favorite hypothesis, but by using prior
knowledge and experiments to test alternative hypotheses in order to discover useful principles and
methods, as the M4-Competition has done.
Our suggestion for future competitions is that when competitors submit their models and forecasts,
they should use prior research to explain the principles and methods that they used, their prior hypotheses
on relative accuracy under different conditions, and which category of method their model belongs to.
That would allow the competition organizers, commentators, and other researcher to compare the results
by category of method taking into account prior evidence, rather than resorting to ex post speculation on
what can be learned from the results.
References
Armstrong, J.S. (2007). Significance Tests Harm Progress in forecasting. International Journal of
Forecasting, 23, 321-336
Armstrong, J.S. (2012). Illusions in regression analysis, International Journal of Forecasting, 28, 689-
694.
Armstrong, J. S. & Collopy, F. (1993). Causal Forces: Structuring Knowledge for Time-series
Extrapolation, Journal of Forecasting, 12, 103-115.
Armstrong, J.S., Collopy, F. & Yokum, J.T. (2005). Decomposition by Causal Forces: A Procedure for
Forecasting Complex Time Series, International Journal of Forecasting, 21 (2005), 25-36.
Armstrong, J.S. & Green, K. C. (2018). Forecasting methods and principles: Evidence-based checklists,
Journal of Global Scholars in Marketing Science, 28, 103-159.
Armstrong, J. S, Green, K.C., & Graefe, A. (2015). Golden rule of forecasting: Be conservative, Journal
of Business Research, 68, 1717-1731.
Armstrong, J.S. & Tessier, T. (2015). Decomposition of time-series by level and change, Journal of
Business Research, 68 (2015), 1755-1758.
Collopy, F. & Armstrong, J. S. (1992). Rule-Based Forecasting: Development and Validation of an
Expert Systems Approach to Combining Time Series Extrapolations, Management Science, 38,1394-
1414.
Einhorn, H. (1972). Alchemy in the behavioral sciences. Public Opinion Quarterly, 36, 367-378.
Graefe, A., Green, K. C. & Armstrong, J. S. (2019). Accuracy gains from conservative forecasting: Tests
using variations of 19 econometric models to predict 154 elections in 10 countries. PLoS ONE, 14(1),
e0209850.
Green, K. C. & Armstrong, J.S. (2015). Simple versus complex forecasting: The evidence. Journal of
Business Research, 68, 1678-1685.
Keogh, E. & Kasetty, S. (2003). On the need for time series data mining benchmarks: A survey and
empirical demonstration. Data Mining and Knowledge Discovery, 7, 349371.
Koning, A.J., Franses, P. H., Hibon, M., & Stekler, H. O..(2005). The M3-Competition: Statistical tests of
the results. International Journal of Forecasting, 21, 397-409.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018).'The M4 Competition: Results, findings,
conclusion and way forward. International Journal of Forecasting, 34, 802-808.
Acknowledgements: We thank Robert Fildes, Andreas Graefe, and Eamon Keogh for reviewing
this paper.
Chapter
List of works referred to in Armstrong & Green (2022) The Scientific Method
Article
Full-text available
Problem Do conservative econometric models that comply with the Golden Rule of Forecasting provide more accurate forecasts? Methods To test the effects of forecast accuracy, we applied three evidence-based guidelines to 19 published regression models used for forecasting 154 elections in Australia, Canada, Italy, Japan, Netherlands, Portugal, Spain, Turkey, U.K., and the U.S. The guidelines direct forecasters using causal models to be conservative to account for uncertainty by (I) modifying effect estimates to reflect uncertainty either by damping coefficients towards no effect or equalizing coefficients, (II) combining forecasts from diverse models, and (III) incorporating more knowledge by including more variables with known important effects. Findings Modifying the econometric models to make them more conservative reduced forecast errors compared to forecasts from the original models: (I) Damping coefficients by 10% reduced error by 2% on average, although further damping generally harmed accuracy; modifying coefficients by equalizing coefficients consistently reduced errors with average error reductions between 2% and 8% depending on the level of equalizing. Averaging the original regression model forecast with an equal-weights model forecast reduced error by 7%. (II) Combining forecasts from two Australian models and from eight U.S. models reduced error by 14% and 36%, respectively. (III) Using more knowledge by including all six unique variables from the Australian models and all 24 unique variables from the U.S. models in equal-weight “knowledge models” reduced error by 10% and 43%, respectively. Originality This paper provides the first test of applying guidelines for conservative forecasting to established election forecasting models. Usefulness Election forecasters can substantially improve the accuracy of forecasts from econometric models by following simple guidelines for conservative forecasting. Decision-makers can make better decisions when they are provided with models that are more realistic and forecasts that are more accurate.
Article
Full-text available
The M4 competition is the continuation of three previous competitions started more than 45 years ago whose purpose was to learn how to improve forecasting accuracy, and how such learning can be applied to advance the theory and practice of forecasting. The purpose of M4 was to replicate the results of the previous ones and extend them into three directions: First significantly increase the number of series, second include Machine Learning (ML) forecasting methods, and third evaluate both point forecasts and prediction intervals. The five major findings of the M4 Competitions are: 1. Out Of the 17 most accurate methods, 12 were “combinations” of mostly statistical approaches. 2. The biggest surprise was a “hybrid” approach that utilized both statistical and ML features. This method's average sMAPE was close to 10% more accurate than the combination benchmark used to compare the submitted methods. 3. The second most accurate method was a combination of seven statistical methods and one ML one, with the weights for the averaging being calculated by a ML algorithm that was trained to minimize the forecasting. 4. The two most accurate methods also achieved an amazing success in specifying the 95% prediction intervals correctly. 5. The six pure ML methods performed poorly, with none of them being more accurate than the combination benchmark and only one being more accurate than Naïve2. This paper presents some initial results of M4, its major findings and a logical conclusion. Finally, it outlines what the authors consider to be the way forward for the field of forecasting.
Article
Full-text available
Problem How to help practitioners, academics, and decision makers use experimental research findings to substantially reduce forecast errors for all types of forecasting problems. Methods Findings from our review of forecasting experiments were used to identify methods and principles that lead to accurate forecasts. Cited authors were contacted to verify that summaries of their research were correct. Checklists to help forecasters and their clients undertake and commission studies that adhere to principles and use valid methods were developed. Leading researchers were asked to identify errors of omission or commission in the analyses and summaries of research findings. Findings Forecast accuracy can be improved by using one of 15 relatively simple evidence-based forecasting methods. One of those methods, knowledge models, provides substantial improvements in accuracy when causal knowledge is good. On the other hand, data models – developed using multiple regression, data mining, neural nets, and “big data analytics” – are unsuited for forecasting. Originality Three new checklists for choosing validated methods, developing knowledge models, and assessing uncertainty are presented. A fourth checklist, based on the Golden Rule of Forecasting, was improved. Usefulness Combining forecasts within individual methods and across different methods can reduce forecast errors by as much as 50%. Forecasts errors from currently used methods can be reduced by increasing their compliance with the principles of conservatism (Golden Rule of Forecasting) and simplicity (Occam’s Razor). Clients and other interested parties can use the checklists to determine whether forecasts were derived using evidence-based procedures and can, therefore, be trusted for making decisions. Scientists can use the checklists to devise tests of the predictive validity of their findings.
Article
Full-text available
This article examines whether decomposing time series data into two parts - level and change - produces forecasts that are more accurate than those from forecasting the aggregate directly. Prior research found that, in general, decomposition reduced forecasting errors by 35%. An earlier study on decomposition into level and change found a forecast error reduction of 23%. The current study found that nowcasts consisting of a simple average of estimates from preliminary surveys and econometric models of the U.S. lodging market, improved the accuracy of final estimates of levels. Forecasts of change from an econometric model and the improved nowcasts reduced forecast errors by 29% when compared to direct forecasts of the aggregate. Forecasts of change from an extrapolation model and the improved nowcasts reduced forecast errors by 45%. On average then, the error reduction for this study was 37%.
Article
Full-text available
This article introduces this JBR Special Issue on simple versus complex methods in forecasting. Simplicity in forecasting requires that (1) method, (2) representation of cumulative knowledge, (3) relationships in models, and (4) relationships among models, forecasts, and decisions are all sufficiently uncomplicated as to be easily understood by decision-makers. Our review of studies comparing simple and complex methods - including those in this special issue - found 97 comparisons in 32 papers. None of the papers provide a balance of evidence that complexity improves forecast accuracy. Complexity increases forecast error by 27 percent on average in the 25 papers with quantitative comparisons. The finding is consistent with prior research to identify valid forecasting methods: all 22 previously identified evidence-based forecasting procedures are simple. Nevertheless, complexity remains popular among researchers, forecasters, and clients. Some evidence suggests that the popularity of complexity may be due to incentives: (1) researchers are rewarded for publishing in highly ranked journals, which favor complexity; (2) forecasters can use complex methods to provide forecasts that support decision-makers’ plans; and (3) forecasters’ clients may be reassured by incomprehensibility. Clients who prefer accuracy should accept forecasts only from simple evidence-based procedures. They can rate the simplicity of forecasters’ procedures using the questionnaire at simple-forecasting.com.
Article
Full-text available
This article proposes a unifying theory, or the Golden Rule, of forecasting. The Golden Rule of Forecasting is to be conservative. A conservative forecast is consistent with cumulative knowledge about the present and the past. To be conservative, forecasters must seek out and use all knowledge relevant to the problem, including knowledge of methods validated for the situation. Twenty-eight guidelines are logically deduced from the Golden Rule. A review of evidence identified 105 papers with experimental comparisons; 102 support the guidelines. Ignoring a single guideline increased forecast error by more than two-fifths on average. Ignoring the Golden Rule is likely to harm accuracy most when the situation is uncertain and complex, and when bias is likely. Non-experts who use the Golden Rule can identify dubious forecasts quickly and inexpensively. To date, ignorance of research findings, bias, sophisticated statistical procedures, and the proliferation of big data, have led forecasters to violate the Golden Rule. As a result, despite major advances in evidence-based forecasting methods, forecasting practice in many fields has failed to improve over the past half-century.
Article
Full-text available
Soyer and Hogarth’s article, 'The Illusion of Predictability,' shows that diagnostic statistics that are commonly provided with regression analysis lead to confusion, reduced accuracy, and overconfidence. Even highly competent researchers are subject to these problems. This overview examines the Soyer-Hogarth findings in light of prior research on illusions associated with regression analysis. It also summarizes solutions that have been proposed over the past century. These solutions would enhance the value of regression analysis.
Article
Full-text available
This paper examines the feasibility of rule-based forecasting, a procedure that applies forecasting expertise and domain knowledge to produce forecasts according to features of the data. We developed a rule base to make annual extrapolation forecasts for economic and demographic time series. The development of the rule base drew upon protocol analyses of five experts on forecasting methods. This rule base, consisting of 99 rules, combined forecasts from four extrapolation methods (the random walk, regression, Brown's linear exponential smoothing, and Holt's exponential smoothing) according to rules using 18 features of time series. For one-year ahead ex ante forecasts of 90 annual series, the median absolute percentage error (MdAPE) for rule-based forecasting was 13% less than that from equally-weighted combined forecasts. For six-year ahead ex ante forecasts, rule-based forecasting had a MdAPE that was 42% less. The improvement in accuracy of the rule-based forecasts over equally-weighted combined forecasts was statistically significant. Rule-based forecasting was more accurate than equal-weights combining in situations involving significant trends, low uncertainty, stability, and good domain expertise.
Article
Access to powerful new computers has encouraged routine use of highly complex analytic techniques, often in the absence of any theory, hypotheses, or model to guide the researcher's expectations of results. The author examines the potential of such techniques for generating spurious results, and urges that in exploratory work the outcome be subjected to a more rigorous criterion than the usual tests of statistical significance.
Article
In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of improvement that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.