PreprintPDF Available

Small Steps to Accuracy: Incremental Belief Updaters are Better Forecasters

Authors:

Abstract and Figures

Laboratory research has shown that both underreaction and overreaction to new information pose threats to forecasting accuracy. This article explores how real-world forecasters who vary in skill attempt to balance these threats. We distinguish among three aspects of updating: frequency, magnitude, and confirmation propensity. Drawing on data from a four-year forecasting tournament that elicited over 400,000 probabilistic predictions on almost 500 geopolitical questions, we found that the most accurate forecasters made frequent, small updates, while low-skill predictors were prone to confirm initial judgments or make infrequent, large revisions. High-frequency updaters scored higher on crystallized intelligence and open-mindedness, accessed more information, and improved over time. Small-increment updaters had higher fluid intelligence scores, and derived their advantage from initial forecasts. Update magnitude mediated the causal effect of training on accuracy. Frequent, small revisions provided reliable and valid signals of skill. These updating patterns can help organizations identify talent for managing uncertain prospects.
Content may be subject to copyright.
Small Steps to Accuracy: Incremental Belief Updaters Are
Beer Forecasters
PAVEL ATANASOV, Pytho LLC, USA
JENS WITKOWSKI, Frankfurt School of Finance & Management, Germany
LYLE UNGAR, University of Pennsylvania, USA
BARBARA MELLERS, University of Pennsylvania, USA
PHILIP TETLOCK, University of Pennsylvania, USA
Laboratory research has shown that both underreaction and overreaction to new information pose threats to
forecasting accuracy. This article explores how real-world forecasters who vary in skill attempt to balance
these threats. We distinguish among three aspects of updating: frequency, magnitude, and conrmation
propensity. Drawing on data from a four-year forecasting tournament that elicited over 400,000 probabilistic
predictions on almost 500 geopolitical questions, we found that the most accurate forecasters made frequent,
small updates, while low-skill forecasters were prone to conrm initial judgments or make infrequent, large
revisions. High-frequency updaters scored higher on crystallized intelligence and open-mindedness, accessed
more information, and improved over time. Small-increment updaters had higher uid intelligence scores and
derived their advantage from initial forecasts. Update magnitude mediated the causal eect of training on
accuracy. Frequent, small revisions provided reliable and valid signals of skill. These updating patterns can
help organizations identify talent for managing uncertain prospects.
Note: This is a preprint that may slightly dier from the version to be published in Organizational Behavior and
Human Decision Processes, 160:19–35, September 2020.
1 INTRODUCTION
Should organizations be more concerned that forecasters, analysts, and executive decision makers
will underreact or overreact to the daily ow of new information across their desks? The literature
on judgment and choice provides conicting answers. The theoretical and empirical grounds for
expecting underreaction to be more detrimental to judgment quality include Bayesian conservatism
(Edwards 1968) and the anchoring heuristic (Tversky and Kahneman 1974). They describe a common
tendency to update estimates in the right direction, but not far enough. Reasons for thinking
overreaction poses the more serious problem include the availability (Tversky and Kahneman 1973)
and the representativeness heuristics (Kahneman and Tversky 1972). These heuristics focus our
attention on recent, memorable, and distinctive features (Kahneman and Tversky 1973, Bar-Hillel
1980), and have been cited as reasons behind suboptimal real-world behavior, such as excessive
volatility in nancial markets (De Bondt and Thaler 1985, Shiller 1981, Arrow 1982). In fact, we
know relatively little about how real-world forecasters balance the threats of updating too little
versus too much in naturalistic settings.
Belief updating is an integral part of decision making in organizations, in contexts such as
auditing (Ashton and Ashton 1998, Hogarth 1991), organizational expectation setting (Ehrig 2015),
and strategic decision making. Entrepreneurs face choices that rely on implicit predictions, such as
choosing if, when, and how to pivot from one strategy to another (Ries 2011). The propensity of
leaders to update their beliefs and change their minds is an oft-debated aspect of business acumen
(Pittampalli 2016). Amazon founder Je Bezos has spoken publicly about the benets of frequent
belief revisions, noting that “people who are right a lot of the time are those who often change
their minds” (Fried 2012).
Authors’ addresses: Pavel Atanasov, Pytho LLC, USA, pavel@pytho.io; Jens Witkowski, Frankfurt School of Finance &
Management, Germany; Lyle Ungar, University of Pennsylvania, USA; Barbara Mellers, University of Pennsylvania, USA;
Philip Tetlock, University of Pennsylvania, USA.
2 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Does Bezos’ heuristic help us identify those who are right about the future a lot of the time? To
answer this question, we must rst consider the extent to which real-world forecasting performance
is a matter of skill rather than luck. Mellers et al. (2015) demonstrated that prediction accuracy was
due in part to skill; past performance reliably predicted future performance. Furthermore, individuals
with certain cognitive and personality proles tended to possess more skill. Atanasov et al. (2017)
found that weighting predictions based on forecasters’ past performance improved accuracy of
prediction polls
1
, which feature direct probability elicitation and algorithmic aggregation, and
enabled polls to outperform prediction markets.
2
Small crowds selected based on past performance
have been shown to outperform larger, less selective crowds (Mannes et al. 2014, Goldstein et al.
2014).
Thus, past accuracy is useful in choosing how to weight opinions. Historical accuracy data, how-
ever, is not always available (Witkowski et al. 2017). For example, newly hired political or nancial
analysts need time to build a track record. In the meantime, managers may have little information
on which to judge the quality of their forecasts. Furthermore, some forecasting questions, such
as those regarding the impact of climate change or the emergence of disruptive technologies, are
scheduled to resolve far in the future. Alternative markers of forecasting skill, available earlier than
historical accuracy, would thus be especially helpful in these settings.
Identifying those with better foresight is also useful to advice seekers. The extensive literature on
advice taking and advice giving (Bonaccio and Dalal 2006, Yaniv 2004), has investigated the number
of advisors one should ask and how to weight their inputs versus one’s own prior judgments
(Larrick and Soll 2006). Advice seekers use the quality of past recommendations (Redd 2002) and
information about advisors’ prediction strategies when choosing how much to weight their advice
(Yates et al. 1996). The updating patterns of potential advisors may serve as useful cues for judging
the advisors’ predictive skill.
We propose that belief updating patterns can be useful for identifying skilled forecasters, espe-
cially when there is little information about historical accuracy. We had the opportunity to examine
the relationship between belief updating and prediction skill in a naturalistic environment with
thousands of probability forecasts about real-world political events over four years. We characterize
belief updating as the process of adjusting probabilistic forecasts by incorporating new information
over time. Instead of taking a one-dimensional view of updating and assessing whether the best
forecasters update too little or too much, we distinguish among three aspects of belief updating.
First, forecasters can dier in the frequency with which they submit new probability estimates on
a question. Second, they can vary in the absolute magnitude of their updates on the probability
scale. Third, they can dier in their propensity to change their beliefs versus actively conrming
their prior forecasts by simply restating their preceding forecasts. We nd that all three aspects of
updating are predictive of forecaster accuracy.
Mellers et al. (2015) and Chang et al. (2016) previously examined the relationship between
accuracy and frequency of updates in forecasting tournaments. Frequent belief updaters, i.e.,
forecasters who revised their estimates more often, tended to be more accurate. Tetlock and
Gardner (2016) discussed anecdotal examples when incremental updating was associated with
1
In contrast, Chen et al. (2005) found no benet to weighting individuals based on prior accuracy. Goel et al. (2010) employed
forecaster selection based on self-reported condence. This ltered poll method performed on par with comparison
conditions.
2
Prediction markets rely on a built-in weighting mechanism to identify skill: in the long run, forecasters who make correct
bets grow richer and gain inuence over market prices (Wolfers and Zitzewitz 2004). This is especially true in play-money
markets, where traders are not allowed to add outside funds. By contrast, in prediction polls, where forecasters provide
direct probability estimates, platform designers must decide how to weight these estimates (Clemen and Winkler 1999).
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 3
higher accuracy. The current work is the rst to systematically examine the relationship between a
forecaster’s magnitude of updates, her conrmation propensity, and her forecasting skill.
2 HYPOTHESES
We present three hypotheses about how three aspects of belief updating—frequency, magnitude and
conrmation propensity—relate to forecasting skill. These hypotheses are tested simultaneously,
accounting for the eects of other predictors. We also pose three additional research questions
focused on the reliability, predictability and malleability of updating behavior.
Hypothesis 1. Forecasters who make frequent updates tend to be more accurate.
We build on prior work by Mellers et al. (2015) by asking whether frequent updaters generate
relatively accurate initial judgments or outperform only by improving their forecasts over time.
Hypothesis 2. Forecasters who make smaller belief updates tend to be more accurate.
There are compelling reasons to expect that small-increment (i.e., incremental) updates predict
accuracy, but there are also good reasons to expect the opposite. To test this hypothesis, we examine
how observed update magnitude related to accuracy, and perform counterfactual simulations. That
is, we examine whether forecasters who make smaller revisions are more accurate than those who
make larger ones. Using simulation, we further investigate whether forecasters would have done
better had they made smaller or larger updates.
Let us rst examine the hypothesis that larger updates predict greater accuracy. Research suggests
that people generally update too little in the face of new evidence. Edwards (1968) describes belief
updaters as conservative Bayesians, who update in the correct direction, but do so insuciently. In
this account, forecasters making larger updates would be seen as better Bayesians, which may help
them achieve accuracy advantages over time.
The alternative is that smaller updates are associated with better accuracy. This pattern may
hold because people are often bombarded by information. Overreaction to new data could result in
excessive volatility and degrade accuracy. One such example is the dilution eect or the tendency
to discount valid cues as more and more non-diagnostic information appears (Tetlock and Boettger
1989). Market overreaction
3
is a common occurrence (De Bondt and Thaler 1985) and could be due to
the discounting of stable cues (e.g., base rates) in favor of noisy inside-view cues (e.g., case-specic
information), especially when the inside cues are extreme (Grin and Tversky 1992).
4
Institutional
practices may contribute as well. For example, US intelligence training emphasizes the need to
avoid underreaction to new evidence, which could increase the risks of overreaction (Chang and
Tetlock 2016), and bring about advantages to incremental updaters, i.e., forecasters who tend to
make smaller revisions, and may be less prone to overreact to new information.
Another reason to expect an association between smaller updates and higher accuracy is that
incremental updaters might be more accurate from the start (Massey and Wu 2005). Forecast updates
provide signals about forecasters’ metabeliefs: a small update represents a vote of condence in
oneś previous forecast. If forecasters believed their prior forecasts were already accurate, they
would see less need for large revisions. The question is whether such condence, or lack thereof, is
justied. Research on metacognition suggests that people have better-than-chance estimates about
the accuracy of their forecasts (Harvey 1988), despite substantial individual dierences (Tetlock
3Information cascades may produce aggregate-level overreaction even without individual-level overreaction.
4
Koehler (1996) notes that base-rate neglect depends on the structure and representation of the task, and argues in favor
of ignoring base rates that are ambiguous, unreliable or unstable. The validity of base rate cues is an open question in
naturalistic tasks such as real-world forecasting.
4 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
2005). Thus, smaller belief revisions may indicate higher initial accuracy. We address this possibility
by separately assessing the accuracy of early versus late forecasts on each question.
Of course, it is possible that both small and large updaters may benet from larger updates.
Small updaters could be more accurate from the start, but all forecasters could improve from
their starting points. To explore whether large-increment updaters are overreacting, we examine
counterfactual reruns of history; specically reruns of the time series of forecasts for each individual,
which let us gauge whether accuracy would have increased or decreased if we systematically
dialed the magnitude of their updates up or down. We provide a supplementary measure of
underreaction versus overreaction, by assessing volatility relative to the Bayesian benchmark
proposed by Augenblick and Rabin (2018). This is described in Appendix Section A5.
Hypothesis 3. Forecasters who confirm their forecasts more oen tend to be less
accurate.
Psychologists have suggested that the most consequential belief updating error may be failing to
update at all (Nisbett and Ross 1980). They describe a common tendency to form initial impressions
of the causal propensities at play with these causal schemata then biasing the interpretation of
new data. The strong version of this bias is belief perseverance, a failure to adjust one’s initial
estimate even when faced with evidence that it was wrong. Conrmation bias, the tendency to seek
information consistent with one’s initial beliefs (Nickerson 1998), may also produce a pattern of
conrming one’s previous judgments.
5
A failure to update may also result from satiscing (Simon
1956)—a forecaster may consider her probability estimates to be good enough.
6
In the current
analysis, we operationalize forecast conrmations as commissions, i.e., cases in which forecasters
actively re-enter their most recent forecast on a given question.
Additional Analyses
In addition to these hypotheses, we explore three further questions. First, are updating frequency,
magnitude and conrmation propensity unique, stable individual characteristics of forecasters?
Second, what are the psychometric and behavioral predictors of update frequency, magnitude and
conrmation propensity? Third, can training inuence updating patterns, and if so, do update
measures mediate the eect of training on accuracy? Training increases update frequency and
accuracy (Mellers et al. 2014, Chang et al. 2016), but the eect of training on update magnitude is
unknown. We thus examine whether probability training reduces update magnitude, and if this
reduction is associated with better performance.
3 METHODS
3.1 Subjects and Data
To test the hypotheses, we utilized data collected by the Good Judgment Project, a research project
and team that competed in—and won—the Aggregative Contingent Estimation (ACE) geopolitical
forecasting tournament, sponsored by Intelligence Advanced Research Projects Activity (IARPA).
The tournament took place between 2011 and 2015, and consisted of four seasons, each lasting
approximately 9-10 months, featuring a total of 481 forecasting questions with precisely resolvable
answers. Forecasters were recruited for participation through mailing lists, personal connections,
5
Conrmation bias may also lead to belief polarization, where individuals only seek and nd conrmatory evidence for
their favored beliefs, making their probability estimates more extreme over time. Belief polarization is more likely when
people hold strong and relevant ideological positions. The tournament’s focus on non-US geopolitical questions and the
wide variety of topics may limit the inuence of ideological predispositions.
6
On the other hand, maximizers, those who strenuously seek to select optimal rather than good-enough options, are not
necessarily better at forecasting (Jain et al. 2013).
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 5
blogs, and media coverage of the project. Actual or imminent completion of bachelor’s degree was
a pre-requisite for inclusion into the study, as was the completion of an inventory of psychometric
tests administered before the start of each forecasting season (Atanasov et al. 2017).
Forecasters were experimentally assigned to conditions, including training and teaming. We
report data from independent (non-teaming) prediction polls, in which forecasters provided proba-
bility estimates without access to their peers’ predictions. Forecasters had the option to update
their estimates whenever they wished before questions resolved. Performance was assessed using
the Brier score (Brier 1950), also known as the quadratic score, a strictly proper scoring rule. The
Brier score is dened in Equation 1, where
fc
denotes the probability forecast placed on the correct
answer of a binary question.7
Brier Score =2·(1fc)2(1)
A generalized version of this scoring rule was used for questions with three or more possible
outcomes. For questions that had outcome categories with a pre-dened order (e.g., from low
to high), we used the ordered Brier scoring rule, which assigns better scores for placing high
probabilities on categories closer to the correct one (Jose et al. 2009). Brier score decomposition
analyses do not apply to this modication. When a question closed, daily Brier scores were calculated
after a participant’s rst forecast. Forecasts were carried forward across days until the forecaster
made an update. For days before a participant’s initial forecast on a question, we imputed Brier
scores from the average Brier score of forecasters on that question and condition. If a forecaster
skipped a question entirely, she received an imputed score for all days of the question that was
equal to the mean Brier score of those in her condition who did report forecasts on that question.
These imputation rules were intended to incentivize forecasters to attempt questions for which they
believed they were better than average and update whenever their forecasts could be improved.
Forecasters learned about these scoring procedures at the start of each season, and they had
access to scoring rule descriptions, as well as their scores and accuracy rank, throughout a season.
Imputed scores on questions that forecasters did not attempt were used only for incentive purposes;
the current analysis excludes such scores. Scores were averaged across days within a question.
Because forecasters selected their own questions, scores were standardized within question, i.e.,
converted to z-scores, to emphasize relative forecaster skill while accounting for question diculty.
Finally, standardized Brier scores were averaged across questions for each forecaster. Brier score
decomposition analyses follow the formulations of Murphy and Winkler (1987).
The fty most accurate forecasters in an experimental condition were featured on a leaderboard.
Forecasters received compensation in the form of electronic retailer gift certicates if they had
made at least one forecast on 25 or more questions. The value of the gift certicates was $150 in
the rst two seasons, and $250 in the last two seasons. There were no nancial incentives based on
forecast updating or accuracy. The top 2% of forecasters at the end of each season were awarded
superforecaster status and invited to participate in small teams with other superforecasters in the
following tournament season.
3.2 Belief Updating Measures
The current analysis assumes that forecasts are a reasonable proxy for beliefs. While it is possible
that forecasters may not always report their true beliefs, we note that proper scoring rules, such as
the Brier score, incentivize forecasters to state their beliefs truthfully and update them as necessary.
In our analyses, we use more literal descriptions of changes in probabilistic estimates—forecast
updates or simply updates—but posit that forecast changes correspond closely to revisions in
underlying beliefs. Imagine a forecaster faced with the question: “Will Bashar al-Assad cease to be
7In this specication, Brier scores may vary between 0 and 2, and 50% forecasts result in 0.5 Brier scores.
6 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Fig. 1. An illustrative forecast stream for a forecaster making three unique forecasts on a question. Solid
gray and dashed black lines illustrate forecast counterfactual forecast streams that would have resulted from
proportionally smaller or large belief updates.
president of Syria by May 1, 2020?” The solid black line in Figure 1 illustrates a forecast stream.
The forecaster provides an initial probability estimate of 25%. She later conrms the initial estimate,
re-entering the same one, as denoted by the black dot. She then updates her forecast to 45%, and
then lowers it to 9%. This forecaster has made 4 forecasts, including one forecast conrmation,
so that 25% of forecasts were conrmations. Update frequency in this example is 3, including the
initial forecast but excluding conrmations. The forecaster’s mean absolute update magnitude is
(|25%-45%| + |45%-9%|) / 2 = 28%.
8
Average absolute update magnitude, hereafter referred to simply
as magnitude, is rst calculated across updates on a given question, then averaged across questions
for each forecaster. Let ibe the index of forecasts within a question, and let Ibe the total number
of forecasts; let
q
be the index of questions attempted by a forecaster, and let
Q
be the total number
of questions that the forecaster attempted; nally, let pbe the reported probability values.
Magnitude =
1
Q
Q
Õ
q=1
1
I1
I
Õ
i=2
|pipi1|(2)
If a forecaster made only one forecast, magnitude was set to missing and thus did not aect
magnitude calculations, frequency was set to one, and conrmation propensity was set to zero.
Accuracy was measured for all forecasting questions, regardless of whether a forecaster updated
his or her beliefs. Our sample included those forecasters who updated forecasts on at least ten
questions.
3.3 Updating Simulation
We also performed counterfactual simulations to determine whether modifying a forecaster’s
updating behavior would have improved accuracy. We simulated belief streams with larger or
smaller updates than the actual ones. The solid black line in Figure 1 represents the actual forecast
stream, while the solid gray line and the dashed black line illustrate two counterfactual forecast
streams. The dashed black line depicts forecasts that would have resulted if update magnitudes
8
We used absolute magnitude, rather than squared magnitude. In this example, the mean squared belief update magnitude
was 0.085. We applied squared update magnitude in a sensitivity analysis, listed in Appendix Table A2.3. Results were
similar for absolute and squared magnitude measures.
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 7
were set to 130% of original values. The solid gray line depicts the forecast stream resulting from
updates that were 70% as large as the observed values. Forecast streams were modied to t the
probability scale, i.e., simulated values below 0% and above 100% were truncated to 0% and 100%,
respectively. The simulation had no stochastic components.
3.4 Predictors of Updating and Accuracy
We examined how behavioral and psychometric measures relate to updating patterns. Behavioral
measures included: a) the number of forecasting questions attempted by a forecaster; b) the number
of active forecasting sessions (i.e., the number of instances in which a forecaster logged into the
web platform); and c) the number of news-link clicks or instances in which forecasters clicked on
links to news sources displayed on the platform in seasons 2 and 3. These links were based on
the top results of Google News search queries featuring the keywords of forecasting questions.
Mellers et al. (2015) showed that forecasters with high uid and crystallized intelligence, and those
with actively open-minded thinking styles tended to be more accurate. We include these measures
as predictors of both updating measures and accuracy. Fluid intelligence was a combination of
scores from Raven’s progressive matrices test (Balboni et al. 2010), Shipley’s analytical reasoning
test (Zachary et al. 1985), a numeracy test (Lipkus et al. 2001, Peters et al. 2007), and the cognitive
reection test (Frederick 2005).
3.5 Forecasting Training
Does forecasting training inuence updating patterns? Approximately half of participants were
randomly assigned to a training condition at the start of each season of the tournament. These
forecasters were required to complete training in order to participate in the tournament. Forecasters
who received training were later assigned to training conditions in subsequent seasons. Training
content was designed to improve overall forecasting accuracy, not to produce specic updating
patterns. Three topics were especially relevant: a) constructing comparison classes and calculating
base rates, b) combining potentially conicting information from multiple sources, and c) updating
forecasts in response to new information. Training took approximately one hour to complete and
resulted in approximately 8%–10% improvement in accuracy over the course of each of four seasons.
3.6 Cross-Sample Validation
To provide robust tests of our hypotheses, we employed a cross-validation procedure, in which
belief updating patterns were measured on one set of forecasting questions and then used to predict
forecasting accuracy in another set of questions. This procedure consisted of three steps. First, for
each forecaster in the sample, we independently and randomly assigned the questions attempted to
two approximately equally-sized sets, which we refer to as A and B. Second, in each set of questions,
we calculated the rate of belief conrmation (proportion of all forecasts that were conrmations),
update frequency (number of forecasts per question), update magnitude (mean absolute dierence
between successive unique forecasts), and standardized Brier scores. Finally, we used these updating
measures from set A to predict accuracy in set B (and, analogously, updating measures from set B
to predict accuracy in set A), in several regression models. Individual forecasters were the units of
analysis. We repeated this procedure with 50 iterations of random half-sample splits, creating new
sets A and B in each iteration. Each split yielded two sets of regression coecients per model: one
for predicting accuracy in set A based on behavioral measures in set B, and another for predicting
accuracy in set B based on behavioral measures in set A. Thus, we had 100 sets of regression
coecients. We report those regression estimates that were closest to the median set of t-test
values. We used a similar procedure for estimating correlations and present the median correlation
coecients across iterations.
8 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Study sample characteristics Mean (SD) Median (IQR)
Number of questions with updates 43 (35) 32 (17, 59)
Update frequency per question 2.0 (1.6) 1.6 (1.3, 2.1)
Absolute update magnitude 0.20 (0.10) 0.19 (0.14, 0.24)
Forecast conrmations, % of forecasts 0.19 (0.18) 0.18 (0.06, 0.26)
Table 1. Forecasting behavior and updating characteristics of
N=
515 forecasters with standard deviation (SD)
and interquartile range (IQR). Note that forecast confirmations are distinct from updates, and confirmations
are not included in update calculations.
3.7 Study Sample Characteristics
Forecasting accuracy was assessed based on 481 forecasting questions. The median duration of
questions was about three months (Median = 81 days, Mean = 110 days, SD = 93). These questions
were posed and resolved over the course of four forecasting seasons, each lasting approximately
9-10 months. The core study sample consists of N = 515 participants who made at least one forecast
update on at least 10 forecasting questions. This inclusion criterion is consistent with those used
in prior research on forecaster accuracy (e.g., Mellers et al. 2014). Results held when we included
the sample of all available forecasters, as long as they attempted two questions, the minimum
needed for a split-sample analysis (see Appendix Section A2). Forecasters could choose to return
from one forecasting season to another. As long as they fullled the inclusion criterion, we did not
distinguish between forecasters who were active in one versus multiple forecasting seasons.
Forecasters in the study sample attempted an average of one hundred and thirteen questions (M
= 113, SD = 73), and made at least one update on forty-three of those questions (M = 43, SD = 35).
(Also see Table 1.) The average forecaster submitted two forecasts per question (M = 2.0, SD = 1.6),
i.e., one initial forecast and one update. The pattern of update frequency was best approximated
by a log-normal distribution. Frequency was log-transformed before being used in the regression
models. The average absolute update magnitude was 0.20 (SD = 0.10) on the probability scale.
For the average forecaster, 19% of all forecasts were conrmations. Conrmations are treated as
distinct from initial forecasts and updates. Superforecasters, as discussed by Mellers et al. (2015) and
Tetlock and Gardner (2016), are only included in our core analysis in the seasons before attaining
superforecaster status, which was granted to the top 2% of performers at the end of each season
based on Brier score. We perform a separate sensitivity analysis using data from superforecasters
after they attained superforecaster status and were assigned to work in teams (see Appendix
Table A2.2). A total of N = 175 superforecasters are included, and the sensitivity analysis includes
all forecasts submitted after they attained superforecasting status. Superforecasters submitted a
mean of 5.1 forecasts per question (vs. 2.0 in the core sample) and had an average absolute update
magnitude of 0.11 (vs. 0.20 in the core sample).
4 RESULTS
The analyses in this section are organized as follows: Section 4.1 discusses the reliability of individual
updating measures, i.e., how well does a particular measure predict the same or another measure,
on the same set of questions (in sample) or on a dierent set of questions. Section 4.2 presents
the nding regarding the associations between update measures and forecasting skill, i.e., which
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 9
Variable In vs. Out of Sample Brier Score Frequency Magnitude Conrmation
Brier Score In 1
Frequency In -0.32 1
Magnitude In 0.49 -0.26 1
Conrmation In 0.03 0.51 0.02 1
Brier Score Out 0.75
Frequency Out -0.32 0.98
Magnitude Out 0.45 -0.25 0.79
Conrmation Out 0.04 0.51 0.02 0.83
Table 2. In and out of sample Pearson correlation coeicient matrix across N = 515 forecasters. Median
coeicients based on 100 resamples shown.
measures predict out-of-sample (standardized) Brier score how well. Sections 4.3–4.7 examine the
underlying reasons for those associations.
4.1 Reliability Assessment
We use Pearson correlation coecients across question set samples (out of sample) as measures of
test-retest reliability. We also use out-of-sample correlation coecients to measure the predictive-
ness of dierent updating patterns for Brier score. Standardized Brier scores, where lower values
denote better accuracy, had a test-retest reliability of r = 0.75 across question sets. All updating
measures had higher reliability coecients than Brier scores. Update frequency had the strongest
reliability of r = 0.98, suggesting that the tendency to update more or less often was a highly
stable individual dierence. Update magnitude had a test-retest reliability of r = 0.79, denoting that
forecasters who made small or large updates in one set of questions tended to do the same in the
other set. Appendix Figure A1 depicts the distribution of update magnitude in our sample. The
reliability of update conrmations was r = 0.83.
To study the relationship between updating measures, we use correlation coecients within a
question set sample (in sample), allowing us to estimate whether they capture dierent aspects of
updating behavior. Magnitude and frequency were negatively correlated; forecasters who updated
more often tended to take smaller steps. The relationship was weak to moderate (r = -0.26, p
<
.001). Update frequency correlated positively with conrmation propensity (r = 0.51, p
<
.001);
forecasters who made more frequent updates were also tended to conrm their prior forecasts.
Update magnitude did not correlate with the propensity for conrming one’s beliefs (r = 0.02, p
>
.10), as forecasters who conrmed their beliefs more often showed no tendency to make smaller
updates. This provides evidence for a psychological distinction between the decisions of whether
and how much to update one’s beliefs. Test-retest (i.e., out-of-sample for the same variable) and
cross-variable (in and out-of-sample) correlation coecients are shown in Table 2.
10 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
DV: Brier Score A. B. C.
(standardized) Base Model A. + Psychometrics A. + Accuracy in Training Set
Intercept -0.050 -0.069 -0.069
[-0.079, 0.021] [-0.102, -0.036] [-0.091, -0.047]
Frequency -0.079 -0.060 -0.037
[-0.101, -0.057] [-0.084, -0.036] [-0.055, -0.019]
Magnitude 0.098 0.105 0.024
[0.074, 0.122] [0.081, 0.029] [0.006, 0.042]
Conrmation 0.028 0.010 0.020
propensity [0.006, 0.050] [-0.017, 0.037] [0.004, 0.036]
Training –0.063 -0.099 -0.015
[-0.104, -0.022] [-0.144, -0.054] [-0.046, 0.016]
Fluid intelligence -0.064
[-0.086, -0.042]
Political -0.001
knowledge [-0.025, 0.023]
AOMT -0.012
[-0.034, 0.010]
Out-of-sample 0.173
accuracy [0.155, 0.191]
N 515 382 515
Adj. R20.28 0.38 0.57
Table 3. Predictors of forecaster accuracy. Results are based on OLS models in which measures based on
one set of questions are used to predict accuracy in a dierent set of questions. All continuous variables,
independent and dependent, are standardized. Lower scores denote beer accuracy. Bounds of 95% confidence
intervals are shown in parentheses.
4.2 Belief Updating and Accuracy
Tests of our hypotheses consisted of ordinary least squares (OLS) regression models, all of which
had the same dependent variable: relative accuracy of forecasters for out-of-sample questions,
as measured by mean standardized Brier scores. Update frequency, magnitude and conrmation
propensity were predictors. The specications also accounted for covariates, such as psychometric
scores and out-of-sample accuracy scores. All models included training assignment as a covariate.
The rst model, shown in Table 3, column A, tested the relationship of frequency of forecasts per
question, magnitude of updates, and conrmation propensity with forecasting accuracy. This model
investigated all three hypotheses. To facilitate comparison of eect sizes and regression coecients,
frequency, magnitude and conrmation measures were standardized. Results were consistent with
Hypothesis 1, which stated that update frequency would be positively related to accuracy, i.e.,
negatively related to Brier scores (b = -0.079, t = -6.86, p
<
.001). Consistent with Hypothesis 2,
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 11
larger update magnitude was associated with higher Brier scores, i.e., worse forecasting accuracy
(b = 0.098, t = 8.69, p
<
.001). And as predicted by Hypothesis 3, conrmations were associated with
worse accuracy (b = 0.028, t = 2.62, p = .009). In short, more accurate forecasters tended to make
frequent, smaller updates and rarely conrmed their previous forecasts. Training was associated
with greater accuracy after accounting for updating patterns. This specication provided the basis
of the sensitivity analyses listed in the Appendix Section A2. To provide context for the coecients
in Table 3, Column A, we map standardized to raw values, for both Brier scores and update measures.
An untrained forecaster with mean scores on all updating measures would have received a raw
Brier score of 0.36. Ceteris paribus, if that forecaster had update frequency that was one standard
deviation (1 SD) below average (1.1 forecasts per question versus mean of 1.7), her model-predicted
raw Brier scores would be 0.39, while a forecaster with 1 SD above average frequency (2.8 forests
per question) would have an estimated Brier score of 0.33. If update magnitude were one SD below
or above the mean (0.10 or 0.31 absolute magnitude versus a mean of 0.20), predicted Brier scores
would be 0.33 and 0.39, respectively. If conrmation propensity were one SD below or above the
mean (1% or 37% conrmation rate versus 19% mean), predicted Brier scores would be 0.35 or 0.37,
respectively. Undergoing training would reduce estimated Brier scores from 0.36 to 0.34.
Then we examined the predictive value of updating measures in the presence of psychometric
measures. (See Table 3, column B.) Fluid intelligence was associated with accuracy (b = -0.064, t
= -5.29, p
<
.001), while political knowledge and actively open-minded thinking (AOMT) were
not (p
>
.10 for both). Frequency and magnitude were related to accuracy when account for those
covariates. Of all the psychometric and behavioral predictors of accuracy in this model, update
magnitude was the strongest (b = 0.105, t = 8.47, p <.001).
Finally, we assessed whether updating measures were associated with accuracy even if one’s
track record was included as a predictor. (See Table 3, Column C.) We used standardized Brier
scores in the training set as predictors of standardized Brier scores in the validation set of questions.
The relationship between accuracy in training and validation sets was strong (b = 0.173, t = 18.24,
p
<
.001). Frequency (b = -0.037, t = 4.14, p
<
.001), magnitude (b = -0.024, t = -2.60, p = .010),
and conrmation propensity (b = 0.020, t = 2.43, p = .015) were also associated with accuracy.
Across the 100 split-sample iterations, 100% yielded negative coecients for frequency, while 97%
yielded positive coecients for magnitude, and 96% yielded positive coecients for conrmation
propensity. These 100 iterations are not independent, so frequency counts reported above should
be interpreted with caution.
Additional tests are listed in Appendix Section A2. Table A2.1 shows the results when we relax
or tighten forecaster inclusion criteria regarding the number of questions with updates. Table A2.2
shows that the core results directionally replicated the base model in a sample of superforecasters
working in teams. Table A2.3 provides a sensitivityanalysis using an alternative measure of accuracy:
mean-debiased rather than standardized Brier scores; and an alternate measure of magnitude:
squared distance rather than absolute distance. Table A2.4 breaks down performance by question
resolution outcome: status quo vs. change and time-sensitive vs. others. All of these analyses
yield results that are consistent with the base model: magnitude was signicantly associated with
standardized Brier scores in all cases, frequency was signicantly associated with accuracy in
all cases except the superforecaster analysis, and conrmation propensity was associated with
accuracy, except for the least selective independent forecaster sample, the superforecaster sample,
and when questions were broken down by type and outcome.
4.3 Early versus Late Forecasts
The high accuracy of frequent, incremental updaters could be associated with highly accurate
initial forecasts or with accuracy improvements attributable to the updates. To distinguish between
12 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Fig. 2. Standardized Brier scores by update paern and period within question. Forecasters were placed
in four categories by median splits on frequency and magnitude in a training sample of questions. Mean
standardized Brier scores based on a dierent set of questions are shown. Scores are divided by time period
within question. Calipers denote one standard error of the mean in each direction.
these possibilities, we examined the accuracy of rst vs. last forecasts (see Appendix Table A3.1).
Small updates were a marker of initial accuracy. A one standard deviation decrease in magnitude
corresponded to a 0.12 decrease in mean standardized Brier scores (b = 0.121, t = 10.94, p
<
.001).
Magnitude was a weaker, but still signicant predictor of last-forecast accuracy (b = 0.044, t = 3.64,
p
<
.001). Forecasters appeared to demonstrate sucient metacognitive skill to gauge how much
they needed to update their beliefs. Those with greater initial accuracy needed—and made—smaller
updates.
Greater frequency of updating was the best predictor of last-forecast accuracy. An increase of
one standard deviation in frequency corresponded to a 0.18 decrease in mean standardized Brier
scores (b = -0.181, t = 14.72, p
<
.001). However, frequency was unrelated to initial accuracy (b =
0.008, t = 0.67, p
>
.20). The propensity to conrm prior predictions was associated with worse
accuracy of both initial forecasts (b = 0.030, t = 2.80, p = .005) and nal forecasts (b = 0.045, t = 3.90,
p = .001). In summary, forecasters who made relatively accurate initial judgments tended to make
smaller belief updates, while frequent updaters got more accurate over time. Figure 2 illustrates the
relationship between frequency, magnitude, and accuracy. Forecasters were separated into four
categories using a median-split on frequency and magnitude of updating. The median frequency
was 1.6 forecasts per question, and the median magnitude was 19 percentage points. Magnitude and
frequency were correlated, so approximately twice as many forecasters were placed in the “large,
infrequent” (N=164) and “small, frequent” (N=164) categories as in the “large, frequent” (N=93) and
“small, infrequent” (N=94) categories. We divided forecasts based on whether they were made in
the beginning, middle, or end of the forecasting period. For example, if a question was open for 90
days, we would separately score forecasts made from days 1 to 30, 31 to 60, and 61 to 90. Mean
scores on a hold-out set of questions are presented, with no regression adjustments.
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 13
Group Raw Brier Score Calibration Error Discrimination Uncertainty
Large, infrequent 0.432 0.023 0.174 0.583
Large, frequent 0.374 0.014 0.224 0.583
Small, infrequent 0.394 0.010 0.200 0.583
Small, frequent 0.292 0.002 0.293 0.583
Table 4. Brier score decomposition for individual forecasters by updating category.
Across all periods, forecasters who made small, frequent adjustments achieved Brier scores that
were approximately 0.5 standard deviations better than the average forecaster. Forecasters who
made large, infrequent updates were 0.5 standard deviations worse than average. Dierences in
full-period accuracy were larger than those in any sub-period. This is partly due to the lower
variance in raw Brier scores for the full period, which accentuates dierences in standardized
(variance-adjusted) Brier scores.
The regression results for rst versus last forecast analysis (see Appendix Table A3.1) were
directionally similar to the analysis of early, middle, and late period forecasts (see Appendix Table
A3.2), except that the association between magnitude and accuracy was approximately null in
the late period. Magnitude and frequency were also associated with accuracy in an analysis of
forecast-level Brier scores, adjusting for forecast order and timing (see Appendix Table A3.3 and
Fig. A3). In sum, magnitude was more strongly associated with accuracy early on, while frequency
was more strongly linked to accuracy in the later periods of questions.
4.4 Calibration and Discrimination
What advantages do small, frequent updaters have over other forecasters? We performed Brier
score decomposition analysis to determine how forecasters with dierent update patterns perform
in terms of calibration and discrimination, following the same categorization and cross-validation
strategy used to produce Figure 2. Four questions were excluded from the analysis, due to lack of
Brier score data among at least one of the four groups. Forecasters from all four categories covered
the remaining questions, so the uncertainty score is the same across categories. Results are shown in
Table 4. Large, infrequent updaters had the highest (worst) raw Brier scores, with the highest levels
of calibration error and worst (lowest) discrimination scores. Large, frequent updaters performed
similarly to small, infrequent updaters in terms of raw Brier scores, with the former outperforming
in terms of discrimination but slightly underperforming in terms of calibration. Small, frequent
updaters performed the best in the validation set of questions, registering the lowest calibration
errors and the best discrimination scores. For calibration plots, see Appendix Section A4.
4.5 Update Magnitude Simulation
We have shown that forecasters who updated incrementally tended to be more accurate. But
did forecasters, on average, under- or overreact to new information? Would forecasters have
benetted from debiasing their estimates after elicitation? To answer these questions, we constructed
simulated forecast streams with smaller or larger-than-actual update magnitudes. This procedure
was illustrated in Figure 1.
We produced forecast streams with 30% smaller-than-actual and 30% larger-than-actual belief
updates, lling in the intermediate steps in 10 percentage-point increments. For both the factual
14 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Fig. 3. Last-forecast accuracy of simulated forecast streams with smaller-than-actual and larger-than-actual
belief updates.
and counterfactual forecast streams, we then scored the accuracy of the last forecast made by
a forecaster on a given question, i.e., the forecast they would have produced after all forecast
revisions. Accuracy was assessed in terms of absolute Brier scores and indexed to the accuracy of
actual forecasts, which was set to 100%. Indexed scores above 100% indicated worse accuracy of
counterfactual forecasts relative to actual ones, and vice versa. For example, an indexed Brier score
of 103% denotes that the simulated forecasts yielded 3% higher (worse) Brier scores than the actual
forecasts. Scores shown in Figure 3 represent simple means of Brier scores across forecasts, and
do not account for clustering across forecasters or questions. Figure 3 shows that actual forecasts
(denoted by 100% on the horizontal axis) resulted in nearly optimal accuracy. Increasing update
size by 20%-30% produced an accuracy boost, reducing Brier score by 0.3% on average across all
forecasters, a relatively small improvement. For comparison, the mean Brier score for forecasts made
by incremental updaters (M = 0.22) was approximately 30% lower than that for large-increment
updaters (M = 0.29). Thus, selecting forecasters based on small update magnitude would have
produced an approximately 100 times stronger accuracy-boosting eect than increasing update
size ex-post (30% versus 0.3% Brier score decrease).
On the other hand, reducing update increments across the board worsened accuracy, increasing
Brier scores by up to 4%. Incremental updaters would have beneted by 0.7% from larger updates.
We calculated the optimal update magnitude transformation and found that 65% of participants
would have received better scores on their last forecasts if they had made larger updates, 25% would
have beneted from smaller updates, and 10% would not have beneted from any of the adjustment
levels we tested. The median forecaster would have achieved greater accuracy from a 10% increase
in update magnitude. Incremental updaters were more likely to benet from magnitude increases,
but the relationship between magnitude and optimal transformation was not strong enough to
justify applying custom transformations. Thus, it appeared that forecasters showed a slight bias
toward underreaction to new information, but correcting such bias would have resulted in very
small improvements in accuracy.
Augenblick and Rabin (2018) provide an alternative measure of underreaction versus overreaction,
based on a comparison of the initial extremity and subsequent time-series movements of forecasts.
This comparison showed that forecast streams suered from excess volatility on average; only
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 15
incremental updaters produced forecast streams with appropriate volatility levels (see Appendix
Section A5).
4.6 Belief Updating, Eort, and Psychometric Profiles
What distinguishes frequent updaters from infrequent updaters, and small-increment updaters
from incremental updaters? Linking the psychometric proles of forecasters to their updating
propensities may give us insights. We conducted regressions with update measures as the dependent
variable and activity and psychometric measures as predictor variables.
9
Activity measures indicate
the extent to which belief updating patterns were associated with eort.
Update frequency was strongly associated with activity measures. Frequent updaters tended
to attempt more questions, log in more often, and click on more news links. Frequent updaters
appeared to be active information gatherers based on available activity measures. Update magnitude
had negative or null associations with activity measures. Incremental updaters tended to spread their
activity across a small number of questions and log in to the forecasting platform less often. There
was no association between magnitude and news-click activity. Higher conrmation propensity
was strongly associated with more sessions (i.e., number of logins) but not with other activity
measures. Overall, apart from its association with update frequency, smaller magnitudes were not
correlated with higher levels of activity. For full regression results, see Appendix Section A6.
In the analysis of psychometric measures, higher updating frequency was associated with higher
scores on political knowledge and actively open-minded thinking (AOMT), but it was unrelated
to uid intelligence. In contrast, lower updating magnitudes were associated with higher uid
intelligence, but magnitude was unrelated to political knowledge and AOMT scores. The only
signicant predictor of belief conrmation was uid intelligence: forecasters with high uid
intelligence scores were less prone to conrm their forecasts.
4.7 Training Eects on Updating and Accuracy
We rst replicated the result due to Mellers et al. (2014) that training is associated with more
frequent updates and greater accuracy. Trained forecasters updated 15% more often (M=1.9, SD =
1.3 vs. M = 2.2, SD = 1.8, t = 1.79, p = 0.075). Then we asked whether update magnitude is associated
with training. Across four years of the tournament, trained forecasters made updates that were on
average 12% smaller (M = 0.22, SD = 0.11 for untrained vs. M = 0.19, SD = 0.09 for trained forecasters,
t = 2.52, p = 0.012).
We conducted mediation analyses to see whether updating patterns accounted for the eects of
training on accuracy. The analysis employed a causal mediation approach, as implemented in the
mediation package in the statistical software R (Tingley et al. 2014). In Figure 4, the indirect eect
of frequency accounted for 17% of the total eect of training on accuracy (proportion mediated =
0.17, 95% CI (-0.02, 0.35)). The benets of training were partially channeled through more frequent
updates. The indirect eect of magnitude was stronger, accounting for 29% of the total eect of
training on accuracy (proportion mediated = 0.29, 95% CI (0.09, 0.55)). Training caused forecasters to
update their beliefs in smaller increments, which in turn boosted accuracy. Conrmation propensity
did not mediate the training eect (proportion mediated <0.01). See Appendix Section A7.
9
Activity measures were not used as predictors in models focused on accuracy, such as those shown in Table 3, for two
reasons. First, news link click data were only available for Seasons 2 and 3 of the tournament. Second, activity measures,
such as news link clicks and logins were not question specic, so they could not be meaningfully incorporated in analyses
that utilize cross-validation across questions.
16 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Fig. 4. Mediation models for the eect of forecasting training on forecasting accuracy, as assessed using
standardized Brier score. Low scores denote beer accuracy. Standardized regression coeicients shown, with
standard errors in parentheses.
5 DISCUSSION
5.1 Two Paradoxes
In our core tests of forecasters working independently, we showed that small, frequent updates
were strongly and robustly associated with greater accuracy. These results directionally replicated
among elite forecasters working in teams. Two perhaps counterintuitive patterns accompanied
these top-line results.
First, the tendency to conrm one’s prior forecasts was associated with worse performance. This
might seem inconsistent with simple brain-as-computer intuitions about forecasting. For example,
if a machine model provides the same probability values on two subsequent weeks, one possibility
would be that positive and negative inputs cancelled each other out, yielding a forecast update
that rounds to zero. A more conservative model would produce smaller updates and more frequent
conrmations. A more aggressive model could produce larger updates and fewer conrmations. In
contrast, we found that decisions on whether and how much to adjust forecasts were unrelated
to one another. Belief conrmation propensity was qualitatively dierent from the tendency to
make small updates. In fact, both fewer belief conrmations and smaller updates correlated with
higher uid intelligence and higher accuracy. Thus, updating may be best modeled using a mixture
distribution, separately estimating the probability of a non-zero update and update magnitude.
The other counterintuitive aspect relates to the interpretation of these results: Forecasters who
made small belief revisions were highly accurate, and training eectively reduced update magnitude,
thereby boosting accuracy. However, these results should not be interpreted to mean that simply
advising forecasters to make small updates would improve their accuracy. Forecasting training did
not issue explicit, general-purpose advice favoring small updates. And our simulations showed
that simply reducing update magnitude would have degraded accuracy. It appears that the most
accurate forecasters did not update in small increments by mechanically throttling down update
increments; instead, their revisions reected generally accurate meta-judgments that large updates
were unnecessary. In other words, forecasters generally demonstrated well-calibrated trust in their
previous forecasts, a tendency that persisted across forecasting questions. A consistent pattern of
small updates was the key factor that dierentiated those who were initially accurate from those
who were not.
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 17
We considered three ways in which training could have reduced magnitudes and improved
accuracy. First, we instructed forecasters to ground their estimates in stable historical base rates.
This advice could have diminished the relative weight forecasters placed on new, inside view cues.
10
This explanation is consistent with the observation that small updates were associated with more
accurate initial forecasts.
Another way in which training could have inuenced updating and accuracy was that training
encouraged forecasters to seek more information. They may have picked up subtler signals that
less careful forecasters missed. To illustrate, let us compare two hypothetical forecasters, one who
reads a newspaper from start to nish, and another who reads only the front page. Most news
stories do not t on the front page, so the rst forecaster is likely to pick up more signals about
future developments. But some of the mid-page news may be of limited predictive value. Front-page
news on any topic is, in contrast, less frequent but usually more informative of current or future
developments. The in-depth reader is thus likely to update in frequent, but smaller steps. This
depth-of-information-search link was not supported by our analysis: frequency and magnitude
were only moderately correlated, explaining less than 8% of the variance in one another. In addition,
incremental updaters were not especially active information consumers, and they derived their
accuracy advantages primarily from their superior initial estimates.
Finally, training materials instructed forecasters to combine information from multiple sources,
i.e., to average probability value estimates based on dierent sources. This guidance might have
resulted helped forecasters synthesize evidence, rather than attending to a single cue and ignoring
all others. More generally—and more speculatively—forecasting training may have improved fore-
casters’ ability to hold opposing ideas in their heads while still retaining the ability to function, a
“test of rst-rate intelligence,” in the words of F. Scott Fitzgerald (2009). Indeed, the eect of forecast-
ing training on accuracy was equivalent to that of a uid intelligence boost of approximately 1.5
standard deviations (in Table 3, Column B, compare coecients for training and uid intelligence),
and nearly one third of the training eect was mediated by smaller update magnitude.
5.2 Updating and Eort
The current analysis of forecast updating and accuracy addresses an often-raised concern: that
superior forecasting performance is mostly a matter of hard work and has nothing to do with
unique skill; that accuracy scores reveal which forecasters are working hard, not necessarily which
ones are thinking in the right way. And because forecasting tournament research is often conducted
on volunteers who have leeway to choose how hard to work, results might not generalize to settings
in which eort is strongly incentivized.
The connection between belief updating and eort is non-trivial because eort and engagement
are often dicult to assess reliably. Even professional forecasters are frequently inattentive (Andrade
and Le Bihan 2013), so understanding who pays attention to the task is useful. But not all possible
measures of attention and eort are associated with accuracy: the number of questions participants
attempted did not correlate with average accuracy. Frequent belief updates indicate a specic way
in which forecasters choose to engage and invest eort, which was highly reliable and a valid
predictor of accuracy.
The other two measures of updating, conrmation propensity and magnitude, told a dierent
story. Better forecasters made fewer conrmations. Although greater conrmation propensity
indicated more eort, the extra work of conrming one’s beliefs was associated with less accuracy.
10
As an illustration of the way incorporating base rates dampens update size, compare FiveThirtyEight’s 2016 Presidential
Election forecasting models, polls-plus and polls-only variants (Silver 2016). The polls-plus incorporated stable cues such as
economic indicators and produced smaller updates than the polls-only model.
18 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
Incremental updating was positively correlated with some activity measures and negatively with
others. Incremental updaters outperformed without putting in more apparent eort than large-
increment updaters.
Our results provide evidence for two distinct mechanisms of association between updating and
accuracy. Update frequency was associated with the quantity of information forecasters processed
in three ways. Frequent updaters had higher levels of activity, including observed information
consumption. They had higher starting levels of political knowledge, indicating they had processed
or retained more information about relevant political facts before their tournament season entry.
Finally, they scored relatively higher in open-mindedness, implying higher willingness to seek
and process new information. By contrast, update magnitude may have captured the quality of
information processing. Small updaters did not exhibit higher levels of activity and information
consumption than large updaters. Instead, incremental updaters were distinguished by their higher
uid intelligence, exposure to training, and higher accuracy of initial judgments.
5.3 Situational Context
The relationship between frequent updating and higher accuracy seemingly contradicts results from
the multiple cue probability learning literature (e.g., Castellan Jr 1973), in which frequent updating
may be driven by responses to irrelevant cues. While updating in response to pseudo-diagnostic
cues is a threat in our context as well, the average independent forecaster made only two forecasts
per question: one initial estimate, and one update. At such a low rate of updating, the benets of
bringing valid, new information presumably outweighed the risks of attending to irrelevant cues.
In our sensitivity analysis of high-update-frequency superforecasters, the association between
frequency and accuracy was notably weaker (See Appendix Section A2). Thus, as average update
frequency increases, the marginal benet of additional revisions may be reduced or even reversed.
Our results showed that underreaction was a larger threat to accuracy than overreaction, con-
sistent with Edwards’ (1968) notion of Bayesian conservatism. The forecasting tournament task
allowed forecasters to learn through practice, making it conceptually similar to the learning-from-
experience paradigm employed by Edwards. Fildes et al. (2009) found that experts making small
adjustments to model forecasts made statistical predictions worse, while larger adjustments im-
proved accuracy. We point out two potential explanations for the divergence between these results
and ours. First, our forecasters had the choice to update their own past estimates, not those produced
by a model. The forecasters’ insight into their own past reasoning may help them develop better
calibrated trust in past estimates than in a model’s forecast, avoiding biases against trusting external
advice (e.g., Yaniv and Kleinberger 2000, Dietvorst et al. 2015). Second, while the forecasters in
our study competed purely on accuracy, experts in Fildes et al. (2009) operated in a professional
setting where the desire to signal competence and attention to detail may have watered down
the motivation to maximize accuracy. As Fildes et al. noted, small adjustments might have served
primarily to send such signals.11
5.4 Beyond forecasting tournaments
The current eort likely represents the largest empirical study of naturalistic belief updating. As
such, the study setting was unusually rich in information: it included hundreds of forecasters who
underwent psychometric assessments and produced thousands of veriable probabilistic forecasts
across hundreds of questions. Few real-world environments oer such rich data. We surmise that
belief updating measures will be especially useful in low-information settings.
11
In a dierent context, politicians appear to believe that the public perceived incremental updating more favorably than
large changes, and usually describe their views as having evolved rather than changed (Leibovitch 2015).
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 19
While this study focused on relative accuracy of individual human forecasters, belief updating
measures may be informative in assessing the validity of predictive sources more generally, be they
groups of forecasters, advisors, statistical models, or even wider elds of inquiry. As an example,
consider nutritional guidelines. Some forms of traditional medicine have produced specic, albeit
implicit, predictions about the inuence of dierent foods on the human body. Such predictions have
not been updated for hundreds of years. Such lack of belief revision raises doubts about the resulting
recommendations: mankind has surely produced new knowledge about health eects of food that
has not been incorporated. Conversely, modern nutritional science has gone through large swings
in implicit beliefs and explicit guidelines about the relative risks of consuming saturated fat versus
sugar (DiNicolantonio et al. 2016). Such belief swings may indicate the relative paucity of reliable
and valid evidence on this matter. While philosophers of science have posited that science advances
primarily through revolutions (Kuhn 1954), frequent, incremental opinion revisions in scientic
communities may indicate a healthy combination of maturity and openness to new knowledge.
5.5 Limitations and Future Directions
A threat to the internal validity of studies in real-world settings is that the stimulus-generation
mechanism is not under experimental control and the correct answer—that is, the correct probabilis-
tic forecast on a given outcome at any given time—is inherently unknowable. We will never know
whether Nate Silver was right that the likelihood of Hillary Clinton winning the 2016 Presidential
Election, the day before the election, was approximately 70%. By contrast, laboratory tasks, such as
those involving card decks or urns and balls, produce perfectly knowable answers and thus allow
us to conduct more precise comparisons between normative and actual behavior. In our setting,
the relationships between belief-updating propensities and forecasting outcomes could only be
assessed in the aggregate by following hundreds of forecasters across many forecasting problems.
Despite this limitation, we believe that the gains in external validity from real-world judgment
tasks such as forecasting tournaments more than compensate for imperfect experimenter control.
Our ability to generalize the current ndings is somewhat limited by the context of the task.
Forecasting questions in the tournament relate to relatively complex global issues, and many of
the informational cues available to forecasters were quite subtle in nature, such as new data on
shifts in public mood or new readings of economic indicators. Thus, the current results may not
generalize to forecasting environments in which large shifts are far more common (Massey and
Wu 2005). Examples of such settings include professional sports, where a single event can change
the course of a game or even a season, or clinical drug trials, where data become publicly available
in few, highly informative steps.
Within the political domain, it is possible that small updates were associated with forecasting
accuracy in part because the underlying events developed gradually, and that political environments
characterized by regime shifts would have produced dierent results. We cannot rule this out but
consider it unlikely because the forecasting tournament covered a wide range of events, including
ones in settings beset with high volatility and sudden developments, such as the Arab Spring and
the Greek nancial crisis. Our analyses of question subsets with status quo versus change outcomes,
shown in Appendix Section A2, show that our results would hold even if a larger proportion of
questions yielded non-status quo outcomes. Still, it would be useful to test if our results would
replicate in more dynamic settings.
Separately, the scope diversity of geopolitical forecasting questions likely limits the inuence
of politically motivated reasoning in belief updating. For example, questions about leadership
transition in Zimbabwe may re up less political passion among our mostly US-based forecasters
than questions at the center of the US political discourse. Thus, the current results should be placed
20 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
in the context of a tournament, in which forecasters were generally motivated to demonstrate and
improve their forecasting skills.
5.6 Conclusion
Our results oer evidence supporting Bezos’ statement that people who get things right tend to
change their mind often. However, accurate forecasters are not ip-oppers: their tendency to
change their mind frequently is matched by a propensity for gradual revision. The best forecasters
seemingly experience the prediction task as a long sequence of slight surprises rather than a short
string of hard collisions with reality. A pattern of frequent, small belief adjustments helps identify
forecasters and decision makers who maintain an edge in a complex, turbulent world.
ACKNOWLEDGMENTS
The authors beneted from helpful comments from conference participants at the 2016 Judgment
and Decision Making Society meeting and the Forecasting Workshop at the 18th ACM conference on
Economics and Computation as well as from helpful discussions with Michael Bishop and Stephen
Bennett. The authors also thank the Intelligence Advanced Research Projects Agency (IARPA)
for their support. This research was supported by IARPA via the Department of Interior National
Business Center contract number D11PC20061. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding any copyright annotation thereon.
REFERENCES
[1]
Philippe Andrade and Hervé Le Bihan. Inattentive professional forecasters. Journal of Monetary Economics, 60(8):
967–982, 2013.
[2] Kenneth J Arrow. Risk perception in psychology and economics. Economic inquiry, 20(1):1–9, 1982.
[3]
Alison Hubbard Ashton and Robert H Ashton. Sequential belief revision in auditing. Accounting Review, pages 623–641,
1988.
[4]
Pavel Atanasov, Phillip Rescober, Eric Stone, Samuel A Swift, Emile Servan-Schreiber, Philip Tetlock, Lyle Ungar, and
Barbara Mellers. Distilling the wisdom of crowds: Prediction markets vs. prediction polls. Management Science, 63(3):
691–706, 2017.
[5]
Ned Augenblick and Matthew Rabin. Belief movement, uncertainty reduction, and rational updating. UC Berkeley-Haas
and Harvard University Mimeo, 2018.
[6]
Giulia Balboni, Jack A Naglieri, and Roberto Cubelli. Concurrent and predictive validity of the raven progressive
matrices and the Naglieri Nonverbal Ability Test. Journal of Psychoeducational Assessment, 28(3):222–235, 2010.
[7] Maya Bar-Hillel. The base-rate fallacy in probability judgments. Acta Psychologica, 1980.
[8]
Silvia Bonaccio and Reeshad S Dalal. Advice taking and decision-making: An integrative literature review, and
implications for the organizational sciences. Organizational Behavior and Human Decision Processes, 101(2):127–151,
2006.
[9] Glenn W Brier. Verication of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3, 1950.
[10]
N John Castellan Jr. Multiple-cue probability learning with irrelevant cues. Organizational Behavior and Human
Performance, 9(1):16–29, 1973.
[11]
Welton Chang and Philip E Tetlock. Rethinking the training of intelligence analysts. Intelligence and National Security,
31(6):903–920, 2016.
[12]
Welton Chang, Eva Chen, Barbara Mellers, and Philip Tetlock. Developing expert political judgment: The impact of
training and practice on judgmental accuracy in geopolitical forecasting tournaments. Judgment and Decision making,
11(5):509, 2016.
[13]
Yiling Chen, Chao-Hsien Chu, Tracy Mullen, and David M Pennock. Information markets vs. opinion pools: An
empirical comparison. In Proceedings of the 6th ACM Conference on Electronic Commerce, pages 58–67, 2005.
[14]
Robert T Clemen and Robert L Winkler. Combining probability distributions from experts in risk analysis. Risk
Analysis, 19(2):187–203, 1999.
[15]
Werner FM De Bondt and Richard Thaler. Does the stock market overreact? The Journal of Finance, 40(3):793–805,
1985.
[16]
Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: People erroneously avoid algorithms
after seeing them err. Journal of Experimental Psychology: General, 144(1):114, 2015.
Small Steps to Accuracy: Incremental Belief Updaters Are Beer Forecasters 21
[17]
James J DiNicolantonio, Sean C Lucan, and James H O’Keefe. The evidence for saturated fat and for sugar related to
coronary heart disease. Progress in cardiovascular diseases, 58(5):464–472, 2016.
[18] Ward Edwards. Conservatism in human information processing. Formal representation of human judgment, 1968.
[19]
Timo Ralf Ehrig. The Revision of Beliefs Underlying Organizational Expectations. In Academy of Management
Proceedings, page 14988, 2015.
[20]
Robert Fildes, Paul Goodwin, Michael Lawrence, and Konstantinos Nikolopoulos. Eective forecasting and judgmental
adjustments: an empirical evaluation and strategies for improvement in supply-chain planning. International Journal
of Forecasting, 25(1):3–23, 2009.
[21] F Scott Fitzgerald. The crack-up. New Directions Publishing, 2009.
[22] Shane Frederick. Cognitive reection and decision making. Journal of Economic perspectives, 19(4):25–42, 2005.
[23] Jason Fried. Some advice from Je Bezos. Signal v. Noise Blog, 2012.
[24]
Sharad Goel, Daniel M Reeves, Duncan J Watts, and David M Pennock. Prediction without markets. In Proceedings of
the 11th ACM conference on Electronic commerce, pages 357–366, 2010.
[25]
Daniel G Goldstein, Randolph Preston McAfee, and Siddharth Suri. The wisdom of smaller, smarter crowds. In
Proceedings of the Fifteenth ACM Conference on Economics and Computation, pages 471–488, 2014.
[26]
Dale Grin and Amos Tversky. The weighing of evidence and the determinants of condence. Cognitive Psychology,
24(3):411–435, 1992.
[27]
Nigel Harvey. Judgmental forecasting of univariate time series. Journal of Behavioral Decision Making, 1(2):95–110,
1988.
[28] Robin M Hogarth. A perspective on cognitive research in accounting. The Accounting Review, 66(2):277–290, 1991.
[29]
Kriti Jain, J Neil Bearden, and Allan Filipowicz. Do maximizers predict better than satiscers? Journal of Behavioral
Decision Making, 26(1):41–50, 2013.
[30]
Victor Richmond R Jose, Robert F Nau, and Robert L Winkler. Sensitivity to distance and baseline distributions in
forecast evaluation. Management Science, 55(4):582–590, 2009.
[31]
Daniel Kahneman and Amos Tversky. Subjective probability: A judgment of representativeness. Cognitive Psychology,
3(3):430–454, 1972.
[32] Daniel Kahneman and Amos Tversky. On the psychology of prediction. Psychological Review, 80(4):237, 1973.
[33]
Jonathan J Koehler. The base rate fallacy reconsidered: Descriptive, normative, and methodological challenges.
Behavioral and Brain Sciences, 19(1):1–17, 1996.
[34] Thomas S Kuhn. The structure of scientic revolutions. University of Chicago Press, 1954.
[35]
Richard P Larrick and Jack B Soll. Intuitions about combining opinions: Misappreciation of the averaging principle.
Management Science, 52(1):111–127, 2006.
[36] Mark Leibovitch. You and I change our minds. Politicians evolve. New York Times Magazine, 3, 2015.
[37]
Isaac M Lipkus, Greg Samsa, and Barbara K Rimer. General performance on a numeracy scale among highly educated
samples. Medical Decision Making, 21(1):37–44, 2001.
[38]
Albert E Mannes, Jack B Soll, and Richard P Larrick. The wisdom of select crowds. Journal of Personality and Social
Psychology, 107(2):276, 2014.
[39]
Cade Massey and George Wu. Detecting regime shifts: The causes of under-and overreaction. Management Science, 51
(6):932–947, 2005.
[40]
Barbara Mellers, Lyle Ungar, Jonathan Baron, Jaime Ramos, Burcu Gurcay, Katrina Fincher, Sydney E Scott, Don Moore,
Pavel Atanasov, Samuel A Swift, et al. Psychological strategies for winning a geopolitical forecasting tournament.
Psychological Science, 25(5):1106–1115, 2014.
[41]
Barbara Mellers, Eric Stone, Pavel Atanasov, Nick Rohrbaugh, S Emlen Metz, Lyle Ungar, Michael M Bishop, Michael
Horowitz, Ed Merkle, and Philip Tetlock. The psychology of intelligence analysis: Drivers of prediction accuracy in
world politics. Journal of Experimental Psychology: Applied, 21(1):1, 2015.
[42]
Barbara Mellers, Eric Stone, Terry Murray, Angela Minster, Nick Rohrbaugh, Michael Bishop, Eva Chen, Joshua Baker,
Yuan Hou, Michael Horowitz, et al. Identifying and cultivating superforecasters as a method of improving probabilistic
predictions. Perspectives on Psychological Science, 10(3):267–281, 2015.
[43] Allan H Murphy and Robert L Winkler. A general framework for forecast verication. Monthly Weather Review, 115
(7):1330–1338, 1987.
[44]
Raymond S Nickerson. Conrmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2
(2):175–220, 1998.
[45]
Richard E Nisbett and Lee Ross. Human inference: Strategies and shortcomings of social judgment. Prentice Hall, 1980.
[46]
Ellen Peters, Nathan Dieckmann, Anna Dixon, Judith H Hibbard, and CK Mertz. Less is more in presenting quality
information to consumers. Medical Care Research and Review, 64(2):169–190, 2007.
[47] Al Pittampalli. Persuadable: How great leaders change their minds to change the world. HarperCollins, 2016.
22 Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock
[48]
Steven B Redd. The inuence of advisers on foreign policy decision making: an experimental study. Journal of Conict
Resolution, 46(3):335–364, 2002.
[49]
Eric Ries. The lean startup: How today’s entrepreneurs use continuous innovation to create radically successful businesses.
Currency, 2011.
[50]
Robert J Shiller. Do Stock Prices Move Too Much to be Justied by Subsequent Changes in Dividends? The American
Economic Review, 71(3):421–436, 1981.
[51]
Nate Silver. Who will win the presidency? https://projects.vethirtyeight.com/2016-election-forecast/, 2016. Accessed
May 29, 2020.
[52] Herbert A Simon. Rational choice and the structure of the environment. Psychological Review, 63(2):129, 1956.
[53]
Philip E Tetlock. Expert political judgment: How good is it? How can we know?-New edition. Princeton University Press,
2005.
[54]
Philip E Tetlock and Richard Boettger. Accountability: A social magnier of the dilution eect. Journal of personality
and social psychology, 57(3):388–398, 1989.
[55] Philip E Tetlock and Dan Gardner. Superforecasting: The art and science of prediction. Random House, 2016.
[56]
Dustin Tingley, Teppei Yamamoto, Kentaro Hirose, Luke Keele, and Kosuke Imai. Mediation: R package for causal
mediation analysis. Journal of Statistical Software, 59, 2014.
[57]
Amos Tversky and Daniel Kahneman. Availability: A heuristic for judging frequency and probability. Cognitive
Psychology, 5(2):207–232, 1973.
[58]
Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131,
1974.
[59]
Jens Witkowski, Pavel Atanasov, Lyle H Ungar, and Andreas Krause. Proper Proxy Scoring Rules. In Thirty-First AAAI
Conference on Articial Intelligence, pages 743–749, 2017.
[60] Justin Wolfers and Eric Zitzewitz. Prediction markets. Journal of Economic Perspectives, 18(2):107–126, 2004.
[61] Ilan Yaniv. The benet of additional opinions. Current Directions in Psychological Science, 13(2):75–78, 2004.
[62]
Ilan Yaniv and Eli Kleinberger. Advice taking in decision making: Egocentric discounting and reputation formation.
Organizational behavior and human decision processes, 83(2):260–281, 2000.
[63]
J Frank Yates, Paul C Price, Ju-Whei Lee, and James Ramirez. Good probabilistic forecasters: The ‘consumer
'
s’
perspective. International Journal of Forecasting, 12(1):41–56, 1996.
[64]
Robert A Zachary, Morris J Paulson, and Richard L Gorsuch. Estimating WAIS IQ from the Shipley Institute of Living
Scale using continuously adjusted age norms. Journal of Clinical Psychology, 41(6):820–831, 1985.
1
2
Histogram of all_indiv5_bs2_a$MeanAbsDiff1
Mean Absolute Magnitude
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0 1000 3000
3
DV: Rescaled
Brier Score
At least 2
questions
At least 6
questions
At least 10
questions1
At least 20
questions
Intercept
-0.024
(0.013)
-0.039
(0.015)**
-0.050
(0.015)**
-0.073
(0.016)**
Frequency
-0.068
(0.009)**
-0.079
(0.011)**
-0.079
(0.011)**
-0.075
(0.013)**
Magnitude
0.057
(0.010)**
0.089
(0.011)**
0.098
(0.012)**
0.067
(0.012)**
Confirmation
propensity
0.015
(0.019)
0.025
(0.010)*
0.028
(0.011)**
0.033
(0.012)**
Training
-0.081
(0.018)**
-0.070
(0.020)**
-0.063
(0.021)**
-0.073
(0.024)**
N
832
598
515
352
R2
0.15
0.22
0.28
0.27
4
DV: Standardized Brier
Score
A. Base
Model
Intercept
0.079
(0.028)**
Frequency
-0.053
(0.034)
Magnitude
0.100
(0.034)**
Confirmation propensity
0.041
(0.027)
Training
N
175
R2
0.13
5
A. Mean
Debiased Score
B. Mean
Standardized
Score
C. Base
Model2
Intercept
-0.017
(0.004)**
-0.044
(0.014)**
-0.050
(0.015)**
Frequency
-0.023
(0.004)**
-0.085
(0.011)**
-0.079
(0.011)**
Magnitude - Absolute
0.029
(0.004)**
0.098
(0.012)**
Magnitude - Squared
0.073
(0.010)**
Confirmation propensity
0.010
(0.003)*
0.030
(0.010)**
0.028
(0.011)**
Training
-0.015
(0.007)*
-0.072
(0.020)**
-0.063
(0.021)**
N
515
349
515
R2
0.25
0.33
0.28
6
DV: Standardized
Brier Score
Status Quo &
Timing
Non-Status Quo
or No Timing
Intercept
-0.053
(0.020)**
-0.172
(0.029)**
Frequency
-0.053
(0.014)**
-0.045
(0.023)*
Magnitude
0.087
(0.015)**
0.028
(0.022)
Confirmation propensity
0.015
(0.013)
0.038
(0.022)
Training
-0.078
(0.026)**
-0.015
(0.042)
N
482
165
R2
0.16
0.03
7
DV: Standardized
Brier Score
A. First
Forecast
B. Last
Forecast
Intercept
-0.020
(0.015)
-0.131
(0.016)**
Frequency
0.007
(0.011)
-0.181
(0.012)**
Magnitude
0.121
(0.011)**
0.044
(0.012)**
Confirmation
propensity
0.030
(0.011)**
0.045
(0.012)**
Training
-0.039
(0.021)
-0.042
(0.023)
N
515
515
R2
0.22
0.37
8
DV: Standardized
Brier Score
A. Start
Period
B. Middle
Period
C. End
Period
Intercept
-0.067
(0.017)**
-0.007
(0.023)
-0.050
(0.023)**
Frequency
-0.036
(0.013)**
-0.085
(0.018)**
-0.119
(0.017)**
Magnitude
0.101
(0.013)**
0.091
(0.017)**
-0.001
(0.017)
Confirmation
propensity
0.023
(0.012)
0.071
(0.017)**
0.067
(0.016)**
Training
-0.045
(0.024)
-0.023
(0.032)
-0.019
(0.032)
N
515
515
515
R2
0.17
0.14
0.10
9
0.4
0.2
0.0
12345
Forecast Order
Mean Standardized Brier
Frequency_Type
Low
High
+
10
DV: Standardized Brier Score at
Forecast Level
A. Without
Magnitude
B. With
Magnitude
Intercept
0.381
(28.82)
0.331
(19.00)
Forecaster Characteristics
High-Frequency
-0.114
(-6.32)
-0.078
(-5.15)
High-Magnitude
0.101
(4.29)
Forecast Characteristics
Forecast Order (1 to 5)
-0.012
(-4.28)
-0.011
(-4.19)
Time within Question (0 to 1)
-0.726
(-77.11)
-0.726
(-77.08)
N Forecasters
515
515
Forecaster Fixed Effects
Yes
Yes
AIC
318,292.8
318,282.8
11
33%
14%
10% 8%
6%
4% 6%
6%
5%
5%2%
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Mean Forecast
Observed Frequency
22%
14% 13% 12%
11%
7%
8% 5%
4%
3%1%
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Mean Forecast
Observed Frequency
34%
11% 10%
8% 5%
5% 6%
6%
5%
4%
6%
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Mean Forecast
Observed Frequency
32%
12% 11% 8% 6%
5%
7% 6%
4%
3%
5%
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Mean Forecast
Observed Frequency
A. Large, infrequent B. Large, frequent
C. Small, infrequent D. Small, frequent
12
(
π
t
π
t1)2=
π
0
t=1
T
(1
π
0)
π
t
π
0
13
Predictor Set
Activity
Psychometrics
DV
Frequency
Magnitude
Frequency
Magnitude
Intercept
0.023
(0.0050)
0.058
(0.073)
-0.084
(0.069)
-0.113
(0.072)**
Frequency
-0.608
(0.080)**
-0.366
(0.049)**
Magnitude
-0.278
(0.036)**
-0.356
(0.050)**
Training
-0.030
(0.068)
-0.098
(0.100)
0.058
(0.097)
-0.200
(0.098)*
Number of questions
0.160
(0.039)**
0.384
(0.054)**
Number of sessions
0.707
(0.041)**
0.391
(0.083)**
Number of news links clicks††
0. 113
(0.040)**
-0.004
(0.060)
Fluid intelligence
-0.010
(0.048)
-0.139
(0.048)**
Political knowledge
0.180
(0.048)**
0.037
(0.050)
AOMT
0.123
(0.048)*
0.065
(0.049)
N
294
294
380
380
Adj. R2
0.67
0.30
0.17
0.15
14
DV: Standardized Brier Score
Estimate
95% CI
Lower
95% CI
Upper
p-value
Mediation Effect (ACME)
-0.017
-0.033
0.001
0.07
Average Direct Effect
-0.082
-0.120
-0.042
<0.01
Total Effect
-0.098
-0.140
-0.055
<0.01
Proportion Mediated
0.167
-0.015
0.353
0.07
DV: Standardized Brier Score
Estimate
95% CI
Lower
95% CI
Upper
p-value
Mediation Effect (ACME)
-0.029
-0.052
-0.001
0.01
Average Direct Effect
-0.069
-0.107
-0.030
<0.01
Total Effect
-0.098
-0.140
-0.054
<0.01
Proportion Mediated
0.293
0.089
0.548
0.01
DV: Standardized Brier Score
Estimate
95% CI
Lower
95% CI
Upper
p-value
Mediation Effect (ACME)
0.001
-0.002
0.004
0.80
Average Direct Effect
-0.098
-0.140
-0.055
<0.01
Total Effect
-0.098
-0.139
-0.053
<0.01
Proportion Mediated
-0.001
-0.048
0.22
0.80
ResearchGate has not been able to resolve any citations for this publication.
Preprint
Full-text available
Distinguishing between high- and low-performing individuals and groups is of prime importance in a wide range of high-stakes contexts. While this is straightforward when accurate records of past performance exist, in most real-world contexts, such records are unavailable. Focusing on the class of binary decision problems, we use a combined theoretical and empirical approach to develop and test a novel approach to this important problem. First, we employ a general mathematical argument and numerical simulations to show that the similarity of an individual’s decisions to others is a powerful predictor of that individual’s decision accuracy. Second, testing this prediction with several large data sets on breast and skin cancer diagnostics, geopolitical forecasting and a general knowledge task, we find, as predicted, that decision similarity robustly permits the identification of high-performing individuals and groups. Our findings offer an intriguingly simple, yet broadly applicable, heuristic of improving real-world decision-making systems.
Article
Full-text available
Good judgment is often gauged against two gold standards – coherence and correspondence. Judgments are coherent if they demonstrate consistency with the axioms of probability theory or propositional logic. Judgments are correspondent if they agree with ground truth. When gold standards are unavailable, silver standards such as consistency and discrimination can be used to evaluate judgment quality. Individuals are consistent if they assign similar judgments to comparable stimuli, and they discriminate if they assign different judgments to dissimilar stimuli. We ask whether " superforecasters " , individuals with noteworthy correspondence skills (see Mellers et al., 2014) show superior performance on laboratory tasks assessing other standards of good judgment. Results showed that superforecasters either tied or out-performed less correspondent forecasters and undergraduates with no forecasting experience on tests of consistency, discrimination, and coherence. While multifaceted, good judgment may be a more unified than concept than previously thought.
Article
When a Bayesian learns new information and changes her beliefs, she must on average become concomitantly more certain about the state of the world. Consequently, it is rare for a Bayesian to frequently shift beliefs substantially while remaining relatively uncertain, or, conversely, become very confident with relatively little belief movement. We formalize this intuition by developing specific measures of movement and uncertainty reduction given a Bayesian’s changing beliefs over time, showing that these measures are equal in expectation, and creating consequent statistical tests for Bayesianess. We then show connections between these two core concepts and four common psychological biases, suggesting that the test might be particularly good at detecting these biases. We provide support for this conclusion by simulating the performance of our test and other martingale tests. Finally, we apply our test to datasets of individual, algorithmic, and market beliefs.
Book
Since its original publication, Expert Political Judgment by New York Times bestselling author Philip Tetlock has established itself as a contemporary classic in the literature on evaluating expert opinion. Tetlock first discusses arguments about whether the world is too complex for people to find the tools to understand political phenomena, let alone predict the future. He evaluates predictions from experts in different fields, comparing them to predictions by well-informed laity or those based on simple extrapolation from current trends. He goes on to analyze which styles of thinking are more successful in forecasting. Classifying thinking styles using Isaiah Berlin’s prototypes of the fox and the hedgehog, Tetlock contends that the fox--the thinker who knows many little things, draws from an eclectic array of traditions, and is better able to improvise in response to changing events--is more successful in predicting the future than the hedgehog, who knows one big thing, toils devotedly within one tradition, and imposes formulaic solutions on ill-defined problems. He notes a perversely inverse relationship between the best scientific indicators of good judgement and the qualities that the media most prizes in pundits--the single-minded determination required to prevail in ideological combat. Clearly written and impeccably researched, the book fills a huge void in the literature on evaluating expert opinion. It will appeal across many academic disciplines as well as to corporations seeking to develop standards for judging expert decision-making. Now with a new preface in which Tetlock discusses the latest research in the field, the book explores what constitutes good judgment in predicting future events and looks at why experts are often wrong in their forecasts.
Article
Scholars, practitioners, and pundits often leave their assessments of uncertainty vague when debating foreign policy, arguing that clearer probability estimates would provide arbitrary detail instead of useful insight. We provide the first systematic test of this claim using a data set containing 888,328 geopolitical forecasts. We find that coarsening numeric probability assessments in a manner consistent with common qualitative expressions—including expressions currently recommended for use by intelligence analysts—consistently sacrifices predictive accuracy. This finding does not depend on extreme probability estimates, short time horizons, particular scoring rules, or individual attributes that are difficult to cultivate. At a practical level, our analysis indicates that it would be possible to make foreign policy discourse more informative by supplementing natural language-based descriptions of uncertainty with quantitative probability estimates. More broadly, our findings advance long-standing debates over the nature and limits of subjective judgment when assessing social phenomena, showing how explicit probability assessments are empirically justifiable even in domains as complex as world politics.
Article
We report the results of the first large-scale, long-term, experimental test between two crowdsourcing methods: prediction markets and prediction polls. More than 2,400 participants made forecasts on 261 events over two seasons of a geopolitical prediction tournament. Forecasters were randomly assigned to either prediction markets (continuous double auction markets) in which they were ranked based on earnings, or prediction polls in which they submitted probability judgments, independently or in teams, and were ranked based on Brier scores. In both seasons of the tournament, prices from the prediction market were more accurate than the simple mean of forecasts from prediction polls. However, team prediction polls outperformed prediction markets when forecasts were statistically aggregated using temporal decay, differential weighting based on past performance, and recalibration. The biggest advantage of prediction polls was atthe beginning of long-duration questions. Results suggest that prediction polls with proper scoring feedback, collaboration features, and statistical aggregation are an attractive alternative to prediction markets for distilling the wisdom of crowds.
Article
We examined the relationship between maximizing (i.e. seeking the best) versus satisficing (i.e.seeking the good enough) tendencies and forecasting ability in a real-world prediction task: forecasting the outcomes of the 2010 FIFA World Cup. In Studies 1 and 2, participants gave probabilistic forecasts for the outcomes of the tournament, and also completed a measure of maximizing tendencies. We found that although maximizers expected themselves to outperform others much more than satisficers, they actually forecasted more poorly. Hence, on net, they were more overconfident. The differences in forecasting abilities seem to be driven by the maximizers’ tendency to give more variable probability estimates. In Study 3, participants played a betting task where they could select between safe and uncertain gambles linked to World Cup outcomes. Again, maximizers did more poorly and earned less, because of a higher variance in their responses. This research contributes to the growing literature on maximizing tendencies by expanding the range of objective outcomes over which maximizing has an influence, and further showing that there may be substantial upside to being a satisficer.