ArticlePDF Available

Probabilistic Coherence Weighting for Optimizing Expert Forecasts

Authors:
  • Jacobs / George Mason University

Abstract and Figures

Methods for eliciting and aggregating expert judgment are necessary when decision-relevant data are scarce. Such methods have been used for aggregating the judgments of a large, heterogeneous group of forecasters, as well as the multiple judgments produced from an individual forecaster. This paper addresses how multiple related individual forecasts can be used to improve aggregation of probabilities for a binary event across a set of forecasters. We extend previous efforts that use probabilistic incoherence of an individual forecaster’s subjective probability judgments to weight and aggregate the judgments of multiple forecasters for the goal of increasing the accuracy of forecasts. With data from two studies, we describe an approach for eliciting extra probability judgments to (i) adjust the judgments of each individual forecaster, and (ii) assign weights to the judgments to aggregate over the entire set of forecasters. We show improvement of up to 30% over the established benchmark of a simple equal-weighted averaging of forecasts. We also describe how this method can be used to remedy the “fifty–fifty blip” that occurs when forecasters use the probability value of 0.5 to represent epistemic uncertainty.
Content may be subject to copyright.
This article was downloaded by: [130.63.180.147] On: 02 December 2013, At: 16:01
Publisher: Institute for Operations Research and the Management Sciences (INFORMS)
INFORMS is located in Maryland, USA
Decision Analysis
Publication details, including instructions for authors and subscription information:
http://pubsonline.informs.org
Probabilistic Coherence Weighting for Optimizing Expert
Forecasts
Christopher W. Karvetski, Kenneth C. Olson, David R. Mandel, Charles R. Twardy
To cite this article:
Christopher W. Karvetski, Kenneth C. Olson, David R. Mandel, Charles R. Twardy (2013) Probabilistic Coherence Weighting for
Optimizing Expert Forecasts. Decision Analysis 10(4):305-326. http://dx.doi.org/10.1287/deca.2013.0279
Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions
This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval. For more information, contact permissions@informs.org.
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.
Copyright © 2013, INFORMS
Please scroll down for article—it is on subsequent pages
INFORMS is the largest professional society in the world for professionals in the fields of operations research, management
science, and analytics.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
Decision Analysis
Vol. 10, No. 4, December 2013, pp. 305–326
ISSN 1545-8490 (print) ISSN 1545-8504 (online) http://dx.doi.org/10.1287/deca.2013.0279
© 2013 INFORMS
Probabilistic Coherence Weighting for
Optimizing Expert Forecasts
Christopher W. Karvetski, Kenneth C. Olson
Department of Applied Information Technology, George Mason University, Fairfax, Virginia 22030
{ckarvetski@gmail.com, kolson8@gmu.edu}
David R. Mandel
Socio-Cognitive Systems Section, DRDC Toronto; and Department of Psychology, York University,
Toronto, Ontario M3J 1P3, Canada, drmandel66@gmail.com
Charles R. Twardy
Command, Control, Communications, Computing, and Intelligence Center, George Mason University,
Fairfax, Virginia 22030, ctwardy@gmu.edu
Methods for eliciting and aggregating expert judgment are necessary when decision-relevant data are scarce.
Such methods have been used for aggregating the judgments of a large, heterogeneous group of fore-
casters, as well as the multiple judgments produced from an individual forecaster. This paper addresses how
multiple related individual forecasts can be used to improve aggregation of probabilities for a binary event
across a set of forecasters. We extend previous efforts that use probabilistic incoherence of an individual fore-
caster’s subjective probability judgments to weight and aggregate the judgments of multiple forecasters for the
goal of increasing the accuracy of forecasts. With data from two studies, we describe an approach for eliciting
extra probability judgments to (i) adjust the judgments of each individual forecaster, and (ii) assign weights
to the judgments to aggregate over the entire set of forecasters. We show improvement of up to 30% over the
established benchmark of a simple equal-weighted averaging of forecasts. We also describe how this method
can be used to remedy the “fifty–fifty blip” that occurs when forecasters use the probability value of 0.5 to
represent epistemic uncertainty.
Key words: probabilistic coherence; forecast aggregation; crowdsourcing; linear opinion pool; fifty–fifty blip;
practice
History : Received on February 21, 2013. Accepted by Rakesh Sarin on July 26, 2013, after 1 revision.
1. Introduction
1.1. Aggregating Forecaster Judgment
Decision makers often rely on the subjective judg-
ment and expertise of forecasters when little to no
decision-relevant “hard” data exist. Methods for elic-
iting and aggregating the judgments of multiple fore-
casters have proven to be valuable tools for improving
the accuracy of the judgments in a variety of engineer-
ing and other settings (Cooke and Goossens 2008).
Clemen and Winkler (1999) give an overview describ-
ing the combination of expert judgment in the context
of risk analysis.
When forecasting the occurrence of a future event,
such as the outcome of an upcoming election,
each forecaster provides a subjective probability
distribution concerning the resolution of the event.
For binary events, the distribution is typically a sin-
gle probability value, and the resolution value is one
if the event occurs or zero if the event does not occur.
Although more sophisticated Bayesian and clas-
sical methods of aggregation have been proposed
(e.g., Merrick 2008, Cooke and Goossens 2008), a sim-
ple, equal-weighted averaging of probability distri-
butions across the crowd of forecasters, known as
a linear opinion pool (LINOP), is generally consid-
ered the benchmark for aggregation (Clemen 2008,
Clemen and Winkler 1999). Consider, for example,
the Aggregative Contingent Estimation (ACE) Pro-
gram, which began in 2010 as a multiyear, six-team
forecasting challenge sponsored by the Intelligence
Advanced Research Projects Activity (IARPA). The
305
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
306 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
goal of the ACE Program is, through the development
of advanced forecasting techniques, to dramatically
enhance the accuracy of forecasts for a broad range of
event types, compared to a linear opinion pool.
Nevertheless, it is reasonable to believe that in a
large crowd, particular forecasters will produce more
accurate judgments than other forecasters, and giv-
ing more weight to these forecasters will improve
upon the LINOP (Genest and McConway 1990). A key
challenge is identifying these better forecasters before
resolution. Previous weighting schemes have used
forecaster performance data on resolved questions or
seed questions to generate forecaster weights (Cooke
and Goossens 2008, Cooke 1991). However, these
approaches require the existence of past performance
data and they assume that the data will be a good
indicator of future forecasting performance.
Recent studies have shown that weighting forecast-
ers by the degree to which their forecasts are proba-
bilistically coherent can improve upon a LINOP (Tsai
and Kirlik 2012, Wang et al. 2011). Probabilistic coher-
ence implies that the distribution over the events in
a probability space do not violate the basic axioms
of probability (De Finetti 1990, Kolmogorov 1956).
Importantly, probabilistic coherence is the only evalu-
ation method for an individual’s probabilities that can
be done before resolution (Lindley et al. 1979).
Although coherence weighting within aggregation
has been shown to improve over a LINOP, a facile
approach has never been developed and tested for
simple events. That is, given a binary event A, it
would be useful to know (a) what the best judgments
are to elicit in addition to P4A5 to obtain a measure
of probabilistic coherence, (b) if the related judgments
should be elicited independently or concurrently, and
(c) how the degree of coherence should be converted
to an aggregation weight.
1.2. Overview of Paper
In this paper, using the data from two studies,
we build on past efforts and develop and test an
approach for eliciting small sets of related probabil-
ities from individual forecasters, which are used to
aggregate the probabilities over the crowd of fore-
casters. In particular, by giving greater weight to
more coherent forecasters in a weighted averaging,
we expect to significantly improve upon a LINOP
using a relatively facile and highly feasible approach.
Within the approach, we describe the importance of
generating independent or “spaced” (as opposed to
concurrent) intraforecaster judgments, as our method
is designed to capitalize on extra incoherence pro-
duced by forecasters making independent judgments.
The rest of the paper is organized as follows: In the
next section, we describe related literature for coher-
ence weighting. We review key principles of forecast
elicitation and aggregation in this second section as
well. In §3, we outline our experimental methods, and
§4 describes the results and insights from the two
studies. In §5, we discuss the results, offer conclu-
sions, and suggest future research.
2. Related Literature
2.1. Coherentization and Coherence Weighting
Methods of elicitation and aggregation of judgment,
including the LINOP, are usually applied to a set of
at least two forecasters, and are sometimes applied to
a large “crowd” of forecasters (i.e., “crowdsourcing”;
Surowiecki 2005). These methods, however, have also
been applied to an individual forecaster. In this lat-
ter case, the individual forecaster is prompted to pro-
duce multiple, independent estimates for the same
unknown parameter, and these estimates are com-
bined to yield, on average, a more accurate judgment.
Herzog and Hertwig (2009) describe “dialectical
bootstrapping” as a way to generate two estimates
from an individual, with the hope of bracketing the
true unknown value or quantity (Larrick and Soll
2006). When the unknown quantity is a probability of
a binary event A, multiple estimates cannot bracket
the true resolution of one or zero. One can elicit
P4A5 60117as the first judgment and P 4Ac560117
for the second judgment, where Acis the complement
of the event A. These judgments can be logically con-
sidered two judgments for the same quantity P4A5,
because they are linked using an axiom of probability
P4A5 =1P 4Ac5. However, in practice, the events A
and Acare not always linked logically in forecasters’
cognition, and thus the judgments may be incoher-
ent (Mandel 2008, 2005). Probabilistic incoherence can
manifest in different ways because of refocusing or
unpacking effects that are described in support the-
ory (Tversky and Koehler 1994), but support theory
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 307
cannot explain the incoherence of probability judg-
ments of binary complements (Macchi et al. 1999).
Although the relationship between probabilistic
coherence and accuracy has been described previ-
ously in the literature (Wright et al. 1994), its mea-
surement and use within forecast aggregation has
not been given full attention until recently (Tsai and
Kirlik 2012, Wang et al. 2011). In addition to averaging
P4A5 and 1 P 4Ac5to produce a single judgment, we
can also quantify the degree of incoherence as the dif-
ference between these values. Wang et al. (2011) were,
to the best of our knowledge, the first to measure
and weight individual forecasters according to their
degree of probabilistic coherence. Rather than elicit-
ing just two estimates and taking a simple average,
they elicited dozens of estimates from each forecaster
for related uncertain variables in the 2008 U.S. pres-
idential election. The survey from which they took
their data asked questions such as, “What is the prob-
ability that Obama wins Indiana?” and, “What is the
probability that Obama wins Indiana and McCain
wins Texas?”
Because a simple averaging of estimates is not well
defined for many related, but logically nonequiva-
lent events, Wang et al. (2011) applied the coher-
ent approximation principle (CAP) (Predd et al. 2008,
Osherson and Vardi 2006) to obtain a coherent set of
probabilities that best represented the elicited inco-
herent subjective probabilities across all forecasters.
The CAP was proposed to obtain a coherent set of
forecast probabilities that is least different in terms of
squared deviation from the elicited forecast probabil-
ities. This “closest” set of coherent forecast probabili-
ties is found by projecting the incoherent probabilities
onto the coherent space of forecast probabilities, thus
allowing incoherent probabilities to be approximated
by coherent probabilities that are minimally different.
The incoherence metric is then the Euclidean distance
from an incoherent set of forecast probabilities to the
“closest” coherent set of forecast probabilities.
For a simple event A, in terms of yielding an aggre-
gated judgment for P4A5, the concept of the averaging
P4A5 and 1 P 4Ac5and the concept of the CAP are
equivalent, and can be shown in Figure 1. For this
example, an individual produces a judgment y1for
P4A5 as 0.7, and then produces a judgment y2for
P4Ac5as 0.7. Taken together, these two judgments are
Figure 1 The Coherent Approximation Principle (Predd et al. 2008,
Osherson and Vardi 2006) for P 4A5 and P4 Ac5
0!
0!
C

Y
P
incoherent and lie above the set of coherent estimates
for (P 4A51 P 4Ac5), which is represented by the line
between (011) and (110). We can obtain an averaged
estimate as P4A5 =4007+4100755/2=005 and do the
same to yield the averaged estimate of P4Ac5=005.
Equivalently, with the CAP, we can think of
the coherent estimate (0051005) as the projection of
(0071007) onto the set of coherent estimates, and we
can uniquely measure the degree of incoherence as
the Euclidean distance between the two points to
yield an incoherence metric (IM). For this particular
example, the IM is
p4P 4A5 y152+4P4Ac5y252
=p400500752+4005007520
Applying the CAP is a constrained optimization
problem: minimization of the Euclidean distance, sub-
ject to constraints for the new probabilities to be
coherent. Importantly, the CAP is an operational con-
cept for any set of related subjective probabilities,
given that the coherent set of probability forecasts can
be expressed mathematically.
Wang et al. (2011) used a concept of local coherence
to measure an individual judge’s degree of coherence.
After weighting each judge’s distribution according to
their local incoherence metric, giving less weight to
more incoherent forecasters, the CAP was employed
once within a complex numerical algorithm to find an
aggregate coherent probability distribution that least
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
308 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
modified the weighted average distribution. Concep-
tually, giving less weight to more incoherent fore-
casters should generate forecasting accuracy gains
for multiple reasons. First, when a set of forecasts
is incoherent, at least one estimate must be inac-
curate (Mandel 2005). For every incoherent subjec-
tive probability distribution, there exists a coherent
probability distribution that dominates the incoher-
ent distribution in terms of the Brier score—a scor-
ing rule described in more detail later (De Finetti
1990). Second, coherence might signal a more system-
atic accessing and consideration of relevant informa-
tion. And third, coherence can indicate the care and
effort taken by a forecaster to yield his or her estimate
(Wang et al. 2011).
Initially, the optimization required for the CAP was
equivalent to an NP-hard decision problem, but Predd
et al. (2008) created an algorithm that decomposed
the optimization into subproblems that used “local”
sets of related events. The weighting scheme devised
by Wang et al. (2011) maintained a similar run time,
which was less than a full solution to the CAP but
still substantial.
Using “big data” for the 2008 presidential election
set, Wang et al. (2011) found that the coherent adjust-
ments with more weight given to judges with greater
individual coherence produced significantly greater
stochastic accuracy—with upward of 41% improve-
ment in average Brier score—and was comparable
with other models like Intrade in predicting the out-
comes of the United States in the election. How-
ever, this aggregation procedure was complex, and
required significant computational expertise and time.
2.2. Novelty of This Paper
Coherence weighting within judgment aggregation is
useful for one-time, unique events, where judges’ pre-
vious performance is not known, or where a decision
maker does not want to rely on perceived expertise
and other demographics of judges, which have been
shown to be poor indicators of forecasting perfor-
mance (Burgman et al. 2011, Tetlock 2005). Although
we incorporate the CAP and coherence weighting
into the approach of this paper, we differ from Wang
et al. (2011) in five key ways.
First, we are interested in how coherence weight-
ing and “coherentizing” of probabilities can be used
for one binary event A, rather than a large given
collection of related probability events. Second, we
ask what the best questions are and how many should
we ask within our approach to increase the fore-
casting accuracy of the event A. Third, we consider
simpler coherentizing algorithms that can be done
without a need for significant computing time (see
Wang et al. 2011) and expertise. We formulate a sim-
ple quadratic program that is done for each user-
question pair, which implies the run time for the
approach is linear in questions and users, rather than
one that is severely increasing in these two factors.
Fourth, when eliciting probabilities for multiple, unre-
lated, simple events, we consider that a judge might
be an expert with respect to some events, and yet
be uninformed for other events. Therefore, we weight
individuals on each question rather than globally
for all questions. And, fifth, we describe a different
type of coherence-weighting function that has a nat-
ural threshold for decreasing the weight assigned to
judges with extreme epistemic uncertainty.
Congruent aims of the paper are to assess the
important elicitation-aggregation trade-off in efforts
to boost forecasting accuracy. On the one hand, there
is some evidence showing that aggregation of multi-
ple judgments (e.g., judging the number of jellybeans
in a jar) is improved when they are elicited indepen-
dently from different forecasters (Surowiecki 2005).
More recently, there is also evidence to suggest that
multiple intraforecaster elicitations can improve accu-
racy, especially when the elicitations are spaced apart
to promote independence. In one study, Vul and
Pashler (2008) elicited two quantity judgments from
individual forecasters for general knowledge ques-
tions, with the judgments elicited in immediate suc-
cession or separated by three weeks to promote
independence. Averaging judgments in both cases
improved accuracy, but the benefit was significantly
greater in the spaced condition. As with interfore-
caster aggregation (crowdsourcing), intraforecaster
aggregation seems to work best when the individual
forecasts are not correlated, and spacing apart elicita-
tions seems to promote that sort of independence.
On the other hand, some researchers proposed that
judgment quality can be improved by eliciting judg-
ments in ways that encourage a fuller consideration
of pertinent information and the relation between
alternative hypotheses (e.g., Hirt and Markman 1995,
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 309
Sieck et al. 2007). In a classic study, Lord et al. (1984)
demonstrated a correction to social judgment when
they either made opposing possibilities more salient
to people or directly instructed them to consider the
opposite of their beliefs of the moment.
One way to promote a consideration of the relations
between alternative hypotheses is to elicit estimates
of each hypothesis in close succession or concurrently.
For instance, Mandel (2005) found that probability
judgment of Aand Acwere more likely to be coherent
(i.e., additive) if the judgments were elicited in imme-
diate succession rather than spaced apart with an
intervening distracter task. In a related vein, Williams
and Mandel (2007) found an improvement in the
quality of conditional probability judgments, both in
terms of coherence (i.e., additivity) and accuracy (dis-
tance from mathematical probability), when queries
were elicited with “evaluation frames” (a term coined
by Tversky and Koehler 1994) that explicate the oppo-
site possibility (e.g., “Given A1 what is the probabil-
ity of Xrather than not-X?”) rather than “economy
frames” that only explicate a focal hypothesis (e.g.,
“Given A, what is the probability of X?”). As with the
consecutive elicitations, evaluation frames encourage
judges to think about one hypothesis in relation to
other related hypotheses, increasing what Hsee (1996)
calls their evaluability.
These two perspectives—improving aggregation by
spacing apart elicitations and, improving individual
judgments by eliciting probabilities together—present
a trade-off that has yet to be carefully examined.
Methods that encourage people to see the depen-
dence between probabilities should improve judg-
ment quality by increasing the accessibility of logical
rules of probability and by improving the weight-
ing of evidence among options. These advantages
support an argument in favor of concurrent elicita-
tion methods. However, methods that obscure the
relations between probabilities through spaced elic-
itations should decrease coherence of related judg-
ments and increase the variability of judgments
and incoherence across forecasters. Thus, judgment
aggregation procedures that capitalize on incoherence
and variability should benefit from spaced elicitation
methods. With our two studies described in the next
section, we test both independent and concurrent elic-
itations in our aggregation approaches with the goal
of obtaining the most accurate aggregate estimates.
3. Methods
3.1. Overview of the Two Studies
For a single event A, we want to determine the
best way to elicit extra information from judges to
use probabilistic incoherence to identify the “better”
judges. Our ultimate goal is to maximize the improve-
ment in judgment accuracy of P4A5, when compared
with the equal-weighted averaging of judgments,
by giving more weight to better judges. To sim-
ulate expert judgment with 60 simple Aevents,
we constructed 60 statements, both true (e.g., A=
8Michelangelo painted the Sistine Chapel9) and false
(e.g., A=8Melbourne is the capital of Australia9),
over general knowledge categories, and, in two stud-
ies, recruited undergraduate psychology student par-
ticipants to serve as judges. Each participant provided
confidence assessments rather than true forecasts.
We asked the students about the veracity of each
statement to generate P4A5. For example, if a partic-
ipant believed a statement Awas true with a confi-
dence or subjective probability of 0.75, he or she was
instructed to give P4A5 =0075. To measure probabilis-
tic incoherence, we also asked about the veracity of
the related event statements B, with Aand Bmutually
exclusive, Ac, and AB.
The events A1 B1 Ac, and ABform a probabil-
ity space of related events for each general knowl-
edge category. In this research, Ais the focal event,
and the remaining three events serve as auxiliary
events for the purpose of improving the forecast-
ing accuracy of P4A5. In both studies, P 4B5,P 4Ac5,
and P 4A B5 were elicited, but, recognizing that dif-
ferent elicitations may be more useful, the analyses
within this paper were conducted using three alter-
native coherentization schemes: (a) the two-way (com-
plements only) scheme relies on just P4A5 and P4Ac5;
(b) the three-way (disjunctions only) scheme relies on
P4A5,P 4B5, and P 4A B5; and (c) the four-way (com-
plements and disjunctions) scheme relies on all four
probabilities.
Study 1 fostered independent intraperson judg-
ments for the related events by spacing apart the
related judgments in a manner that maximized inter-
elicitation distance for related items (i.e., B1 Ac1and
AB), whereas study 2 elicited the related judg-
ments concurrently, thus minimizing inter-elicitation
distance. This allowed us to test whether concurrently
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
310 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
elicited probabilities improved the accuracy of the
aggregated judgments and decreased the degree of
incoherence. This also allowed comparison of the
effects of coherence weighting of independent judg-
ments versus coherence weighting of judgments that
were assessed together.
In both studies, the statements used were the
same, and a quadratic programming model was used
to coherentize the estimates for each category and
generate the incoherence metrics that were used
to weight and aggregate the judgments for P4A5.
The conjectures of the paper are described in testable
hypotheses:
Hypothesis 1. An equal-weighted average of coheren-
tized estimates of P4A5 increases accuracy compared to an
equal-weighted average of raw estimates of P4A5 ( for both
studies 1 and 2).
Hypothesis 2. A coherence-weighted average of coher-
entized estimates of P4A5 increases accuracy compared
to an equal-weighted average of coherentized estimates of
P4A5 ( for both studies 1 and 2).
Hypothesis 3. An equal-weighted average of raw esti-
mates of P4A5 is more accurate when related estimates (i.e.,
B1 Ac,AB) are elicited concurrently rather than in a
spaced manner.
Hypothesis 4. A coherence-weighted average of coher-
entized P4A5 is more accurate when related estimates are
elicited in a spaced manner rather than concurrently.
Hypotheses 1 and 2 are formulated with Wang
et al. (2011) in mind, and Hypothesis 3 is formulated
with Sieck et al. (2007), Mandel (2005), Hsee (1996),
Hirt and Markman (1995), and Lord et al. (1984) in
mind. We assume that spaced estimates will be more
incoherent than concurrent estimates. For Hypothe-
sis 4, we further assume that our method will take
full advantage of the added incoherence. We exam-
ine the initial hypotheses (Hypotheses 1 and 2) for
the two-way, three-way, and four-way coherentization
schemes, and we examine Hypothesis 4 for the opti-
mal scheme and close competitors.
3.2. Experiment Design
Both studies 1 and 2 used the same 60 general
knowledge categories, where each category contained
statements A1 B1 Ac,AB. Asking for a probability
estimate concerning the veracity of each statement
generated a 240-question survey. The statements
used in the studies were designed such that fresh-
men undergraduate psychology students would have
familiarity with at least some of the topics. The state-
ments were therefore spread over topics in history,
geography, psychology, economics, postal abbrevi-
ations, state/country capitals, art, politics, science,
sports, and other topics.
An example A statement was “In the Earth’s solar
system, Mars is the fifth plant from the Sun,” which
is false. In this category, the Bstatement was “In
the Earth’s solar system, Jupiter is the fifth planet
from the Sun,” the Acstatement was “In the Earth’s
solar system, Mars is NOT the fifth planet from
the Sun,” and ABstatement was “In the Earth’s
solar system, Mars or Jupiter is the fifth planet
from the sun.” Additional examples of statements are
found in Appendix A.
In both studies, student participants, who were
blind to the full purpose of the research, served
as judges and provided the probabilities to be
aggregated. The student participants were only
incentivized to participate with credit/no credit for
fulfilling a course requirement, and this decision was
based only on completion, and not performance. Each
student participant was presented with a statement,
and used a pull-down menu to select his or her sub-
jective probability from 0% to 100% that the statement
was true. The elicited probability was then displayed
on a ruler beneath the estimate to give the partici-
pant a visual aid. The participant then submitted an
answer and moved to the next screen with the next
statement(s).
3.3. Coherentization
Letting the vector ybe the elicited vector of subjective
probability estimates for a coherentization scheme, we
have pas the “closest” coherent vector of probabil-
ities. Finding pis done by coherentization using a
quadratic programming model.1For example, for the
four-way scheme, y=6y11 y21 y31 y47is elicited, and
p=6P 4A51 P 4B51 P 4Ac5,P 4A B57 is found as
Minimize (over p)2 4y1P 4A552+4y2P 4B552
+y3P4Ac52+y4P 4A B521
1We use the standard quadratic programming package in MATLAB,
called “quadprog.”
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 311
such that
1. P 4A5 +P4Ac5=1,
2. P 4A5 +P4B5 =P 4A B5 (since Aand Bare mutu-
ally exclusive),
3. 0 P 4A51 P 4B51 P 4A B51 P 4Ac51.
In all elicitation cases, the objective function is
the squared Euclidean distance between yand p.
The square root of the objective function yields the
incoherence metric for the category for each partici-
pant. In the example of the four-way scheme, the first
two constraints are represented as linear equality con-
straints, and the last set of constraints are represented
as linear inequality constraints.2Taken together, the
three sets of coherence constraints form a convex set
of coherent probabilities. The convexity of the set of
coherent probabilities implies that any weighted aver-
age of coherent probabilities is again coherent, as a
weighted average is a convex combination of points.
As an example of the two-, three-, and four-way
coherentization schemes, consider the incoherent esti-
mates for y=6y11 y21 y31 y47=60041003100510067, where
y1is an estimate of P 4A51 y2is an estimate for P 4B51 y3
is an estimate for P4Ac), and y4is an estimate for
P 4A B5. Then the multiple formulations yield coher-
entized probabilities, such as:
Two-way: 6Pc4A51Pc4Ac57=600451005571IM =0007.
Three-way: 6Pc4A51 Pc4B51 Pc4A B57 =6003710.27,
00637, IM =0006.
• Four-way: 6Pc4A51Pc4B51Pc4Ac51 Pc4AB57=600421
0.24, 0.58, 0.66], IM =0012.
3.4. Aggregating Probabilities
For aggregating, given yi
1as the raw subjective prob-
ability for P4A5 from the ith student participant,
and Pi
c4A5 as the coherentized probability for P4A5
from the ith participant, the simple equal-weighted
average of the raw estimates for all Nparticipants is
defined as 1
N
N
X
i=1
yi
13
the simple, equal-weighted average of the coheren-
tized estimates is similarly defined as
1
N
N
X
i=1
Pi
c4A53
2If one were to ask conditional probability questions, the con-
straints would no longer be completely linear and would include
ratio constraints.
and the coherence-weighted average of coherentized
estimates is defined as
1
N
X
i=1
4IMi5P i
c4A51 with =
N
X
i=1
4IMi51
where 4IMi5is the weighting function evaluated at
the ith participant’s incoherence metric for the cate-
gory containing A.
When weighting forecasters according to IMi, with
IMidefined as the Euclidean distance between yi
and pi, the weighting function is defined on the non-
negative real line, and the function decreases as IMi
increases to increasingly penalize incoherent forecasts.
Because the weights are normalized during the aggre-
gation, only the ratio values of the weights are rel-
evant (not the absolute values). Arbitrarily setting
the weighting for a perfectly coherent set of forecasts
(IMi=0) to a value of 1, a set of weighting functions
that satisfy these conditions is described as
4IMi5=IMmax IMi
IMmax
1
where IMmax is the largest incoherence score recorded
for any category over all participants and all ques-
tions. The value of IMmax is always known before
resolution, and all IMivalues are assigned nonnega-
tive weights. The parameter is a scale parameter,
and when =0, all IM values receive a weight value
of one, thus reducing to a simple equal-weighted
averaging. When =1, the weighting function is a
linear, decreasing function that weights perfect coher-
ence as a value of one, and the largest incoherence
metric as zero. As approaches infinity, only the
perfectly coherent forecasts (IMi=0) are assigned a
nonzero weight.
Given that the coherentizing and coherence-
weighting approaches are performed before the res-
olutions of the questions are known, an important
question concerns a best value for the parame-
ter. Previous literature provides a good first guess.
In particular, a significant challenge when eliciting
probabilities for a diverse set of events is that judges
commonly use the probability value of 0.5 to indi-
cate a probability value of 0.5 (i.e., a point-mass value
at 0.5 for a fair coin flip) but also use 0.5 to indicate
epistemic uncertainty (i.e., a uniform 0–1 distribution).
In the first case, the forecaster is describing aleatory
uncertainty (Pate-Cornell 1996), and has assessed that
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
312 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
both the events Aand Acare equally likely. However,
in the second case, the forecaster might not have suffi-
cient knowledge of the event space or other informa-
tion to make an informed forecast.
Although conceptually one could argue that 0.5 is
the appropriate point estimate for epistemic uncer-
tainty, the excess of 0.5 probabilities can lessen the
influence of judges that might have important insight
when an equal-weighted average is used to pro-
duce an aggregate estimate. The epistemic use of 0.5
implies a blip or jump at 0.5 in the histogram of
elicited probabilities (Bruine de Bruin et al. 2002).
A particular challenge for forecast aggregation is then
to differentiate the forecasters that are describing
aleatory uncertainty, and those that are describing
epistemic uncertainty.
When there is a significant degree of epistemic
uncertainty among judges, some will likely enter 0.5
for P4A51 P 4B51 and P 4A B5 (especially if the esti-
mates are independently elicited), which will yield
incoherence because P 4A5 +P4B5 6= P 4A B5. Thus the
IM score can be advantageous for the three-way and
four-way coherentization schemes because a response
of 0.5 to express epistemic uncertainty will yield an
IM value of approximately 0.29 for the three-way
scheme and 0.32 for the four-way scheme, but a
response of 0.5 to express aleatory uncertainty (i.e.,
P4A5 =P 4B5 =005 and P 4A B5 =1) will yield an IM
value of zero. We therefore want to pick a value
appropriately to assign sufficiently small weights to
these epistemic judgments and judgments that are
further away from the coherent set of judgments.
With this in mind, the scale parameter is fixed for
the following studies, =15, which yields signifi-
cantly small weight values of 400295=00021, and
400325=00013. In a later section, we demonstrate
how this parameter is sufficient to alleviate issues
with the fifty–fifty blip, and also demonstrate the
robustness of our gains in forecasting accuracy using
sensitivity analysis.
4. Experiments
4.1. Study 1: Spaced Judgments of Related
Probabilities
Study 1 featured 30 undergraduate George Mason
University psychology students who provided the
probability estimates to be aggregated. As noted ear-
lier, the important distinction between the first and
second studies was that the probabilities for events
in the same category were spaced as far apart as
possible in study 1 to try to foster independent intra-
participant judgments for each category, whereas the
probabilities for events in the same category were
elicited concurrently in study 2. In study 2, all state-
ments in a category received probability judgments
together, although the statement order was random-
ized. In study 1, the first randomly chosen state-
ment for a category received a probability judgment,
but the next statement in the same category did
not receive a judgment until the participant cycled
through unrelated statements of the other 59 cat-
egories. With the randomization, each participant
did not know if they were providing a probability
for A1 B1 Ac, or AB, and the participant was not
allowed to change previously submitted probabilities.
Instructions were to enter “a probability between
0% and 100%. If you are absolutely certain that the
statement is true, you should enter 100. Likewise, if
you are absolutely certain that the statement is false,
you should enter 0. If you are uncertain, you should
enter the probability that corresponds with what you
think are the chances that the statement is true.” Par-
ticipants took an average of 45 minutes to submit all
answers.
After all surveys were completed, the IM scores
were calculated for each judge-category pair, and
Hypotheses 1 and 2 were tested for the two-way,
three-way, and four-way coherentization schemes.
Figure 2 shows the weighting function (right axis)
transposed over the histogram (left axis) of incoher-
ence metrics for the 1,800 participant-category pairs
for the four-way coherentization. We see from this
figure that a majority of participants were beyond
the 0.32 cutoff that resulted from answering 0.5 for
P 4A51 P 4B51 P 4Ac5, and P 4A B5. We observed similar
results with the three-way coherentization.
Figure 3 shows the histogram of all elicited prob-
abilities yi
1for P4A5 in the top panel, and shows in
the bottom panel the 861 coherentized Pi
c4A5 estimates
that received IMiscores less than or equal to 0.31 for
the four-way coherentization. We note the histogram
bar that included 0.5 decreased the most from the top
to the bottom panel. This observation supports the
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 313
Figure 2 The Weighting Function with Scale Parameter Set as =15
(Right Axis), and Histogram of the Incoherence Metrics for
Study 1 (Left Axis)
0 1.4
0
450 1
Incoherence metric
Statement weight
No. of incoherence metrics
Answering 50% for all
questions A, B, A B, Ac in
category
0
0.32
Note. The weighting function assigns to each incoherence metric a weight
that is normalized and used for aggregating over the participants.
conjecture that the weighting approach would reduce
the fifty–fifty blip.
We used two tests for Hypotheses 1 and 2. First,
after we generated all aggregate estimates for P4A5
for all 60 categories, we looked at an average Brier
score (Brier 1950). The Brier score is a proper scor-
ing rule for judgment accuracy, which may be further
decomposed to provide measures of calibration and
Figure 3 The Histogram of 1,800 Raw Estimates yi
1(Top), and Histogram of 861 Coherentized Estimates Pi
c4A5 for Categories with Incoherence
Metrics Less Than 0.31 (Bottom) for Study 1
Histogram of 1,800 raw P(A) estimates
Histogram of 861 coherentized P(A) estimates (four-way) with incoherence scores 0.31
1.00.90.80.70.60.50.40.30.20.10
1.00.90.80.70.60.50.40.30.20.10
0
100
200
300
500
400
0
100
200
300
500
400
Fifty–fifty blip
Note. Coherentization most affected the bin containing probability estimates of 50%.
discrimination (Yaniv et al. 1991, Murphy 1973), and
the Brier score is one of the most popular scoring
rules (Gneiting 2011). However, for statistical tests we
looked at the number of times (out of 60) that each
aggregation approach produced an estimate that was
closer to the correct resolution value, and we looked
at the change in average absolute distance between
estimates and resolution values. All statistical tests in
this paper are unidirectional, so p-values are reported
for one-tailed tests as consistent with our hypothe-
ses. For some hypotheses, multiple tests must be run,
requiring Šidák correction (p < 00017 for three t-tests).
Table 1 displays the average Brier score over the
60 questions for the various aggregation approaches.
In these cases, the Brier score for a binary event
scores an elicited probability forecast as BS(forecast)=
(resolution-forecast)2, where resolution is the value 0 if
the statement is false, and 1 if it is true. The first
row of the table displays the average Brier score of
the simple, equal-weighted average of the raw esti-
mates yi
1. The next three rows display the equal-
weighted average of the coherentized judgments
Pi
c4A5 for the various coherentization schemes. We see
that the two-way scheme does not yield much
improvement over the LINOP. The three-way scheme
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
314 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
Table 1 The Average Brier Scores for the Various Aggregation
Approaches for P 4A5, and the Percent-Improvement Over the
Equal-Weighted Averaging of Raw Estimates yi
1(LINOP ) for
Study 1
Average BS Improvement
Aggregation approach (0–1 scale) over LINOP (%)
Equal-weighted estimate just using raw
yi
1(LINOP)
002243 —
Equal-weighted estimate using two-way
Pi
c4A5,Pi
c4Ac5
002218 1014
Equal-weighted estimate using three-way
Pi
c4A5,Pi
c4B5,Pi
c4A B5
001831 18039
Equal-weighted estimate using four-way
Pi
c4A5,Pi
c4B5,Pi
c4Ac5,Pi
c4A B5
001932 13086
Coherence-weighted estimate using
two-way Pi
c4A5,Pi
c4Ac5
002066 7088
Coherence-weighted estimate using
three-way Pi
c4A5,Pi
c4B5,Pi
c4A B5
001515 32045
Coherence-weighted estimate using
four-way Pi
c4A5,Pi
c4B5,Pi
c4Ac5,Pi
c4AB5
001568 30010
offers about 18% improvement, which is greater than
the four-way scheme. In the bottom three rows, we
see that the coherence weighting of the coherentized
estimates offers an additional improvement over the
equal-weighted average of the coherentized estimates,
and the three-way and the four-way coherentization
schemes offer about 30% total improvement over the
LINOP. Using the average Brier score, we see evi-
dence that supports both Hypotheses 1 and 2 for the
three- and four-way coherentizations.
A formal test of the first two hypotheses looks
at the number of questions in which the respective
methods offered an improvement over the baseline
method. For testing Hypothesis 1, we refer to Table 2,
and see that the equal-weighted averages of coher-
entized estimates for the three-way and the four-way
schemes produced a large proportion of estimates
closer to the correct resolution when compared to the
equal-weighted averages of the raw estimates, but the
two-way scheme did not. Using the normal approxi-
mation with the sample proportion, we show that the
proportion of questions (41 out of 60) for which the
three- and four-way coherentization schemes showed
improvement was significantly greater than 0.5 (three-
way and four-way: t59 =30053, p=00002), which is the
proportion one would expect if neither method were
superior. In Table 2, we also see that the three- and
four-way coherentization schemes move the proba-
bilities on average about 0.045 and 0.030 closer to
Table 2 The Number of Times the Equal-Weighted Averaging of
Coherentized Estimates Pi
c4A5 Improved Over the
Equal-Weighted Averaging of Raw Estimates yi
1(LINOP ), and
the Average Absolute Improvement Distance and 90%
Confidence Intervals for Study 1
No. of times Average absolute
improved over improvement,
Aggregation approach LINOP (out of 60) 90% CI
Equal-weighted estimate using 28 0.0006,
two-way Pi
c4A5,Pi
c4Ac5[000081, 0.0093]
Equal-weighted estimate using 410.0448,
three-way Pi
c4A5,Pi
c4B5,Pi
c4A B5 [0.0177, 0.0718]
Equal-weighted estimate using 410.0296,
four-way Pi
c4A5,Pi
c4B5,Pi
c4Ac5, [0.0087, 0.0506]
Pi
c4A B5
Statistically greater than 30 with =0005.
the correct resolution, respectively, gains that are sta-
tistically greater than zero (three-way: t59 =20766,
p=00004; four-way t59 =20361, p=00011). Table 2 also
displays the respective 90% confidence intervals.
For testing Hypothesis 2, we refer to Table 3,
and see that a coherence-weighted average of coher-
entized probabilities gives better estimates when
compared to the respective methods that use an
equal-weighted average of the coherentized estimates.
In this case, all three weighting methods produced
proportions of improvement over the equal-weighted
average of the coherentized estimates that are greater
than 0.5 (two-way: t59 =40884, p < 00001; three-way:
t59 =40472, p < 00001; four-way: t59 =603391 p < 00001).
The average distance of improvement is also greater
with coherence weighting than with coherentizing
alone, and these distances are statistically greater than
zero for all three cases (two-way: t59 =400841 p < 00001;
three-way: t59 =506991 p < 00001; four-way: t59 =50365,
p < 00001). We see that, in this case, the four-way
coherentization offers the greatest average improve-
ment distance, but the three-way scheme is close
behind. Šidák corrections do not change any of these
conclusions.
The results of this study indicate that the three-
way elicitation of event probabilities may be the best
elicitation approach for subsequently implementing
coherentization and coherence weighting. It requires
one less elicitation per question than the four-way
scheme, while yielding similar improvement. Both the
three- and four-way coherentization schemes domi-
nate the two-way scheme for improving accuracy.
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 315
Table 3 The Number of Times the Coherence-Weighted Averaging of
Coherentized Estimates Improved Pi
c4A5 Over the
Equal-Weighted Averaging of Coherentized Estimates Pi
c4A5,
and the Average Absolute Improvement Distance and 90%
Confidence Intervals for Study 1
No. of times improved Average absolute
over equal-weighted, improvement,
Aggregation approach coherent (out of 60) 90% CI
Coherence-weighted estimate 460.0401,
using two-way Pi
c4A5,Pi
c4Ac5[0.0237, 0.0566]
Coherence-weighted estimate 450.0736,
using three-way Pi
c4A5, [0.0520, 0.0951]
Pi
c4B5,Pi
c4A B5
Coherence-weighted estimate 490.0959,
using four-way Pi
c4A5, [0.0660, 0.1257]
Pi
c4B5,Pi
c4Ac5,Pi
c4A B5
Statistically greater than 30 with =0005.
4.2. Study 2: Concurrent Judgments of Related
Probabilities
Study 2 featured a different sample of 28 undergrad-
uate GMU psychology students who provided the
probability estimates to be aggregated. The aims of
study 2 were to measure the effect of concurrently
elicited related probability judgments for coherenti-
zation and coherence weighting for comparison with
study 1. Overall, study 2 had the same setup as
study 1, except estimates for P4A51 P4B51 P 4A B5,
P4Ac5were elicited together on the same screen rather
than spaced apart.
We begin with an examination of the effect of
coherentization and coherence weighting on judg-
ment accuracy and turn to a test of our cross-study
comparisons pertinent to testing Hypotheses 3 and 4
in §4.3. Figure 4 is similar to Figure 2 and displays
the weighting function (right axis) overlaid on the
histogram of incoherence metrics. We note that there
are more forecasts in the left-most coherent bar for
study 2, but this is still a large degree of incoherence
over all the estimates. Figure 5 is similar to Figure 3,
and shows the histogram of all elicited probabilities yi
1
in the top panel, and the bottom panel shows the 924
coherentized Pi
c4A5 estimates that received IM scores
less than or equal to 0.31 for the four-way elicita-
tion. We note that although we elicited fewer total
estimates from participants in study 2, we had more
estimates with incoherence metrics less than or equal
to 0.31. We also note similar support for the conjecture
Figure 4 The Weighting Function with Scale Parameter Set as =15
(Right Axis), and Histogram of the Incoherence Metrics for
Study 2 (Left Axis)
 
 
)NCOHERENCEMETRIC
3TATEMENTWEIGHT
.OOFINCOHERENCEMETRICS
!NSWERINGFORALL
QUESTIONSA, B, AB, AcIN
CATEGORY

Note. The weighting function assigns to each incoherence metric a weight
that is normalized and used for aggregating over the participants.
that the weighting approach reduced the fifty–fifty
blip.
Table 4 displays the average Brier score over the
60 questions for the various aggregation approaches
for study 2, and displays in parentheses the aver-
age Brier score for study 1. The first row displays
the average Brier score of the simple, equal-weighted
average of the raw estimates yi
1. The next three rows
display the equal-weighted average of the coheren-
tized forecasts, Pi
c4A5. As with study 1, we see that the
two-way coherentization scheme does not yield much
improvement over the equal-weighted average of the
raw estimates. The three-way scheme offers greater
improvement than the four-way scheme. We see that
the coherence weighting of the coherentized estimates
offers an improvement over the equal-weighted aver-
age of the coherentized estimates for study 2, and for
these cases, the three-way and the four-way schemes
are very close.
As with study 1, we formally test Hypotheses 1
and 2 for study 2 by looking at the number of
questions that the respective methods offered an
improvement over the compared method. For test-
ing Hypothesis 1, we refer to Table 5, which shows
that the equal-weighted averages of coherentized esti-
mates for the three- and four-way schemes offer a
proportion of improvement over the equal-weighted
average of the raw estimates that is statistically
significant (three-way: t59 =30053, p=00002; four-
way: t59 =30381, p=00001). We see that the three- and
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
316 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
Figure 5 The Histogram of 1,680 Raw Estimates yi
1(Top), and Histogram 924 of Coherentized Estimates Pi
c4A5 for Categories with Incoherence
Metrics Less Than 0.31 (Bottom) for Study 2
Histogram of 1,680 raw P(A) estimates
Histogram of 924 coherentized P(A) estimates (four-way) with incoherence scores < 0.31
Fifty–fifty blip
1.00.90.80.70.60.50.40.30.20.10
0
100
200
300
400
500
0
100
200
300
400
500
1.00.90.80.70.60.50.40.30.20.10
Note. We note the reduction in estimates at the probability value 0.5.
four-way coherentization schemes move the probabil-
ities, on average, about 0.031, and 0.015 closer to the
correct resolution, respectively. Because we are mak-
ing three comparisons to test Hypothesis 1, we must
Table 4 The Average Brier Scores for the Various Aggregation
Approaches for P 4A5, and the Percent-Improvement Over the
Equal-Weighted Averaging of Raw Estimates yi
14LINOP 5for
Study 2
Average BS Percent-
(0–1 scale), improvement over
Aggregation approach (study 1 BS) LINOP (%)
Equal-weighted estimate just using 002020 —
raw yi
1(LINOP)40022435
Equal-weighted estimate using two-way 002076 2077
Pi
c4A51 P i
c4Ac5 40022185
Equal-weighted estimate using three-way 001778 11096
Pi
c4A51 P i
c4B51 P i
c4A B5 40018315
Equal-weighted estimate using four-way 001865 7068
Pi
c4A51 P i
c4B51 P i
c4Ac51 P i
c4A B5 40019325
Coherence-weighted estimate using 001928 4054
two-way Pi
c4A51 P i
c4Ac5 40020665
Coherence-weighted estimate using 001704 15065
three-way Pi
c4A51 P i
c4B51 P i
c4A B5 40015155
Coherence-weighted estimate using 001708 15045
four-way Pi
c4A51 P i
c4B51 P i
c4Ac51 P i
c4A B5 40015685
Table 5 The Number of Times the Equal-Weighted Averaging of
Coherentized Estimates Pi
c4A5 Improved Over the
Equal-Weighted Averaging of Raw Estimates of yi
14LINOP 5,
and the Average Absolute Improvement Value and 90%
Confidence Intervals for Study 2
No. of times Average absolute
improved over improvement,
Aggregation approach LINOP (out of 60) 90% CI
Equal-weighted estimate 20 000072,
using two-way Pi
c4A51 P i
c4Ac5[000138, 0.0007]
Equal-weighted estimate 410.0311,
using three-way [0.0100, 0.0522]
Pi
c4A51 P i
c4B51 P i
c4A B5
Equal-weighted estimate using 420.0149,
four-way Pi
c4A51 P i
c4B51 [0.0025, 0.0273]
Pi
c4Ac51 P i
c4A B5
Statistically greater than 30 with =0005.
control for family-wise alpha. Using our stricter stan-
dard (p < 00017), we see that gains for the three-way
scheme are statistically significant, but gains for the
four-way scheme are not (three-way: t59 =20462, p=
00008; four-way: t59 =20009, p=00025).
For testing Hypothesis 2, we refer to Table 6, and
see that a coherence-weighted average of coherentized
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 317
Table 6 The Number of Times the Coherence-Weighted Averaging of
Coherentized Estimates Pi
c4A5 Improved Over the
Equal-Weighted Averaging of Coherentized Estimates Pi
c4A5,
and the Average Absolute Improvement Value and 90%
Confidence Intervals for Study 2
No. of times
improved over Average absolute
equal-weighted, improvement,
Aggregation approach coherent (out of 60) 90% CI
Coherence-weighted estimate using 430.0274,
two-way Pi
c4A51 P i
c4Ac5[0.0181, 0.0367]
Coherence-weighted estimate using 380.0286,
three-way Pi
c4A51 P i
c4B5,Pi
c4A B5 [0.0074, 0.0498]
Coherence-weighted estimate using 430.0521,
four-way Pi
c4A51 P i
c4B5, [0.0267, 0.0775]
Pi
c4Ac51 P i
c4AB5
Statistically greater than 30 with =0005.
probabilities gives better estimates when compared to
the respective methods that use an equal-weighted
averaging of the coherentized estimates. In this case,
all three weighting methods produce proportions of
improvement that are greater than 0.5 (two-way: t59 =
307241 p < 00001; three-way: t59 =201431 p =00018; four-
way: t59 =307241 p < 00001). Tests were significant for
all but the three-way scheme, given the conserva-
tive family-wise alpha standard for Šidák correction
to our tests. The average distance of improvement
is also greater with coherence weighting than with
coherentizing alone, and these distances are statisti-
cally greater than zero for all three cases (two-way:
t59 =409151 p < 00001; three-way: t59 =202581 p =00014;
four-way: t59 =30431, p < 00001). Again we see that
three-way scheme offers comparable improvement to
the four-way scheme, and given that the three-way
requires one less elicitation, we view this as our best
method going forward.
4.3. Comparing Results of Studies 1 and 2
Given studies 1 and 2, we can explicitly test the effect
of having independent intraparticipant judgments for
the subjective probabilities in each category. Figures 2
and 4 allow us to compare the histograms of incoher-
ence metrics. In general, and as predicted, there were
more coherent estimates in study 2 than in study 1.
The leftmost histogram bar in study 2 (Figure 4) con-
tains almost 600 estimates, whereas the same bar
in study 1 (Figure 2) contains about 450 estimates.
In study 1, the 95% confidence interval for the aver-
age IM value for the four-way scheme was [0.3208,
0.3441] and [0.2294, 0.2491] for the three-way scheme,
whereas the 95% confidence interval for the average
IM in study 2 for the four-way scheme was [0.2559,
0.2784] and [0.1773, 0.1952] for the three-way scheme.
We test Hypothesis 3 to see the improvement in
the equal-weighted average of raw estimates of P4A5
(LINOP) when the estimates of P 4A51 P 4B51 P 4Ac51 and
P4A B5 are provided concurrently rather than in
the spaced manner. In Table 4, we see that the aver-
age Brier score of the raw estimates yi
1(LINOP) in
study 2 is less than that of study 1. In Table 7, for
each method, we compared the number of questions
where the aggregate estimate of study 1 was bet-
ter than the aggregate estimate of study 2, and we
examined the distance between the averages. For the
equal-weighted estimates of using the raw yi
1, the pro-
portion of times that the estimate of study 1 beat
the estimate of study 2 was 0.367 (22/60), which is
statistically less than 0.5 (t59 =20143, p=00018), and
the average distance between the question estimates
is less than zero (t59 =10763, p=00042). These find-
ings support our prediction that participants would
make more accurate raw estimates when their related
estimates were elicited concurrently in study 2 rather
than spaced independently as in study 1. We also
see some support for the equal-weighted estimate of
coherentized probabilities of study 2 being slightly
better when we look at the average Brier scores
in Table 4.
We test Hypothesis 4 to see the improvement in
the coherence-weighted average of coherentized esti-
mates of P4A5 when the estimates of P 4A51 P 4B51
P4Ac5, and P 4A B5 are provided in the spaced man-
ner, rather than concurrently. We test the three-way
scheme, deemed the best performing method, for
Hypothesis 4, and in Table 4, we see that the aver-
age Brier score for the three-way coherence weighting
scheme in study 1 is less than that in study 2. In test-
ing Hypothesis 4 for the three-way scheme, for the
coherence weighting of coherentized estimates, we
see that the proportion of questions for which study 1
produces a better estimate than study 2 is greater
than 0.5 (t59 =10858, p=00034). Furthermore, the aver-
age absolute improvement distance is greater than
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
318 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
Table 7 The Number of Times the Estimate of Study 1 Was Closer to
the Resolution Than the Estimate of Study 2 for the Various
Aggregation Approaches
Average absolute
No. of times improvement of
study 1 better study 1 over
Aggregation approach than study 2 study 2, 90% CI
Equal-weighted estimate just using 2200021,
raw yi
1(LINOP) [000408, 000011]
Equal-weighted estimate using two-way 28 000131,
Pi
c4A5,Pi
c4Ac5[000334, 0.0072]
Equal-weighted estimate using three-way 28 000073,
Pi
c4A5,Pi
c4B5,Pi
c4A B5 [000217, 0.0071]
Equal-weighted estimate using four-way 29 000062,
Pi
c4A5,Pi
c4B5,Pi
c4Ac5,Pi
c4A B5 [00023, 0.0105]
Coherence-weighted estimate using 27 000003,
two-way Pi
c4A51 P i
c4Ac5[000335, 0.0329]
Coherence-weighted estimate using 37∗∗ 0.0376,
three-way Pi
c4A51 P i
c4B51 P i
c4A B5 [0.0021, 0.0732]
Coherence-weighted estimate using 37∗∗ 0.0375,
four-way Pi
c4A51 P i
c4B51 P i
c4Ac5,Pi
c4A B5 [000096, 0.0846]
Note. The average improvement and 90% confidence intervals are also
shown.
Statistically less than 30 with =0005.
∗∗Statistically greater than 30 with =0.05.
zero for the three-way elicitation (three-way: t59 =
10771, p=00041).
In sum, although we do see a slight expected
increase in accuracy of the equal-weighted average
of the raw estimates by showing the participants
the related judgments on the same screen, we have
generated larger gains in accuracy by using coher-
ence weighting. By spacing out the related judgments
in study 1, we increased the degree of incoherence
among the estimates when compared to study 2, but
within the three-way coherence weighting scheme, we
see independent intraparticipant judgments provide
the best estimates of any of the tested approaches.
5. Discussion
5.1. Summary of Findings
The key insights from the two studies are (i) concur-
rent judgments improved the raw estimate of P4A5
and reduced the degree of incoherence, although
they did not eliminate it, (ii) coherentizing indi-
viduals’ judgments improved accuracy, (iii) coher-
ence weighting generated larger accuracy gains than
coherentizing alone, and (iv) independence across
Table 8 Summarizing the Effects in Study 1 of Three Aggregation
Approaches, with Both Raw and Coherentized Estimates of
P 4A5, on the Average Brier Score Using the Three-Way
Coherence Scheme
Aggregation approach
Equal weighting of
Equal Coherence top n-most coherent
Probabilities weighting weighting forecasters for each question n
Raw yi
1002243 001539 001699 3
001625 5
001560 10
001731 15
Coherentized 001831 001515 001685 3
Pi
c4A5 001617 5
001533 10
001653 15
Note. The threshold approach in the two rightmost columns of the table
shows the average Brier scores when only the top n-most judges (out of 30)
are aggregated with equal weight.
related intraparticipant judgments was used to gener-
ate more accurate aggregate forecasts with coherence
weighting.
Overall, when looking at all of the aggregation
approaches tested in terms of average Brier score, the
most accurate method was the three-way coherence
weighting of the coherentized estimates that were
elicited with maximum independence in the spaced
mode (study 1). In our two studies, separate elici-
tations of related probabilities generated less accu-
rate initial estimates of the target probability than did
joint elicitations, allowing more room for improve-
ment by coherentization. Likewise, less coherent esti-
mates allowed greater improvement by coherence
weighting.
The three-way scheme performed best, and from
it, the effects of both coherentization and coherence
weighting are shown in terms of the average Brier
score in Table 8. We see that coherentizing probabil-
ities alone provides some increase in accuracy, but
coherence weighting provides the largest gains in
accuracy for both raw and coherentized probabilities.
In coherence weighting, the least coherent probability
estimates are weighted significantly less than the most
coherent estimates. Thus, the effect of coherentization
is minimal with the coherence weighting.
In Table 8, we also examine how threshold weight-
ing (Tsai and Kirlik 2012) compares with our weight-
ing function approach. In the rightmost columns of
Table 8, we show the Brier scores for equal weighting
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 319
of the n-most coherent judges, for various values of n
(out of 30). We see the most accurate aggregate esti-
mates are obtained with an nof about 10, and these
methods produce similar gains in accuracy with the
three-way method described within this paper. How-
ever, we note that the inverse of the coherence metric
employed by Tsai and Kirlik is not the same as our IM,
as their measures describe the coherence of a judge’s
estimates with Bayes’ theorem and historical data.
5.2. Sensitivity Analysis
The major efforts for sensitivity analysis concern the
weighting function employed and the pooling size.
In particular, we found that our Brier score results and
comparisons between the two studies were generally
robust to different weighting functions, provided that
the weighting functions assigned very little weight
(<0005) to the critical threshold point that represented
answering 0.5 for all questions in the category.
Figure 6 shows in the top panel the average of
the BS for the three-way scheme in study 1 (solid
dark line) and study 2 (dashed light line) for the
coherence weighting of coherentized estimates, as a
Figure 6 The Average Brier Score for the Three-Way Coherentization Scheme vs. the Scale Parameter for Study 1 (Solid Dark Line) and Study 2
(Dashed Light Line) in the Top Panel, and the Weighting Function Evaluated at 0.29 vs. the Scale Parameter in Bottom Panel
Average BS of
coherence-weighting
Scale parameter on weighting function
Scale parameter on weighting function
Weighting function
evaluated at 0.29
1009080706050403020100
0
0.2
0.4
0.6
0.8
1.0
0.14
0.15
0.16
0.17
0.18
0.19
0.20
1009080706050403020100
function of the exponent parameter in the weight-
ing function (recall we set as =15 in the analy-
ses). In the bottom panel, we see the weighting value
400295as a function of . We note that the minimum
of the average BS occurs around =20, which coin-
cides with the point where 400295is practically zero.
However, there are similar scores when is between
10 and 40. We also see that the average Brier score
from study 1 is less than that of study 2, save the
region where is close to zero. This strongly sup-
ports our analysis in §4.4, which shows that indepen-
dent, coherence-weighted, intraparticipant estimates
produce more accurate results.
Figure 7 shows the same analysis as Figure 6, but
for the four-way scheme. The top panel shows the
average BS of the four-way scheme in study 1 (solid
dark line) and study 2 (dashed light line) for the
coherence weighting of coherentized estimates as a
function of the exponent parameter in the weight-
ing function. The bottom panel shows the weighting
value 400325as a function of . We see the average
Brier score of study 1 was less than that of study 2
for a large range.
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
320 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
Figure 7 The Average Brier Score for the Four-Way Coherentization Scheme vs. the Scale Parameter for Study 1 (Solid Dark Line) and Study 2
(Dashed Light Line) in the Top Panel, and the Weighting Function Evaluated at 0.32 vs. the Scale Parameter in Bottom Panel
Scale parameter on weighting function
Average BS of
coherence-weighting
Scale parameter on weighting function
Weighting function
evaluated at 0.32
1009080706050403020100
0
0.2
0.4
0.6
0.8
1.0
0.14
0.15
0.16
0.17
0.18
0.19
0.20
1009080706050403020100
Testing different weighting functional forms, we
found a linear function from (011) to (003210) gener-
ated similar Brier score results in study 1 for the three-
way elicitation (0.1609 with linear versus 0.1515 with
original power function), as well as for the four-way
scheme (0.1596 with linear versus 0.1568 with origi-
nal power function), recalling the average Brier score
for the equal-weighted averaging of raw judgments
was 0.2243.
We also found for study 1 that an indicator weight-
ing function that assigned a weight of 1 to the partic-
ipant for the question if his or her IM was to the left
of the critical threshold, and a weight of 0 at or to the
right of the critical IM threshold still generated sizable
gains for the three-way scheme (0.1726 with indicator
function versus 0.1515 with original power function),
and the four-way scheme (0.1752 with indicator func-
tion versus 0.1568 with original power function). This
type of weighting function is similar to the approach
of Tsai and Kirlik (2012), yet we allow the number of
estimates that are averaged for each question to vary
depending on how many judges produce estimates
that are to the left of the critical threshold.
For study 2, we found a linear function from (011)
to (003210) generated similar Brier score for the three-
way elicitation (0.1698 with linear versus 0.1704 with
original power function), as well as for the four-way
scheme (0.1747 with linear versus 0.1708 with origi-
nal power function), recalling the average Brier score
for the equal-weighted averaging of raw judgments
was 0.2020.
We also found for study 2 that an indicator weight-
ing function that assigned a weight of 1 to the par-
ticipant for the question if his or her IM was to the
left of the critical threshold, and a weight of 0 at
or to the right of the critical threshold still gener-
ated sizable gains for the three-way scheme (0.1755
with indicator function versus 0.1704 with original
power function), and the four-way scheme (0.1812
with indicator function versus 0.1708 with original
power function). In sum, these sensitivity results for
the weighting function are encouraging for future use
of the approach.
Figure 8 shows how the average Brier score results
vary depending on the size of the aggregation pool.
These results were obtained by randomly sampling
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 321
Figure 8 The Average Brier Score of the Equal-Weighted Average
(LINOP ; Solid Line), and the Coherence Weighting of
Coherentized Estimates for the Three-Way Coherentization
Scheme (Dashed Line), as a Function of the Aggregation
Pool Size
1,000 subsets of participants for each integer size
ranging from 1 to 30, and then implementing the
respective methods and averaging over the samples
to get an average overall Brier score. Recalling study 1
had 30 participants, the equal-weighted averaging
method (LINOP) is shown as the solid black line, and
the coherence weighting of coherentized estimates for
the three-way scheme is shown in the dashed line, as
a function of the pooling size. The initial difference
at a pool size of 1 is due to the coherentization of
the estimates, and the coherence weighting of coher-
entized estimates method dominates the LINOP for
every pooling size.
As the pool size increases, there are diminishing
returns in the Brier score improvement for the equal-
weighted averaging method, with strong asymptotic
returns around a pool size of about 10. Typically,
diminishing returns are seen with between three and
five experts (Clemen and Winkler 1999, Winkler and
Clemen 2004), and this observation is consistent with
the equal-weighted averaging method. The coherence
weighting of coherentized estimates method how-
ever only begins to asymptote around a pool size
of 30. This observation suggests that, depending on
the degree of incoherence in the pooling population,
the decision maker should seek a larger aggregation
pool for the coherentization method than for a LINOP
to increase the quality of the aggregate estimates.
5.3. Generalization of Research
The results of this paper extend past work that mea-
sures and uses probabilistic coherence to adjust prob-
ability judgments and weight judges in an effort to
produce more accurate aggregated forecasts. Prob-
abilistic coherence provides a logical framework
to elicit multiple, different subjective probabilities
4P 4A51 P 4B51 P 4Ac5, and P 4A B55, and use these prob-
abilities to adjust the probability of interest 4P4A55.
For two probability judgments P4A5 and P4Ac),
creating probabilistic coherence is equivalent to the
approach of averaging P4A5 and 1 P 4Ac5, but prob-
abilistic coherence can be applied to any set of vari-
ables that are logically related. Probabilistic coherence
allows for the most useful information to be elicited.
For example, we can ask about a mutually exclusive
event B, and the union of two events Aand Bto make
a judgment on A. By comparing the gains in accuracy,
we were able to prioritize the best information that
should be elicited.
There are key advantages of the approach of
this paper when compared to other aggregation
approaches. The approach does not require questions
to resolve or similar seed questions with known res-
olution values to be constructed. Using coherence
weighting could therefore decrease the time it takes
to prepare for the elicitation, when compared with
other performance-based weighting schemes that use
seed variables. For example, there is a significant time
burden in constructing the seed variables necessary
for Cooke’s classical weighting method (Cooke and
Goossens 2008), especially when the number of seed
variables needed is large (Clemen 2008). The seed
variables also need to closely match the theme of the
real forecasting events.
With the three-way scheme, only one extra event Bis
constructed, and ABthen follows. The event B
directly concerns the target event A. We found that
constructing Bis usually straightforward, and the
three-way coherentization scheme would be easily
translated to real forecasting questions. For exam-
ple, given an upcoming election, the events Aand B
could describe the various candidates. Alternatively,
for forecasting growth of gross domestic product, the
events Aand Bcould be different intervals of growth.
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
322 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
We found that eliciting P 4Ac5does not generate
large accuracy gains for P4A5 when compared to elic-
iting P4B5 and P 4A B5. There are two potential rea-
sons for this. First, for many questions, there was
no convenient way to express Ac, other than to sim-
ply say “not A.” Thus, participants might have been
anchoring on P4A5 when providing P4Ac5, even if the
judgments were elicited independently (Tversky and
Kahneman 1974). Eliciting P4B5 perhaps allowed the
participant to think about P4B5 in a manner that did
not anchor on P4A5. Second, epistemic uncertainty or
ignorance can pass as a coherent estimate, and it is
not possible to establish a critical IM threshold for
answering 0.5 for the probabilities P4A5 and P4Ac)
with the two-way coherentization scheme as can be
done with the three- and four-way coherentization
schemes (IM values of 0.29 and 0.32, respectively).
Coherentizing related probability estimates for the
three- and four-way schemes always increased the
accuracy of the aggregated individual estimate, thus
providing support for “crowdsourcing” within an
individual (Herzog and Hertwig 2009, Vul and
Pashler 2008, Larrick and Soll 2006). Moreover, in
line with previous research (Hirt and Markman 1995,
Lord et al. 1984, Mandel 2005, Sieck et al. 2007,
Williams and Mandel 2007), we found that partic-
ipants’ accuracy and coherence were improved by
judging logically-related events in a concurrent as
opposed to spaced manner. Thus, concurrent judg-
ments may represent a preferable elicitation mode for
improving judgment quality in contexts where judg-
ments will be used without further aggregation or
transformation. However, with coherence weighting,
we found increasing the independence of the related
judgments was more effective in improving the accu-
racy of the aggregate judgments than eliciting them
concurrently. Our best methods produced gains that
were over 30% better than the LINOP, which is in
line with the gains seen by others (Tsai and Kirlik
2012, Wang et al. 2011). Understanding this elicitation-
aggregation trade-off can be important for reaching
decisions about the optimal means for leveraging
forecasts or other advice that comes in the form of
probabilistic judgments.
In terms of eliminating the fifty–fifty blip (Bruine
de Bruin et al. 2002), the current approach effectively
removed the 0.5 probabilities that should likely not
be interpreted as point estimates, but rather that rep-
resent epistemic uncertainty. We were also able to
justify the reduced influence of probabilities other
than 0.5 by using the incoherence metric. Reducing
the number of 0.5 estimates allowed better discrim-
ination, which is shown in Figure 9 in the form of
ROC curves of the equal-weighted averages of the
raw judgments (solid, light gray line), and the three-
way (dashed, dark gray line) and four-way (dotted,
gray line) coherentization schemes. Whereas we do
not see complete dominance over the equal-weighted
average of raw judgments, we do see a significant
advantage of the two coherence-weighting methods
where the response probabilities are between 0.4 and
0.6. (This region is not immediately discernible from
Figure 9, but corresponds approximately to the region
between 0.1 and 0.3 on the xaxis, and 0.3 and 0.8 on
the yaxis.)
We also observe better discrimination and cal-
ibration when we decompose the Brier score for
the three- and four-way schemes. We do so for
study 1, where we observed the largest performance
improvement after coherence weighting. Using a
three-part decomposition of the Brier score (BS =
uncertaintydiscrimination+calibration) (Yaniv et al.
1991, Murphy 1973) with six equally spaced subjec-
tive probability partitions, we have for the LINOP in
study 1, BS =00223 =00216 00058 +00064. For the
three-way scheme, we have BS =00150 =00216
00082 +00016, and we have BS =00152 =00216
00081+00017 for the four-way scheme. Discrimination
over uncertainty, 2, captures the proportion of vari-
ance explained in the outcomes by the judgment cat-
egories (Sharp et al. 1988). In study 1, 2=00269
for the LINOP,2=00375 for the three-way scheme,
and 2=00380 for the four-way scheme. Thus, either
coherence-weighting scheme yielded slightly more
than a 40% increase over the LINOP in explain-
ing outcome variance—a substantial proportional
increase and over a 10% increase in explained vari-
ance in absolute terms. In terms of calibration, it is
useful to consider the square root of the calibration
index taken from the Brier decomposition since it rep-
resents the average absolute deviation from perfect
calibration. In study 1, the square roots are 0.253 for
the LINOP, 0.126 for the three-way scheme, and 0.130
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 323
Figure 9 The ROC Curves for Study 1 for the Equal-Weighted Averaging of Raw Judgments (LINOP, Solid, Light Gray Line), the Coherence
Weighting of Coherentized Estimates for the Three-Way Coherentization Scheme (Dashed, Dark Gray Line), and the Coherence Weighting
of Coherentized Estimates for the Four-Way Coherentization Scheme (Dotted Gray Line)










         
0STATEMENT!ISTRUE\STATEMENTISFALSE
0STATEMENT!ISTRUE\STATEMENTISTRUE
for the four-way scheme. Thus, we see that the pro-
portional increase in calibration close to 50%, an even
more substantial increase than we observed for dis-
crimination. Thus, both calibration and discrimination
are substantially improved by coherence weighting
when compared with the LINOP.
Within the spectrum of technical complexity, the
coherence-based weighting approach of this paper is
much simpler than the approach of Wang et al. (2011),
and comparable to that of simple, equal-weighted
averaging. The coherentization algorithm is done for
each participant-question pair, and thus the proce-
dure is linear in each factor. The entire analysis can
be closely approximated in spreadsheet software for
the three-way scheme, without the need of sophisti-
cated computational algorithms. We found the degree
to which a set of judgments is incoherent can be
approximated as,
p4P 4A5 +P 4B5 P4A B552
and the coherentizing can be closely approximated
by distributing this difference P4A5 +P4B5 P 4A B5
equally among the three probabilities.
In general, the findings of this research are appli-
cable in any situation where expert probability fore-
casts are aggregated. In particular, the findings could
be used for aggregating the responses of multiple
experts for use within a shared model. For example,
in the Bayesian network case model of Karvetski et al.
(2013), each conditional probability distribution for
an arc requires elicitations over two or three mutu-
ally exclusive and exhaustive states in a probability
space (that match with A,B, and AB), and, in total
there are 115 probability judgments that are needed.
The judgments could be elicited in a manner similar
to study 1.
5.4. Future Work
We recognize that our findings are contingent on the
three performance measures that were used (average
Brier score, number of questions one method im-
proved on another, and average absolution distance
of improvement), and that if additional performance
measures (e.g., averaged log score, correlation, slope)
were used, they could change the degree (although,
unlikely, the direction) of preference among the
methods.
One factor that we explicitly changed across the
two studies was the independence of the intrapartic-
ipant judgments. For the first study, we spaced the
related judgments out across the 60 questions, and for
the second study, the related judgments were elicited
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
324 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
consecutively. However, if a decision maker seeks a
forecast for one key event, it is not feasible to gener-
ate 59 other statements to get independent estimates.
Future work might address how far apart to tempo-
rally space the related subjective probability elicita-
tions to realize optimal forecasting gains, and how
much of a decrement in gain accrues as the window
size is reduced. This type of analysis might paral-
lel the type undertaken in studying optimal spacing
between test sessions to enhance learning (Rohrer and
Pashler 2007).
For some risk and decision analyses, experts would
have at least some training in probability biases as
well as formal elicitation methods, and best prac-
tice would include having an analyst interview them
and work with each expert one-on-one. Many expert
settings are clearly different from the circumstances
under which the undergraduates in our studies pro-
vided their probabilities. Future research could look
at how our results apply to real forecasting questions
with real experts, such as pundits’ forecasts of election
outcomes and other key events that face policymak-
ers. Part of this research would investigate if the gains
observed within our studies would still be achiev-
able, and, if so, why. For example, in our studies,
we were not incentivizing performance. Some par-
ticipants likely took the survey more seriously than
others, and it is possible that variation in incoher-
ence was associated with experimental vigilance by
the participant. Future work could investigate this
potential relationship by screening for different lev-
els of vigilance. For instance, Oppenheimer et al.
(2009) developed a simple one-response task aimed
at detecting whether a participant is an experimental
satisficer (namely, one who does not follow the task
instructions properly because of a purported lack of
cognitive effort devoted to the task). This task could
be used to create low- and high-vigilance groups in
future research. As well, even with experts, future
work could investigate the utility of the approach
when time is limited and working memory is accord-
ingly taxed (Sprenger et al. 2011).
Future work could continue to investigate how
many forecasters and what degree of coherence are
needed to efficiently generate gains. There will likely
be diminishing performance gains as these values
increase (Winkler and Clemen 2004). Alternatively,
future work could investigate the calibration of the
coherence-weighted estimates, in an effort to better
understand when to push the aggregative estimates
outside of the convex combination of forecast val-
ues, or when to weight forecasters with negative
weights. Finally, we suggest comparing coherence
weighting with Cooke’s classical weighting method
(Cooke 1991), both in terms of accuracy and ease of
implementation. This would allow us to examine the
relative utility of coherence weighting for forecasts of
continuous variables.
Acknowledgments
The authors acknowledge the helpful comments from re-
searchers within the IARPA ACE program. The U.S. and
Canadian governments are authorized to reproduce and
distribute reprints for governmental purposes notwith-
standing any copyright annotation thereon. The views
and conclusions contained herein are those of the authors
and should not be interpreted as necessarily represent-
ing the official policies or endorsements, either expressed
or implied, of IARPA, DoI/NBC, or the Canadian or
U.S. governments. This research was supported in part
by the IC Postdoctoral Research Fellowship Program, by
the Natural Sciences and Engineering Research Council of
Canada Discovery Grant [249537-2007], and by the Intel-
ligence Advanced Research Projects Activity (IARPA) via
Department of Interior National Business Center [Contract
D11PC20062].
Appendix A. Additional Statements Evaluated by
Participants
Additional examples of A statements used in the two stud-
ies with the truth values in parentheses (complete list avail-
able from corresponding author).
In the Earth’s solar system, Mars is the fifth plant from
the sun (F).
Michelangelo painted the Sistine Chapel (T).
• Hydrogen is the first element listed in the periodic
table (T).
As of 2008, Nebraska is the top corn-producing state
in the United States (F).
In terms of 2011 population, Manhattan is the largest
of the five New York City Boroughs (F).
Volvo is a Swedish car manufacturer (T).
Massachusetts was the first state admitted to the
United States (F).
The Pacific Ocean is the largest of Earth’s oceans (T).
The United States won the most total medals in the
2008 Beijing Olympics (T).
Richard Nixon was the 37th president of the United
States (T).
The average annual rainfall in Seattle is between 30
and 40 inches (T).
Melbourne is the capital of Australia (F).
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS 325
Appendix B. Comparing Numeracy and Coherence
Within Study 2
In study 2, we examined the role of numeracy in the accu-
racy and coherence of participants’ judgments. Numeracy is
the ability of people to use numeric information and to rea-
son with numerical concepts, and it has been shown to
vary greatly across individuals (Peters et al. 2007). We were
unsure whether more numerate people would make more
coherent or more accurate estimates for the general knowl-
edge statements we used. The most well-developed test of
numeracy is a series of mathematical problems varying in
their difficulty (Weller et al. 2012). This test produces a
roughly normal distribution of scores among a general pop-
ulation by primarily asking questions about probabilities,
and we chose to use it for measuring participants’ numer-
acy after they completed all other survey items in study 2.
Out of eight questions, the average numeracy score
was 4.32 in study 2. Scores ranged from one to seven
among participants. Participants’ probability estimates yi
1
were weighted by their numeracy similarly to how they
had otherwise been weighted by their coherence, but here
all estimates from a participant received the same weight.
The highest numeracy score received the highest weight.
Decreasing scores received decreasing weights provided
by a power function for which the best parameter value
was 6.16. Numeracy weighting of individuals’ responses
produced an improvement in the Brier score of 3.131%
over the equal-weighted average of raw estimates (0.1957
versus 0.2020). The proportion of questions for which
accuracy improved (34/60) was not statistically signifi-
cant (t59 =10042, p=00151). The average absolute value
of improvement also did not reach statistical significance
(t59 =106481 p =00052).
The small increase in accuracy from numeracy weight-
ing of participants compared to coherence weighting is not
surprising when we consider the weak correlation between
the incoherence metric and numeracy scores (r= −00147).
Perhaps because the questions on the numeracy scale are
worded as math problems and our questions are not,
numeracy did not relate to how well people followed rules
of probability for responding to general knowledge state-
ments in study 2. Whatever the reason, numeracy scores did
not significantly correlate with participants’ incoherence of
estimates (t26 = −007581 p =00226).
References
Brier GW (1950) Verification of forecasts expressed in terms of prob-
ability. Monthly Weather Rev. 78(1):1–3.
Bruine de Bruin W, Fischbeck PS, Stiber NA, Fischhoff B (2002)
What number is “fifty–fifty”?: Redistributing excessive 50%
responses in elicited probabilities. Risk Anal. 22(4):713–723.
Burgman MA, McBride M, Ashton R, Speirs-Bridge A, Flander L,
Wintle B, Fidler F, Rumpff L, Twardy C (2011) Expert status
and performance. PLoS One 6(7):1–7.
Clemen RT (2008) Comment on Cooke’s classical method. Reliability
Engrg. System Safety 93:760–765.
Clemen RT, Winkler RL (1999) Combining probability distributions
from experts in risk analysis. Risk Anal. 19(2):187–203.
Cooke RM (1991) Experts in Uncertainty (Oxford University Press,
Oxford, UK).
Cooke RM, Goossens LLHJ (2008) TU Delft expert judgment data
base. Reliability Engrg. System Safety 93(5):657–674.
De Finetti B (1990) Theory of Probability: A Critical Introductory Treat-
ment (John Wiley & Sons, New York).
Genest C, McConway KJ (1990) Allocating the weights in the linear
opinion pool. J. Forecasting 9(1):53–73.
Gneiting T (2011) Making and evaluating point forecasts. J. Amer.
Statist. Assoc. 106(494):746–762.
Herzog SM, Hertwig R (2009) The wisdom of many in one mind:
Improving individual judgments with dialectical bootstrap-
ping. Psych. Sci. 20(2):231–237.
Hirt ER, Markman KD (1995) Multiple explanation: A consider-an-
alternative strategy for debiasing judgments. J. Personality Soc.
Psych. 69(6):1069–1086.
Hsee C (1996) The evaluability hypothesis: An explanation for
preference reversals between joint and separate evaluations
of alternatives. Organ. Behav. Human Decision Processes 67(1):
247–257.
Karvetski CW, Olson KC, Gantz DT, Cross GA (2013) Structur-
ing and analyzing competing hypotheses with Bayesian net-
works for intelligence analysis. EURO J. Decision Processes,
ePub ahead of print April 30, http://link.springer.com/article/
10.1007/s40070-013-0001-x.
Kolmogorov A (1956) Foundations of the Theory of Probability
(Chelsea Publishing Company, New York).
Larrick RP, Soll JB (2006) Intuitions about combining opinions:
Misappreciation of the averaging principle. Management Sci.
52(1):111–127.
Lindley DV, Tversky A, Brown RV (1979) On the reconciliation of
probability assessments. J. Roy. Statist. Soc. 142(2):146–180.
Lord CG, Lepper MR, Preston E (1984) Considering the opposite:
A corrective strategy for social judgment. J. Personality Soc.
Psych. 47(6):1231–1243.
Macchi L, Osherson D, Krantz DH (1999) A note on superadditive
probability judgment. Psych. Rev. 106(1):210–214.
Mandel DR (2005) Are risk assessments of a terrorist attack coher-
ent? J. Experiment. Psych.: Appl. 11(4):277–288.
Mandel DR (2008) Violations of coherence in subjective probability:
A representational and assessment process account. Cognition
106(1):130–156.
Merrick JRW (2008) Getting the right mix of experts. Decision Anal.
5(1):43–52.
Murphy AH (1973) A new vector partition of the probability score.
J. Appl. Meteorology 12(4):595–600.
Oppenheimer DM, Meyvis T, Davidenko N (2009) Instructional
manipulation checks: Detecting satisficing to increase statistical
power. J. Experiment. Soc. Psych. 45(4):867–872.
Osherson D, Vardi MY (2006) Aggregating disparate estimates of
chance. Games Econom. Behav. 56(1):148–173.
Pate-Cornell ME (1996) Uncertainties in risk analysis: Six levels of
treatment. Reliability Engrg. Systems Safety 54(2–3):95–111.
Peters E, Dieckmann NF, Dixon A, Hibbard JH, Mertz CK (2007)
Less is more in presenting quality information to consumers.
Medical Care Res. Rev. 64(2):169–190.
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
Karvetski et al.: Probabilistic Coherence Weighting for Optimizing Expert Forecasts
326 Decision Analysis 10(4), pp. 305–326, © 2013 INFORMS
Predd JB, Osherson DN, Kulkarni SR, Poor HV (2008) Aggregating
probabilistic forecasts from incoherent and abstaining experts.
Decision Anal. 5(4):177–189.
Rohrer D, Pashler H (2007) Increasing retention without increasing
study time. Current Directions Psych. Sci. 16(4):183–186.
Sharp GL, Cutler BL, Penrod SD (1988) Performance feedback
improves the resolution of confidence judgments. Organ. Behav.
Human Decision Processes 42(3):271–283.
Sieck WR, Merkle EC, Van Zandt T (2007) Option fixation: A cogni-
tive contributor to overconfidence. Organ. Behav. Human Deci-
sion Processes 103(1):68–83.
Sprenger AM, Dougherty MR, Atkins SM, Franco-Watkins AM,
Thomas RP, Lange N, Abbs B (2011) Implications of cognitive
load for hypothesis generation and probability judgment. Fron-
tiers Psych. 2(129):1–15.
Surowiecki J (2005) The Wisdom of Crowds (Double Day, New York).
Tetlock PE (2005) Expert Political Judgment: How Good Is It? How Can
We Know? (Princeton University Press, Princeton, NJ).
Tsai J, Kirlik A (2012) Coherence and correspondence competence:
Implications for elicitation and aggregation of probabilistic
forecasts of world events. Proc. Human Factors and Ergonomics
Soc. 56th Annual Meeting (Sage, Thousand Oaks, CA), 313–317.
Tversky A, Kahneman D (1974) Judgment under uncertainty:
Heuristics and biases. Science 185(4157):1124–1131.
Tversky A, Koehler DJ (1994) Support theory: A nonexten-
sional representation of subjective probability. Psych. Rev.
101(4):547–567.
Vul E, Pashler H (2008) Measuring the crowd within: Probabilistic
representations within individuals. Psych. Sci. 19(7):645–647.
Wang G, Kulkarni SR, Poor HV, Osherson DN (2011) Aggregat-
ing large sets of probabilistic forecasts by weighted coherent
adjustment. Decision Anal. 8(2):128–144.
Weller JA, Dieckmann NF, Tusler M, Mertz CK, Burns WJ, Peters E
(2012) Development and testing of an abbreviated numeracy
scale: A Rasch analysis approach. J. Behavioral Decision Making
26(2):198–212.
Williams JJ, Mandel DR (2007) Do evaluation frames improve the
quality of conditional probability judgment? McNamara DS,
Trafton JG, eds. Proc. 29th Annual Meeting of the Cognitive
Sci. Soc. (Lawrence Erlbaum Associates, Inc., Mahwah, NJ),
1653–1658.
Winkler RL, Clemen RT (2004) Multiple experts vs. multiple
methods: Combining correlation assessments. Decision Anal.
1(3):167–176.
Wright G, Rowe G, Bolger F, Gammack J (1994) Coherence, cal-
ibration, and expertise in judgmental probability forecasting.
Organ. Behav. Human Decision Processes 57(1):1–25.
Yaniv I, Yates JF, Smith JEK (1991) Measures of discrimination skill
in probabilistic judgment. Psych. Bull. 110(3):611–617.
Christopher W. Karvetski is a quantitative finance ana-
lyst within the Model Validation and Analytics Group,
Financial Intelligence Unit at Bank of America, where his
focus is on developing models for detecting and mitigating
financial crimes risk, including money laundering, terrorist
financing, and economic sanctions risk. Previously, he was
an Intelligence Community Research Fellow and assistant
professor in the Applied Information Technology Depart-
ment at George Mason University in Fairfax, Virginia. His
main interests are in risk and decision analysis, and predic-
tive modeling and analytics, with a focus on defense and
security applications. He holds a Ph.D. in systems engineer-
ing from the University of Virginia. He is a member of the
Society for Risk Analysis, the Decision Analysis Society, and
Tau Beta Pi.
Kenneth C. Olson is an assistant professor in the Vol-
genau School of Engineering at George Mason University.
He earned his Ph.D. in quantitative psychology from The
Ohio State University. He completed a fellowship with the
Intelligence Community Postdoctoral Research Fellowship
Program. He is most knowledgeable about social, cognitive,
and motor processes in decision making and has extensive
experience educating both the public and professionals on
common judgment errors. His teaching and research pro-
vide statistical modeling tools to non-statisticians, includ-
ing work to develop a structured analytic technique to
implement Bayesian networks for intelligence assessments.
In recent years, he has collaborated on several projects to
improve international political forecasting. He is a member
of the Society for Mathematical Psychology, the Cognitive
Science Society, the Society for Risk Analysis, and the Soci-
ety for Judgment and Decision Making.
David R. Mandel is a senior scientist in the Sensemak-
ing and Decision Group of the Socio-Cognitive Systems
Section at DRDC Toronto, which is part of the Govern-
ment of Canada’s Department of National Defence. He is
also adjunct professor of psychology at York University.
He holds a Ph.D. in psychology from the University of
British Columbia. His research focuses on basic and applied
topics in judgment and decision making, with particular
emphasis on the application of such work to the defense
and security sector. He has served as a scientific advi-
sor to such organizations as the The National Academies,
The National Institutes of Health, The Office of the Director
of National Intelligence, the U.S. Department of Defense,
and NATO. His books include The Psychology of Counterfac-
tual Thinking (Routledge 2005) and Neuroscience of Decision
Making (Psychology Press 2011).
Charles R. Twardy is a research assistant professor at
George Mason University where he leads the SciCast fore-
casting project, a four-year effort to enhance the accuracy,
calibration, and timeliness of crowdsourced forecasts of
(previously) geopolitical events and (now) science and tech-
nology. His work focuses on fusing Bayesian networks with
prediction markets for crowdsourced elicitation of large
joint probability spaces. More generally he is interested
in inference and decision making with a special interest
in causal models. Previous work includes counter-IED mod-
els, credibility models, sensor selection, hierarchical fusion,
and epidemiological models. He holds a dual Ph.D. in his-
tory and philosophy of science and cognitive science from
Indiana University.
Downloaded from informs.org by [130.63.180.147] on 02 December 2013, at 16:01 . For personal use only, all rights reserved.
... A secondary aim of Study 1 was to explore the relationship between individuals' evaluation of alternative hypotheses and their judgmental coherence and cognitive reflection. Some have argued that intelligence organizations ought to recruit and select the 'right kind' of individuals (e.g., Dhami & Mandel, 2021;Karvetski et al., 2013;Mellers et al., 2015aMellers et al., , 2015b. In order to be coherent, judgments of the likelihood of (mutually exclusive) hypotheses ought to sum to unity, but some individuals have been shown to be nonadditive; either demonstrating superadditivity (sum to less than unity) or subadditivity (sum to more than unity; Rottenstreich & Tversky, 1997;Tversky & Koehler, 1994; for nonadditivity in analyst samples see Mandel, 2015;. ...
... Similarly, some individuals appear to have greater ability to reflect on a decision problem and refrain from providing the first response that comes to mind while others have less ability to do so (Frederick, 2005; see also Campitelli & Gerrans, 2014;Pennycook et al., 2016). Both judgmental coherence and cognitive reflection have been shown to be positively related to performance on a range of judgment and decision-making tasks (e.g., Baron et al., 2015;Campitelli & Labollita, 2010;Fan et al., 2019;Frederick, 2005;Karvetski et al., 2013;Mellers et al., 2015aMellers et al., , 2015bMoritz et al., 2013;Pajala, 2019;Toplak et al., 2011). These two individual difference measures may therefore be useful in analyst selection. ...
... Others have previously found cognitive reflection to be positively related to performance on a range of judgment and decision-making tasks (e.g., Baron et al., 2015;Campitelli & Labollita, 2010;Frederick, 2005;Mellers et al., 2015a;Moritz et al., 2013;Pajala, 2019;Toplak et al., 2011). The CRT may therefore be a useful tool for analyst selection, in addition to the other individual difference measures such as intelligence, open-minded thinking and numeracy that have been previously suggested (e.g., Karvetski et al., 2013;Mellers et al., 2015aMellers et al., , 2015b. However, it is necessary to further explore the positive relationship between CRT and strategy use identified in Study 1 for two reasons. ...
Article
Full-text available
We empirically examined the effectiveness of how the Analysis of Competing Hypotheses (ACH) technique structures task information to help reduce confirmation bias (Study 1) and the portrayal of intelligence analysts as suffering from such bias (Study 2). Study 1 ( N = 161) showed that individuals presented with hypotheses in rows and evidence items in columns were significantly less likely to demonstrate confirmation bias, whereas those presented with the ACH-style matrix (with hypotheses in columns and evidence items in rows) or a paragraph of text (listing the evidence for each hypothesis) were not less likely to demonstrate bias. The ACH-style matrix also did not confer any benefits regarding increasing sensitivity to evidence credibility. Study 2 showed that the majority of 62 Dutch military analysts did not suffer from confirmation bias and were sensitive to evidence credibility. Finally, neither judgmental coherence nor cognitive reflection differentiated between better or worse performers in the hypotheses evaluation tasks.
... While statistical models are usually limited in applicability by requiring sufficiently large and complete data sets, human forecasts can overcome this limitation taking advantage of human experience and intuition (Clemen, 1989;Clemen & Winkler, 1986;Genest & Zidek, 1986). The probability estimates can be given either as forecasts of events, e.g., rain probabilities in meteorological science or probabilities for the outcomes of geopolitical events such as elections (Graefe, 2018;Turner et al., 2014), other binary classifications, or the quantification of the experts' confidence on a prediction or the answer to a specific question (Karvetski et al., 2013;Prelec et al., 2017). ...
... The ACE-IDEA data set (Hanea et al., 2021) includes forecasts on 155 events, but on average each forecaster only replied to about 19 queries. Other data sets consist of less queries to be predicted or answered (Graefe, 2018;Hanea et al., 2021;Karvetski et al., 2013;Prelec et al., 2017), of which the highest number of queries is about 80 (Prelec et al., 2017). However, 80 answers per forecaster is still a small number for modeling the forecasters' behavior, particularly if we divide the data into training and test sets and model the answers to true/false queries separately. ...
... As performance measures, we consider Brier score, 0-1 loss, and mean absolute error. The Brier score (Brier, 1950) is a popular metric for quantifying human forecast performance, used by e.g., Karvetski et al. (2013), Turner et al. (2014), Hanea et al. (2021), and Satopää (2022). It is a strictly proper scoring rule (Murphy, 1973), meaning that it is optimized when people report their true beliefs of the probability instead of intentionally providing more or less extreme probabilities. ...
Article
Full-text available
Combining experts’ subjective probability estimates is a fundamental task with broad applicability in domains ranging from finance to public health. However, it is still an open question how to combine such estimates optimally. Since the beta distribution is a common choice for modeling uncertainty about probabilities, here we propose a family of normative Bayesian models for aggregating probability estimates based on beta distributions. We systematically derive and compare different variants, including hierarchical and non-hierarchical as well as asymmetric and symmetric beta fusion models. Using these models, we show how the beta calibration function naturally arises in this normative framework and how it is related to the widely used Linear-in-Log-Odds calibration function. For evaluation, we provide the new Knowledge Test Confidence data set consisting of subjective probability estimates of 85 forecasters on 180 queries. On this and another data set, we show that the hierarchical symmetric beta fusion model performs best of all beta fusion models and outperforms related Bayesian fusion models in terms of mean absolute error.
... Statistical procedures that modify judgments to adhere to probabilistic axioms and approximate a coherent set have also increased accuracy (Osherson & Vardi, 2006). Karvetski, Olson, Mandel, and Twardy (2013) illustrated how estimated probabilities can be made coherent by using quadratic programming to transform incoherent subjective judgments (in their case from a sample of undergraduate psychology students) to a set of coherent estimates that respect the Kolmogorov's axioms. This produced a person-and forecast event-specific measure of incoherence. ...
... Weights are also the focus of the work by Collins et al. (2024) who take a close look at coherencebased weights (Karvetski et al., 2013). Coherence describes the degree to which one's judgments follow the rules, or axioms, of probability theory: nonnegativity, unitarity, and additivity. ...
... Based on the insight that respondents report specious 50 percent responses, Bruine de Bruin et al. (2002) redistributed excess 50% responses to points in the probability distribution where responses were anomalously infrequent relative to a smooth distribution. Karvetski et al. (2013) and Fan et al. (2019) employ a coherentization procedure. If a respondent reported two complementary probabilities that sum to more than one then coherentization imposes a sum of one and rescales each individual probability accordingly. ...
Article
Full-text available
A standard way to elicit expectations asks for the percentage chance an event will occur. Previous research demonstrates noise in reported percentages. The current research models a bias; a five percentage point change in reported probabilities implies a larger change in beliefs at certain points in the probability distribution. One contribution of my model is that it can parse bias in beliefs from biases in reports. I reconsider age and gender differences in Subjective Survival Probabilities (SSPs). These are generally interpreted as differences in survival beliefs, e.g., that males are more optimistic than females and older respondents are more optimistic than younger respondents. These demographic differences (in the English Longitudinal Study of Ageing) can be entirely explained by reporting bias. Older respondents are no more optimistic than younger respondents and males are no more optimistic than females. Similarly, in forecasting, information is obscured by taking reported percentages at face value. Accounting for reporting bias thus better exploits the private information contained in reports. Relative to a face-value specification, a specification that does this delivers improved forecasts of mortality events, raising the pseudo R-squared from less than 3 percent to over 6 percent.
Article
It is intuitive and theoretically sound to combine experts’ forecasts based on their proven skills, while accounting for correlation among their forecast submissions. Simpler combination methods, however, which assume independence of forecasts or equal skill, have been found to be empirically robust, in particular, in settings in which there are few historical data available for assessing experts’ skill. One explanation for the robust performance by simple methods is that empirical estimation of skill and of correlations introduces error, leading to worse aggregated forecasts than simpler alternatives. We offer a heuristic that accounts for skill and reduces estimation error by utilizing a common correlation factor. Our theoretical results present an optimal form for this common correlation, and we offer Bayesian estimators that can be used in practice. The common correlation heuristic is shown to outperform alternative combination methods on macroeconomic and experimental forecasting where there are limited historical data. This paper was accepted by Ilia Tsetlin, behavioral economics and decision analysis. Supplemental Material: The data file is available at https://doi.org/10.1287/mnsc.2021.02009 .
Chapter
Who is good at prediction? Addressing this question is key to recruiting and cultivating accurate crowds and effectively aggregating their judgments. Recent research on superforecasting has demonstrated the importance of individual, persistent skill in crowd prediction. This chapter takes stock of skill identification measures in probability estimation tasks, and complements the review with original analyses, comparing such measures directly within the same dataset. We classify all measures in five broad categories: (1) accuracy-related measures, such as proper scores, model-based estimates of accuracy and excess volatility scores; (2) intersubjective measures, including proxy, surrogate and similarity scores; (3) forecasting behaviors, including activity, belief updating, extremity, coherence, and linguistic properties of rationales; (4) dispositional measures of fluid intelligence, cognitive reflection, numeracy, personality and thinking styles; and (5) measures of expertise, including demonstrated knowledge, confidence calibration, biographical, and self-rated expertise. Among non-accuracy-related measures, we report a median correlation coefficient with outcomes of r = 0.20. In the absence of accuracy data, we find that intersubjective and behavioral measures are most strongly correlated with forecasting accuracy. These results hold in a LASSO machine-learning model with automated variable selection. Two focal applications provide context for these assessments: long-term, existential risk prediction and corporate forecasting tournaments.KeywordsForecastingPredictionCrowdsourcingSkill assessment
Article
Full-text available
Previous research shows that variation in coherence (i.e., degrees of respect for axioms of probability calculus), when used as a basis for performance-weighted aggregation, can improve the accuracy of probability judgments. However, many aspects of coherence-weighted aggregation remain a mystery, including both prescriptive issues (e.g., how best to use coherence measures) and theoretical issues (e.g., why coherence-weighted aggregation is effective). Using data from six experiments in two earlier studies (N = 58, N = 2,858) employing either general-knowledge or statistical information integration tasks, we addressed many of these issues. Of prescriptive relevance, we examined the effectiveness of coherence-weighted aggregation as a function of judgment elicitation method, group size, weighting function, and the bias of the function’s tuning parameter. Of descriptive relevance, we propose that coherence-weighted aggregation can improve accuracy via two distinct, task-dependent routes: a causal route in which the bases for scoring accuracy depend on conformity to coherence principles (e.g., Bayesian information integration) and a diagnostic route in which coherence serves as a cue to correct knowledge. The findings provide support for the efficacy of both routes, but they also highlight why coherence weighting, especially the most biased forms, sometimes imposes costs to accuracy. We conclude by sketching a decision–theoretic approach to how aggregators can sensibly leverage the wisdom of the coherent within the crowd.
Chapter
Full-text available
The benefits of judgment aggregation are intuitive and well-documented. By combining the input of several judges, practitioners may enhance information sharing and signal strength while cancelling out biases and noise. The resulting judgment is more accurate than the average accuracy of the individual judgments—a phenomenon known as the wisdom of crowds. Although an unweighted arithmetic average is often sufficient to improve judgment accuracy, sophisticated performance-weighting methods have been developed to further improve accuracy. By weighting the judges according to: (1) past performance on similar tasks, (2) performance on closely related tasks, and/or (3) the internal consistency (or coherence) of judgments, practitioners can exploit individual differences in probabilistic judgment skill to ferret out bone fide experts within the crowd. Each method has proven useful, with associated benefits and potential drawbacks. In this chapter, we review the evidence for-and-against performance weighting strategies, discussing the circumstances in which they are appropriate and beneficial to apply. We describe how to implement these methods, with a focus on mathematical functions and formulas that translate performance metrics into aggregation weights.KeywordsJudgmentWeighted aggregationAccuracyCorrespondence
Article
In group decision-making (GDM) problems, decision makers may prefer to divide the alternatives into two preference-ordered categories, which is called a 2-rank GDM problem. In the process of 2-rank GDM, consensus is of great significance for the aggregation of individual opinions, in which the experts’ weight plays a key role in the consensus level among experts, and different expert weights may lead to different 2-rank consensus results. Thus, a coordinator may strategically set experts’ weight to attain the desired consensus level, which we call strategic experts’ weight manipulation in 2-rank consensus reaching in GDM. In this study, first, we provided the concept of 2-rank consensus level range, and then we constructed a few mixed 0–1 linear programming models to compute the strategic experts’ weights to obtain the coordinator's desired 2-rank consensus level, and the property is provided. Finally, we conducted a numerical example and a few simulation experiments to validate the effectiveness of our proposed models, and to show the effect of the number of experts and alternatives, respectively, on the process of strategic experts’ weight manipulation in 2-rank consensus reaching in GDM.
Article
Full-text available
Proposes that several biases in social judgment result from a failure to consider possibilities at odds with beliefs and perceptions of the moment. Individuals who are induced to consider the opposite position, therefore, should display less bias in social judgment. In 2 experiments, with 150 undergraduates, this reasoning was applied to 2 domains––biased assimilation of new evidence on social issues and biased hypothesis testing of personality impressions. Ss were induced to consider the opposite through explicit instructions to do so and through stimulus materials that made opposite possibilities more salient. In both experiments, the induction of a consider-the-opposite strategy had greater corrective effect than more demand-laden alternative instructions to be as fair and unbiased as possible. Results are consistent with previous research on perseverance, hindsight, and logical problem solving, and they suggest an effective method of retraining social judgment.
Article
Full-text available
Expert judgements are essential when time and resources are stretched or we face novel dilemmas requiring fast solutions. Good advice can save lives and large sums of money. Typically, experts are defined by their qualifications, track record and experience [1,2]. The social expectation hypothesis argues that more highly regarded and more experienced experts will give better advice. We asked experts to predict how they will perform, and how their peers will perform, on sets of questions. The results indicate that the way experts regard each other is consistent, but unfortunately, ranks are a poor guide to actual performance. Expert advice will be more accurate if technical decisions routinely use broadly-defined expert groups, structured question protocols and feedback.
Article
Full-text available
Averaging estimates is an effective way to improve accuracy when combining expert judgments, integrating group members' judgments, or using advice to modify personal judgments. If the estimates of two judges ever fall on different sides of the truth, which we term bracketing, averaging must outperform the average judge for convex loss functions, such as mean absolute deviation (MAD). We hypothesized that people often hold incorrect beliefs about averaging, falsely concluding that the average of two judges' estimates would be no more accurate than the average judge. The experiments confirmed that this misconception was common across a range of tasks that involved reasoning from summary data (Experiment 1), from specific instances (Experiment 2), and conceptually (Experiment 3). However, this misconception decreased as observed or assumed bracketing rate increased (all three studies) and when bracketing was made more transparent (Experiment 2). Experiment 4 showed that flawed inferential rules and poor extensional reasoning abilities contributed to the misconception. We conclude by describing how people may face few opportunities to learn the benefits of averaging and how misappreciating averaging contributes to poor intuitive strategies for combining estimates.
Article
Full-text available
Intelligence analysis often tackles questions shrouded by deep uncertainty, such as those that deal with chemical and biological terrorism or nuclear weapon detection. In dealing with such questions, the task falls on intelligence analysts to assemble collected items of information and determine the consistency of the body of reporting with a set of conflicting hypotheses. One popular procedure within the Intelligence Community for distinguishing a hypothesis that is “least inconsistent” with evidence is analysis of competing hypotheses (ACH). Although ACH aims at reducing confirmation bias, as typically implemented, it can fall short in diagramming the relationships between hypotheses and items of evidence, determining where assumptions fit into the modeling framework, and providing a suitable model for “what-if” sensitivity analysis. This paper describes a facilitated process that uses Bayesian networks to (1) provide a clear probabilistic characterization of the uncertainty associated with competing hypotheses, and (2) prioritize information gathering among the remaining unknowns. We illustrate the process using the 1984 Rajneeshee bioterror attack in The Dalles, Oregon, USA.
Article
Many decisions are based on beliefs concerning the likelihood of uncertain events such as the outcome of an election, the guilt of a defendant, or the future value of the dollar. Occasionally, beliefs concerning uncertain events are expressed in numerical form as odds or subjective probabilities. In general, the heuristics are quite useful, but sometimes they lead to severe and systematic errors. The subjective assessment of probability resembles the subjective assessment of physical quantities such as distance or size. These judgments are all based on data of limited validity, which are processed according to heuristic rules. However, the reliance on this rule leads to systematic errors in the estimation of distance. This chapter describes three heuristics that are employed in making judgments under uncertainty. The first is representativeness, which is usually employed when people are asked to judge the probability that an object or event belongs to a class or event. The second is the availability of instances or scenarios, which is often employed when people are asked to assess the frequency of a class or the plausibility of a particular development, and the third is adjustment from an anchor, which is usually employed in numerical prediction when a relevant value is available.
Article
One potentially useful concept that arises in the elicitation and aggregation of probabilistic forecasts is Hammond’s (1996) distinction between coherence and correspondence. A study was conducted to test the commonly held assumption that coherence competency, a judge’s ability to reason correctly according to the prescriptions demanded by the problem, directly yields correspondence competency, a judge’s ability to predict the outcome that actually happens in the external world. The role of a visualization aid in terms of moderating these effects was also examined. Participants who were knowledgeable baseball fans predicted the probability with which their favored team would win the 2011 Major League Baseball World Series, giving a prior probability shortly before the start of the Series, and then sequentially updating their answer as the individual games unfolded over time. Results show that for participants using the visualization, their ability to update probabilities according to the dictates of Bayes’ Theorem was correlated with their ability to predict the winner of the 2011 MLB Series – a desirable property that allows for estimation of judges’ outcome performance based on more readily available process information.
Book
Using structured expert judgment
Article
Research has demonstrated that individual differences in numeracy may have important consequences for decision making. In the present paper, we develop a shorter, psychometrically improved measure of numeracy—the ability to understand, manipulate, and use numerical information, including probabilities. Across two large independent samples that varied widely in age and educational level, participants completed 18 items from existing numeracy measures. In Study 1, we conducted a Rasch analysis on the item pool and created an eight-item numeracy scale that assesses a broader range of difficulty than previous scales. In Study 2, we replicated this eight-item scale in a separate Rasch analysis using data from an independent sample. We also found that the new Rasch-based numeracy scale, compared with previous measures, could predict decision-making preferences obtained in past studies, supporting its predictive validity. In Study, 3, we further established the predictive validity of the Rasch-based numeracy scale. Specifically, we examined the associations between numeracy and risk judgments, compared with previous scales. Overall, we found that the Rasch-based scale was a better linear predictor of risk judgments than prior measures. Moreover, this study is the first to present the psychometric properties of several popular numeracy measures across a diverse sample of ages and educational level. We discuss the usefulness and the advantages of the new scale, which we feel can be used in a wide range of subject populations, allowing for a more clear understanding of how numeracy is associated with decision processes.