ArticlePublisher preview available

The Effects of Averaging Subjective Probability Estimates Between and Within Judges

American Psychological Association
Journal of Experimental Psychology: Applied
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The average probability estimate of J > 1 judges is generally better than its components. Two studies test 3 predictions regarding averaging that follow from theorems based on a cognitive model of the judges and idealizations of the judgment situation. Prediction 1 is that the average of conditionally pairwise independent estimates will be highly diagnostic, and Prediction 2 is that the average of dependent estimates (differing only by independent error terms) may be well calibrated. Prediction 3 contrasts between- and within-subject averaging. Results demonstrate the predictions' robustness by showing the extent to which they hold as the information conditions depart from the ideal and as J increases. Practical consequences are that (a) substantial improvement can be obtained with as few as 2–6 judges and (b) the decision maker can estimate the nature of the expected improvement by considering the information conditions. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Journal
of
Experimental
Psychology: Applied
2000,
Vol.
6, No,
2,
130-147
Copyright
2000
by the
American
Psychological
Association,
Inc.
1076-898XAKVJ5.00
DO1:
I0.1037//1076-898X.6.2.130
The
Effects
of
Averaging Subjective Probability Estimates
Between
and
Within Judges
Dan
Ariely
Massachusetts Institute
of
Technology
Wing
Tung
Au
The
Chinese
University
of
Hong
Kong
Randall
H.
Bender
Research Triangle Institute
David
V.
Budescu
University
of
Illinois
at
Urbana-Champaign
Christiane
B.
Dietz,
Hongbin
Gu,
Thomas
S.
Wallsten
University
of
North
Carolina
Gal
Zauberman
Duke
University
The
average
probability
estimate
of J > 1
judges
is
generally
better
than
its
components.
Two
studies
test
3
predictions regarding averaging that follow
from
theorems
based
on a
cognitive
model
of the
judges
and
idealizations
of the
judgment situation.
Prediction
1 is
that
the
average
of
conditionally
pairwise
independent
estimates
will
be
highly
diagnostic,
and
Prediction
2 is
that
the
average
of
dependent
estimates (differing only
by
independent
error
terms)
may be
well
calibrated.
Prediction
3
contrasts
between-
and
within-subject
averaging.
Results demonstrate
the
predictions'
robustness
by
showing
the
extent
to
which they hold
as the
information
conditions
depart
from the
ideal
and as J
increases.
Practical
consequences
are
that
(a)
substantial improvement
can be
obtained
with
as few as 2- 6
judges
and (b) the
decision
maker
can
estimate
the
nature
of the
expected
improvement
by
considering
the
information
conditions.
On
many
occasions, experts
are
required
to
provide decision
makers
or
policymakers
with
subjective probability estimates
of
uncertain
events (Morgan
&
Henrion,
1990).
The
extensive liter-
ature
(e.g.,
Harvey,
1997;
McClelland
&
Bolger,
1994)
on the
topic
shows
that
in
general,
but
with
clear exceptions, subjective
probability
estimates
are too
extreme, implying
overconfidence
on
the
part
of the
judges.
The
theoretical
challenge
is to
understand
the
conditions
and the
cognitive processes that lead
to
this over-
confidence.
The
applied challenge
is to
figure
out
ways
to
obtain
more
realistic and
useful
estimates.
The
theoretical developments
of
Wallsten,
Budescu,
Erev,
and
Diederich
(1997) provide
one
route
to the
applied goals,
and
they
are the
focus
of
this article.
Dan
Ariely,
School
of
Management,
Massachusetts
Institute
of
Tech-
nology,
Boston,
Massachusetts;
Wing Tung
Au,
Department
of
Psychol-
ogy,
The
Chinese University
of
Hong Kong, Hong Kong, China; Randall
H.
Bender,
Statistics
Research
Division,
Research
Triangle Institute,
Re-
search
Triangle Park, North
Carolina;
David
V.
Budescu, Department
of
Psychology,
University
of
Illinois
at
Urbana-Champaign; Christiane
B.
Dietz,
Hongbin
Gu, and
Thomas
S.
Wallsten, Department
of
Psychology,
University
of
North
Carolina;
Gal
Zauberman,
Fuqua
School
of
Business,
Duke University.
The
authorship
is
intentionally
in
alphabetical
order;
all
authors
con-
tributed equally
to
this
article.
This
research was
supported
by
National
Science
Foundation
Grants
SBR-9632448
and
SBR-9601281.
We
thank
Peter
Juslin
and
Anders
Winman
for
generously sharing
their
data with
us
and
Neil
Bearden
for
comments
on an
earlier
version
of the
article.
Correspondence
concerning
this
article
should
be
addressed
to
Thomas
S.
Wallsten,
Department
of
Psychology, University
of
North
Carolina,
Chapel
Hill,
North Carolina
27599-3270.
Electronic
mail
may be
sent
to
tom.wallsten@unc.edu.
Specifically,
this
research
tests three predictions
regarding the
consequences
of
averaging multiple estimates
that
an
event
will
occur
or is
true.
The
predictions
follow
from
two
theorems
pro-
posed
by
Wallsten
et
al.
(1997)
and
proved rigorously
by
Wallsten
and
Diederich
(in
press). They
are
based
on
idealizations
that
are
unlikely
to
hold
in the real
world.
If,
however,
the
conditions
are
approximated
or if the
predicted
results are
robust
to
departures
from
them,
then
the
theorems
are of
considerable practical
use.
We
next provide
a
brief
overview
of
background
material
and
then
develop
the
predictions
in
more
detail.
We
test
them
by
reanalyzing
data collected
for
other purposes
and
with
an
original
experiment.
We
defer
discussion
of the
practical
and
theoretical
consequences
to the
final
section.
Researchers
have
studied
subjective
probability estimation
in
two
types
of
tasks.
In the
no-choice
full-scale
task,
respondents
provide
an
estimate
from
0 to 1 (or
from
0% to
100%)
that
statements
or
forecasts
are or
will
be
true.
In the
other, perhaps
more
common
task, choice half-scale, respondents select
one of
two
answers
to a
question
and
then
give
confidence
estimates
from
0.5 to 1.0 (or 50% to
100%)
that
they
are
correct. Instructions
in
both
the
choice
and
nonchoice
paradigms
generally
limit
re-
spondents
to
categorical
probability
estimates
in
multiples
of 0.1
(or of
10).
When
judges
are not
restricted
to
categorical
responses,
the
estimates generally
are
gathered
for
purposes
of
analysis
into
categories
corresponding
to
such multiples.
The
graph
of
fraction
correct
in
choice half-scale
tasks
or of
statements
that
are
true
in
no-choice
full-scale
tasks
as a
function
of
subjective
probability
category
is
called
a
calibration
curve.
The
most
common
finding
in
general-knowledge
or
forecasting
domains
is
that
probability
estimates
are too
extreme,
which
is
interpreted
as
indicating overconfidence
on the
part
of the
judge.
130
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
... Positive relationship between confidence and stimuli discriminability Confidence increases as stimuli discriminability increases Baranski and Petrusic (1998); Garrett (1922); Vickers (1979); Johnson (1939); Vickers and Packer (1982) Higher stimuli discriminability partitions the hypothesis space to more unevenly favor the correct response. Thus, the hypothesis samples that were converted into evidence become more homogenous with increasing discriminability Resolution of confidence There is a positive relationship between choice accuracy and confidence judgments Baranski and Petrusic (1998); Vickers (1979); Garrett (1922); Ariely et al. (2000); Johnson (1939); Vickers and Packer (1982) Optionally terminating an autocorrelated sampling process will make the correct responses faster and thereby greater confidences ...
... This is in contrast to the SPRT: in the SPRT, confidence is unaffected by additional sampling because confidence is determined by a fixed decision threshold. The decreasing decision confidence of the ABS with an increasing number of samples allows it to capture four key empirical phenomena which are not accommodated by the SPRT described above: the positive relationship between confidence and the discriminability of the stimuli (Baranski & Petrusic, 1998;Vickers, 1979;Vickers & Packer, 1982, Figure 8A), the "resolution of confidence" effect (Ariely et al., 2000;Baranski & Petrusic, 1998;Garrett, 1922;Vickers, 1979;Vickers & Packer, 1982, Figure 8B), so-called "metacognitive inefficiency" (Shekhar & Rahnev, 2021a, 2021b, Figure 8C), and the complex relationship between RT and confidence (Baranski & Petrusic, 1998;Vickers & Packer, 1982, Figure 8D). ...
... Second, average confidence ratings tend to be higher for correct responses than for incorrect responses (e.g., Ariely et al., 2000;Baranski & Petrusic, 1998;Vickers, 1979Vickers, , 2014Vickers & Packer, 1982). This so-called "resolution-of-confidence" effect also holds true even if stimulus difficulty is held constant (Baranski & Petrusic, 1998) and even if choice and confidence are simultaneously elicited from participants (Kiani et al., 2014;Ratcliff & Starns, 2009;Van Zandt, 2000). ...
Article
Full-text available
Normative models of decision-making that optimally transform noisy (sensory) information into categorical decisions qualitatively mismatch human behavior. Indeed, leading computational models have only achieved high empirical corroboration by adding task-specific assumptions that deviate from normative principles. In response, we offer a Bayesian approach that implicitly produces a posterior distribution of possible answers (hypotheses) in response to sensory information. But we assume that the brain has no direct access to this posterior, but can only sample hypotheses according to their posterior probabilities. Accordingly, we argue that the primary problem of normative concern in decision-making is integrating stochastic hypotheses, rather than stochastic sensory information, to make categorical decisions. This implies that human response variability arises mainly from posterior sampling rather than sensory noise. Because human hypothesis generation is serially correlated, hypothesis samples will be autocorrelated. Guided by this new problem formulation, we develop a new process, the Autocorrelated Bayesian Sampler (ABS), which grounds autocorrelated hypothesis generation in a sophisticated sampling algorithm. The ABS provides a single mechanism that qualitatively explains many empirical effects of probability judgments, estimates, confidence intervals, choice, confidence judgments, response times, and their relationships. Our analysis demonstrates the unifying power of a perspective shift in the exploration of normative models. It also exemplifies the proposal that the “Bayesian brain” operates using samples not probabilities, and that variability in human behavior may primarily reflect computational rather than sensory noise.
... To address this problem, Ranjan and Gneiting (2010) propose a method that extremizes the linear opinion pool by pushing it closer to its nearer extreme. Many others have employed schemes to extremize the average forecast (Karmarkar 1978, Erev et al. 1994, Ariely et al. 2000, Shlomi and Wallsten 2010, Turner et al. 2014, Baron et al. 2014, Satopää et al. 2014. Extremizing methods, such as the logit aggregator, are now used as benchmarks in practice (IARPA Geopolitical Forecasting Challenge 2018). ...
Preprint
Many organizations face critical decisions that rely on forecasts of binary events. In these situations, organizations often gather forecasts from multiple experts or models and average those forecasts to produce a single aggregate forecast. Because the average forecast is known to be underconfident, methods have been proposed that create an aggregate forecast more extreme than the average forecast. But is it always appropriate to extremize the average forecast? And if not, when is it appropriate to anti-extremize (i.e., to make the aggregate forecast less extreme)? To answer these questions, we introduce a class of optimal aggregators. These aggregators are Bayesian ensembles because they follow from a Bayesian model of the underlying information experts have. Each ensemble is a generalized additive model of experts' probabilities that first transforms the experts' probabilities into their corresponding information states, then linearly combines these information states, and finally transforms the combined information states back into the probability space. Analytically, we find that these optimal aggregators do not always extremize the average forecast, and when they do, they can run counter to existing methods. On two publicly available datasets, we demonstrate that these new ensembles are easily fit to real forecast data and are more accurate than existing methods.
... There are evidences that under the assumption of mutual independence of judgements, estimations made from the opinions of a crowd of N judges can be accurate (Galton 1907;Wallis 2014). This methodology always attracted the interest of statisticians but experimental and quasi-experimental research over this topic surged after 2000 (Ariely et al. 2000;Soll and Larrick 2009;Müller-Trede et al. 2018). However, we have to highlight major differences in core assumptions between experimental methodology and crowd rating information systems (observational methodology): ...
Article
Full-text available
Crowd rating is a continuous and public process of data gathering that allows the display of general quantitative opinions on a topic from online anonymous networks as they are crowds. Online platforms leveraged these technologies to improve predictive tasks in marketing. However, we argue for a different employment of crowd rating as a tool of public utility to support social contexts suffering to adverse selection, like tourism. This aim needs to deal with issues in both method of measurement and analysis of data, and with common biases associated to public disclosure of rating information. We propose an evalu-ative method to investigate fairness of common measures of rating procedures with the peculiar perspective of assessing linearity of the ranked outcomes. This is tested on a longitudinal observational case of 7 years of customer satisfaction ratings, for a total amount of 26.888 reviews. According to the results obtained from the sampled dataset, analysed with the proposed evaluative method, there is a trade-off between loss of (potentially) biased information on ratings and fairness of the resulting rankings. However, computing an ad hoc unbiased ranking case, the ranking outcome through the time-weighted measure is not significantly different from the ad hoc unbiased case.
... Although unable to overcome idiosyncratic biases (since both estimates originate from the same person), aggregated responses should still be more accurate than individual responses due to the decrease in error resulting from random noise. Importantly, the noise contained in these estimates should be at least somewhat independent for the errors to cancel each other out (Ariely et al. 2000;Herzog and Hertwig 2009). ...
Article
Full-text available
Artificial intelligence can now synthesise face images which people cannot distinguish from real faces. Here, we investigated the wisdom of the (outer) crowd (averaging individuals' responses to the same trial) and inner crowd (averaging the same individual's responses to the same trial after completing the test twice) as routes to increased performance. In Experiment 1, participants viewed synthetic and real faces, and rated whether they thought each face was synthetic or real using a 1–7 scale. Each participant completed the task twice. Inner crowds showed little benefit over individual responses, and we found no associations between performance and personality factors. However, we found increases in performance with increasing sizes of outer crowd. In Experiment 2, participants judged each face only once, providing a binary ‘synthetic/real’ response, along with a confidence rating and an estimate of the percentage of other participants that they thought agreed with their answer. We compared three methods of aggregation for outer crowd decisions, finding that the majority vote provided the best performance for small crowds. However, the ‘surprisingly popular’ solution outperformed the majority vote and the confidence‐weighted approach for larger crowds. Taken together, we demonstrate the use of outer crowds as a robust method of improvement during synthetic face detection, comparable with previous approaches based on training interventions.
... 1.1 Collective Intelligence within and Between Effects [12] regarded the CI obtained from repeated trials by one person as a "crowd within effect" and concluded that it was ineffective as CI. In contrast, [13] found that a "crowd within effect" can be obtained through repeated trials. ...
Article
Both group process studies and collective intelligence studies are concerned with “which of the crowds and the best members perform better.” This can be seen as a matter of democracy versus dictatorship. Having evidence of the growth potential of crowds and experts can be useful in making correct predictions and can benefit humanity. In the collective intelligence experimental paradigm, experts' or best members ability is compared with the accuracy of the crowd average. In this research (n =620), using repeated trials of simple tasks, we compare the correct answer of a class average (index of collective intelligence) and the best member (the one whose answer was closest to the correct answer). The results indicated that, for the cognition task, collective intelligence improved to the level of the best member through repeated trials without feedback; however, it depended on the ability of the best members for the prediction task. The present study suggested that best members' superiority over crowds for the prediction task on the premise of being free from social influence. However, machine learning results suggests that the best members among us cannot be easily found beforehand because they appear through repeated trials.
... However, the quantification of all the nodes has been carried out by an expert panel, making the model highly subjective to the opinions and knowledge of the experts. It is argued that the Delphi method is highly useful when experts' opinion is the only available source of information [41,42]. Critics claim that this process is too time-consuming and contains uncertainties due to experts' elicitation [43]. ...
Article
Full-text available
This paper proposes a methodology to estimate the probability of basic causes of allision accidents between vessels and offshore platforms that overcomes the problem of data scarcity required for causal analysis. The approach uses information derived from incidental data and expert elicitation, processed by a multiple attribute utility method and hierarchical Bayesian analysis. First, the methodology is detailed, briefly describing the adopted approaches. A dataset of allision incidents provided mainly by the UK Health and Safety Executive and other agencies is prepared. The features of the incidents’ causation in terms of the causal factors and basic causes are presented and discussed. A novel scheme is proposed to evaluate the annual occurrence rates of basic causes of accidents from the relative importance of each basic cause derived by the Deck of Cards method. Then, a hierarchical Bayesian analysis is conducted to predict the posterior distribution of the occurrence rate of each basic cause in the time frame under analysis. The proposed holistic methodology provides transparent estimates of allision causation probabilities from limited and heterogeneous datasets.
Article
Experts are usually valued for their knowledge. However, do they possess metaknowledge, that is, knowing how much they know as well as the limits of that knowledge? The current research examined expert metaknowledge by comparing experts' and nonexperts' confidence when they made correct versus incorrect choices as well as the difference in‐between (e.g., Murphy's Resolution and Yate's Separation). Across three fields of expertise (climate science, psychological statistics, and investment), we found that experts tended to display better metaknowledge than nonexperts but still showed systematic and important imperfections. They were less overconfident than nonexperts in general and expressed more confidence in their correct answers. However, they tend to exhibit low Murphy's Resolution similar to nonexperts and gave endorsed wrong answers with equal to higher confidence than did their nonexpert peers. Thus, it appears that expertise is associated with knowing with more certainty what one knows but conceals awareness of what one does not know.
Article
Full-text available
We present a wisdom of crowds study where participants are asked to order a small set of images based on the number of dots they contain and then to guess the respective number of dots in each image. We test two input elicitation interfaces—one elicits the two modalities of estimates jointly and the other independently. We show that the latter interface yields higher quality estimates, even though the multimodal estimates tend to be more self-contradictory. The inputs are aggregated via optimization and voting-rule based methods to estimate the true ordering of a larger universal set of images. We demonstrate that the quality of collective estimates from the simpler yet more computationally-efficient voting methods is comparable to that achieved by the more complex optimization model. Lastly, we find that using multiple modalities of estimates from one group yields better collective estimates compared to mixing numerical estimates from one group with the ordinal estimates from a different group.
Article
Full-text available
Research on people's confidence in their general knowledge has to date produced two fairly stable effects, many inconsistent results, and no comprehensive theory. We propose such a comprehensive framework, the theory of probabilistic mental models (PMM theory). The theory (a) explains both the overconfidence effect (mean confidence is higher than percentage of answers correct) and the hard–easy effect (overconfidence increases with item difficulty) reported in the literature and (b) predicts conditions under which both effects appear, disappear, or invert. In addition, (c) it predicts a new phenomenon, the confidence–frequency effect, a systematic difference between a judgment of confidence in a single event (i.e., that any given answer is correct) and a judgment of the frequency of correct answers in the long run. Two experiments are reported that support PMM theory by confirming these predictions, and several apparent anomalies reported in the literature are explained and integrated into the present framework.
Chapter
It is often assumed that n heads are better than one, that a judgment obtained from a group will be of higher quality than could be expected from an individual. This chapter considers the effectiveness of methods that have been proposed for combining individual quantitative judgments into a group judgment. For the most part, it will be found that n heads are, indeed, better than one, and at least one investigator has concluded that it does not much matter how they are combined. But the potential for improving performance is so great and the problems of achieving it so subtle that a clear understanding of the issues is essential.
Article
Schmittlein discussed the lack of universality of regression toward the mean. The present note emphasizes the universality of a similar effect, dubbed “reversion” toward the mean, defined as the shift in conditional expectation of the upper or lower portion of a distribution. Reversion toward the mean is a useful concept for statistical reasoning in applications and is more self-evidently plausible than regression toward the mean.
Article
This paper briefly describes some results of operational and experimental programmes in the United States involving subjective probability forecasts of precipitation occurrence and of maximum and minimum temperatures. These results indicate that weather forecasters can formulate such forecasts in a reliable manner.
Article
This note examines the number of experts to be included in a prediction group where the criterion of predictive ability is the correlation between the uncertain event and the mean judgment of the group members. It is shown that groups containing between 8 and 12 members have predictive ability close to the “optimum” under a wide range of circumstances but provided (1) mean intercorrelation of experts' opinions is not low (<.3, approximately) and/or (2) mean expert validity does not exceed mean intercorrelation. Evidence indicates these exceptions will not be common in practice. The characteristics needed by an additional expert to increase the validity of an existing group are also derived.
Article
focuses on several practical issues in subjective probability for discrete events from the standpoint of decision analysis / decision analysis sets the standards for the use of subjective probability and points the way for other applications / it has a high stake in the success of subjective probability methods and a high commitment to ensuring their reliability and validity examine elicitation, calibration and combination of discrete subjective probabilities in the light of a model that explains and brings order to a considerable amount of confusing experimental data / calibration, the extent to which the observed proportions of events that occur agree with the assigned probability values, directly affects the quality of decision analysis and is the central issue / elicitation, the process by which judgments are obtained, and combination, the process by which probabilities of the same event from different judges are aggregated, are intimately related to calibration and are considered from that standpoint (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
the emergence of "overconfidence" in calibration studies is often understood as an indication of a general human bias / a cognitive approach is proposed which offers a different interpretation: miscalibration is not seen as a bias, but as a necessary consequence of task characteristics and the selection of items (PsycINFO Database Record (c) 2012 APA, all rights reserved)