ArticlePDF Available

When 90% confidence intervals are 50% certain: On the credibility of credible intervals

Wiley
Applied Cognitive Psychology
Authors:

Abstract and Figures

Estimated confidence intervals for general knowledge items are usually too narrow. We report five experiments showing that people have much less confidence in these intervals than dictated by the assigned level of confidence. For instance, 90% intervals can be associated with an estimated confidence of 50% or less (and still lower hit rates). Moreover, interval width appears to remain stable over a wide range of instructions (high and low numeric and verbal confidence levels). This leads to a high degree of overconfidence for 90% intervals, but less for 50% intervals or for free choice intervals (without an assigned degree of confidence). To increase interval width one may have to ask exclusion rather than inclusion questions, for instance by soliciting ‘improbable’ upper and lower values (Experiment 4), or by asking separate ‘more than’ and ‘less than’ questions (Experiment 5). We conclude that interval width and degree of confidence have different determinants, and cannot be regarded as equivalent ways of expressing uncertainty. Copyright © 2005 John Wiley & Sons, Ltd.
Content may be subject to copyright.
APPLIED COGNITIVE PSYCHOLO GY
Appl. Cognit. Psychol. 19: 455–475 (2005)
Published online 14 March 2005 in Wiley InterScience
(www.interscience.wiley.com) DOI: 10.1002/acp.1085
When 90% Confidence Intervals are 50% Certain:
On the Credibility of Credible Intervals
KARL HALVOR TEIGEN
1
* and MAGNE JØRGENSEN
2
1
University of Oslo, Norway
2
Simula Research Laboratory, Oslo, Norway
SUMMARY
Estimated confidence intervals for general knowledge items are usually too narrow. We report ve
experiments showing that people have much less confidence in these intervals than dictated by the
assigned level of confidence. For instance, 90% intervals can be associated with an estimated
confidence of 50% or less (and still lower hit rates). Moreover, interval width appears to remain
stable over a wide range of instructions (high and low numeric and verbal confidence levels). This
leads to a high degree of overconfidence for 90% intervals, but less for 50% intervals or for free
choice intervals (without an assigned degree of confidence). To increase interval width one may have
to ask exclusion rather than inclusion questions, for instance by soliciting ‘improbable’ upper and
lower values (Experiment 4), or by asking separate ‘more than’ and ‘less than’ questions
(Experiment 5). We conclude that interval width and degree of confidence have different
determinants, and cannot be regarded as equivalent ways of expressing uncertainty. Copyright #
2005 John Wiley & Sons, Ltd.
Laypeople and experts alike are often called upon to formulate estimates or make
predictions about imperfectly known quantities, like: How many subjects do I need to
achieve reliable results? How much will I have to pay for a decent flat? How many weeks
will it take to revise the paper? And how long time will it take to receive an answer from
the journal editor?
Answers to such questions are often fraught with considerable uncertainty. This
uncertainty can be expressed in two ways: (1) By adding a probabilistic modifier to the
most likely estimate (‘it is 90% probable’, ‘it is almost certain’). (2) By using an interval
estimate, or range judgment (‘it will take 4–8 weeks’). Sometimes, intervals are indicated
by lower or upper limits only (‘at least 4 weeks’; ‘not more than eight’).
The most complete uncertainty descriptions are achieved by a combination of
probabilities and intervals. Such estimates have been labelled credible intervals, sub-
jective confidence intervals, fractile assessments, uncertainty intervals,orprobabilistic
prediction intervals. For instance, project managers are encouraged to predict both the
most likely effort of a new project (in work hours) and the 90% prediction interval
Copyright # 2005 John Wiley & Sons, Ltd.
*Correspondence to: Dr K. H. Teigen, Department of Psychology, University of Oslo, P.O.B. 1094, Blindern,
N-0317 Oslo, Norway. E-mail: k.h.teigen@psykologi.uio.no
Contract/grant sponsor: Research Council of Norway; contract/grant number: 135854/350.
(minimum and maximum limits that will include the correct value with 90% certainty)
(Moder, Phillips, & Davis, 1995).
From a formal point of view, probability levels and interval magnitudes are compensa-
tory, in the sense that high uncertainties can be expressed either by low probabilities or by
wide intervals. Narrow intervals can be compensated for by low probabilities, whereas
high probabilities are warranted if the interval estimates are wide enough. Formal
equivalence, however, does not necessarily mean psychological equivalence. The aim of
the present study is to investigate the determinants of interval magnitudes, and specifically
the role play ed by probability levels. Will an increase in probability level be associated
with a corresponding increase in interval magnitude, and vice versa?
Most studies of interval estimation have asked people to produce intervals associated
with an assigned probability level, usually probabilities approaching certainty (90%, 98%,
or 99%). The ‘credible’ intervals people produce in these studies are usually far too
narrow. Actual hit rates (the frequency of correct values falling inside the interval) are
often less than 0.50, leading to 50% or more ‘surprises’, instead of the 1%–10% that would
be expected from a well-calibrated judge (Alpert & Raiffa, 1982; Klayman, Soll,
Gonza
´
les-Vallejo, & Barlas, 1999). Such results are typically described as evidence of
‘overconfidence’. In fact, interval estimates have become the most popular and robust way
of demonstrating overconfidence in textbook accounts (Bazerman, 1994; Russo &
Schoemaker, 1989).
The concept of overconfidence is also used to describe the results from a different
research paradigm, where people are asked to produce subjective probability estimates for
their chosen answers to two-choice questions, or similar ‘discrete propositions’. These
confidence estimates are subsequently compared to the proportion of correct answers.
Overconfidence refers here to the fact that confidence estimates often exceed hit rates (for
reviews, see Arkes, 2001; Keren, 1991; Lichtenstein, Fischhoff, & Phillips, 1982;
McClelland & Bolger, 1994). Studies that have compared judgmen ts of discrete events
with interval estimates have found that interval estimates produce more overconfidence
(e.g. Seaver, Winterfeldt, & Edwards, 1978), a phenomenon referred to as ‘format
dependence’ (Juslin, Wennerholm, & Olsson, 1999). It may, however, be misleading to
speak of ‘overconfidence’ as long as questions about confidence have not been directly
asked. Participants producing intervals may not feel they are expressing their confidence,
but rather their error margins.
ASSIGNED VERSUS ESTIMATED CONFIDENCE
The pre sent experiments wer e d esigned to i nco r porate questions about confidence in to
the traditional interval estimation paradigm. This enable s us to compare prescribed
confidence levels, used for generating credible intervals, with probability est imates
generated in response to such intervals. To simplify the description of the studies, we
will adopt the following terminology. Probability values proposed by the experimenter
(as in most studies of credible intervals) will be ref erred to as assigned confidence,
AC. In contrast, estimated con fidence, EC, r efers to probability values produced by
participants (as in studies of confidence in discrete propositions). We make a similar
distinction between estimated intervals, EI (generated by participants) and assigned
intervals, AI (generated by the experim enter, or calculated by the subjects a ccording to
specific i nstructions).
456 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
The standard procedure for studying ‘overconfidence’ in credible intervals has been to
ask for intervals associated with high levels of assigned confidence (AC-EI procedures). In
the present studies, results from this procedure will be compared with its mirror image,
namely the confidence associated with assigned intervals (AI-EC formats). The assumed
compensatory relationship between confidence and interval magnitude suggests that
higher levels of assigned confidence should lead to wider interval estimates, and similarly,
that wider assigned intervals should be associated with higher confidence estimates.
We first report the results from three studies (Experiments 1–3), showing that AC and
EC are not the same. Assigned confidence (AC) typically leads to the generation of too
narrow EIs, but when these or similar intervals are assigned, people report different,
generally lower confidence estimates (EC). Moreover, the magnitude of EI seems to be
rather constant across a wide variety of AC (Experiments 2, 3, and 5). Similar EIs will also
be reported under conditions where no confidence is assigned (Experiments 4 and 5).
As a background for understanding these effects, we will review some potential
mechanisms that may affect confidence estimates and interval estimates, respectively.
POTENTIAL DETERMINANTS OF CONFIDENCE
AND INTERVAL ESTIMATES
Confidence ratings, especially those leading to overconfidence, have been explained in a
variety of ways, ranging from a tendency to favour positive above negative evidence
(Koriat, Lichtenstein, & Fischhoff, 1980), to a lack of complete, immediate and accurate
feedback (Arkes, 2001). Overconfidence has also been explained as an artifact, due to a
biased sampling of questions (Gigerenzer, Hoffrage, & Kleinbo
¨
lting, 1991), or as a
regression effect, due to random errors and unreliable measures (Erev, Wallsten, &
Budescu, 1994; Soll, 1996). A common theme going through several, otherwise different,
theoretical accounts is the idea that the rater does not have direct access to the certainty of
any particular proposition, but has to make indirect assessments based on more or less
valid probability cues, or by compari sons with a limited number of memory exemplars
(Juslin & Persson, 2002).
Following Kahneman and Tversky’s (1982) distinction between internal and external
uncertainty, we may think of confidence judgments as reflecting (1) the judge’s subjective
expertise, that is, an individual’s degree of trust, or lack of trust, in his or her own
knowledge, and (2) the degree of variability believed to be associated with the target value.
Prediction problems, as reflected in the ‘planning fallacy’ (Buehler, Griffin, & Ross,
1994), may be primarily due to an underestimation of the external uncertainties involved.
For general knowledge items, which are the subject of the present investigation,
attributions to external uncertainty are usually not applicable (there is not muc h variability
associated with the birth year of Mozart). In this case, degree of confidence must reflect a
balance between the individual’s ‘internal’ arguments for and against a particular piece of
knowledge, suggested by the experimenter or generated by the individual. High prob-
abilities could be due to a stro ng belief in the accuracy of a particular statement, but they
could also reflect an absence of counterarguments.
The magnitude of credible intervals can be dependent on similar factors, including (1) a
person’s trust in his own knowledge, and (2) the degree of variability believed to be
associated with the target value. Yet confidence estimates and interval estimates differ by
having different foci. Awareness of missing knowledge (high uncertainty) can be
Credibility of credible intervals 457
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
expressed directly in terms of low confidence ratings. Interval estimates require in addition
that deviating alternative outcomes are imaginable. If not, too narrow intervals will ensue.
This has been repeatedly demonstrated for probabilistic prediction intervals for real tasks
(Connolly & Dean, 1997; Jørgensen, Teigen, & Moløkken, 2004), as well as for general
knowledge questions (Alpert & Raiffa, 1982; Juslin, et al., 1999; Soll & Klayman, 2004).
Intervals may be determined by additional considerations that have even less to do with
confidence. One is an implicit demand for communicative informativeness. This is a
special case of the Gricean maxim of quantity (Grice, 1975), indicating that intervals, like
other parts of a communication, should be as informat ive as possible, and thus not exceed
a certain size even under conditions of relative ignorance. Yaniv and Foster (1995, 1997)
have shown that people’s preferences for ‘fine-grained’ and precise values can lead to
inaccurate estimates by a process of informativeness-accuracy trade-off.
Intervals may also be affected by two strategies that can be used in any categorization
task: An inclusion strategy, where the question is whether a target object should be
accepted as belonging to the class; and an exclusion strategy, where the task is to reject or
eliminate those item s that do not belong to the class. These two strategies are not entirely
complementary. Yaniv and Schul (1997) found that inclusion instructions led to a much
smaller range of acceptable items than exclusion instructions. Respondents asked to mark
alternatives ‘that are likely to be the correct answers’ marked, on the average, 18% of the
alternatives, whereas those who were given eliminations instructions (checking alter-
natives ‘that are not likely to be the correct answer’), marked 49.9% of the set, implying
that 50.1% were ‘likely’. Interval estimates, where participants are asked to identify lower
and upper limits for the category of correct answers (‘the population of London is
between ... ...and ... ...millions’), can be construed as a inclusion process, the question
being how large and how small populations that can be accepted as belonging to the set of
potential London populations. Soll and Klayman (2004) have recently suggested that
when interval estimates are formulated as range judgments, as above, they tend to be
treated as a single (fuzzy) judgment, dominated by a single search of the relevant
information available.
Thus narrow intervals can reflect various aspects of overconfidence, but they can also be
the result of other processes (demand for informativeness and inclusion strategies) that are
conceptually distinct from confidence. This may lead to a dissociation between estimated
confidence and estimated intervals.
EXPERIMENT 1
Experiment 1 was designed to compare assigned confidence (AC) of credible intervals
with estimated confidence (EC) of intervals of similar magnitude. In the first case, the
dependent variable is the width of estimated intervals, in the second case the dependent
variable is degree of confidence.
Method
Participants
Altogether 83 students were recruited in the reading room for social science students at the
University of Oslo. They were randomly allocated to two conditions by receiving different
questionnaires, and received an instant lottery ticket for participation.
458 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
Questionnaires
All participants were asked 10 questions about a variety of subjects, including the
population of Spain, the height of the City Hall in Oslo, and the annual number of
suicides in Norway.
After giving their most likely estimates (E), partic ipants in the AC condition (n ¼ 44)
were asked to produce 90% confidence intervals around each estimate, described as a
minimum and a maximum value that would contain the true value in nine out of 10 cases.
(‘The population of Spain is betwee n ...and ...millions, with 90% certainty’.)
Participants in the EC condition (n ¼ 39) were asked to calculate minimum values by
subtracting 50% from their most likely estimates and maximum values by adding 50%.
This corresponds to a mean relative interval (MRI) width of 1.00, MRI ¼ (max-min)/E. So
if the estimated value is 40, the computed interval will be between 20 and 60. Relative
intervals were preferred to absolute intervals because they allow for comparisons between
estimates of different order of magnitude. EC participants were then asked to estimate
their degree of confidence (as a percentage) that the true answer would fall within this
range.
Results
The questions proved more difficult than intended. The hit rates for 10 items in the AC
condition ranged from 5% to 55%. The average participant had 2.35 rather than nine
correct answers, amounting to a mean overconfidence of 66.5% (90%–23.5%).
The intervals calculated by participants in the EC condition turned out to be close to the
estimated intervals of the AC condition. (Incidentally, six were wider and four were
narrower; none of the differences was significant.) In line with this, the hit rates were also
similar (M ¼ 23.4%). Confidence estimates were, however, much below 90%, but ranging
from 34% to 66%, with a grand mean of 52.5%. This amounts to an overconfidence of
29.1%.
The experiment shows that credible intervals, with an assigned confidence of 90%,
correspond to a much lower degree of estimated confidence. The intervals were in both
cases too narrow, but much less so in the EC than in the AC condition. Thus amount of
overconfidence varies with the way questions are asked.
EXPERIMENT 2
When participants in the same experiment are asked to produce credible intervals
corresponding to more than one level of confidence, for instance 50% intervals in addition
to 98% intervals, they adjust the ranges accordingly (Alpert & Raiffa, 1982; Seaver et al.,
1978). It does not follow, however, that this pattern of response will be observed in a
between-subjects design. In a recent study, Jørgensen et al. (2004) asked four groups of
computer science students to estimate the range of work hours they thought would be
required to complete a programming task, with assigned confidence levels of 50%, 75%,
90%, and 99% in the four groups, respectively. The interval estimates were, however,
almost identical (MRIs were 0.8 in the 50% group and 0.7 in the three other groups). The
present experiment was designed to replicate this finding with a set of easier general
knowledge items. We also wanted to explore the inverse relationship, namely the effect of
different assigned intervals on confidence.
Credibility of credible intervals 459
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
Method
Participants
Participants were 75 students attending a course in computer science at the University of
Oslo. They were randomly assigned to six different conditions (n ¼ 11–14).
Questionnaires
All questionnaires contained questions about the traveling distance between Oslo and 10
other, generally well known Norwegian cities and townships (the correct distances ranging
from 215 to 2104 km).
After giving their most likely estimate, E, participants in three AC conditions were
asked to give minimum-maximum estimates corresponding to assigned confidence levels
of 99%, 90%, or 75%, for Conditions 1, 2, and 3, respectively.
Participants in three AI conditions were instead asked to give confidence estimates for
the actual distance to fall within plus/minus 10%, plus minus 25%, and plus/minus 50% of
their most likely distance estimate. For instance, with an estimate of 400 km, they would
have to evaluate the interval from 360 to 440 km in Condition 4, the 300–500 km interval
in Conditon 5, and 200–600 km in Condition 6.
In Conditions 1–3, MRI widths were computed for each item based on (max-min)/ E. In
Conditions 4–6 MRIs were, by definition, 0.20, 0.50, and 1.00.
Results
These estimation tasks were clearly easier than the tasks in Experiment 1, resulting in
more narrow intervals in the three first conditions (MRI around 0.50) and highe r hit rates
(50% or above). Yet we can observe the same pattern of results: A high degree of
overconfidence in the AC conditions, but not in the AI conditions (participants in
Condition 5 are well calibrated, whereas participants in Condition 6 are actually under-
confident).
Confidence levels, relative intervals, and hit rates for the six conditions are shown in
Table 1. A comparison of Conditions 1–3 indicates that variations in assigned confidence
level had little, if any effect on interval estimates and hit rates. In contrast, variations in
assigned intervals in Conditions 4–6 seemed to influence confidence judgments and hit
rates. Large assigned intervals (MRI ¼ 1.00) led to higher confidence than more narrow
intervals, and, as could be expected, to higher hit rates. EC values are much lower than the
assigned confidences of Conditions 1–3, even in Condition 6 where the intervals are almost
Table 1. Confidence levels, mean relative interval (MRI) widths, and hit rates in Experiment 2
Condition Confidence Interval (MRI) Hit rate n
Assigned confidence
1 99% 0.60 57.1% 11
2 90% 0.46 51.5% 13
3 75% 0.62 50.2% 12
Assigned intervals
4 53.0% 0.20 31.5% 13
5 45.1% 0.50 43.6% 14
6 62.4% 1.00 78.8% 11
Note: Confidence and MRI in bold are assigned values.
460 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
twice as wide. Assigned intervals in Condition 5 are of the same magnitude as the estimated
intervals in Condition 2, with a slightly lower hit rate. Yet the mean estimated confidence of
these intervals is around 45%, about one half of the 90% level of Condition 2.
EXPERIMENT 3
The previous study indicated that the magnitude of credible intervals can be essentially
unaffected by variation in confidence levels from 99% to 75%. This called for a replication
in a different domain, and with even wider variations in confidence levels. Also, if interval
estimates vary little, or not at all, with assigned confidence, one could expect similar
intervals with an unspecified level of confidence.
Method
Participants
Participants in this experiment were 237 students following a course in introductory
psychology at the University of Oslo. They were randomly allocated to five conditions by
receiving different variants of the same basic questionnaire.
Questionnaires
The questionnaires asked for interval estimates of birth years for ve famous characters
from world history (Mohammed, Newton, Mozart, Napoleon, and Einstein), and the years
of death for five other famous persons (Nero, Copernicus, Galileo, Shakespeare, and
Lincoln).
Participants in the two first, assigned confidence conditions were asked to state intervals
that they believed would contain the true answer with 90% confidence (Condition 1) or
with 50% confidence (Condi tion 2).
Participants in Condition 3 were asked to state intervals of their own choice, i.e. not
linked to a specific level of confidence. They were then asked to give confidence estimates
(percentages) indicating their probability that these intervals contained the true answer.
Participants in the two last conditions were asked to suggest intervals of 50 years
(Condition 4) or 20 years (Condition 5), before giving their confidence estimates
(percentages) that the these intervals contained the correct answer.
Results and discussion
Participants in Conditions 1–3 generated intervals ranging from around 50 years for
Lincoln and Einstein, to 150–200 years for Nero and Mohammed, with increasing
intervals for persons belonging to the more distant past. Average intervals were in these
three conditions of similar magnitude, as shown in Table 2. (Relative intervals do not make
sense in this experiment, as the year of birth/death scale has no natural zero point.) The
90% confidence intervals of Condition 1 were only slightly wider than 50% confidence
intervals of Condition 2 (five were wider and ve were narrower; none of the differences is
significant). The hit rates were low, making participants in both interval conditions (and
especially in the 90% condition) appear highly overconfident.
Intervals in the ‘free choice’ condition were of similar magnitude to the intervals of
the first two conditions. But when these participants were asked to describe their
Credibility of credible intervals 461
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
own confidence in these intervals, the mean estimates ranged from 32% (Nero) to 50%
(Einstein). They were, in other words, less confident than participants in Conditions 1 and
2, who should, by definition, be either 90% or 50% confident in all their estimates.
The assigned intervals in Conditions 4 and 5 were clearly narrower than the estimated
intervals. The 20 years intervals led, as expected, to lower confidence than 50 years
intervals for all 10 birth- and death year estimates ( p < 0.01, sign test), and also to a very
low hit rate.
These results confirm the general finding from the first two experiments in yet another
domain: Estimated confidence in 90% credible intervals turns out to be much below 90%.
Interval size does not seem to be much influenced by assigned confidence level, and will
remain about the same even without instructions to match a particular level of confidence.
Confidence (as a dependent variable) is, however, influenced by interval magnitude. Most
participants in this study appeared to be highly overconfident, but again, there was no fixed
degree of overconfidence. Overconfidence is massive with a high assigned confidence
level, less so with a 50% level, and even lower when confidence is estimated rather than
assigned.
EXPERIMENT 4
Experiment 3 showed that people produce credible intervals of the same magnitude under
quite different instructions. Intervals corresponding to 90% confidence were similar to
‘free’ intervals, where level of confidence had not been specified. At the same time,
confidence in these free intervals was much lower than 90%. Experiment 4 was designed
to replicate this finding in another domain. To avoid biased sampling of items (Gigerenzer
et al., 1991), they were this time randomly drawn from a nite, well-defined universe,
namely the population of European capitals.
As suggested in the introduction, narrow intervals can be a result of an inclusion
strategy, where participants are searching for ‘likely’ estimates. In the present experiment,
participants were als o asked to produce ‘unlikely’ estimates, namely values that are clearly
too low and too high to be true. This was suppos ed to create wider intervals almost by
force, partly because it asks for two separate, opposing values, and partly because it
encourages an exclusion strategy, the task being to name numbers outside rather than
inside the expected range. With this procedure we would also expect high confidence
estimates (it is after all very likely that the true value can be found between the two
unlikely extremes).
Table 2. Mean confidence intervals, mean confidence, and hit rates in five conditions, Experiment 3
Condition Interval (years) Confidence Hit rate
1. 90% confidence 99.0 90% 22.8%
2. 50% confidence 84.9 50% 22.6%
3. Free choice 93.8 42.4% 27.2%
4. 50 years interval 50 50.6% 25.4%
5. 20 years interval 20 43.3% 10.7%
Note: Confidence values and intervals in bold are assigned values.
462 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
Method
Participants
Participants were 94 students at the Universities of Oslo and Tromsø, who were paid NOK
100 ($12) for completing this and several other unrelated judgment tasks. They were
divided into two equal groups by receiving different variants of the same basic
questionnaire.
Questionnaires
Two sets of 10 European capitals were prepared by draws according to a table of random
numbers, from the complete list of 45 European countries and their capitals (One World–
Nations Online, n.d.).
List A contain ed (in alphabetical order): Andorra la Vella, Berlin, Budapest, Chisinau,
Kiev, Lisbon, Minsk, Moscow, Rome, and San Marino. The list also included the names of
the respective countries.
List B contained: Bern, Bucharest, London, Madrid, Paris, Riga, Sarajevo, Tallinn,
Tirana, and Vaduz, along with the appropriate country names.
Part 1. Participants in Group 1 (credible intervals) received one list and were asked to
give a lower and an upper population estimate for each city, with 90% confidence, expla-
ined as the interval within which the correct number would fall in nine out of 10 cases. Half
of the participants (n ¼ 25) received List A, and the other half (n ¼ 22) received List B.
Participants in Group 2 (free intervals) received the same two lists, and were asked to
give lower and upper estimates of their own choice along with their own confidence esti-
mates. As an example, 90% confidence was explained in the same way as to participants in
Group 1. One half of the participants (n ¼ 22) received List A, and the other half (n ¼ 25)
received List B.
Part 2. When this task was completed, participants in both groups received the second list,
but with a different instruction, namely to suggest two improbable population figures for
each city, one clearly too low and the other clearly too high, but without further specifications
of the degree of improbability involved. Finally, they were asked to indicate how confident
they were (on a 0–100% scale) that the actual number would fall between these two figures.
Results and discussion
Participants in Group 1 (AC ¼ 90%) produced too narrow intervals for all cities on both lists,
with an average of 3.85 correct answers. The two samples of cities led to very similar hit
rates, as shown in Table 3. Those who were asked to produce free intervals (Group 2) offered
even narrower intervals, including only 2.72 correct answers. The difference between Group
Table 3. Mean hit rates and mean confidence estimates for population intervals, Experiment 4
Part 1 Part 2 (both groups)
Assigned confidence Free intervals Improbable values
Hit rate Confidence Hit rate Confidence Hit rate Confidence
List A 37.9 90% 21.6 46% 66.7 78%
List B 39.1 90% 32.9 55% 63.9 72%
Total 38.5 90% 27.2 51% 65.3 75%
Credibility of credible intervals 463
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
1and2issignicant,t (95) ¼ 2.41, p < 0.02. But Group 2 participants were at the same time
much more willing to admit their uncertainty, giving confidence estimates around 50%.
Thus overconfidence was reduced from 51.5% in Group 1 to 23.8% in Group 2.
When asked to give ‘improbably’ low and high population figures, participants in both
conditions generated much wider intervals, containing the correct population figures in
6.53 out of 10 cases. Despite the ‘improbability’ of the high and low values, the
participants were far from sure about their success in capturing the correct population
figures, and gave confidence estimates around 75%, not far away from their actual hit
rates. These intervals were much wider than the intervals intended to correspond to a 90%
confidence level in the first part of the experiment, yet they were associated with a lower
reported self-confidence.
A closer analysis of errors (correct values outside the confidence intervals) showed both
over- and underestimations, for instance the populations of Bucharest and Moscow were
often underestimated, whereas Bern and Andorra (small capitals) were overestimated.
This can be explained partly as a regression effect, but some large, well-known capitals
(Paris, Rome) were also overestimated. Overestimations were in this experiment general ly
two to three times more common than underestimations, perhaps because capital cities are
believed to have large populations (May, 1986). According to an exemplar model (Juslin &
Persson, 2002), city populations will be derived from a comparison with similar, known
cities. This may lead to an overestimation bias, as known cities generally have larger
populations than unknown cities.
EXPERIMENT 5
The purpose of Experiment 5 was (1) to study intervals produced in response to verbal
probability phrases rather than numeric probabilities; (2) to compare intervals produced by
different elicitation methods; (3) to measure overconfidence/underconfidence based on
predicted hit rates; and (4) to study the effects of feedback on performance.
(1) Experiments 2–4 showed that similar intervals were produced in response to different
assigned levels of confidence. In the present experiment, confidence is expressed by
verbal phrases instead of percent ages. Verbal phrases are in daily life more common
and perhaps more meaningful than numbers. Such phrases do not represent specific
probabilities, but can be used to characterize wide segments of the probability
dimension (Budescu & Wallsten, 1995; Teigen & Brun, 2003). Yet there will be a
general agreement on group level that some phrases indicate higher probabilities than
others. In the present study we asked participants to estimate intervals either based on
the phrase ‘I believe that ...’, or ‘I am quite certain that ...’. ‘Quite certain’ indicates
a high probability, whereas ‘I believe that’ is less definite and can include a range of
probabilities from 0.5 and upwards.
(2) Soll and Klayman (2004) recently found that when people are asked separate
questions about the lower bound and the higher bound of the probability interval
(the two-point method), they produce wider intervals than with the more conventional
range method.
Soll and Klayman asked the same participants to estimate both lower and highe r
bounds. In the present experiment, questions about lower and higher bounds were
given to different groups of participants. We did not formulate specific predictions
464 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
about the effect of this method (the study was planned and conducted before Soll and
Klayman’s research was known to us). One could argue both ways:
When participants are asked to concentrate exclusively on the lower bound, or on
the upper bound, they may recruit different magnitude information, leading in the first
case to low and in the second case to high estimates. Single boundary questions
(between-subjects design) make this procedure even more different from the range
method, and thus any difference between the range method and the two-point method
is likely to become prominent. This could lead to wider intervals (and less over-
confidence) with the single bound method.
On the other hand, the need to produce informative statements could pull in the
opposite direction. A wide range can be informative because it also gives an indication
about where the central value is expected to fall. In contrast, a single, very high upper
bound, or a single, very low lower bound gives no clue about the most likely value. To
be informative, the single bound estimate has to be as close to the most likely estimate
as possible.
(3) Calibration is in most studies of overconfidence measured by comparing confidence
estimates with hit rates. This procedure rests upon a belief that these two estimates are
(or should be) comparable. But individual confidence estimates suggest a concept of
probability as degree of belief for unique events, whereas hit rate is a frequentistic
concept. In many studies (like those reported here), this problem is circumvented by
defining confidence in terms of frequencies (for instance by saying that 90%
confidence means nine out of 10 correct answers). Yet there is some evidence that
mean confidence estimates (‘local confidence’) and estimates of the number of correct
responses (‘global confidence’) do not always agree, not even when done by the same
subjects (Liberman, 2004; Sniezek & Buckley, 1991). Global frequency estimates are
typically more realistic than average local confidence estimates. This may be due to an
explicit focus on frequencies, and also by allowing the respondents to take a more
detached ‘outside view’ on their own performance. It may be easier to admit: ‘I am
often wrong, after a set of answers, than to indicate: ‘I am probably wrong, after each
individual answer. In the present experiment, participa nts were not asked to give local
confidence estimates, but were instead asked about their global confidence, by
estimating their most likely number of hits after they had completed a set of 10 items.
(4) The fourth issue addressed in the present study is the effects of feedback on
subsequent performance. When overconfident estimators are informed about the
correct values, they can conclude that the intervals should have been wider, or that
the confidence level should have been lower. In addition, they will have learned
something about the typical values of objects in this particular domain. All these
lessons can, in principle, be carried over to a new task within the same field, leading to
improved performance. Jørgensen and Teigen (2002) found that interval predictions of
the time taken to complete software programming projects improved with feedback,
although rather slowly. It appeared easier for participants to learn to lower their
confidence to an appropriate level than to increase their intervals.
In the present study, participants received feedback on their first set of 10 estimates,
enabling them to com pare their predicted number of hits to their actual hit rates. They
were then given a second estimation task, with 10 new items drawn from the same
universe. A comparison of Task 1 and Task 2 will show whether participants modify
their performance in terms of (a) improved accuracy, (b) adjusted intervals, or (c)
adjusted confidence.
Credibility of credible intervals 465
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
Method
Participants
Participants were 354 students (81 men, 235 women, 38 did not report sex), attending a
course in introductory psychology at the University of Oslo. They were randomly divided
into six groups, by receiving different versions of the questionnaire.
Questionnaires
The questionnaires contained the same two lists of European capitals as in Experiment 4.
Participants in Condition 1 were asked to estimate population intervals for all citi es by the
range method (both lower and higher bounds). In Conditions 2 and 3, they were asked
about single bounds (either lower or upper). Level of confidence was manipulated by
asking for intervals that they either (a) believed would contain the true number (moderate
confidence), or (b) were quite certain would contain the correct number (high confidence).
Thus, questions about the population of each city were asked (to different subjects) in
3 2 different ways:
1a: I believe [1b: I am quite certain] that London has between ... ...and ... ...
inhabitants
2a: I believe [2b: I am quite certain] that London has more than ... ...inhabitants
3a: I believe [3b: I am quite certain] that London has less than ... ...inhabitants
After completing the first set of 10 estimates, the participants were asked to predict their
own number of hits, by completing the statement: ‘I think I have ... ...correct answers’.
They were then allowed to open a second envelope containing the true population figures,
which they were to check against their own estimates, computing their actual hit rates.
The second envelope also contained a questionnaire with a second list of capitals, to be
completed in the same way as before. Half the participants in each of the six groups rece-
ived the List A as their rst task, followed by the List B, whereas the other half received
the two lists in opposite order. The lists proved to have the same level of difficulty, with
4.93 correct answers to List A and 4.92 correct answers to list B (averaged over all
conditions and presentation orders), so the performance on these two lists was pooled.
Manipulation check
We tested the assumption that I am quite certain reflects a higher probability than I believe,
by presenting both phrases to an independent panel consisting of 30 employees in a
government agency (ranging from secretaries to lawyers). Participants in this condition
were asked to rate I am quite certain that and I believe that (along with the filler item I
guess that) on 0–100% visual analogue probability scales. Quite certain achieved a mean
rating of 85%, whereas believe was given a mean rating of 68%, confirming that these two
phrases are associated with different levels of confidence.
Results
Confidence levels
Participants in the moderate confidence conditions (n ¼ 171), predicted their number of
hits to be 4.59 and 5.44 on Task 1 and Task 2, respectively (averaged over all groups). The
mean predicted hits in the high confidence conditions (n ¼ 178) were almost identical,
namely 4.65 (Task 1) and 5.43 (Task 2). On an a priori basis, one might think that being
466 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
‘quite certain’ implies an expectation of having most, if not all, answers correct, and not
around 50%, as suggested by these answers. If ‘quite certain’ implies a higher degree of
confidence than ‘believe’, one would further expect this group to propose wider intervals
to make sure that their estimates were, in fact, correct. But the groups did not differ in
average performance, mean hit rates being 0.51 (Task 1) and 0.61 (Task 2) in the high
confidence condition, versus 0.47 and 0.60 in the moderate confidence condition. This
insensitivity to variations in assigned confidence is in line with the previous findings with
numeric confidence levels. As the verbal phrases did not make a difference, results from
these two conditions were pooled.
Elicitation method
Table 4 shows mean predicted hits and mean actual hits for participants in the three
elicitation conditions. Hit rates are clearly higher in the single bounds conditions than in
the range condition, for both tasks. Single estimate participa nts also predicted more
correct answers.
To allow for a more precise comparison, mean upper and lower limits were calculated
for each of the 20 cities in the three conditions. Three participants with extremely high
upper boundaries (100 millions or 1 billion inhabitants for all cities) were excluded from
this analysis, to prevent outliers to have a disproportionate effect on the averages. The
grand means of these calculations are presented in Table 5, showing that lower limits are
consistently lower and upper limits are consistently higher in the single limit conditions
Table 4. Predicted and actual hits for task 1 (before feedback) and task 2 (after feedback) for
participants in range and single limit estimates condition, Experiment 5
Condition 1 Condition 2 Condition 3
Range estimates Lower limits only Upper limits only
n ¼ 121 n ¼ 98 n ¼ 136
Task 1
Predicted hits 3.87 5.24 4.84
Actual hits 2.64 4.61 7.19
Task 2
Predicted hits 4.13 5.59 6.50
Actual hits 3.64 7.12 7.50
Table 5. Mean lower and upper population limits (in millions) and credible intervals for range
estimates and single limit estimates, averaged over 20 capitals, Experiment 5
Lower limit Upper limit Interval width
Absolute Relative
Range estimates (Condition 1)
Task 1 2.37 4.31 1.94 0.92
Task 2 1.27 3.29 2.02 1.28
Single limit estimates (Conditions 2 þ 3)
Task 1 2.04 5.07 3.03 0.64
Task 2 1.04 3.79 2.65 1.02
Note: Absolute interval width ¼ Upper limit lower limit estimates. Relative intervals ¼ Absolute intervals/
interval midpoints.
Credibility of credible intervals 467
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
than in the range condition, yielding 56.2% wider intervals for Task 1 and 31.2% wider
intervals for Task 2. Two-way repeated measures analyses of variance (ANOVAs), with
elicitation method (single vs. range) and task (Task 1 vs. Task 2) as the two factors, reveal
highly significant main effects of elicitation method, for lower limit estimates,
F(1, 19) ¼ 7.15, p ¼ 0.015, for upper limit estimates, F(1, 19) ¼ 46.9, p < 0.001, as well
as for interval widths, F(1, 19) ¼ 30.0, p < 0.001.
Overconfidence/underconfidence
Predicted hit rates were modest and did not show a general pattern of overconfidence.
Participants in the range condition overestimated their own performance on Task 1 with
47%, which was reduced to 13% on Task 2. Participants in the upper boundary conditions
were generally underconfident, whereas participants in the lower boundary condi tion were
slightly overconfident on Task 1 and clearly underconfident on Task 2. This is evident from
a comparison of predicted and actual hits (Table 4), and also if we compare the number of
overconfident versus underconfident participants in the three conditions, presented in
Table 6 (for Task 1 only). Soll and Klayman (2004) found a sex difference in over-
confidence, males being more confident than females. The present sample consisted of
about 75% women. They estimated their own hit rates significantly lower than did the
men, but they had also fewer correct answers.
Effects of training
Hit rates as well as degree of calibrationthe overall correspondence between hit rates
and estimated hitsimproved from Task 1 to Task 2 (Table 4, upper vs. lower half). This
improvement seems chiefly due to a general reduction of population estimates, lower
limits being adjusted downwards with nearly 50%, and upper limits with about 25%. It
will be recalled that many estimates, particula rly for small capitals, were originally too
high. Feedback on Task 1 informed participants, among other things, that some capita ls of
small European count ries have less than 100,000 inhabitants, whereas the capitals of large
countries rarely exceed 10 millions.
Absolute interval width (in millions) remained fairly constant from Task 1 to Task 2, but
relative interval widths increased, as seen from Table 5, last column. This may be regarded
as a side effect of the general reduction of population estimates. A population estimate of
3 millions, plus/minus 1 milli on, yields a MRI of 0.67. After a reduction to 2 millions, an
uncertainty interval of the same absolute size yields a MRI of 1.00.
Feedback did have an effect on confidence, dependent upon the participants’ degree of
under- or overconfidence. Mean changes in hit rates for under confident, well calibrated
and overconfident participants are shown in Table 6. Underconfident participants adjusted
Table 6. Mean changes in estimated hits (confidence) from Task 1 (before feedback) to Task 2
(after feedback) for underconfident, overconfident and well-calibrated participants, Experiment 5
Condition 1 Condition 2 Condition 3
Range estimates Lower limits only Upper limits only
MnMnMn
Underconfident 1.16 23 1.14 32 2.36 90
Well-calibrated 1.14 21 0.39 18 0.40 25
Overconfident 0.28 76 0.17 46 0.38 16
468 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
their predictions upwards from 3.90 to 5.80 hits. Accurate participants adjusted their
estimates upwards, but not so much, whereas overconfident participants adjusted their
predictions slightly downwards (from 4.30 to 4.04 estimated hits). A 3 3 ANOVA on the
changes reported in Table 6 reveals no effect of condition, F(2, 338) ¼ 0.61, n.s., but a
highly significant effect of confidence, F(2, 338) ¼ 20.38, p < 0.0001. Post hoc analyses
(Tukey) show that all three confidence groups are different at p < 0.01, and separate one-
way ANOVAs yield significant main effect in all three conditions. Thus, we can conclude
that feed back can sometimes reduce the optimism of initially overconfident subjects, but
that the encouraging effects of feedback on underconfident and even on well-calibrated
subjects are much stronger.
Discussion
The four main findings of the present experiment were:
(1) Verbal level of confidence has no apparent effect on the width of credible intervals and
estimated hit rates. This is in line with our general finding that estimated uncertainty
intervals remain largely the same regardless of level of probability.
(2) People can believe that they have only a modest number of correct guesses (three to
six out of 10), despite being ‘quite certain’ about each guess.
(3) Feedback made overconfident participants slightly less certain, whereas underconfi-
dent and accurate participants became much more confident. This asymmetry is in line
with a study showing that participants in a basket ball game adjusted their chances
slightly downward after each miss, but much more upwards after each hit (W. Bruine
de Bruin, unpublished manuscript, 2002). Feedback may in the present experiment
have had some effect on the intervals, not in an absolute sense, but by increasing their
relative widths. However, the main effect of feedback was a shift towards lower and
more realistic population estimates. The participants evidently realized that many
capitals, especially in small countries, had fewer inhabitants than they originally
thought. Put differently: participants seem to use feedback primarily to improve their
domain knowledge, and less so to improve their own way of handl ing uncertainty. By
scoring their own responses to Task 1 they may have learnt more about city
populations than about judgmental strategies.
(4) Single limit estimates produce much wider intervals than the more traditional range
method. ‘London has more than 1 million inhabitants’ or ‘less than 20 million
inhabitants’ appear to be acceptable statements, although ‘between 1 and 20 millions’
is too wide.
This finding has obvious practical implications. In areas where intervals tend to be
too narrow (most domains studied so far), more realistic intervals can be obtained by
asking judges to produce separate lower and upper bounds.
This finding was not predicted, but is clearly in line with Soll and Klayman’s (2004)
results on the two-point method, where the same judges produce lower and upper
limits in response to two separate questions. In their view, this is because two
questions invite informants to sample their knowledge twice. Questions about lower
boundaries make ideas about low populations accessible, whereas questions about
upper boundaries facilitate ideas about high populations, preparing the ground for low
or high estimates through a kind of priming procedure (similar to the process believed
to account for many anchoring phenomena, according to Mussweiler & Strack, 2000).
Credibility of credible intervals 469
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
These results are less compatible with the informativeness interpretation (Yaniv &
Foster, 1997), according to which wide intervals are avoided because of their lack of
communicative precision. But very low lower limits (or very high upper limits) may
be even less informat ive, because they give no hint about the most likely, middle
value. Yet a single limit estimate, even a low one, may perhaps appear less vague,
simply because it consists of one rather than two numbers.
In addition, the single limit questions (and also the two-point questions of Soll &
Klayman) are clearly formulated as tasks of exclusion. By saying that London has
more than 1 million inhabitants, I indicate that a population of 1 million is outside the
category of likely populations. If I say that London has less than 20 millions, I imply
that this value is too high.
GENERAL DISCUSSION
More than 30 years of research has shown that people tend to produce too narrow
uncertainty intervals. The present studies add to this body of research by showing (1) that
estimated confidence does not match the assigned confidence of credible intervals, and (2)
that the width of estimated intervals stays fairly constant over a wide range of assigned
confidence levels, whereas estimated confidence is more likely to vary with interval size.
Estimated confidence lower than assigned confidence
The ve experiments reported here allow for several AC-EC comparisons. As an illustration,
let us look at conditions where respondents have been asked to produce 90% confidence
intervals (for Experiment 5, the closest equivalent would be Task 1 range judgments in
the ‘quite certain’ condition). This instruction resulted in MRIs ranging from 0.46 (Experi-
ment 2) to about 1.00 (Experiments 1 and 5). Next, we look for conditions where participants
were asked to estimate the confidence of intervals of roughly comparable magnitude. These
are the assigned interval conditions with a MRI of 1.00 in Experiment 1, the assigned interval
condition with MRI ¼ 0.50 in Experiment 2, the free interval conditions in Experiment 3 and
4, and the range condition of Experiment 5. The first set of results show that 90% intervals can
be ‘translated’ into intervals with a mean MRI of 0.50–1.0, whereas the second set indicates
that such intervals will be ‘back-translated into confidence estimates of 40%–50%. The
situation can be compared to an exchange bureau where one is paid one Euro per dollar, but
returning with Euros, one will only receive half a dollar for each. Under such circumstances,
we may well ask: what is the ‘true’ exchange rate of dollar and Euro?
The hit rates for these intervals tell a similar story. In the AC conditions hit rates ranged
from about 23% to about 46%, which is clearly below the assigned confidence of 90%.
From these gures, respondents appear to be massively (44%–67%) overconfident. Hit
rates in the corresponding interval conditions were in the same range. But when hit rates
are compared to estimated confidence to the assigned intervals, much of the over-
confidence appears to be gone. One condition (Experiment 2) shows evidence of under-
confidence; in the other conditions, overconfidence is down to 12%–39%.
Stable intervals
The second main conclusion to be drawn from the present set of studies is that estimated
intervals remain stable over a wide range of instructions. In Experiment 2, 99% and 75%
470 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
confidence levels yielded intervals of equal magnitudes. In Experiment 3, 90% confidence
yielded only slightly wider intervals than 50% confidence. This contrasts sharply with the
normative requirements. With a normal distribution of errors, we should expect the high
probability interval in both these cases to be more than twice as wide as the low probability
interval. In Experiment 5, with verbal probability phrases, ‘quite certain’ intervals were no
wider than ‘believed’ intervals. Furthermore ‘free intervals’ (with no assigned level of
confidence) proved to be equal to the 90% confidence interval in Experiment 3 (but
somewhat more narrow in Experiment 4).
This confirms a previous finding of effort predictions (Jørgensen et al., 2004), where
several different confidence levels yielded almost identical min-max estimates of work
hours. It is also in line with Yaniv and Foster’s (1997) finding that participants who were
asked to generate 95% confidence intervals obtained the same number of hits as those who
were asked to provide interval estimates that they merely ‘felt comfortable communicat-
ing’. Thus we are forced to conclude that when people produce an uncertainty interval, it is
not primarily based on probability considerations. Hence it may be more fair to ask
respondents simply to produce an interval, without specifying the degree of confidence
that it is assumed to reflect.
Earlier studies using the ‘fractile’ method (asking people to produce several intervals)
indicate that people are under some circumstances able to take into account that intervals
corresponding to a confidence of 98% must be wider than intervals corresponding to 80%
or 50% (Alpert & Raiffa, 1982; Juslin et al., 1999). These studies have used a within-
subjects design, highlighting the difference between high and low probabilities. Our
studies show that this effect tends to disappear in a between-subjects design, where
intervals cannot be directly compared. Kahneman (2003) has argued that intuitive
judgments are best studied in between-subjects designs, because such designs provide
fewer cues about the target attribute that the experimenter intends to test. Thus we can
conclude that people may realize the difference between the width of a high and a low
probability interval when they are explicitly compared, but this requires analytical,
deliberate considerations, which are not so readily accessed when only one type of
intervals is asked for.
Variable confidence?
There is some evidence indicating that confidence estimates are more sensitive to
variations in intervals than vice versa. In Experiment 2, wide assigned intervals led to
higher confidence than smaller intervals. In Experiment 3, assigned 50 years intervals
implied higher confidence than 20 years intervals. Finally, the very wide intervals
generated in response to the upper and lower ‘improbable’ values in Experiment 4, as
well as the upper and lower single bound values in Experiment 5, were associated with
higher confidence than those produced by the range method. Yet the variations in
confidence were in all these cases less prominent than the much wider variations in actual
hit rates. Thus, the present set of studies does not allow for any definite conclusions about
the effect of interval size on confidence judgments.
Interval estimates and confidence estimates have different determinants
In the introduction, we claimed that uncertainty about quantities can be described in two
interchangeable ways, namely in terms of wide or narrow uncertainty intervals, and/or in
Credibility of credible intervals 471
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
terms of high and low confidence in these intervals. The results from the present
experiments suggest that these two indicators of uncertainty are, in practice, not so
readily interchangeable. Interval size may be a meaningful way of expressing external
uncertainty, where we know from experience or from theory that outcomes can vary
between certain limits. With internal uncertainty, intervals make less sense. It may still be
meaningful to prefer a coarse, but hopefully correct estimate (‘Mozart was born in the 18th
century’) to a sharp, but inaccurate one. Yaniv and Foster (1997) have argued that
‘graininess’ or precision of uncertainty judgments involves a trade-off between two
competing objectives: accuracy (which favours imprecise estimates) and informativeness
(which demands precision). These two objectives may be better served by a medium-sized
interval, accompanied by a moderate level of confidence, than with a completely
uninformative interval that is 100% certain.
A second problem with internal uncertainty intervals is how to select the upper and
lower bounds. It may be reasonable to place Mozart simply within the 18th century, but
perhaps more problematic to state that he was born between 1700 and 1799, using exact
numbers to describe the inexact knowledge. Lack of knowledge seems more easily and
naturally communicated by ‘meta-cognitive’ statements, referring to confidence (‘I am
just guessing’, or ‘I am 50% sure’), than by suggesting intervals with wide, but explicit
bounds.
If we grant that too narrow intervals may be a result of a wish to communicate
informative statements, why are people sometimes willing to generate much
wider intervals, as we found in Experiment 4 when asking for improbable figures, and
in the single question conditions in Experiment 5? Soll and Klayman (2004)
have suggested that the range method yields a narrow interval because it is basically
conceived as a question about the most likely event. The wide intervals in Experiments 4
and 5 differ from this in two important respects: they are given in response to two
separate questions, which are formulated as questions of exclusion rather than inclusion.
Both these features indicate that the object of communication differs from that of the range
method. It would be an object for further studies to investigate the relative contributions of
these two factors. Separate questions about minimum and maximum values (rather than
about ‘more than’ and ‘less than’ values) might change the two-point method from an
exclusion into an inclusion task. This might entail narrower intervals even in a two-
question format.
Are people overconfident?
Discrepancies between estimated confidence and actual hit rates of discrete propositions
have been debated as revealing genuine overconfidence, methodological artifacts, or
both (Hoffrage, 2004; Klayman et al., 1999). Such discrepancies have been even more
prominent in the area of credible intervals, where the same methodological criticisms do
not apply (Soll & Klayman, 2004). But most studies of interval overconfidence appear to
have compared hit rates to assigned rather than estimated confidence. The present studies
replicate these findings, showing that 90% confidence intervals can yield hit rates of 25–
40% rather than 90%. However, when we compare hit rates to estimated confidence, the
degree of overconfidence is greatly reduced. It is also greatly reduced if we ask for 50%
intervals rather than 90% intervals. This does not imply that interval overconfidence is a
purely methodological artifact, but it makes it difficult to draw general conclusions about
its magnitude and pervasiveness.
472 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
Practical implications
Uncertainty intervals are used or recommended in a number of applied settings. Standard
texts on project management (Kerzner, 2001; Moder et al., 1995) typically require
managers to submit ‘most optimistic’ and ‘most pessimistic’ completion times. These
intervals are usually defined in terms of frequencies or probabilities, as values that will not
be exceeded in more than 1% or 5% of the time. The present research should make us
suspicious about such estimates. Not only because people give range estimates that do not
match their actual hit rates, but also because they may misrepresent their own confidence
in these estimates. Rather than specifying a high confidence level before aski ng for
interval estimat es, it may be better to ask for unspecified intervals followed by confidence
estimates. To avoid overconfidence, separate assessments of lower and higher interval
bounds may be an even more promising procedure .
The present results are based upon ‘artificial’ general knowledge questions, which may
appear far removed from optimistic and pessimistic predictions of real life events. They
agree, however, with results from a parallel set of studies about effort estimates in software
development projects (Jørgensen, 2004; Jørgensen et al., 2004; Jørgensen & Teigen,
2002). In one experiment, 29 software professionals were asked to estimate completion
times of 30 software enhancement tasks, which had already been performed by a different
company. The tasks were described in detail, and feedback about actual completion time
was given after each estimate. Half of the participants were asked to produce 90%
confidence intervals around their most likely estimate. They produced too narrow
intervals, starting with a hit rate of 64% for the first set of 10 tasks, increasing to 81%
for the last 10 tasks. The other half were instead asked to estimate their confidence in an
assigned interval, with minimum and maximum values arbitrarily set as 50% and 200% of
most likely estimate (based on recommendations in NASA, 1990, for uncertainty intervals
of new projects). This led to intervals with hit rates of 67%–73%. More important, the
estimated confidence in these intervals was quite realistic (around 72%), and thus clearly
lower than the 90% assigned confidence of the first group. Thus, we believe that the
discrepancy between uncertainty intervals and interval uncertainty, demonstrated in the
present article, reflects a general problem of uncertainty estimation, not restricted to a
specific subset of laboratory tasks.
ACKNOWLEDGEMENTS
This research was supported by grant No. 135854/350 from the Research Council of
Norway to the first author.
Thanks are due to Siri Ska
˚
re-Botner and Hege Unde m Store for valuable assistance in
conducting, scoring, and analysing Experiment 5.
REFERENCES
Alpert, M., & Raiffa, H. (1982). A progress report on the training of probability advisors. In
D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases
(pp. 294–305). Cambridge: Cambridge University Press.
Arkes, H. R. (2001). Overconfidence in judgmental forecasting. In J. S. Armstrong (Ed.), Principles
of forecasting (pp. 495–515). Boston: Kluwer Academic Publishers.
Credibility of credible intervals 473
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
Bazerman, M. H. (1994). Judgment in managerial decision making, 3rd ed. New York: Wiley.
Budescu, D. V., & Wallsten, T. S. (1995). Processing linguistic probabilities: general principles and
empirical evidence. The Psychology of Learning and Motivation, 32, 275–318.
Buehler, R., Griffin, D., & Ross, M. (1994). Exploring the ‘planning fallacy’: why people under-
estimate their task completion time. Journal of Personality and Social Psychology, 67, 366–381.
Connolly, T., & Dean, D. (1997). Decomposed versus holistic estimates of effort required for
software writing tasks. Management Science, 43, 1029–1045.
Erev, I., Wallsten, T. S., & Budescu, D. V. (1994). Simultaneous over- and under-confidence: the role
of error in judgment processes. Psychological Review, 101, 519–527.
Gigerenzer, G., Hoffrage, U., & Kleinbo
¨
lting, H. (1991). Probabilistic mental models: A Brunswi-
kian theory of confidence. Psychological Review, 106, 180–209.
Grice, H. P. (1975). Logic and conversation. In P. Cole, & J. L. Morgan (Eds.), Syntax and semantics
3: Speech acts. New York: Academic Press.
Hoffrage, U. (2004). Overconfidence. In R. F. Pohl (Ed.), Cognitive illusions: Fallacies and biases in
thinking, judgment, and memory, pp. 235–254. Hove: Psychology Press.
Jørgensen, M. (2004). Increasing realism in assessment of effort estimation uncertainty: it matters
how you ask. IEEE Transactions on Software Engineering, 30, 209–217.
Jørgensen, M., & Teigen, K. H. (2002). Uncertainty intervals versus interval uncertainty: an
alternative method for eliciting effort prediction intervals in software development projects.
Proceedings of International Conference on Project Management (pp. 343–352). Singapore:
ProMAC-2002.
Jørgensen, M., Teigen, K. H., & Moløkken, K. (2004). Better sure than safe? Overconfidence in
judgment based software development effort prediction intervals. Journal of Systems and
Software, 70, 79–93.
Juslin, P., & Persson, M. (2002). PROBabilities from Exemplars (PROBEX): a ‘lazy’ algorithm for
probabilistic inference from generic knowledge. Cognitive Science, 26, 563–607.
Juslin, P., Wennerholm, P., & Olsson, H. (1999). Format dependence in subjective probability
calibration. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 1038–
1052.
Kahneman, D. (2003). A perspective on judgment and choice: mapping bounded rationality.
American Psychologist, 58, 697–720.
Kahneman, D., & Tversky, A. (1982). Variants of uncertainty. Cognition, 11, 143–157.
Keren, G. (1991). Calibration and probability judgments: conceptual and methodological issues.
Acta Psychologica, 77, 217–273.
Kerzner, H. (2001). Project management: A systems approach to planning, scheduling, and
controlling. New York: Wiley.
Klayman, J., Soll, J. B., Gonza
´
les-Vallejo, C., & Barlas, S. (1999). Overconfidence: it depends on
how, what, and whom you ask. Organizational Behavior and Human Decision Processes, 79, 216–
247.
Koriat, A., Lichtenstein, S., & Fischhoff, B. (1980). Reasons for confidence. Journal of Experimental
Psychology: Human Learning and Memory, 6, 107–118.
Liberman, V. (2004). Local and global judgments of confidence. Journal of Experimental Psychol-
ogy: Learning, Memory, and Cognition, 30, 729–732.
Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). Calibration of probabilities: The state of the
art to 1980. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty:
Heuristics and biases (pp. 306–334). Cambridge: Cambridge University Press.
May, R. S. (1986). Inferences, subjective probability and frequency of correct answers: a cognitive
approach to the overconfidence phenomenon. In B. Brehmer, H. Jungermann, P. Lourens, &
G. Sevon (Eds.), New directions in research on decision making. Amsterdam: North Holland.
McClelland, A. G. R., & Bolger, F. (1994). The calibration of subjective probabilities: theories and
models 1980–94. In G. Wright, & P. Ayton (Eds.), Subjective probability (pp. 453–482).
Chichester: John Wiley.
Moder, J. J., Phillips, C. R., & Davis, E. W. (1995). Project management with CPM, PERT and
precedence diagramming. Wisconsin: Blitz Publishing Company.
Mussweiler, T., & Strack, F. (2000). Comparing is believing: a selective accessibility model of
judgmental anchoring. In W. Stroebe, & M. Hewstone (Eds.), European Review of Social
Psychology, 10 (pp. 135–167). Chichester, UK: Wiley.
474 K. H. Teigen and M. Jørgensen
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
NASA. (1990). Manager’s handbook for software development. Greenbelt, MD: Goddard Space
Flight Center.
One worldNations online. (n.d.). Capitals and states of the worldEurope. Retrieved September
9, 2002, from http://www.nationsonline.org/oneworld/capitals_europe.htp.
Russo, J. E., & Schoemaker, P. J. H. (1989). Decision traps: Ten barriers to brilliant decision making
and how to overcome them. New York: Simon and Schuster.
Seaver, D. A., Winterfeldt, D. v., & Edwards, W. (1978). Eliciting subjective probability distribu-
tions on continuous variables. Organizational Behavior and Human Performance, 21, 352–379.
Sniezek, J. A., & Buckley, T. (1991). Confidence depends on level of aggregation. Journal of
Behavioral Decision Making, 4, 263–272.
Soll, J. B. (1996). Determinants of overconfidence and miscalibration: the roles of random error and
ecological structure. Organizational Behavior and Human Decision Processes, 65, 117–137.
Soll, J. B., & Klayman, J. (2004). Overconfidence in interval estimates. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 30, 299–314.
Teigen, K. H., & Brun, W. (2003). Verbal expressions of probability and uncertainty. In D. Hardman,
& L. Macchi (Eds.), Thinking: Psychological perspectives on reasoning, judgment, and decision
making (pp. 125–145). Chichester: Wiley.
Yaniv, I., & Foster, D. P. (1995). Graininess of judgment under uncertainty: an informativenes-
accuracy tradeoff. Journal of Experimental Psychology: General, 124, 424–432.
Yaniv, I., & Foster, D. P. (1997). Precision and accuracy of judgmental estimation. Journal of
Behavioral Decision Making, 10, 21–32.
Yaniv, I., & Schul, Y. (1997). Elimination and inclusion procedures in judgment. Journal of
Behavioral Decision Making, 10, 211–220.
Credibility of credible intervals 475
Copyright # 2005 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 19: 455–475 (2005)
... Not only is the answer to this question practically meaningful, but it is also theoretically relevant. A large literature has explored how logically equivalent elicitation methods can lead respondents to make different choices (Shafir, 1993;Tversky & Kahneman, 1981), provide different point estimates (Kelly & Simmons, 2016;Thomas & Kyung, 2019;Tversky et al., 1988), and generate different interval estimates (Juslin et al., 1999;Klayman et al., 1999;Soll & Klayman, 2004;Teigen & Jørgensen, 2005). Given the many differences between belief distribution elicitation and other question formats, it is worthwhile to investigate whether (and how) logically equivalent elicitation methods may also lead people to generate different distributions of beliefs. ...
... A large literature suggests that logically equivalent response formats can generate different responses to simple questions (e.g., Juslin et al., 1999;Kelly & Simmons, 2016;Klayman et al., 1999;Moon & Nelson, 2020;Shafir, 1993;Soll & Klayman, 2004;Teigen & Jørgensen, 2005;Thomas & Kyung, 2019;Tversky & Kahneman, 1981;Tversky et al., 1988), and researchers have identified many different reasons why. For example, response formats may change people's reference points (e.g., Tversky & Kahneman, 1981) or alter the salience of different information, options, or stimuli (e.g., Kelly & Simmons, 2016). ...
Article
Full-text available
When eliciting people’s forecasts or beliefs, you can ask for a point estimate—for example, what is the most likely state of the world?—or you can ask for an entire distribution of beliefs—for example, how likely is every possible state of the world? Eliciting belief distributions potentially yields more information, and researchers have increasingly tried to do so. In this article, we show that different elicitation methods elicit different belief distributions. We compare two popular methods used to elicit belief distributions: Distribution Builder and Sliders. In 10 preregistered studies (N = 14,553), we find that Distribution Builder elicits more accurate belief distributions than Sliders, except when true distributions are right-skewed, for which the results are mixed. This result holds when we assess accuracy (a) relative to a normative benchmark and (b) relative to participants’ own beliefs. Our evidence suggests that participants approach these two methods differently: Sliders users are more likely to start with the lowest bins in the interface, which in turn leads them to put excessive mass in those bins. Our research sheds light on the process by which people construct belief distributions while offering a practical recommendation for future research: All else equal, Distribution Builder yields more accurate belief distributions.
... After providing consent, they responded to sets of 20 binary choice and 20 interval estimation questions (set order was randomized). The question formats reflected those in the CPA course (i.e., the training phase) and questions regarded the populations of geographic regions (a common topic from previous work, e.g., Juslin et al., 1999;Klayman et al., 1999;Subbotin, 1996;Teigen & Jørgensen, 2005)-specifically, urban agglomerations (UA). 4 The full list of questions is available at osf.io/yqxpw/. ...
... If one were trained on difficult items and learned mainly that they should be less confident, then when presented with easy items, they may be expected to exhibit underconfidence. Another avenue for future work would be to assess the effect of calibration training when participants are allowed to select their confidence level as some research indicates that individuals may be better calibrated when they can choose the confidence levels for their judgments (Soll & Klayman, 2004;Teigen & Jørgensen, 2005). ...
Article
Full-text available
Experts are expected to make well‐calibrated judgments within their field, yet a voluminous literature demonstrates miscalibration in human judgment. Calibration training aimed at improving subsequent calibration performance offers a potential solution. We tested the effect of commercial calibration training on a group of 70 intelligence analysts by comparing the miscalibration and bias of their judgments before and after a commercial training course meant to improve calibration across interval estimation and binary choice tasks. Training significantly improved calibration and bias overall, but this effect was contingent on the task. For interval estimation, analysts were overconfident before training and became better calibrated after training. For the binary choice task, however, analysts were initially underconfident and bias increased in this same direction post‐training. Improvement on the two tasks was also uncorrelated. Taken together, results indicate that the training shifted analyst bias toward less confidence rather than having improved metacognitive monitoring ability.
... (Brown, 2015;Hu & Bentler, 1999;Keith, 2015). We used the 90% confidence interval to report the results to make the estimated value relatively stable (Brown, 2015;Teigen & Jørgensen, 2005). Brown stated, "Additional support for the fit of the solution would be evidenced by a 90% confidence interval of the RMSEA whose upper limit is below these cutoff values (e.g., 0.08)" (p. ...
Article
Full-text available
This study validated measures for elementary preservice teachers’ technological, pedagogical, and content knowledge (TPACK) for elementary mathematics and evaluated the extent to which technology knowledge, pedagogy knowledge, and content knowledge were related to the formation of TPACK. The study was guided by the TPACK framework and adopted a widely used survey instrument. Participants were elementary preservice teachers at the end of a mathematics method class at a midwestern US teacher preparation program. The study used confirmatory factor analysis and structural equation modeling to analyze measurement and predictive models. The confirmatory factor analysis validated a four-factor correlated measure of technological knowledge, pedagogical knowledge, content knowledge, and TPACK. The structural equation model indicated technological knowledge and pedagogical knowledge significantly predicted TPACK in elementary mathematics, but content knowledge did not. Preservice elementary school teachers indicated that their technological expertise was lower than their pedagogical knowledge, content knowledge, and TPACK. The results underscore the importance of strengthening TPACK in elementary teacher preparation programs with a focus on mathematics, enhancing the proficiency of preservice teachers in utilizing technology for effective mathematics teaching. This is particularly critical due to rapid technological change and shifts in students' needs and competencies.
... Moreover, to increase the validity of response intervals for measuring intra-individual variability, it might be beneficial to rely on more elaborate procedures for eliciting response intervals. For instance, instructions may ask respondents to especially consider implausible values when specifying the bounds of the DRS response intervals (i.e., exclusion instead of inclusion instructions; Teigen & Jorgensen, 2005) or to evaluate multiple pre-defined intervals that are later aggregated into a distribution (Haran, Moore, & Morewedge, 2010). However, more evolved elicitation methods or task instructions for interval responses would also be more time-consuming and might reduce the simplicity and the appeal of the DRS format. ...
Article
Full-text available
Measuring the variability in persons’ behaviors and experiences using ecological momentary assessment is time-consuming and costly. We investigate whether interval responses provided through a dual-range slider (DRS) response format can be used as a simple and efficient alternative: Respondents indicate variability in their behavior in a retrospective rating by choosing a lower and an upper bound on a continuous, bounded scale. We investigate the psychometric properties of this response format as a prerequisite for further validation. First, we assess the test–retest reliability of factor-score estimates for the width of DRS intervals. Second, we test whether factor-score estimates of the visual analog scale (VAS) and the location of DRS intervals show convergent validity. Third, we investigate whether factor-score estimates for the DRS are uncorrelated between different personality scales. We present a longitudinal multitrait-multimethod study using two personality scales (Extraversion, Conscientiousness) and two response formats (VAS, DRS) at two measurement occasions (6–8 weeks apart) for which we estimate factor-score correlations in a joint item response theory model. The test–retest reliability of the width of DRS intervals was high ( ρ^.73\hat{\rho } \ge .73 ρ ^ ≥ . 73 ). Also, convergent validity between location scores of VAS and DRS was high ( ρ^.88\hat{\rho } \ge .88 ρ ^ ≥ . 88 ). Conversely, discriminant validity of the width of DRS intervals between Extraversion and Conscientiousness was poor ( ρ^.94\hat{\rho } \ge .94 ρ ^ ≥ . 94 ). In conclusion, the DRS seems to be a reliable response format that could be used to measure the central tendency of a trait equivalently to the VAS. However, it might not be well suited for measuring intra-individual variability in personality traits.
... First, histogram responses are generally more accurate and less biased than are confidence intervals (Haran et al., 2010). Evidence suggests that people make a number of systematic errors when specifying confidence intervals (Soll & Klayman, 2004;Teigen & Jørgensen, 2005). Indeed, those errors are severe enough that it is worth questioning the degree to which people are even able to faithfully report percentiles from a subjective probability distribution, as confidence intervals require (Hoffrage, 2004;Moore et al., 2015). ...
Article
Full-text available
Every decision depends on a forecast of its consequences. We examine the calibration of the single longest and most complete forecasting project. The Survey of Professional Forecasters has, since 1968, collected predictions of key economic indicators such as unemployment, inflation, and economic growth. Here, we test the accuracy of those forecasts (n = 16,559) and measure the degree to which they fall victim to overconfidence, both overoptimism and overprecision. We find forecasts are overly precise; forecasters report 53% confidence in the accuracy of their forecasts, but are correct only 23% of the time. By contrast, forecasts show little evidence of optimistic bias. These results have important implications for how organizations ought to make use of forecasts. Moreover, we employ novel methodology in analyzing archival data: we split our dataset into exploration and validation halves. We submitted results from the exploration half to Collabra:Psychology. Following editorial input, we updated our analysis plan for the validation dataset, preregistering only analyses that were consistent across different economic indicators and analytic specifications. This manuscript presents results from the full dataset, prioritizing results that were consistent in both halves of the data.
Article
We study overconfidence related to financial knowledge among men and women within U.S. households, venturing beyond prior research confined to subsamples such as CEOs, retail investors, and older adults. By expanding our study to the broader U.S. population, we provide evidence that women, on average, exhibit greater overconfidence than men – a discrepancy attributable to the gender difference in financial knowledge. We find a positive association between overconfidence and both investment risk-taking and savings behavior, while it correlates inversely with prudent credit card management. Our findings emphasize the instrumental role of financial literacy in mitigating overconfidence, providing a deeper understanding of the interaction between gender, overconfidence, and financial literacy. Our results carry profound implications for policy interventions and educational strategies.
Article
Full-text available
This work concerns judgmental estimation of quantities under uncertainty. The authors suggest that the ''graininess'' or precision of uncertain judgments involves a trade-off between 2 competing objectives: accuracy and informativeness. Coarse (imprecise) judgments are less informative than finely grained judgments; however, they are likely to be more accurate. This trade-off was examined in 3 studies in which participants ranked judgmental estimates in order of preference. The patterns of preference ranking for judgments support an additive trade-off model of accuracy and informativeness. The authors suggest that this trade-off also characterizes other types of uncertain judgments, such as prediction, categorization, and diagnosis.
Article
Full-text available
Overconfidence is a common finding in the forecasting research literature. Judgmental overconfidence leads people (1) to neglect decision aids, (2) to make predictions contrary to the base rate, and (3) to succumb to “groupthink.” To counteract overconfidence forecasters should heed six principles: (1) Consider alternatives, especially in new situations; (2) List reasons why the forecast might be wrong; (3) In group interaction, appoint a devil’s advocate; (4) Make an explicit prediction and then obtain feedback; (5) Treat the feedback you receive as valuable information; (6) When possible, conduct experiments to test prediction strategies. These principles can help people to avoid generating only reasons that bolster their predictions and to learn optimally by comparing a documented prediction with outcome feedback.
Article
Can Verbal Probabilities be Translated into Numbers?Can Vagueness be Quantitatively Represented?Do Words Represent the Same Probabilities As Numbers?Two Kinds of Verbal Probability ExpressionsWhen Are Positive and Negative Phrases Chosen?Consequences of Choice of TermsNumeric Probabilities RevisitedConclusion References
Article
This chapter discusses that practical issues arise because weighty decisions often depend on forecasts and opinions communicated from one person or set of individuals to another. The standard wisdom has been that numerical communication is better than linguistic, and therefore, especially in important contexts, it is to be preferred. A good deal of evidence suggests that this advice is not uniformly correct and is inconsistent with strongly held preferences. A theoretical understanding of the preceding questions is an important step toward the development of means for improving communication, judgment, and decision making under uncertainty. The theoretical issues concern how individuals interpret imprecise linguistic terms, what factors affect their interpretations, and how they combine those terms with other information for the purpose of taking action. The chapter reviews the relevant literature in order to develop a theory of how linguistic information about imprecise continuous quantities is processed in the service of decision making, judgment, and communication. It provides current view, which has evolved inductively, to substantiate it where the data allow, and to suggest where additional research is needed. It also summarizes the research on meanings of qualitative probability expressions and compares judgments and decisions made on the basis of vague and precise probabilities.