Article

The Effects of Averaging Subjective Probability Estimates Between and Within Judges

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The average probability estimate of J > 1 judges is generally better than its components. Two studies test 3 predictions regarding averaging that follow from theorems based on a cognitive model of the judges and idealizations of the judgment situation. Prediction 1 is that the average of conditionally pairwise independent estimates will be highly diagnostic, and Prediction 2 is that the average of dependent estimates (differing only by independent error terms) may be well calibrated. Prediction 3 contrasts between- and within-subject averaging. Results demonstrate the predictions' robustness by showing the extent to which they hold as the information conditions depart from the ideal and as J increases. Practical consequences are that (a) substantial improvement can be obtained with as few as 2–6 judges and (b) the decision maker can estimate the nature of the expected improvement by considering the information conditions. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Instead of crowdsourcing a task to multiple people for diverse solutions, it may be possible to selfsource [19] the task to the same person in different contexts. There is some evidence this may be possible for concrete tasks [1,7,20]. For example, participants in one study were asked to answer twice to, "What percentage of the world's airports are in the United States?" [21]. ...
... When both answers were averaged it produced a more accurate response than either individually, but the accuracy was still much worse than when the two answers were drawn from different participants. This is consistent with work by Ariely et al. [1] that found that averaging multiple guesses from an individual helps somewhat, but not as much as average multiple guesses from different people. In this paper we extend these findings from concrete estimation tasks to more complex, creative tasks. ...
... The wisdom of the crowd arises when we aggregate people's diverse, independent perspectives [1]. To create a similar phenomenon within an individual we must work to artificially create diverse, independent experiences for that individual. ...
Conference Paper
Groups of people tend to generate more diverse ideas than individuals because each group member brings a different perspective to the table. But while someone working alone can suffer from fixation and have difficulty of thinking outside-the-box, in this paper we show that it is possible to help them think more like a group by asking them to approach a problem from different perspectives. We present a study of 54 crowd workers in which some individual workers were asked to assume the role of various relevant experts while solving a problem. We find that participants who were asked to assume different roles came up with more creative ideas than those who were not. These findings suggest there is an opportunity for problem solving tools to bring the wisdom of the crowd to individuals.
... " The mean estimate of all participants was 1197 lb; a re-analysis of Galton's notes showed that the correct weight of the ox was 1197 lb, meaning the crowd had perfectly assessed the weight (Wallis 2014). Subsequent work has extended wisdom of the crowd to geopolitical forecasts (Mellers et al. 2014(Mellers et al. , 2016(Mellers et al. , 2017Turner et al. 2014), probability estimates (Ariely et al. 2000;Lee and Danileiko 2014), ordering problems (e.g., the order of U.S. Presidents; Steyvers et al. 2009), forcedchoice questions (Bennett et al. 2018), and tasks involving the coordination of multiple pieces of information, such as picking the most efficient path through a predetermined ordering of points (Yi et al. 2012). Furthermore, crowd wisdom has been observed in populations whose cognitive abilities are more limited than those of human adults, including young adolescents (Ioannou et al. 2018) and nonhuman animals (Ioannou 2017). ...
... Remarkably, the benefits of averaging estimates hold even when those estimates come from the same person; this effect is called the wisdom of the inner crowd (see Herzog and Hertwig 2014a, for a review; see Ariely et al. 2000, for boundary conditions on the inner crowd). For example, Vul and Pashler (2008) asked participants eight general knowledge questions, all of which required an estimate of a percentage (e.g., What percentage of the world's airports are in the United States?). ...
Article
Full-text available
We investigated the effect of expertise on the wisdom of crowds. Participants completed 60 trials of a numerical estimation task, during which they saw 50–100 asterisks and were asked to estimate how many stars they had just seen. Experiment 1 established that both inner- and outer-crowd wisdom extended to our novel task: Single responses alone were less accurate than responses aggregated across a single participant (showing inner-crowd wisdom) and responses aggregated across different participants were even more accurate (showing outer-crowd wisdom). In Experiment 2, prior to beginning the critical trials, participants did 12 practice trials with feedback, which greatly increased their accuracy. There was a benefit of outer-crowd wisdom relative to a single estimate. There was no inner-crowd wisdom effect, however; with high accuracy came highly restricted variance, and aggregating insufficiently varying responses is not beneficial. Our data suggest that experts give almost the same answer every time they are asked and so they should consult the outer crowd rather than solicit multiple estimates from themselves.
... Researchers have also shown significant interest in examining how people mentally combine subjective expert uncertainty estimates from multiple sources (e.g., Wallsten et al., 1997;Clemen and Winkler, 1999;Ariely et al., 2000;Budescu, 2005) and how to computationally aggregate estimates from multiple experts (e.g., Fan et al., 2019;Han and Budescu, 2019). When people think a phenomenon has uncertainty, they will commonly look at multiple sources of information to reduce their perceived uncertainty (Greis et al., 2017). ...
... Although the research described above systematically reveals how people reason with uncertainty from multiple sources and the variability within a single source, researchers are less clear about how these findings generalize to multiple types of uncertainty from the same source (e.g., direct quantitative and indirect qualitative uncertainty from a single source). For example, a simple solution for determining the most likely forecasted nighttime low temperature using two competing forecasts is to compute the average mean temperature from both forecasts, a strategy that people commonly use (Ariely et al., 2000;Budescu, 2005). However, we have no such obvious way to mentally combine a forecasted nighttime low temperature with a subjective estimate of the forecaster's confidence, because indirect uncertainty that is expressed as forecaster confidence does not have a defined value by definition. ...
Article
Full-text available
When forecasting events, multiple types of uncertainty are often inherently present in the modeling process. Various uncertainty typologies exist, and each type of uncertainty has different implications a scientist might want to convey. In this work, we focus on one type of distinction between direct quantitative uncertainty and indirect qualitative uncertainty. Direct quantitative uncertainty describes uncertainty about facts, numbers, and hypotheses that can be communicated in absolute quantitative forms such as probability distributions or confidence intervals. Indirect qualitative uncertainty describes the quality of knowledge concerning how effectively facts, numbers, or hypotheses represent reality, such as evidence confidence scales proposed by the Intergovernmental Panel on Climate Change. A large body of research demonstrates that both experts and novices have difficulty reasoning with quantitative uncertainty, and visualizations of uncertainty can help with such traditionally challenging concepts. However, the question of if, and how, people may reason with multiple types of uncertainty associated with a forecast remains largely unexplored. In this series of studies, we seek to understand if individuals can integrate indirect uncertainty about how “good” a model is (operationalized as a qualitative expression of forecaster confidence) with quantified uncertainty in a prediction (operationalized as a quantile dotplot visualization of a predicted distribution). Our first study results suggest that participants utilize both direct quantitative uncertainty and indirect qualitative uncertainty when conveyed as quantile dotplots and forecaster confidence. In manipulations where forecasters were less sure about their prediction, participants made more conservative judgments. In our second study, we varied the amount of quantified uncertainty (in the form of the SD of the visualized distributions) to examine how participants’ decisions changed under different combinations of quantified uncertainty (variance) and qualitative uncertainty (low, medium, and high forecaster confidence). The second study results suggest that participants updated their judgments in the direction predicted by both qualitative confidence information (e.g., becoming more conservative when the forecaster confidence is low) and quantitative uncertainty (e.g., becoming more conservative when the variance is increased). Based on the findings from both experiments, we recommend that forecasters present qualitative expressions of model confidence whenever possible alongside quantified uncertainty.
... One way to enhance the performance of groups is to take into account the decision confidence that accompanies each individual opinion, usually reported by the members themselves 13,[17][18][19][20] . For instance, weighing the opinion of each member by their respective confidence 19,21 makes the group's decision more dependent on individuals who have reported high confidence, which tends to improve accuracy, particularly in the presence of tie decisions. ...
... One way to enhance the performance of groups is to take into account the decision confidence that accompanies each individual opinion, usually reported by the members themselves 13,[17][18][19][20] . For instance, weighing the opinion of each member by their respective confidence 19,21 makes the group's decision more dependent on individuals who have reported high confidence, which tends to improve accuracy, particularly in the presence of tie decisions. In such cases, ties can be resolved in favour of decisions associated with greater collective confidence. ...
Preprint
Full-text available
In this paper we present and test collaborative Brain-Computer Interfaces (cBCIs) that can significantly increase both the speed and the accuracy of group decision-making in realistic situations. The key distinguishing features of this work are: (1) our cBCIs combine behavioural, physiological and neural data in such a way as to be able to provide a group decision at any time after the quickest team member casts their vote, but the quality of a cBCI-assisted decision improves monotonically the longer the group decision can wait; (2) we apply our cBCIs to two realistic scenarios of military relevance (patrolling a dark corridor and manning an outpost at night where users need to identify any unidentified characters that appear) in which decisions are based on information conveyed through video feeds; and (3) our cBCIs exploit Event-Related Potentials (ERPs) elicited in brain activity by the appearance of potential threats but, uniquely, the appearance time is estimated automatically by the system (rather than being unrealistically provided to it). As a result of these elements, groups assisted by our cBCIs make both more accurate and faster decisions than when individual decisions are integrated in more traditional manners.
... Another is whether the type of question asked influences how the information is processed, and if so, does it do so equivalently for both types of information? Papers that discuss this topic include Ariely et al. (2000); Juslin, Olsson, and Björkman (1997) ;Juslin, Winman, and Olsson (2003); and Wallsten (1996). ...
... Checking additivity was simple in the full-scale case but first required conversion to a full scale in the half-scale case. 9See Ariely et al., 2000 for how that is done.) ...
... Aggregation can benefit from transforming the original judgements, using some predefined function, and aggregating the transformed estimates. Various transformation methods have been proved to enhance the aggregated forecasting accuracy (Ariely et al. 2000;Baron et al., 2014;Satopää et al., 2014;Mandel, Karvetski & Dhami, 2018;Turner, Steyvers, Merkle, Budescu & Wallsten, 2014). The quantile metric can also be used to evaluate the effectiveness of these transformations. ...
Article
Full-text available
We propose a new method to facilitate comparison of aggregated forecasts based on different aggregation, elicitation and calibration methods. Aggregates are evaluated by their relative position on the cumulative distribution of the corresponding individual scores. This allows one to compare methods using different measures of quality that use different scales. We illustrate the use of the method by re-analyzing various estimates from Budescu and Du ( Management Science , 2007).
... The question arises as to whether one person's intentional change in the answer can improve performance. [12] regarded the collective intelligence obtained by making repeated trials by one person as a "Crowd within effect" and concluded that it was ineffective as collective intelligence. In contrast, [13] found that a "Crowd within effect" can be obtained by repeated trials. ...
... Following this account, people's initial estimates represent samples drawn from an internal distribution of possible estimates, where second estimates are re-sampled guesses from that same distribution (Wallsten et al., 1997;Vul & Pashler, 2008). When second, re-sampled estimates are sufficiently diverse, averaging increases accuracy by canceling out errors across DISAGREEING PERSPECTIVES AND ESTIMATION ACCURACY 4 estimates (Ariely et al., 2000;Herzog & Hertwig, 2009;Keck & Tang, 2020;Litvinova et al., 2020). ...
Article
Full-text available
Many decisions rest on people’s ability to make estimates of unknown quantities. In these judgments, the aggregate estimate of a crowd of individuals is often more accurate than most individual estimates. Remarkably, similar principles apply when multiple estimates from the same person are aggregated, and a key challenge is to identify strategies that improve the accuracy of people’s aggregate estimates. Here, we present the following strategy: Combine people’s first estimate with their second estimate, made from the perspective of someone they often disagree with. In five preregistered experiments (N = 6,425 adults; N = 53,086 estimates) with populations from the United States and United Kingdom, we found that such a strategy produced accurate estimates (compared with situations in which people made a second guess or when second estimates were made from the perspective of someone they often agree with). These results suggest that disagreement, often highlighted for its negative impact, is a powerful tool in producing accurate judgments.
... In the current contexts, however, the collective structure of the decision did not protect candidates from an unfairly biased assessment, regardless of the strategy used to reach a final grade. In experiment 1, we find evidence for the generosity erosion-effect despite the evaluators using a scoringaverage strategy, commonly linked to improvements in judgment (22). In the same vein, we find evidence for the narrow-bracketing effect even with 10 evaluators and a grading strategy that included two steps. ...
Article
Full-text available
In recruitment processes, candidates are often judged one after another. This sequential procedure affects the outcome of the process. Here, we introduce the generosity-erosion effect, which states that evaluators might be harsher in their assessment of candidates after grading previous candidates generously. Generosity is defined as giving a candidate the lowest possible grade required to progress in the hiring process. Analyzing a high-stake hiring process, we find that for each candidate graded generously, the probability for subsequent candidates to pass decreased by 7.7% (experiment 1; N = 11,281). Testing the boundary conditions of the generosity-effect, we explore a hiring process that, in contrast to the previous process, was very selective, because candidates were more likely to fail than to pass. In this scenario, no evidence is found for the generosity-erosion effect (experiment 2; N = 3171). Practical implications and mechanisms underlying the generosity-erosion effect are further discussed.
... This phenomenon has been replicated many times in a variety of different judgment task domains (Larrick & Soll, 2006;Surowiecki, 2004). Ariely et al. (2000) showed, assuming pairwise conditional independence and random individual error distributions (although rare in many decision contexts), the average of J probability estimates (J = the number of estimators) will always be better than any of the component individual estimates and as J increases, the average will tend toward perfect calibration diagnosticity (accurate representation of the true state of affairs), even when information provided to the various estimators is less than optimal. In addition, Johnson, Budescu, and Wallsten (2001) empirically showed the accuracy of the average probability estimate to be robust over several conditions, even when individual estimates were not independent. ...
Preprint
Full-text available
In contemporary organizations, many—if not most—teams work on cognitive or information processing tasks (Hinsz, Tindale, & Vollrath, 1997). The past 50 years of research have taught us much about how information is accessed, created, attended to, and processed as teams attempt to complete various tasks. However, many of the information processing effects that have been observed are task specific, yet little research has focused specifically on tasks and how their information processing requirements differ. In this chapter, we discuss how task differences can impact how teams use and process information and how different information distribution patterns across members might impact performance. In addition, we address how constraints on the amount and type of interactions among the team members influences performance in different task domains. We hope our discussion demonstrates the importance of task differences for understanding team information processing and highlights where greater research focus will be fruitful.
... In some consensus-based MCDA workshops, the group is presented with a proposed aggregate of the individual appraisals to support and expedite its quest for a consensus. For instance, the mean/median is often seen as a good estimator of the unknown 'optimal' value (called 'wisdom of crowds' by Surowiecki 2005, see also: Ariely et al. 2000, Sunstein 2005, from which the group should only diverge with good reasons. It is also possible to have more sophisticated proposed aggregates that consider fairness concerns among group members or uneven influence of participants on the collective decision based on positional power or expertise (e.g. ...
Preprint
Full-text available
In multi-criteria decision analysis workshops, participants often appraise the options individually before discussing the scoring as a group. The individual appraisals lead to score ranges within which the group then seeks the necessary agreement to identify their preferred option. Preference programming enables some options to be identified as dominated even before the group agrees on a precise scoring for them. Workshop participants usually face time pressure to make a decision. Decision support can be provided by flagging options for which further agreement on their scores seems particularly valuable. By valuable, we mean the opportunity to identify other options as dominated (using preference programming) without having their precise scores agreed beforehand. The present paper quantifies this Value of Agreement and extends the concept to portfolio decision analysis and criterion weights. The new concept is validated through a case study in recruitment.
... For instance, = 0.4 means that the decision should be based on 40% quantitative thinking and 60% qualitative thinking (e.g., experience, common sense, logic or observations). Analogous to the weights and again following the 'wisdom-of-the-crowd' thinking (Ariely et al. 2000, Surowiecki 2005, the geometric mean of the participants' data support estimates is later assigned to the decision. ...
Preprint
Full-text available
The popularity of business intelligence (BI) systems to support business analytics has tremendously increased in the last decade. The determination of data items that should be stored in the BI system is vital to ensure the success of an organisation's business analytic strategy. Expanding conventional BI systems often leads to high costs of internally generating, cleansing and maintaining new data items whilst the additional data storage costs are in many cases of minor concern -- what is a conceptual difference to big data systems. Thus, potential additional insights resulting from a new data item in the BI system need to be balanced with the often high costs of data creation. While the literature acknowledges this decision problem, no model-based approach to inform this decision has hitherto been proposed. The present research describes a prescriptive framework to prioritise data items for business analytics and applies it to human resources. To achieve this goal, the proposed framework captures core business activities in a comprehensive process map and assesses their relative importance and possible data support with multi-criteria decision analysis.
... There are evidences that under the assumption of mutual independence of judgements, estimations made from the opinions of a crowd of N judges can be accurate (Galton 1907;Wallis 2014). This methodology always attracted the interest of statisticians but experimental and quasi-experimental research over this topic surged after 2000 (Ariely et al. 2000;Soll and Larrick 2009;Müller-Trede et al. 2018). However, we have to highlight major differences in core assumptions between experimental methodology and crowd rating information systems (observational methodology): ...
Article
Full-text available
Crowd rating is a continuous and public process of data gathering that allows the display of general quantitative opinions on a topic from online anonymous networks as they are crowds. Online platforms leveraged these technologies to improve predictive tasks in marketing. However, we argue for a different employment of crowd rating as a tool of public utility to support social contexts suffering to adverse selection, like tourism. This aim needs to deal with issues in both method of measurement and analysis of data, and with common biases associated to public disclosure of rating information. We propose an evaluative method to investigate fairness of common measures of rating procedures with the peculiar perspective of assessing linearity of the ranked outcomes. This is tested on a longitudinal observational case of 7 years of customer satisfaction ratings, for a total amount of 26.888 reviews. According to the results obtained from the sampled dataset, analysed with the proposed evaluative method, there is a trade-off between loss of (potentially) biased information on ratings and fairness of the resulting rankings. However, computing an ad hoc unbiased ranking case, the ranking outcome through the time-weighted measure is not significantly different from the ad hoc unbiased case.
... An important prerequisite for more applied research on WoC in clinical decision making is to identify real-life choice contexts that fulfil the key assumptions of WoC and thus provide a suitable decision environment to enable successful WoC applications. The main assumption outlined in the Introduction is a diverse crowd of decision makers with independent judgement errors 18,45 . Below, we discuss this assumption in the context of three common prescribing contexts: outpatient consultations, hospital ward rounds, and multidisciplinary team meetings. ...
Article
Full-text available
Antibiotic overprescribing is a global challenge contributing to rising levels of antibiotic resistance and mortality. We test a novel approach to antibiotic stewardship. Capitalising on the concept of “wisdom of crowds”, which states that a group’s collective judgement often outperforms the average individual, we test whether pooling treatment durations recommended by different prescribers can improve antibiotic prescribing. Using international survey data from 787 expert antibiotic prescribers, we run computer simulations to test the performance of the wisdom of crowds by comparing three data aggregation rules across different clinical cases and group sizes. We also identify patterns of prescribing bias in recommendations about antibiotic treatment durations to quantify current levels of overprescribing. Our results suggest that pooling the treatment recommendations (using the median) could improve guideline compliance in groups of three or more prescribers. Implications for antibiotic stewardship and the general improvement of medical decision making are discussed. Clinical applicability is likely to be greatest in the context of hospital ward rounds and larger, multidisciplinary team meetings, where complex patient cases are discussed and existing guidelines provide limited guidance.
... This phenomenon has been replicated many times in a variety of different judgment task domains (Larrick & Soll, 2006;Surowiecki, 2004). Ariely et al. (2000) showed, assuming pairwise conditional independence and random individual error distributions (although rare in many decision contexts), the average of J probability estimates (J = the number of estimators) will always be better than any of the component individual estimates and as J increases, the average will tend toward perfect calibration diagnosticity (accurate representation of the true state of affairs), even when information provided to the various estimators is less than optimal. In addition, Johnson, Budescu, and Wallsten (2001) empirically showed the accuracy of the average probability estimate to be robust over several conditions, even when individual estimates were not independent. ...
... In many situations, it is possible to have access to several probabilistic forecasts of the same event (Clemen 1989;Graham 1996;Ariely et al. 2000;Winkler and Poses 1993). As these forecasts might be provided by independent models, nonnegligible differences can be observed. ...
Article
In this paper, a new model for the combination of two or more probabilistic forecasts is presented. The proposed combination model is based on a logit transformation of the underlying initial forecasts involving interaction terms. The combination aims at approximating the ideal calibration of the forecasts which is shown to be calibrated and to maximize the sharpness. The proposed combination model is applied to two precipitation forecasts, Ensemble-MOS and RadVOR, which were developed by Deutscher Wetterdienst. The proposed combination model shows significant improvements in various forecast scores for all considered lead times compared to both initial forecasts. In particular, the proposed combination model is calibrated, even if both initial forecasts are not calibrated. It is demonstrated that the method enables a seamless transition between both initial forecasts across several lead times to be created. Moreover, the method has been designed in such a way that it allows for fast updates in nearly real time.
... the robust averaging strategy, relative to the more fickle maximizing strategy, can boost accuracy of confidence judgments while requiring less knowledge about the kindness and wickedness of the items the decision maker faces.BOOSTING CONFIDENCE JUDGMENTS36ReferencesAriely, D., Tung Au, W., Bender, R. H., Budescu, D. V., Dietz, C. B., Gu, H., . . .Zauberman, G. (2000). The e of averaging subjective probability estimates between and within judges. Journal of Experimental Psychology:Applied, 6 , 130-147. doi: 10.1037/1076-898X.6.2.130 Arkes, H. R. (2001). Overconfidence in judgmental forecasting. In J. S. Armstrong (Ed.), Principles of forecasting: A handbook for researchers and practitioners (p. 495-5 ...
... Figure 17. Calibration curve for probability assessment from Ariely et al. (2000). The horizontal axis shows the assessed subjective probability and the vertical axis shows the average frequency with which the corresponding events or outcomes occurred. ...
Presentation
Full-text available
Companion whitepaper to the 27th Buchanan Lecture, presented at Texas A&M University, College Station, October 18, 2019.
... The exact number of forecasters needed to improve the accuracy of a prediction is still debated, but it appears there may be a "goldilocks-zone" between having two few and too many forecasters. Multiple advantages have been highlighted from applying the principles of "the wisdom of the crowd" and aggregating forecasts such as (a) maximizing the amount of information available to craft a judgment, (b) reducing the potential impact of an extreme source of information that may be unreliable (Ariely et al., 2000;Johnson et al., 2001), and (c) increasing the credibility and validity of the aggregation process (Wallsten and Diederich, 2001). ...
Article
Full-text available
Coaches are faced with the difficult task of identifying and selecting athletes to their team. Despite its widespread practice in sport, there is still much to learn about improving the identification and selection process. Evidence to date suggests selection decisions (at different competitive levels) can be inaccurate, bias driven, and sometimes even illogical. These mistakes are believed to contribute to“talent wastage,”the effect of a coach’s wrongful selection and/or deselection of an athlete to/from a team. Errors of this scale can lead to negative repercussions for all stakeholders involved and therefore deserve further exploration. It is the purpose of this paper to shed light on the potential factors influencing talent wastage and to illuminate possible psychological pitfalls when making decisions under uncertainty.
... Following this account, people's initial estimates represent samples drawn from an internal distribution of possible estimates, where second estimates are re-sampled guesses from that same distribution (Wallsten et al., 1997;Vul & Pashler, 2008). When second, re-sampled estimates are sufficiently diverse, averaging increases accuracy by canceling out errors across DISAGREEING PERSPECTIVES AND ESTIMATION ACCURACY 4 estimates (Ariely et al., 2000;Herzog & Hertwig, 2009;Keck & Tang, 2020;Litvinova et al., 2020). ...
Preprint
Full-text available
Many decisions rest on people’s ability to make estimates of unknown quantities. In these judgments, the aggregate estimate of a crowd of individuals is often more accurate than most individual estimates. Remarkably, similar principles apply when multiple estimates from the same person are aggregated, and a key challenge is to identify strategies that improve the accuracy of people’s aggregate estimates. Here, we present the following strategy: Combine people’s first estimate with their second estimate, made from the perspective of someone they often disagree with. In five preregistered experiments (N = 6,425 adults; N = 53,086 estimates) with populations from the United States and United Kingdom, we found that such a strategy produced accurate estimates (compared with situations in which people made a second guess or when second estimates were made from the perspective of someone they often agree with). These results suggest that disagreement, often highlighted for its negative impact, is a powerful tool in producing accurate judgments.
... Compounding this problem, participants generally express their views as numerical values on linear scales. Studies have shown that people are nonlinear-thinkers and that participants have different nonlinearities in the internal rating scales they employ [21][22][23]. This means the underlying data used by traditional sampling methods can be highly distorted, tracking numerical values that appear similar on the surface, but mean different things to different respondents. ...
Chapter
Swarm Intelligence (SI) is a natural phenomenon that enables social species to quickly converge on optimized group decisions by interacting as real-time closed-loop systems. This process, which has been shown to amplify the collective intelligence of biological groups, has been studied extensively in schools of fish, flocks of birds, and swarms of bees. This paper provides an overview of a new collaboration technology called Artificial Swarm Intelligence (ASI) that brings the same benefits to networked human groups. Sometimes referred to as "human swarming" or building "hive minds," the process involves groups of networked users being connected in real-time by AI algorithms modeled after natural swarms. This paper presents the basic concepts of ASI and reviews recently published research that shows its effectiveness in amplifying the collective intelligence of human groups, increasing accuracy when groups make forecasts, generate assessments, reach decisions, and form predictions. Examples include significant performance increases when human teams generate financial predictions, business forecasts, subjective judgments, and medical diagnoses.
... Compounding this problem, participants generally express their views as numerical values on linear scales. Studies have shown that people are nonlinear-thinkers and that participants have different nonlinearities in the internal rating scales they employ [21][22][23] This means the underlying data used by traditional sampling methods can be highly distorted, tracking numerical values that appear similar on the surface, but mean different things to different respondents. ...
Conference Paper
Full-text available
Swarm Intelligence (SI) is a natural phenomenon that enables social species to quickly converge on optimized group decisions by interacting as real-time closed-loop systems. This process, which has been shown to amplify the collective intelligence of biological groups, has been studied extensively in schools of fish, flocks of birds, and swarms of bees. This paper provides an overview of a new collaboration technology called Artificial Swarm Intelligence (ASI) that brings the same benefits to networked human groups. Sometimes referred to as "human swarming" or building "hive minds," the process involves groups of networked users being connected in real-time by AI algorithms modeled after natural swarms. This paper presents the basic concepts of ASI and reviews recently published research that shows its effectiveness in amplifying the collective intelligence of human groups, increasing accuracy when groups make forecasts, generate assessments, reach decisions, and form predictions. Examples include significant performance increases when human teams generate financial predictions, business forecasts, subjective judgments, and medical diagnoses.
... Similarly, if a single individual is tasked with retrieving and making corresponding confidence judgments several times, rather than just once, the answer obtained by combining the different retrieval events can be more accurate than any of the single retrieval events, as Herzog and Hertwig (2014) have demonstrated. Fraundorf and Benjamin (2014) estimated that the advantage gained from the inner crowd is about 1/10 that of using different people, and the effects can sometimes be small (Ariely et al., 2000). Even so, the inner crowd findings demonstrate that people may have untapped knowledge about the accuracy of their retrievals. ...
Article
Full-text available
In five experiments, we examined the conditions under which participants remembered true and false information given as feedback. Participants answered general information questions, expressed their confidence in the correctness of their answers, and were given true or false feedback. In all five experiments, participants hypercorrected when they had made a mistake; that is, they remembered better the correct feedback to errors made with high compared to low confidence. However, in none of the experiments did participants hyper'correct' when false feedback followed an initially correct response. Telling people whether the feedback was right or wrong made little difference, suggesting that people already knew whether the feedback was true or false and differentially encoded the true feedback compared to the false feedback. An exception occurred when false feedback followed an error: participants hyper'corrected' to this false feedback, suggesting that when people are wrong initially, they are susceptible to further incorrect information. These results indicate that people have some kind of privileged access to whether their answers are right or wrong, above and beyond their confidence ratings, and that they behave differently when trying to remember new “corrective” information depending upon whether they, themselves, were right or wrong initially. The likely source of this additional information is knowledge about the truth of the feedback, which they rapidly process and use to modulate memory encoding. Electronic supplementary material The online version of this article (10.1186/s41235-019-0153-8) contains supplementary material, which is available to authorized users.
... Crowd analysis-the method of pooling independent decisions from multiple viewers-has been shown to improve face matching performance in previous work (White Burton, Kemp, & Jenkins, 2013;Jeckeln, Hahn, Noyes, Cavazos, & O'Toole, 2018). Given the generality of crowd effects (Galton, 1907;Ariely et al., 2000;Surowiecki, 2004), there are good reasons to expect that crowd analysis could improve identification accuracy for disguised faces too. ...
Article
Facial image comparison is difficult for unfamiliar faces and easy for familiar faces. Those conclusions are robust, but they arise from situations in which the people being identified cooperate with the effort to identify them. In forensic and security settings, people are often motivated to subvert identification by manipulating their appearance, yet little is known about deliberate disguise and its effectiveness. We distinguish two forms of disguise-Evasion (trying not to look like oneself) and Impersonation (trying to look like another person). We present a new set of disguised face images (the FAçADE image set), in which models altered their appearance to induce specific identification errors. In Experiment 1, unfamiliar observers were less accurate matching disguise items, especially evasion items, than matching undisguised items. A similar pattern held in Experiment 2, in which participants were informed about the disguise manipulations. In Experiment 3, familiar observers saw through impersonation disguise, but accuracy was lower for evasion disguise. Quantifying the performance cost of disguise reveals distinct performance profiles for impersonation and evasion. Evasion disguise was especially effective and reduced identification performance for familiar observers as well as for unfamiliar observers. We subsume these findings under a statistical framework of face learning.
... Por exemplo, Ariely et al. (2000) demonstram que, sob certas condições estatísticas, a média de estimativas de probabilidade produzida por um grupo será sempre superior à estimativa individual de qualquer dos seus membros, argumento corroborado por muitos estudos empíricos realizados em condições mais realistas. ...
Article
Full-text available
É comum que decisões organizacionais importantes sejam tomadas em grupo. Nas instituições financeiras, não raro, as concessões de crédito são decididas em comitês. Na esfera governamental, decisões fundamentais, tais como a definição da taxa de juros básica da economia, também são tomadas em colegiado. Nas grandes empresas privadas, o conselho de administração (CA) está no topo da hierarquia organizacional e tem a palavra final sobre estratégias de investimento, financiamento e fusões e aquisições. Todavia, as peculiaridades do processo decisório grupal são largamente ignoradas na literatura de finanças, efetivamente tratando-se o coletivo como se fosse um indivíduo. Por exemplo, as pesquisas no campo de finanças corporativas comportamentais enfatizam os processos cognitivos e vieses do decisor individual e dão pouca atenção a como esses processos interagem para produzir a decisão do grupo (para uma revisão dessa literatura, vide Baker & Wurgler, 2013). Neste ensaio, apresento, de maneira concisa e seletiva, o estado atual da discussão multidisciplinar emergente sobre o processo decisório em pequenos grupos, com ênfase em seus aspectos comportamentais. Primeiramente, abordo as vantagens e dificuldades da decisão em grupo em comparação com a decisão individual. Em seguida, apresento contribuições recentes que mostram como a qualidade da decisão do grupo depende do contexto e como pequenas alterações do ambiente decisório podem ter consequências relevantes.
... A great deal of research on the phenomenon known as "wisdom-of-the-crowd" finds that more accurate outcomes result from aggregating the decisions of multiple decision-makers. 319 Further, recent research into this phenomenon has demonstrated improved accuracy of decision-making when the decisions of facial examiners and facial recognition algorithms are amalgamated. 320 An open evidence system (based on an OSF-like framework) could be implemented to aggregate the decisions made by multiple experts (blind to each other's decisions), as well as computer-based algorithms, in an effort to increase the accuracy of decisions. ...
Preprint
Full-text available
Both science and expert evidence law are undergoing significant changes. In this article, the authors compare these two movements – the open science movement and the evidence-based evidence movement. The open science movement is the recent discovery of many irreproducible findings in science and the subsequent move towards more transparent methods. The evidence-based evidence movement is the discovery that many forms of expert evidence are unreliable and that they have contributed to wrongful convictions. The authors identify many similarities between these movements, including misaligned incentives, cognitive bias, and too much weight accorded to eminence. These similarities suggest several ways in which courts and legal actors may learn from the open science movement. Expert witnesses should comport themselves as rigorous open scientists. Parties should be subjected to more specific and rigorous disclosure requirements because research has shown that even leading scientists find it easy to discount and suppress findings that do not support their hypotheses. And trial judges, as gatekeepers, should not defer to the generally accepted practices that have proven insufficient in the mainstream sciences. The authors end with proposal for systemic reforms designed to further the ideal of open justice.
... While a "60% chance" has a precise mathematical meaning, in a seminal study, Lichtenstein and Newman (1967) found that "likely" was interpreted to mean anything from 25% to 99%. Second, the subjective interpretations of numeric probabilities are more context-dependent than verbal probabilities, which are processed more intuitively (Bilgin & Brenner, 2013;Teigen, 2001;Teigen & Brun, 1995, 1999, 2000Windschitl & Weber, 1999;Windschitl & Wells, 1996). It is easier to evaluate verbal probabilities as a positive or negative sign than it is for numeric probabilities. ...
Article
Understanding decision making under uncertainty is crucial for researchers in the social sciences, policymakers, and anyone trying to make sense of another’s (or their own) choices. In this dissertation, my coauthors and I make three contributions to understanding preferences for uncertainty regarding (a) how preferences are measured, (b) how these preferences may (or may not) manifest in a consequential real-world context, and (c) how different types of advice influence opinions about uncertain events. In Chapter 1, we examine methods that researchers use to study preferences for uncertainty. We find that the presence of uncertainty is often confounded with the presence of “weird” transaction features, dramatically overstating the presence of uncertainty aversion in these experiments. In Chapter 2, we show that extreme uncertainty does not exist in the context of corporate experimentation, despite speculation by pundits and researchers. In fact, people judge experiments similarly to how they would judge simple gambles, with the experiment being judged near the “expected value” of the policies it implements. In Chapter 3, we find that the format in which uncertainty is presented impacts how people combine forecasts from multiple sources. Numeric probability forecasts are averaged, while verbal forecasts are combined additively, with people making more extreme judgments as they see additional forecasts.
... The fact that initially the mean opinion is closer to truth than the individual opinions corresponds to the wisdom of the crowd effect. The wisdom of the crowd is a statistical effect stating that averaging over several independent judgments yields a more accurate evaluation than most of the individual judgments would (see the early ox experiment by Galton in 1907 [40] or more recent work [41]). Since the aggregate performance does not consistently improve over rounds, it can be said that social influence does not consistently promote the wisdom of the crowd. ...
... Recent research proposes that the same principle applies to repeated judgements from the same person 14 . Laboratory experiments confirm that estimation accuracy can indeed be improved by aggregating estimates from a single individual 16,[30][31][32][33][34][35] . The benefit of within-person aggregation reflects what has been dubbed the wisdom of the inner crowd, and can potentially boost the quality of individual decision making 36 . ...
Article
Full-text available
The quality of decisions depends on the accuracy of estimates of relevant quantities. According to the wisdom of crowds principle, accurate estimates can be obtained by combining the judgements of different individuals. This principle has been successfully applied to improve, for example, economic forecasts, medical judgements and meteorological predictions. Unfortunately, there are many situations in which it is infeasible to collect judgements of others. Recent research proposes that a similar principle applies to repeated judgements from the same person. This paper tests this promising approach on a large scale in a real-world context. Using proprietary data comprising 1.2 million observations from three incentivized guessing competitions, we find that within-person aggregation indeed improves accuracy and that the method works better when there is a time delay between subsequent judgements. However, the benefit pales against that of between-person aggregation: the average of a large number of judgements from the same person is barely better than the average of two judgements from different people.
... Bonner & Baumann, 2012). This finding supports research that points, for example, to the potential importance of error correction in groups (Orlitzky & Hirokawa, 2001), to the potential benefits of member discussion and knowledge aggregation (Lu, Yuan, & McLeod, 2011;Van Swol, 2007), and the intrinsic benefits of member preference aggregation (Ariely et al., 2000). ...
... A large body of research in judgment and decision making rests on a premise that is elegant in its simplicity: namely, that the combined estimates of multiple individuals are more accurate than those made alone (Ariely et al. 2000, Clemen 1989, Hogarth 1978, Larrick and Soll 2006, Lorge et al. 1958, Makridakis and Winkler 1983, Simmons et al. 2011, Surowiecki 2004. This notion that "two heads are better than one" frequently guides how corporations, groups, and families go about making decisions. ...
Article
We evaluate the effect of discussion on the accuracy of collaborative judgments. In contrast to prior research, we show that discussion can either aid or impede accuracy relative to the averaging of collaborators’ independent judgments, as a systematic function of task type and interaction process. For estimation tasks with a wide range of potential estimates, discussion aided accuracy by helping participants prevent and eliminate egregious errors. For estimation tasks with a naturally bounded range, discussion following independent estimates performed on par with averaging. Importantly, if participants did not first make independent estimates, discussion greatly harmed accuracy by limiting the range of considered estimates, independent of task type. Our research shows that discussion can be a powerful tool for error reduction, but only when appropriately structured: Decision makers should form independent judgments to consider a wide range of possible answers, and then use discussion to eliminate extremely large errors. Data and the online appendix are available at https://doi.org/10.1287/mnsc.2017.2823 . This paper was accepted by Yuval Rottenstreich, judgment and decision making.
... As a statistical rule of thumb, pooling information across independent individuals leads to more reliable information [63,64]. For example, pooling unbiased but noisy numerical estimates causes uncorrelated errors to cancel out and therefore increases the precision of the pooled estimate (see appendix B1). ...
Article
Full-text available
We review the literature to identify common problems of decision-making in individuals and groups. We are guided by a Bayesian framework to explain the interplay between past experience and new evidence, and the problem of exploring the space of hypotheses about all the possible states that the world could be in and all the possible actions that one could take. There are strong biases, hidden from awareness, that enter into these psychological processes. While biases increase the efficiency of information processing, they often do not lead to the most appropriate action. We highlight the advantages of group decision-making in overcoming biases and searching the hypothesis space for good models of the world and good solutions to problems. Diversity of group members can facilitate these achievements, but diverse groups also face their own problems. We discuss means of managing these pitfalls and make some recommendations on how to make better group decisions.
... To address this problem, Ranjan and Gneiting (2010) propose a simple scheme that extremizes the linear opinion pool by pushing its forecasts closer to the nearest extreme of either zero or one. Many others have employed similar schemes to extremize the linear opinion pool (Karmarkar 1978, Erev et al. 1994, Ariely et al. 2000, Shlomi and Wallsten 2010, Turner et al. 2014, Baron et al. 2014). Baron et al. (2014, p. 134) note that "If every forecaster said 0.6, and they were using different information, then someone who knew all of this would have a right to much higher confidence." ...
Article
Full-text available
Many organizations face critical decisions that rely on forecasts of binary events---events such as whether a borrower will default on a loan or not. In these situations, organizations often gather forecasts from multiple experts or models. This raises the question of how to aggregate the forecasts. Because linear combinations of probability forecasts are known to be underconfident, we introduce a class of aggregation rules, or Bayesian ensembles, that are non-linear in the experts' probabilities. These ensembles are generalized additive models of experts' probabilities. These models have three key properties. They are coherent, i.e., consistent with the Bayesian view. They can aggregate calibrated (or miscalibrated) forecasts. And they are often more extreme, and therefore more confident, than the commonly used linear opinion pool. Empirically, we demonstrate that our ensemble can be easily fit to real data using a generalized linear modeling framework. We use this framework to aggregate several forecasts of binary events in two publicly available datasets. The forecasts come from several leading statistical and machine learning algorithms. Our Bayesian ensemble offers an improvement out-of-sample over the linear opinion pool and over any one of the individual algorithms considered.
Article
Full-text available
The accuracy of human forecasters is often reduced because of incomplete information and cognitive biases that affect the judges. One approach to improve the accuracy of the forecasts is to recalibrate them by means of non-linear transformations that are sensitive to the direction and the magnitude of the biases. Previous work on recalibration has focused on binary forecasts. We propose an extension of this approach by developing an algorithm that uses a single free parameter to recalibrate complete subjective probability distributions. We illustrate the approach with data from the quarterly Survey of Professional Forecasters (SPF) conducted by the European Central Bank (ECB), document the potential benefits of this approach, and show how it can be used in practical applications.
Article
Many decisions rest on people’s ability to make estimates of unknown quantities. In these judgments, the aggregate estimate of a crowd of individuals is often more accurate than most individual estimates. Remarkably, similar principles apply when multiple estimates from the same person are aggregated, and a key challenge is to identify strategies that improve the accuracy of people’s aggregate estimates. Here, we present the following strategy: Combine people’s first estimate with their second estimate, made from the perspective of someone they often disagree with. In five preregistered experiments ( N = 6,425 adults; N = 53,086 estimates) with populations from the United States and United Kingdom, we found that such a strategy produced accurate estimates (compared with situations in which people made a second guess or when second estimates were made from the perspective of someone they often agree with). These results suggest that disagreement, often highlighted for its negative impact, is a powerful tool in producing accurate judgments.
Chapter
The difference in the amount of information and the complexity that can be handled makes the difference between collective intelligence and the best members. Even if the answer distribution is distorted, collective intelligence will reach and even exceed the best member at a high probability. The use of professional crowds as a source of collective intelligence almost always results in a better outcome than the best members. If no expert is found, majority voting can be used as a convenient approximation.
Article
Many organizations combine forecasts of probabilities of binary events to support critical business decisions, such as the approval of credit or the recommendation of a drug. To aggregate individual probabilities, we offer a new method based on Bayesian principles that can help identify why and when combined probabilities need to be extremized. Extremizing is typically viewed as shifting the average probability farther from one half; we emphasize that it is more suitable to define extremizing as shifting it farther from the base rate. We introduce the notion of antiextremizing, cases in which it might be beneficial to make average probabilities less extreme. Analytically, we find that our Bayesian ensembles often extremize the average forecast but sometimes antiextremize instead. On several publicly available data sets, we demonstrate that our Bayesian ensemble performs well and antiextremizes anywhere from 18% to 73% of the cases. Antiextremizing is required more often when there is bracketing with respect to the base rate among the probabilities being aggregated than with no bracketing.
Article
Full-text available
The unprecedented rise of mis- and disinformation surrounding conflicts around the globe poses an imminent threat to humanitarian assistance and conflict resolution. This paper reflects on previous approaches to harness the so-called "wisdom of the crowd" by collecting reliable and real-time information from primary sources during crises. Using insights from game theory, we present the design of a novel crowdsourcing platform thought to increase reliability and objectivity of conflict-related information. In contrast to previous approaches, our design incorporates a mechanism that sets incentives for citizen reporters to tell the truth disclosing privately held information as accurately as possible. Moreover, the proposed platform offers the possibility to certify news based on a multi-stage review process and directly connects reporters and media outlets via an integrated marketplace. By enabling media outlets to access original and verified information from primary sources, the platform is anticipated to become an attractive sourcing tool for media outlets that otherwise would purchase news from secondary sources. We argue that a platform with these design features can help counter the spread of fake news on conflicts and thereby contribute to more effective humanitarian assistance and peacebuilding.
Article
Full-text available
يمثّل بثُّ الشائعاتِ والفهمُ المغلوطُ لحقيقةِ النزاعاتِ الدائرةِ حولَ العالمِ تهديدًا غيرَ مسبوقٍ على تقديمِ المساعدات الإنسانيةِ للمتضررينَ، وعلى إيجاد الحلول للنزاعات، وبناء عليه تناولت هذه الدراسة الأساليب السابقةَ في استخدام ما يسمى ب «حكمة الحشد » لاستقاءِ المعلوماتِ من مصادرِها الرئيسيَّ ةِ أثناء الأزمات. ونقدمُ في هذه الدراسة رؤيتَنا لمنصّةٍ جديدةٍ للتعهيدِ الجماعيِّ بالاعتماد على رؤى «نظرية اللعبة، » بهدفِ زيادةِ موثوقية وموضوعية المعلومات المصاحبة للنزاعات. وبخلاف المنصّات الأخرى، تقدّم منصتُنا آليةً لتحفيزِ شهودِ العِيانِ على قولِ الحقيقةِ بأكبر قدر من الدقة. كما تتيح المنصّةُ إمكانية توثيق المعلوماتِ عن طريق عملية مراجعة متعددةِ المراحلِ، واتصال المراسلِ بوسائل الإعلام مباشرةً عبر تطبيقات متكاملة، فعند تمكين وسائل الإعلام من الوصول إلى المعلومة الموثقة من مصدرها الرئيسي، يتوقع للمنصة أن تكون مصدرًا مهمًّ لوسائل الإعلام لاستقاء المعلومة بدلً من شرائها من مصادر ثانوية، ونستطيعُ القول أيضًا إنّ منصةً بهذه الميزاتِ، يُمكنُها إيقافُ بَثِّ الشائعاتِ حولَ النزاعات، فتُسهمُ في بناء السلام وإيصالِ المساعداتِ الإنساني ة إلى مستحقّيها. الكلمات المفتاحية: التعهيد الجماعي، الصراعات العنيفة، أخبار مضللة، الصحافة، نظرية اللعبة.
Article
Full-text available
In this paper we present, and test in two realistic environments, collaborative Brain-Computer Interfaces (cBCIs) that can significantly increase both the speed and the accuracy of perceptual group decision-making. The key distinguishing features of this work are: (1) our cBCIs combine behavioural, physiological and neural data in such a way as to be able to provide a group decision at any time after the quickest team member casts their vote, but the quality of a cBCI-assisted decision improves monotonically the longer the group decision can wait; (2) we apply our cBCIs to two realistic scenarios of military relevance (patrolling a dark corridor and manning an outpost at night where users need to identify any unidentified characters that appear) in which decisions are based on information conveyed through video feeds; and (3) our cBCIs exploit Event-Related Potentials (ERPs) elicited in brain activity by the appearance of potential threats but, uniquely, the appearance time is estimated automatically by the system (rather than being unrealistically provided to it). As a result of these elements, in the two test environments, groups assisted by our cBCIs make both more accurate and faster decisions than when individual decisions are integrated in more traditional manners.
Article
How do we combine others’ probability forecasts? Prior research has shown that when advisors provide numeric probability forecasts, people typically average them (i.e., they move closer to the average advisor’s forecast). However, what if the advisors say that an event is “likely” or “probable?” In eight studies (n = 7,334), we find that people are more likely to act as if they “count” verbal probabilities (i.e., they move closer to certainty than any individual advisor’s forecast) than they are to “count” numeric probabilities. For example, when the advisors both say an event is “likely,” participants will say that it is “very likely.” This effect occurs for both probabilities above and below 50%, for hypothetical scenarios and real events, and when presenting the others’ forecasts simultaneously or sequentially. We also show that this combination strategy carries over to subsequent consumer decisions that rely on advisors’ likelihood judgments. We discuss and rule out several candidate mechanisms for our effect. This paper was accepted by Yuval Rottenstreich, decision analysis.
Article
Prior research suggests that averaging two guesses from the same person can improve quantitative judgments, a phenomenon known as the “wisdom of the inner crowd.” In this article, we find that this effect hinges on whether people explicitly decide in which direction their first guess had erred before making their second guess. In nine Studies (N = 8,465), we found that asking people to explicitly indicate whether their first guess was too high or too low before making their second guess made people more likely to provide a second guess that was more extreme (in the same direction) than their first guess. As a consequence, the introduction of that “Too High/Too Low” question reduced (and sometimes eliminated or reversed) the wisdom-of-the-inner-crowd effect for (the majority of) questions with non-extreme correct answers and increased the wisdom-of-the-inner-crowd effect for questions with extreme correct answers. Our findings suggest that the wisdom-of-the-inner-crowd effect is not inevitable but rather that it depends on the processes people use to generate their second guesses. This paper was accepted by Yuval Rottenstreich, decision analysis.
Article
This article presents a new risk model for estimating the probability of allision risk (the impact between a ship under way and a stationary installation) from passing vessels on the Norwegian Continental Shelf (NCS). Offshore petroleum operators on the NCS are required by the Norwegian Petroleum Safety Authority (PSA) to perform risk assessments to estimate the probability of impacts between ships and offshore installations, both for field related and passing (merchant) vessels. This has typically been done using the aging industry standard COLLIDE risk model, but this article presents a new risk model based on a Bayesian Belief Network (BBN) that can replace the old COLLIDE model for passing vessels. The new risk model incorporates a wider range of risk influencing factors (RIFs) and enables a holistic and detailed analysis of risk factors, barrier elements and dependencies. Even though the risk of allision with passing vessels is very small, the potential consequences can be critical. The new risk model is more transparent and provides a better understanding of the mechanisms behind allision risk calculations. The results from the new model are aligned with industry expectations, indicating an overall satisfactory performance. The article discusses several key elements, such as the use of expert judgement to estimate RIFs when no empirical data is available, model sensitivity, and a comparative assessment of the new risk model to the old COLLIDE model.
Article
We examined how individuals and groups behave in making judgmental forecasts when they are given external forecast advice. We compare individual and group advice‐taking behavior under different conditions: (a) when advice quality is fixed, (b) when advice quality is randomly varied, and (c) when there is feedback on advice quality or not. Participants in Study 1 received fixed advice of either reasonable or unreasonable quality while making their decisions. Participants in Study 2 randomly received both reasonable and unreasonable advice. We found in both studies that groups feel more confident than individuals. This greater confidence decreased the groups' reliance on advice. We also found that groups are better than individuals at discerning the quality of advice. In the group treatment, the group's reliance on advice increased according to the degree of disagreement with the initial decisions of the group members. In Study 3, participants randomly received both reasonable and unreasonable advice, and in addition, they received feedback on actual realizations that enabled them to learn about the quality of advice. In the presence of feedback on random advice quality, groups are no longer less receptive to advice than individuals; with feedback, both individuals and groups discount advice more than they do without feedback. Nevertheless, groups are still better than individuals at discerning the quality of advice. We conclude that group forecasting is better than individual forecasting across various conditions that we investigate except when advice quality is known to be consistently reliable.
Article
At its core, evaluation involves the generation of value judgments. These evaluative judgments are based on comparing an evaluand's performance to what the evaluand is supposed to do (criteria) and how well it is supposed to do it (standards). The aim of this four-phase study was to test whether criteria and standards can be set via crowdsourcing, a potentially cost- and time-effective approach to collecting public opinion data. In the first three phases, participants were presented with a program description, then asked to complete a task to either identify criteria (phase one), weigh criteria (phase two), or set standards (phase three). Phase four found that the crowd-generated criteria were high quality; more specifically, that they were clear and concise, complete, non-overlapping, and realistic. Overall, the study concludes that crowdsourcing has the potential to be used in evaluation for setting stable, high-quality criteria and standards.
Article
Full-text available
The aggregation of many independent estimates can outperform the most accurate individual judgement1–3. This centenarian finding1,2, popularly known as the 'wisdom of crowds'3, has been applied to problems ranging from the diagnosis of cancer4 to financial forecasting5. It is widely believed that social influence undermines collective wisdom by reducing the diversity of opinions within the crowd. Here, we show that if a large crowd is structured in small independent groups, deliberation and social influence within groups improve the crowd’s collective accuracy. We asked a live crowd (N = 5,180) to respond to general-knowledge questions (for example, "What is the height of the Eiffel Tower?"). Participants first answered individually, then deliberated and made consensus decisions in groups of five, and finally provided revised individual estimates. We found that averaging consensus decisions was substantially more accurate than aggregating the initial independent opinions. Remarkably, combining as few as four consensus choices outperformed the wisdom of thousands of individuals. The collective wisdom of crowds often provides better answers to problems than individual judgements. Here, a large experiment that split a crowd into many small deliberative groups produced better estimates than the average of all answers in the crowd.
Article
This exploratory study examines a novel tool for validating program theory through crowdsourced qualitative analysis. It combines a quantitative pattern matching framework traditionally used in theory-driven evaluation with crowdsourcing to analyze qualitative interview data. A sample of crowdsourced participants are asked to read an interview transcript and identify whether program theory components (Activities and Outcomes) are discussed and to highlight the most relevant passage about that component. The findings indicate that using crowdsourcing to analyze qualitative data can differentiate between program theory components that are supported by a participant's experience and those that are not. This approach expands the range of tools available to validate program theory using qualitative data, thus strengthening the theory-driven approach.
Chapter
In the past few decades, judgment and decision making research had focused on the social components of decision contexts and had led to both new theoretical developments and interesting research findings. Collective decisions can be made in various ways following a number of different procedures that can vary from each other in diverse ways. This chapter discusses the theory and research on group decision making. It begins with groups whose members do not interact to any great degree and will move toward those with both greater interaction and decision control. The chapter shows that a number of basic processes underlie virtually all group decision contexts, and also points out where different processes arise and how they may influence the types of decisions groups make. Finally, the chapter presents a discussion of how technology has been used to aid and influence group decision making, and puts forward conjectures about future developments.
Article
Full-text available
Presents a stochastic judgment model (SJM) as a framework for addressing a wide range of issues in statement verification and probability judgment. The SJM distinguishes between covert confidence in the truth of a proposition and the selection of an overt response. A series of experiments demonstrated the model's validity and yielded new results: Binary true–false responses were biased toward true relative to underlying judgment. Underlying judgment was also biased in that direction. Also, in a domain about which Ss had some knowledge, they discriminated true and false statements better when they compared complementary pairs before judging individual statements than when they performed those tasks in the opposite order. The results are interpreted in terms of the SJM and are discussed with respect to implications for theories of statement verification and for research on the accuracy of probability judgments. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Two experiments with 268 paid volunteers investigated the possibility that assessment of confidence is biased by attempts to justify one's chosen answer. These attempts include selectively focusing on evidence supporting the chosen answer and disregarding evidence contradicting it. Exp I presented Ss with 2-alternative questions and required them to list reasons for and against each of the alternatives prior to choosing an answer and assessing the probability of its being correct. This procedure produced a marked improvement in the appropriateness of confidence judgments. Exp II simplified the manipulation by asking Ss first to choose an answer and then to list (a) 1 reason supporting that choice, (b) 1 reason contradicting it, or (c) 1 reason supporting and 1 reason contradicting. Only the listing of contradicting reasons improved the appropriateness of confidence. Correlational analyses of the data of Exp I strongly suggested that the confidence depends on the amount and strength of the evidence supporting the answer chosen. (21 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
People's ability to assess probabilities of various events has been the topic of much interest in the areas of judgment, prediction, decision making, and memory. The evaluation of probabilistic judgments, however, raises interesting logical questions as to what it means to be a "good" judge. This article focuses on a normative concept of probabilistic accuracy called discrimination and presents a measure of a judge's discrimination skill. This measure builds on an earlier index (A. H. Murphy, 1973) and has the advantages that (1) it can be interpreted as the percentage of variance accounted for by the judge and (2) it is unbiased. By deriving this new discrimination measure, it is also shown to relate to Pearson's chi-square statistic, a result which may be useful in the future development of hypothesis testing and estimation procedures. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Despite the common reliance on numerical probability estimates in decision research and decision analysis, there is considerable interest in the use of verbal probability expressions to communicate opinion. A method is proposed for obtaining and quantitatively evaluating verbal judgments in which each analyst uses a limited vocabulary that he or she has individually selected and scaled. An experiment compared this method to standard numerical responding under three different payoff conditions. Response mode and payoff never interacted. Probability scores and their components were virtually identical for the two response modes and for all payoff groups. Also, judgments of complementary events were essentially additive under all conditions. The two response modes differed in that the central response category was used more frequently in the numerical than the verbal case, while overconfidence was greater verbally than numerically. Response distributions and degrees of overconfidence were also affected by payoffs. Practical and theoretical implications are discussed.
Article
Full-text available
In a recent issue of this journal, Baranski and Petrusic (1994) presented empirical data revealing overconfidence in sensory discrimination. In this paper, we propose an explanation of Baranski and Petrusic’s results, based on an idiosyncrasy in the experimental setting that misleads subjects who are using an unwarranted symmetry assumption. Experiment 1 showed that when this hypothesis is controlled for, a large underconfidence bias is obtained with Baranski and Petrusic’s procedure. The results of Experiment 2 confirmed that overconfidence is difficult to obtain in subject-controlled sensory discrimination tasks, even for a very low proportion of correct responses. The different results obtained in sensory and cognitive tasks suggest that one should not uncritically draw parallels between confidence in sensory and cognitive judgments.
Article
Full-text available
A binary detection task, free from sensory components, is investigated. A deterministic model prescribing a fixed cutoff point is confirmed; a probabilistic model, which generalizes Lee’s micromatching model for externally distributed stimuli, is rejected.
Article
Full-text available
Carleton University, Ottawa, Ontariot Canada Confidence rating based calibration and resolution indices were obtained in two experiments requiring perceptual comparisons and in a third with visual gap detection. Four important results were obtained. First, as in the general knowledge domain, subjects were underconfident when judgments were easy and overconfident when they were difficult. Second, paralleling the clear dependence of calibration on decisional difficulty, resolution decreased with increases in decision difficulty arising either from decreases in discriminability or from increasing demands for speed at the expense of accuracy. Third, providing trial-by-trial response feedback on difficult tasks improved resolution but had no effect on calibration. Fourth, subjects can accurately reportsubjective errors (i.e., trials in which they have indicated that they made an error) with their confidence ratings. It is also shown that the properties of decision time, conditionalized on confidence category, impose a rigorous set of constraints on theories of confidence calibration.
Article
Full-text available
This paper documents a very pervasive underconfidence bias in the area of sensory discrimination. In order to account for this phenomenon, a subjective distance theory of confidence in sensory discrimination is proposed. This theory, based on the law of comparative judgment and the assumption of confidence as an increasing function of the perceived distance between stimuli, predicts underconfidence—that is, that people should perform better than they express in their confidence assessments. Due to the fixed sensitivity of the sensory system, this underconfidence bias is practically impossible to avoid. The results of Experiment 1 confirmed the prediction of underconfidence with the help of present-day calibration methods and indkated-a-good quantitative fit of the theory. The results of Experiment 2 showed that prolonged experience of outcome feedback (160 trials) had no effect on underconfidence. It is concluded that the subjective distance theory provides a better explanation of the underconfidence phenomenon than-do previous accounts in terms of subconscious processes.
Article
Full-text available
Research on people's confidence in their general knowledge has to date produced two fairly stable effects, many inconsistent results, and no comprehensive theory. We propose such a comprehensive framework, the theory of probabilistic mental models (PMM theory). The theory (a) explains both the overconfidence effect (mean confidence is higher than percentage of answers correct) and the hard-easy effect (overconfidence increases with item difficulty) reported in the literature and (b) predicts conditions under which both effects appear, disappear, or invert. In addition, (c) it predicts a new phenomenon, the confidence-frequency effect, a systematic difference between a judgment of confidence in a single event (i.e., that any given answer is correct) and a judgment of the frequency of correct answers in the long run. Two experiments are reported that support PMM theory by confirming these predictions, and several apparent anomalies reported in the literature are explained and integrated into the present framework.
Article
Full-text available
This paper documents a very pervasive underconfidence bias in the area of sensory discrimination. In order to account for this phenomenon, a subjective distance theory of confidence in sensory discrimination is proposed. This theory, based on the law of comparative judgment and the assumption of confidence as an increasing function of the perceived distance between stimuli, predicts underconfidence--that is, that people should perform better than they express in their confidence assessments. Due to the fixed sensitivity of the sensory system, this underconfidence bias is practically impossible to avoid. The results of Experiment 1 confirmed the prediction of underconfidence with the help of present-day calibration methods and indicated a good quantitative fit of the theory. The results of Experiment 2 showed that prolonged experience of outcome feedback (160 trials) had no effect on underconfidence. It is concluded that the subjective distance theory provides a better explanation of the underconfidence phenomenon than do previous accounts in terms of subconscious processes.
Article
Full-text available
Is there a common and general basis for confidence in human judgment? Recently, we found that the properties of confidence judgments in the sensory domain mirror those previously established in the cognitive domain; notably, we found underconfidence on easy sensory judgments and overconfidence on hard sensory judgments. In contrast, data from the Uppsala laboratory in Sweden suggest that sensory judgments are unique; they found a pervasive underconfidence bias, with overconfidence being evident only on very hard sensory judgments. Olsson and Winman (1996) attempted to resolve the debate on the basis of methodological issues related to features of the stimulus display in a visual discrimination task. A reanalysis of the data reported in Baranski and Petrusic (1994), together with the findings of a new experiment that controlled stimulus display characteristics, supports the position that the difference between the Canadian and the Swedish data is real and, thus, may reflect cross-national differences in confidence in sensory discrimination.
Article
Full-text available
Two robust phenomena in research on confidence in one's general knowledge are the overconfidence phenomenon and the hard-easy effect. In this article, the authors propose that the hard-easy effect has been interpreted with insufficient attention to the scale-end effects, the linear dependency, and the regression effects in data and that the continued adherence to the idea of a "cognitive overconfidence bias" is mediated by selective attention to particular data sets. A quantitative review of studies with 2-alternative general knowledge items demonstrates that, contrary to widespread belief, there is (a) very little support for a cognitive-processing bias in these data; (b) a difference between representative and selected item samples that is not reducible to the difference in difficulty; and (c) near elimination of the hard-easy effect when there is control for scale-end effects and linear dependency.
Article
Full-text available
The overconfidence observed in calibration studies has recently been questioned on both psychological and methodological grounds. In the first part of the article we discuss these issues and argue that overconfidence cannot be explained as a selection bias, and that it is not eliminated by random sampling of questions. In the second part of the article, we compare probability judgments for single events with judgments of relative frequency. Subjects received a target individual's personality profile and then predicted the target's responses to a series of binary questions. One group predicted the responses of an individual target, while a second group estimated the relative frequency of responses among all target subjects who shared a given personality profile. Judgments of confidence and estimates of relative frequency were practically indistinguishable; both exhibited substantial overconfidence and were highly correlated with independent judgments of representativeness.
Chapter
It is often assumed that n heads are better than one, that a judgment obtained from a group will be of higher quality than could be expected from an individual. This chapter considers the effectiveness of methods that have been proposed for combining individual quantitative judgments into a group judgment. For the most part, it will be found that n heads are, indeed, better than one, and at least one investigator has concluded that it does not much matter how they are combined. But the potential for improving performance is so great and the problems of achieving it so subtle that a clear understanding of the issues is essential.
Article
Schmittlein discussed the lack of universality of regression toward the mean. The present note emphasizes the universality of a similar effect, dubbed “reversion” toward the mean, defined as the shift in conditional expectation of the upper or lower portion of a distribution. Reversion toward the mean is a useful concept for statistical reasoning in applications and is more self-evidently plausible than regression toward the mean.
Article
This paper briefly describes some results of operational and experimental programmes in the United States involving subjective probability forecasts of precipitation occurrence and of maximum and minimum temperatures. These results indicate that weather forecasters can formulate such forecasts in a reliable manner.
This note examines the number of experts to be included in a prediction group where the criterion of predictive ability is the correlation between the uncertain event and the mean judgment of the group members. It is shown that groups containing between 8 and 12 members have predictive ability close to the “optimum” under a wide range of circumstances but provided (1) mean intercorrelation of experts' opinions is not low (<.3, approximately) and/or (2) mean expert validity does not exceed mean intercorrelation. Evidence indicates these exceptions will not be common in practice. The characteristics needed by an additional expert to increase the validity of an existing group are also derived.
Article
focuses on several practical issues in subjective probability for discrete events from the standpoint of decision analysis / decision analysis sets the standards for the use of subjective probability and points the way for other applications / it has a high stake in the success of subjective probability methods and a high commitment to ensuring their reliability and validity examine elicitation, calibration and combination of discrete subjective probabilities in the light of a model that explains and brings order to a considerable amount of confusing experimental data / calibration, the extent to which the observed proportions of events that occur agree with the assigned probability values, directly affects the quality of decision analysis and is the central issue / elicitation, the process by which judgments are obtained, and combination, the process by which probabilities of the same event from different judges are aggregated, are intimately related to calibration and are considered from that standpoint (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
the emergence of "overconfidence" in calibration studies is often understood as an indication of a general human bias / a cognitive approach is proposed which offers a different interpretation: miscalibration is not seen as a bias, but as a necessary consequence of task characteristics and the selection of items (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Two empirical judgment phenomena appear to contradict each other. In the revision-of-opinion literature, subjective probability (SP) judgments have been analyzed as a function of objective probability (OP) and generally have been found to be conservative, that is, to represent underconfidence. In the calibration literature, analyses of OP (operationalized as relative frequency correct) as a function of SP have led to the opposite conclusion, that judgment is generally overconfident. Reanalysis of 3 studies shows that both results can be obtained from the same set of data, depending on the method of analysis. The simultaneous effects are then generated and factors influencing them are explored by means of a model that instantiates a very general theory of how SP estimates arise from true judgments perturbed by random error. Theoretical and practical implications of the work are discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Is there a difference between believing and merely understanding an idea? R. Descartes (e.g., 1641 [1984]) thought so. He considered the acceptance and rejection of an idea to be alternative outcomes of an effortful assessment process that occurs subsequent to the automatic comprehension of that idea. This article examined B. Spinoza's (1982) alternative suggestion that (1) the acceptance of an idea is part of the automatic comprehension of that idea and (2) the rejection of an idea occurs subsequent to, and more effortfully than, its acceptance. In this view, the mental representation of abstract ideas is quite similar to the mental representation of physical objects: People believe in the ideas they comprehend, as quickly and automatically as they believe in the objects they see. Research in social and cognitive psychology suggests that Spinoza's model may be a more accurate account of human belief than is that of Descartes. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
A two-stage within subjects design was used to compare decisions based on numerically and verbally expressed probabilities. In Stage 1, subjects determined approximate equivalences between vague probability expressions, numerical probabilities, and graphical displays. Subsequently, in Stage 2 they bid for (Experiment 1) or rated (Experiment 2) gambles based on the previously equated verbal, numerical, and graphical descriptors. In Stage 1, numerical and verbal judgments were reliable, internally consistent, and monotonically related to the displayed probabilities. However, the numerical judgments were significantly superior in all respects because they were much less variable within and between subjects. In Stage 2, response times, bids, and ratings were inconsistent with both of two opposing sets of predictions, one assuming that imprecise gambles will be avoided and the other that verbal probabilities will be preferred. The entire pattern of results is explained by means of a general model of decision making with vague probabilities which assumes that in the present task, when presented with a vague probability word, people focus on an implied probability interval and sample values within it to resolve the vagueness prior to forming a bid or a rating. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
[provide] a critical analysis of a number of substantive theories and models of subjective probability judgement for discrete propositions which have appeared in the last 14 yrs / provide a critical evaluation of the models described / assess each model with respect to the empirical evidence, and in terms of psychological plausibility / highlight the similarities and differences between the models the overconfidence effect and the hard–easy effect / the locus of bias in probability judgements / theories and models [the stage model, the detection model, the process model, the memory trace model, the ecological models, the strength and weight model] (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Erev, Wallsten, and Budescu (1994) and Budescu, Erev, and Wallsten (1997) demonstrated that over- and underconfidence often observed in judgment studies may be due, in part, to the presence of random error and its effects on the analysis of the judgments. To illustrate this fact they showed that a general model that assumes that overt responses representing (perfectly calibrated) true judgments perturbed by random error can replicate typical patterns observed in empirical studies. In this paper we provide a method for determining whether apparent overconfidence in empirical data reflects a systematic bias in judgment or is an artifact due solely to the presence of error. The approach is based, in part, on the Wallsten and González-Vallejo (1994) Stochastic Judgment Model (SJM). The new method is described in detail and is used to analyze results from a new study. The analysis indicates a clear overconfidence effect, above and beyond the level predicted by a model assuming perfect calibration perturbed by random error. © 1997 John Wiley & Sons, Ltd.
Article
Wallsten et al. (1997) developed a general framework for assessing the quality of aggregated probability judgments. Within this framework they presented a theorem regarding the effects of pooling multiple probability judgments regarding unique binary events. The theorem states that under reasonable conditions, and assuming conditional pairwise independence of the judgments, the average probability estimate is asymptotically perfectly diagnostic of the true event state as the number of estimates pooled goes to infinity. The purpose of the present study was to examine by simulation (1) the rate of convergence of averaged judgments to perfect diagnostic value under various conditions and (2) the robustness of the theorem to violations of its assumption that the covert probability judgments are conditionally pairwise independent. The results suggest that while the rate of convergence is sensitive to violations of the conditional pairwise independence, the asymptotic properties remain relatively robust under a large variety of conditions. The practical implications of these results are discussed. Copyright © 2001 John Wiley & Sons, Ltd.
Article
This paper concerns the evaluation and combination of subjective probability estimates for categorical events. We argue that the appropriate criterion for evaluating individual and combined estimates depends on the type of uncertainty the decision maker seeks to represent, which in turn depends on his or her model of the event space. Decision makers require accurate estimates in the presence of aleatory uncertainty about exchangeable events, diagnostic estimates given epistemic uncertainty about unique events, and some combination of the two when the events are not necessarily unique, but the best equivalence class de®nition for exchangeable events is not apparent. Following a brief reveiw of the mathematical and empirical literature on combining judgments, we present an approach to the topic that derives from (1) a weak cognitive model of the individual that assumes subjective estimates are a function of underlying judgment perturbed by random error and (2) a classi®cation of judgment contexts in terms of the underlying information structure. In support of our developments, we present new analyses of two sets of subjective probability estimates, one of exchangeable and the other of unique events. As predicted, mean estimates were more accurate than the individual values in the ®rst case and more diagnostic in the second.
Article
Erev, Wallsten, and Budescu (1994) demonstrated that over- and underconfidence can be observed simultaneously in judgment studies, as a function of the method used to analyze the data. They proposed a general model to account for this apparent paradox, which assumes that overt responses represent true judgments perturbed by random error. To illustrate that the model reproduces the pattern of results, they assumed perfectly calibrated true opinions and a particular form (log-odds plus normally distributed error) of the model to simulate data from the full-range paradigm. In this paper we generalize these results by showing that they can be obtained with other instantiations of the same general model (using the binomial error distribution), and that they apply to the half-range paradigm as well. These results illustrate the robustness and generality of the model. They emphasize the need for new methodological approaches to determine whether observed patterns of over- or underconfidence represent real effects or are primarily statistical artifacts. © 1997 John Wiley & Sons, Ltd.
Article
An experiment is presented that explores the finding that a request to judge probabilities can bias subsequent decisions (Erev et al., 1993). Subjects chose among gambles whose outcomes were determined by the occurrence of events in a video game environment. The probabilities of the events could be assessed based on the visual display. In the no-probabilities condition the subjects simply indicated their choices. In the subjective-probability condition subjects first estimated probabilities and then made choices. In the objective-probability condition, subjects saw the actual probabilities instead of the events when making their choices. The results suggest that the availability of explicit probabilities (both subjective and objectives) decreases the subjects sensitivity to the outcome dimension and, hence, increases the reflection effect; i.e. subjects in the subjective-and objective-probability conditions showed stronger risk aversion when the gambles involved possible profit and stronger risk seeking when the gambles involved possible losses than in the no-probabilities condition. In addition, the subjective assessments impaired the quality of the decisions in term of the subjects expected profit. Theoretical and practical implications of the results are discussed.
Article
In many decision situations information is available from a number of different sources. Aggregating the diverse bits of information is an important aspect of the decision-making process but entails special statistical modeling problems in characterizing the information. Prior research in this area has relied primarily on the use of historical data as a basis for modeling the information sources. We develop a Bayesian framework that a decision maker can use to encode subjective knowledge about the information sources in order to aggregate point estimates of an unknown quantity of interest. This framework features a highly flexible environment for modeling the probabilistic nature and interrelationships of the information sources and requires straightforward and intuitive subjective judgments using proven decision-analysis assessment techniques. Analysis of the constructed model produces a posterior distribution for the quantity of interest. An example based on health risks due to ozone exposure demonstrates the technique.
Article
When two forecasters agree regarding the probability of an uncertain event, should a decision maker adopt that probability as his or her own? A decision maker who does so is said to act in accord with the unanimity principle. We examine a variety of Bayesian consensus models with respect to their conformance (or lack thereof) to the unanimity principle and a more general compromise principle. In an analysis of a large set of probability forecast data from meteorology, we show how well the various models, when fit to the data, reflect the empirical pattern of conformance to these principles.
Article
This paper presents a new approach to the problem of expert resolution. The proposed analytic structure provides a mechanism by which a decision maker can incorporate the possibly conflicting probability assessments of a group of experts. The approach is based upon the Bayesian inferential framework presented in [Morris, P. A. 1974. Decision analysis expert use. Management Sci. 20 (9, May)]. A number of specific results are derived from analysis of a generic model structure. In the single expert continuous variable case, we prove that the decision maker should process a calibrated expert's opinion by multiplying the expert's probability assessment by his own prior probability assessment and normalizing. A method for subjectively calibrating an expert is also presented. In the multi-expert case, we obtain a simple multiplicative rule for combining the expert judgments. We also prove the existence of a composite probability function which measures the joint information contained in the probability assessments generated by a panel of experts. The interesting result is that composite prior should be processed as if it were the probability statement of a single calibrated expert.
A mathematical model is developed to describe the calibration of discrete subjective probabilities and is compared with published group calibration results and with new data. The model is appropriate to probability assessment tasks, with a variety of formats, that can be considered from a signal detection point of view, such as giving the probability that a particular two-category classification is correct. The model assumes that the respondent partitions the range of a decision variable and maps the set of response probabilities onto it. Such a model can account for the systematic effect of proportion correct on the degree of under- or overconfidence; it indicates the ways in which training can affect calibration; it makes specific predictions about base rate effects; it provides a measure of “knowing what one knows”; and it gives a unifying viewpoint for a large body of experimental work on calibration.
Article
Decision makers often must pool probability estimates from multiple experts before making a choice. Such pooling sometimes improves accuracy and other times diagnosticity. This article uses a cognitive model of the judge and the decision maker’s classification of the information to explain why. Given a very weak model of the judge, the important factor is the degree to which their information bases are independent. The article also relates this work to other models in the literature.
Two evaluative criteria for probabilistic forecasting performance, consistency with the axioms of probability theory and external correspondence with the events that ultimately occur, are distinguished. The mean probability, or Brier score (), is the scoring rule most commonly used to quantify external correspondence. A review is made of methods for decomposing into components that represent distinct and important aspects of external correspondence. Data from an empirical study of forecasting performance are used to illustrate the interpretation of the components of the most recent decomposition of (J. F. Yates, Forecasting performance: A covariance decomposition of the mean probability score. Paper presented at 22nd Annual Meeting of the Psychonomic Society, Philadelphia, November 1981; also an unpublished manuscript). Substantively, the most important finding of the study was a “collapsing” tendency in forecasting behavior, whereby subjects were inclined to report forecasts of .5 when they felt they knew little about the event in question. This finding is problematic because self-reported knowledge was only minimally related to the actual external correspondence of the subjects' forecasts. A survey of uses of decompositions suggests, among other things, that current research typically emphasizes calibration, perhaps to the neglect of other, more important dimensions of external correspondence.
Article
Considerable literature has accumulated over the years regarding the combination of forecasts. The primary conclusion of this line of research is that forecast accuracy can be substantially improved through the combination of multiple individual forecasts. Furthermore, simple combination methods often work reasonably well relative to more complex combinations. This paper provides a review and annotated bibliography of that literature, including contributions from the forecasting, psychology, statistics, and management science literatures. The objectives are to provide a guide to the literature for students and researchers and to help researchers locate contributions in specific areas, both theoretical and applied. Suggestions for future research directions include (1) examination of simple combining approaches to determine reasons for their robustness, (2) development of alternative uses of multiple forecasts in order to make better use of the information they contain, (3) use of combined forecasts as benchmarks for forecast evaluation, and (4) study of subjective combination procedures. Finally, combining forecasts should become part of the mainstream of forecasting practice. In order to achieve this, practitioners should be encouraged to combine forecasts, and software to produce combined forecasts easily should be made available.
Article
The purpose of this paper is to briefly discuss some important current questions and problems related to the use of scoring rules (SRs) both in connection with the actual assessment of probabilities and with the evaluation of probability forecasts and probability assessors. With regard to the assessment process, we consider both the case in which the assessor's utility function is linear and the case in which his utility function is nonlinear. Under linear utility, important problems of concern are the sensitivity of SRs to deviations from optimality (with a strictly proper SR, optimality consists of the assessor making his statements correspond to his judgments) and the effect of psychological considerations arising from the use of different SRs. Under nonlinear utility, SRs should be modified to allow for the nonlinearity in such a manner that for a specific utility function, the modified SRs are strictly proper. This introduces the difficult question of the assessment of the assessor's utility function. With regard to the evaluation process (as opposed to the assessment process), we consider the process from an inferential viewpoint and from a decision-theoretic viewpoint. From an inferential viewpoint, attributes such as validity may be of interest, and in certain circumstances these attributes may be related to SRs. The attributes of interest, of course, depend on the framework within which the evaluation process is undertaken. From a decision-theoretic viewpoint, SRs may be related to a decision maker's utilities or expected utilities (under uncertainty about the utilities) if the decision maker uses the assessed probabilities in an actual decision situation.In summary, there are many important questions and problems related to SRs, and the need for future research on these problems seems clear. Such research should lead to a greatly improved understanding of the processes of probability assessment and evaluation.
Article
Experiments have shown that, generally, people are overconfident about the correctness of their answers to questions. Cognitive psychologists have attributed this to biases in the way people generate and handle evidence for and against their views. The overconfidence phenomenon and cognitive psychologists' accounts of its origins have recently given rise to three debates. Firstly, ecological psychologists have proposed that overconfidence is an artefact that has arisen because experimenters have used question material not representative of the natural environment. However, it now appears that some overconfidence remains even after this problem has been remedied. Secondly, it has been proposed that overconfidence is an artefactual regression effect that arises because judgments contain an inherently random component. However, those claiming this appear to use the term overconfidence to refer to a phenomenon quite different from the one that the cognitive psychologists set out to explain. Finally, a debate has arisen about the status of perceptual judgments. Some claim that these evince only underconfidence and must, therefore, depend on mechanisms fundamentally different from those subserving other types of judgment. Others have obtained overconfidence with perceptual judgments and argue that a unitary theory is more appropriate. At present, however, no single theory provides an adequate account of the many diverse factors that influence confidence in judgment.
Article
This paper examines how a Bayesian decision maker would update his/her probability $p$ for the occurrence of an event $A$ in the light of a number of expert opinions expressed as probabilities $q_1, \cdots, q_n$ of $A$. It is seen, among other things, that the linear opinion pool, $\lambda_0p + \sum^n_{i = 1} \lambda_iq_i$, corresponds to an application of Bayes' Theorem when the decision maker has specified only the mean of the marginal distribution for $(q_1, \cdots, q_n)$ and requires his/her formula for the posterior probability of $A$ to satisfy a certain consistency condition. A product formula similar to that of Bordley (1982) is also derived in the case where the experts are deemed to be conditionally independent given $A$ (and given its complement).
Article
It has been reported that people are conservative processors of fallible information when data are sampled from 1 of a set of binomial populations. Here, male undergraduates observed samples of data and revised odds estimates about which of 2 normally distributed populations was being sampled. Revisions were nearly optimal when based on an individual datum, but became conservative when based on a sequence of data. The result points to misaggregation of the diagnostic impacts of 2 or more data as a major source of conservatism in information processing. Also Ss were more conservative when a datum favored the population that previous data had favored; overall conservatism was only about 1/2 as great as predicted from experiments that used binomial populations.
Article
Confidence rating based calibration and resolution indices were obtained in two experiments requiring perceptual comparisons and in a third with visual gap detection. Four important results were obtained. First, as in the general knowledge domain, subjects were underconfident when judgments were easy and overconfident when they were difficult. Second, paralleling the clear dependence of calibration on decisional difficulty, resolution decreased with increases in decision difficulty arising either from decreases in discriminability or from increasing demands for speed at the expense of accuracy. Third, providing trial-by-trial response feedback on difficult tasks improved resolution but had no effect on calibration. Fourth, subjects can accurately report subjective errors (i.e., trials in which they have indicated that they made an error) with their confidence ratings. It is also shown that the properties of decision time, conditionalized on confidence category, impose a rigorous set of constraints on theories of confidence calibration.
Article
Many studies have reported that the confidence people have in their judgments exceeds their accuracy and that overconfidence increases with the difficulty of the task. However, some common analyses confound systematic psychological effects with statistical effects that are inevitable if judgments are imperfect. We present three experiments using new methods to separate systematic effects from the statistically inevitable. We still find systematic differences between confidence and accuracy, including an overall bias toward overconfidence. However, these effects vary greatly with the type of judgment. There is little general overconfidence with two-choice questions and pronounced overconfidence with subjective confidence intervals. Over- and underconfidence also vary systematically with the domain of questions asked, but not as a function of difficulty. We also find stable individual differences. Determining why some people, some domains, and some types of judgments are more prone to overconfidence will be important to understanding how confidence judgments are made. Copyright 1999 Academic Press.
Article
In a series of experiments, economically sophisticated subjects, including professional actuaries, priced insurance both as consumers and as firms under conditions of ambiguity. Findings support implications of the Einhorn-Hogarth ambiguity model: (1) for low probability-of-loss events, prices of both consumers and firms indicated aversion to ambiguity; (2) as probabilities of losses increased, aversion to ambiguity decreased, with consumers exhibiting ambiguity preference for high probability-of-loss events; and (3) firms showed greater aversion to ambiguity than consumers. The results are shown to be incompatible with traditional economic analysis of insurance markets and are discussed with respect to the effects of ambiguity on the supply and demand for insurance. Copyright 1989 by Kluwer Academic Publishers
Article
A philosophical basis for combining forecasts and some important current issues in the area of combining forecasts are discussed briefly. The philosophical basis views forecasts as information and the combination of forecasts in terms of aggregation of information. The current issues involve the form of the combining rule, cases with agreement among forecasts, cases with extensive disagreement, dependence, uncertainty about forecast characteristics, instability in the forecasting process, robustness and the role of simple rules, and the role of group interaction.
Article
The combining of forecasts involves more than issues of statistical aggregation. This comment focuses on situations where people have to combine forecasts in the form of diagnostic opinions concerning different states of nature. Taking a descriptive or psychological viewpoint, it is argued that people act upon the information they obtain and engage in considerable interpretation and imagination. Three specific problems are discussed: (1) the level at which opinions are aggregated; (2) the effects of redundancy and credibility of different sources; and (3) the manner in which the structure of information from multiple sources can lead to different diagnostic interpretations. The discussions are illustrated with experimental data.