Article

Distilling the wisdom of crowds: Prediction markets vs. prediction polls

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We report the results of the first large-scale, long-term, experimental test between two crowdsourcing methods: prediction markets and prediction polls. More than 2,400 participants made forecasts on 261 events over two seasons of a geopolitical prediction tournament. Forecasters were randomly assigned to either prediction markets (continuous double auction markets) in which they were ranked based on earnings, or prediction polls in which they submitted probability judgments, independently or in teams, and were ranked based on Brier scores. In both seasons of the tournament, prices from the prediction market were more accurate than the simple mean of forecasts from prediction polls. However, team prediction polls outperformed prediction markets when forecasts were statistically aggregated using temporal decay, differential weighting based on past performance, and recalibration. The biggest advantage of prediction polls was atthe beginning of long-duration questions. Results suggest that prediction polls with proper scoring feedback, collaboration features, and statistical aggregation are an attractive alternative to prediction markets for distilling the wisdom of crowds.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We discuss two potential moderators of differences in accuracy: the number of traders posting orders on the market for a given question and the timing within the question. Atanasov et al. (2017) showed that CDA markets underperform team-based prediction polls when question resolutions are months away but are approximately tied in accuracy in the last few weeks before question resolution. We assess if this result generalizes to the comparison of CDA and LMSR markets. ...
... In the context of geopolitical forecasting tournaments, previous research has demonstrated that individual differences in prediction poll accuracy scores are reliable over time (Mellers et al. 2015a). Such accuracy measures provide useful inputs to weighted aggregation algorithms, which consistently outperform unweighted aggregation (e.g., Atanasov et al. 2017). ...
... To combine individual estimates, we used the weighted mean algorithm described by Atanasov et al. (2017), with two additional weighting features that were based on a forecaster's psychometric test score as well as on the time she spent on the platform. A weighted logit algorithm (Satopää et al. 2014) was used in sensitivity analyses. ...
Preprint
Full-text available
Problem definition: Accurate forecasts are a key ingredient of effective operations. In fast-changing environments with little historical data, organizations rely on judgmental forecasts for decision making. But how should these forecasts be elicited and aggregated, and who should be asked to provide these forecasts in the first place? Academic/practical relevance: The COVID pandemic and geopolitical developments pose increasing challenges to organizations in managing uncertainty. We offer guidance on how to accurately quantify such uncertainty by assessing the performance of two widely used crowd prediction systems, prediction markets and prediction polls. Moreover, we show how the accuracy of these systems depends on whether the crowd of forecasters is large or small but elite. Methodology/Results: We use data from the ACE tournament, a large, multi-year forecasting competition. First, we experimentally evaluate two popular prediction market architectures, continuous double auction (CDA) markets and logarithmic market scoring rule (LMSR) markets. We find that the LMSR market structure produces more accurate forecasts than CDA markets, with especially pronounced differences on questions with fewer traders. Second, we compare the winning LMSR market against team-based prediction polls. Each of these two systems employs both a large and a small but elite crowd. Forecasters who place in the top 2% in a given competition season are considered elite in following seasons. Small crowds of elite forecasters outperform larger, sub-elite crowds, achieving aggregate Brier score improvements of 20% in prediction markets and 23% in prediction polls. LMSR prediction markets and team-based prediction polls are statistically tied. Finally, while prediction polls produce more reliable sub-elite rankings, the two systems are equally effective in identifying elite forecasters. Managerial implications: Our study provides clear guidelines for managers seeking to collect accurate judgmental forecasts: the accuracy benefits of “superforecasting” hold across prediction systems and managers should move towards deploying small, select crowds.
... forecasts) are important in organizational decision-making, such as strategic decisions informed by economic and industry forecasting as well as operational decisions informed by quantities such as project development cost. While such decisions are often made by the expert judgement of a single manager, both statistical and empirical findings show that the aggregated opinion of multiple people can be more accurate than even the most expert individuals (Ashton 1986, Hogarth 1978, Nofer and Hinz 2014), a phenomenon known as the "wisdom of crowds" (Atanasov et al. 2017, Budescu and Chen 2014, Da and Huang 2020, Frey and van de Rijt 2020, Mannes 2009, Palley and Soll 2019. ...
... One of the most widely studied estimation tasks is forecasting (e.g. Atanasov et al. 2017, Dalkey and Helmer 1963, Jansen et al. 2016 including forecasting revenue (Da and Huang 2020) and sales (Cowgill & Zitzewitz, 2015), predicting the success of an advertising campaign (Hartnett, Kennedy, Sharp, & Greenacre, 2016), or estimating future macroeconomic indicators (Jansen, Jin, & de Winter, 2016). In one classic case study, Cyert and March (1963) describe a construction firm for whom expectations of future business volume played a central role in the decision to move operations to a new location. ...
... When crowdsourcing numeric estimates, it has been popularly argued (Surowiecki 2004) that group accuracy requires strictly independent individuals, and that interacting groups are subject to risks associated with "herding" (Lorenz et al. 2011) and "correlated error" (Hong et al. 2016). Thus one area of research has been the development of strategies designed to optimize the aggregation of beliefs from a multiple independent contributors (Atanasov et al. 2017, Budescu and Chen 2014, Da and Huang 2020, Mannes et al. 2014, Palley and Soll 2019. This paradigm is motivated in part by the "diversity prediction theorem" by Page (2007), a reinterpretation of the variance-bias tradeoff in statistical estimation which states the following 1 : group error = average individual errordiversity. ...
Preprint
Full-text available
Decades of research suggest that information exchange in groups and organizations can reliably improve judgment accuracy in tasks such as financial forecasting, market research, and medical decision-making. However, we show that improving the accuracy of numeric estimates does not necessarily improve the accuracy of decisions. For binary choice judgments, also known as classification tasks--e.g. yes/no or build/buy decisions--social influence is most likely to grow the majority vote share, regardless of the accuracy of that opinion. As a result, initially inaccurate groups become increasingly inaccurate after information exchange even as they signal stronger support. We term this dynamic the "crowd classification problem." Using both a novel dataset as well as a reanalysis of three previous datasets, we study this process in two types of information exchange: (1) when people share votes only, and (2) when people form and exchange numeric estimates prior to voting. Surprisingly, when people exchange numeric estimates prior to voting, the binary choice vote can become less accurate even as the average numeric estimate becomes more accurate. Our findings recommend against voting as a form of decision-making when groups are optimizing for accuracy. For those cases where voting is required, we discuss strategies for managing communication to avoid the crowd classification problem. We close with a discussion of how our results contribute to a broader contingency theory of collective intelligence.
... Esports and its related brands occupy 24.2% of the hours watched on Twitch.tv. 1 About 609 million people spent over 5 billion hours watching video game streams in 2016. 2 Despite the popularity of these live shows, their quality varies significantly. We hypothesize an audience's perceived quality for such live streamed content is, in part, derived from the surprise in the content. ...
... A few studies have compared the performance of prediction markets and polls, though there is no conclusive answer [8,16,1]. Both Goel et al. [8] and Rieg and Schoder [16] find no significant differences between these two methods. ...
... Both Goel et al. [8] and Rieg and Schoder [16] find no significant differences between these two methods. Atanasov et al. [1] find that the aggregation rules in prediction polls affect its accuracy level. For example, simply averaging all polls performs worse than a prediction market, while weighting the polls properly leads to a better performance than a prediction market. ...
Preprint
Information flow measures, over the duration of a game, the audience's belief of who will win, and thus can reflect the amount of surprise in a game. To quantify the relationship between information flow and audiences' perceived quality, we conduct a case study where subjects watch one of the world's biggest esports events, LOL S10. In addition to eliciting information flow, we also ask subjects to report their rating for each game. We find that the amount of surprise in the end of the game plays a dominant role in predicting the rating. This suggests the importance of incorporating when the surprise occurs, in addition to the amount of surprise, in perceived quality models. For content providers, it implies that everything else being equal, it is better for twists to be more likely to happen toward the end of a show rather than uniformly throughout.
... One recent exception is the work by Atanasov et al. (2017), who conducted a large-scale long-term experimental test of prediction markets and prediction polls. They found that market prices were more accurate than simple mean forecasts based on estimates from polls (Atanasov et al. 2017). ...
... One recent exception is the work by Atanasov et al. (2017), who conducted a large-scale long-term experimental test of prediction markets and prediction polls. They found that market prices were more accurate than simple mean forecasts based on estimates from polls (Atanasov et al. 2017). However, polls outperformed markets when statistical aggregation techniques that included temporal decay, differential weighting based on past performance, and recalibration were used. ...
... In this paper, we build on this foundation and compare the performance of groups and markets to better understand their respective advantages and shortfalls Budescu 2007, 2013;Atanasov et al. 2017). To achieve this goal, we seek to understand conditions that attenuate differences between performance levels of groups and markets. ...
Article
A crucial challenge for organizations is to pool and aggregate information effectively. Traditionally, organizations have relied on committees and teams, but recently many organizations have explored the use of information markets. In this paper, the authors compared groups and markets in their ability to pool and aggregate information in a hidden-profiles task. In Study 1, groups outperformed markets when there were no conflicts of interest among participants, whereas markets outperformed groups when conflicts of interest were present. Also, participants had more trust in groups to uncover hidden profiles than in markets. Study 2 generalized these findings to a simple prediction task, confirming that people had more trust in groups than in markets. These results were not qualified by conflicts of interest. Drawing on experienced forecasters from Good Judgment Open, Study 3 found that familiarity and experience with markets increased the endorsement and use of markets relative to traditional committees.
... But later research demonstrated that judgmental forecasting can also result in accurate predictions about the future as some forecasters consistently forecast the future more accurately than others in forecasting tournaments (Mellers et al., 2014;Mellers, Stone, Murray, et al., 2015;Tetlock & Gardner, 2016). Forecasting tournaments are prediction polls in which participants make probability forecasts on questions about the future state of the world at a specific point in time (Atanasov et al., 2017;Tetlock, Mellers, Rohrbaugh, & Chen, 2014). As forecasters can update their probability forecasts over time until the question is resolved, forecasters can integrate new information in their forecasts. ...
... Building on our reconceptualization of foresight, we propose the time-weighted Brier score (TWBS) to operationalize foresight in forecasting tournaments (Atanasov et al., 2017;Tetlock et al., 2014). The TWBS represents a mathematical generalization of the BS that weights forecast errors made closer to the event horizon more heavily, which is consistent with the argument that forecasting future states of the world becomes easier over time. ...
... Judgmental forecasting research has demonstrated that forecasters differ in their foresight--their ability to consistently forecast future states of the world accurately (Mellers et al., 2014;Mellers, Stone, Murray, et al., 2015;Tetlock & Gardner, 2016). In forecasting tournaments (Atanasov et al., 2017;Tetlock et al., 2014), the top 2% of forecasters--called superforecasters--were able to forecast future states of the world more accurately than the other forecasters across a variety of geopolitical forecasting questions two years in a row (Mellers, Stone, Murray, et al., 2015). ...
Conference Paper
Judgmental forecasting research has demonstrated that forecasters differ in their foresight––their ability to consistently forecast future states of the world accurately. However, the conceptualization of foresight underlying this research stream focuses exclusively on accuracy and thereby, neglects the time dimension of foresight. We develop a reconceptualization of foresight that integrates the dimensions of accuracy and time. To provide the theoretical basis for this integration, we propose a forecasting framework suggesting that forecasting future states of the world becomes easier over time as the availability of signals, which enable forecasters to forecast future states of the world more accurately, increases over time. Therefore, we reconceptualize foresight as the ability to forecast future states of the world accurately and early. To operationalize this reconceptualization of foresight in forecasting tournaments, we propose the time-weighted Brier score (TWBS), which weights errors made closer to the event horizon more heavily. We prove analytically that the TWBS is a strictly proper scoring rule that represents a mathematical generalization of the Brier score (BS). Furthermore, we show in a simulation study that the linear and square root TWBS, which weight the forecast errors over time heavier according to the linear and square root function, measure the true foresight of persons better than the BS under various theoretical signal trajectories. Taken together, we contribute to the emergent literature on foresight by providing a theoretical framework, reconceptualization, and generalized operationalization of foresight that integrate the dimensions of accuracy and time.
... One simple but effective strategy to improve the accuracy of numeric estimates such as forecasts is to use the average of multiple estimates (Ashton 1986, Clemen 1989, Hogarth 1978, taking advantage of a statistical phenomenon that has been popularized as the "wisdom of crowds" (Almaatouq, Noriega-Campero, et al. 2020, Atanasov et al. 2017, Becker et al. 2017, Budescu and Chen 2014, Da and Huang 2020, Keuschnigg and Ganser 2016, Minson et al. 2018, Palley and Soll 2019. A central practical question is how communication between group members impacts the accuracy of the resulting "crowd" estimate (Atanasov et al. 2017, Becker et al. 2017, Da and Huang 2020. ...
... One simple but effective strategy to improve the accuracy of numeric estimates such as forecasts is to use the average of multiple estimates (Ashton 1986, Clemen 1989, Hogarth 1978, taking advantage of a statistical phenomenon that has been popularized as the "wisdom of crowds" (Almaatouq, Noriega-Campero, et al. 2020, Atanasov et al. 2017, Becker et al. 2017, Budescu and Chen 2014, Da and Huang 2020, Keuschnigg and Ganser 2016, Minson et al. 2018, Palley and Soll 2019. A central practical question is how communication between group members impacts the accuracy of the resulting "crowd" estimate (Atanasov et al. 2017, Becker et al. 2017, Da and Huang 2020. ...
... Despite a common theoretical expectation that groups produce the most accurate estimates when people are independent (Budescu and Chen 2014, Hogarth 1978, Lorenz et al. 2011, Palley and Soll 2019, Surowiecki 2004, experimental research has found that communication can increase the accuracy of numeric estimates under carefully controlled conditions (Almaatouq, Noriega-Campero, et al. 2020, Atanasov et al. 2017, Becker et al. 2017, Jayles et al. 2017, Minson et al. 2018. While this body of research has provided strong evidence that social exchange can sometimes improve the wisdom of crowds, these experiments have produced contradictory results on what exactly is required for estimate accuracy to improve (Hastie 1986). ...
Preprint
Full-text available
Research on belief formation has produced contradictory findings on whether and when communication between group members will improve the accuracy of numeric estimates such as economic forecasts, medical diagnoses, and job candidate assessments. While some evidence suggests that carefully mediated processes such as the "Delphi method" produce more accurate beliefs than unstructured discussion, others argue that unstructured discussion outperforms mediated processes. Still others argue that independent individuals produce the most accurate beliefs. This paper shows how network theories of belief formation can resolve these inconsistencies, even when groups lack apparent structure as in informal conversation. Emergent network structures of influence interact with the pre-discussion belief distribution to moderate the effect of communication on belief formation. As a result, communication sometimes increases and sometimes decreases the accuracy of the average belief in a group. The effects differ for mediated processes and unstructured communication, such that the relative benefit of each communication format depends on both group dynamics as well as the statistical properties of pre-interaction beliefs. These results resolve contradictions in previous research and offer practical recommendations for teams and organizations.
... One practical limitation of prediction markets is that many potential participants lack a background in commodities trading and, as a result, have difficulty expressing their forecasts. An alternative method of crowdsourcing forecasts for infectious disease surveillance is the use of prediction polls that aggregate individual forecasts statistically using recency-based subsetting, differential weighting based on past performance, and recalibration [11]. This method allows forecasters to make predictions using a more intuitive format in which they express beliefs by providing probabilities for potential outcomes. ...
... Outcomes are eventually resolved using ground truth and forecasters are scored on both accuracy and timeliness. In large-scale head-tohead comparisons of geopolitical forecasts, such prediction polls have proven to be as accurate as prediction markets [11]. Prediction polls are conducted to generate forecasts about future outcomes of interest and differ from classic "opinion polling." ...
... Individual forecasts for each question were aggregated using an algorithm developed and tested through experience with several IARPA research programs in geopolitical crowd forecasting [11]. Individual forecasts were first weighted, with greater weight given to forecasters who update their predictions frequently and who have a better past record of accuracy. ...
Article
Full-text available
Background The global spread of COVID-19 has shown that reliable forecasting of public health related outcomes is important but lacking. Methods We report the results of the first large-scale, long-term experiment in crowd-forecasting of infectious-disease outbreaks, where a total of 562 volunteer participants competed over 15 months to make forecasts on 61 questions with a total of 217 possible answers regarding 19 diseases. Results Consistent with the “wisdom of crowds” phenomenon, we found that crowd forecasts aggregated using best-practice adaptive algorithms are well-calibrated, accurate, timely, and outperform all individual forecasters. Conclusions Crowd forecasting efforts in public health may be a useful addition to traditional disease surveillance, modeling, and other approaches to evidence-based decision making for infectious disease outbreaks.
... We discuss two potential moderators of differences in accuracy: the number of traders posting orders on the market for a given question and the timing within the question. Atanasov et al. (2017) showed that CDA markets underperform team-based prediction polls when question resolutions are months away but are approximately tied in accuracy in the last few weeks before question resolution. We assess if this result generalizes to the comparison of CDA and LMSR markets. ...
... In the context of geopolitical forecasting tournaments, previous research has demonstrated that individual differences in prediction poll accuracy scores are reliable over time (Mellers et al. 2015a). Such accuracy measures provide useful inputs to weighted aggregation algorithms, which consistently outperform unweighted aggregation (e.g., Atanasov et al. 2017). ...
... To combine individual estimates, we used the weighted mean algorithm described by Atanasov et al. (2017), with two additional weighting features that were based on a forecaster's psychometric test score as well as on the time she spent on the platform. A weighted logit algorithm (Satopää et al. 2014) was used in sensitivity analyses. ...
... Yemen (Atanasov et al. 2016). Here, proper scoring is important to incentivize forecasters to invest effort and report truthfully, and to judge the relative accuracy of forecasts. ...
... This is important for a number of applications. In forecast aggregation, for example, it is known that using weighted averages, where more weight is put on more accurate forecasters, outperforms simple averaging of forecasts (Atanasov et al. 2016). To make this work with proper scoring rules, one requires early-closing questions, which can then inform the weighting of forecasters for other questions. ...
... In practice, forecasters often report on more than one event, in which case, scores are typically averaged over questions (e. g., Brier 1950, Gneiting and Raftery 2007, Atanasov et al. 2016. With the quadratic scoring rule, for example, a forecaster reporting y 1 , . . . ...
Article
Full-text available
Proper scoring rules can be used to incentivize a forecaster to truthfully report her private beliefs about the probabilities of future events and to evaluate the relative accuracy of forecasters. While standard scoring rules can score forecasts only once the associated events have been resolved, many applications would benefit from instant access to proper scores. In forecast aggregation, for example, it is known that using weighted averages, where more weight is put on more accurate forecasters, outperforms simple averaging of forecasts. We introduce proxy scoring rules, which generalize proper scoring rules and, given access to an appropriate proxy, allow for immediate scoring of probabilistic forecasts. In particular, we suggest a proxy-scoring generalization of the popular quadratic scoring rule, and characterize its incentive and accuracy evaluation properties theoretically. Moreover, we thoroughly evaluate it experimentally using data from a large real world geopolitical forecasting tournament, and show that it is competitive with proper scoring rules when the number of questions is small.
... In general, survey respondents struggle with questions that ask for shares of groups in the population (e.g., Kunovich, 2017;Joslyn and Haider-Markel, 2018). Mistakes in reporting probabilities can take the form of overly extreme probabilities (e.g., Kahneman, 2011) or probabilities overly close to 50 percent, depending on circumstances (Baron et al., 2014;Atanasov et al., 2017). In terms of the specific information required to answer accurately, the compositional question p(X | vote) is more difficult than the behavioral question p(vote | X ), as only the latter is typically reported in the media when presenting demographic breakdowns of election results. ...
... Baron et al., 2014;Atanasov et al., 2017). ...
Article
Full-text available
How well do citizens understand the associations between social groups and political divisions in their societies? Previous research has indicated systematic biases in how the demographic composition of party supporters are perceived, but this need not imply that citizens misperceive the likely voting behavior of specific individuals. We report results from two experiments where subjects were provided with randomly selected demographic profiles of respondents to the 2017 British Election Study (BES) and then asked to assess either (1) which party that individual was likely to have voted for in the 2017 UK election or (2) whether that individual was likely to have voted Leave or Remain in the 2016 UK referendum on EU membership. We find that, despite substantial overconfidence in individual responses, on average citizens’ guesses broadly reflect the actual distribution of groups supporting the parties and referendum positions.
... Previous work has examined such direct human forecasts in various contexts, such as geopolitics (Atanasov et al., 2016;Tetlock et al., 2014), meta-science (Hoogeveen et al., 2020;ReplicationMarkets, 2020), sports (Servan-Schreiber et al., 2004) and epidemiology (Farrow et al., 2017;McAndrew & Reich, 2020;Recchia et al., 2021). Several prediction platforms (CSET Foretell, 2021;Hypermind, 2021;Metaculus, 2020) and prediction markets (PredictIt, 2021) have been created to collate expert and non-expert predictions. ...
... However, most of these studies either focused only on the evaluation of (expert-tuned) model-based approaches (Cramer et al., 2021;Cramer et al., 2020;e.g. Funk et al., 2020), or exclusively on human forecasts (Atanasov et al., 2016;McAndrew & Reich, 2020;Recchia et al., 2021;Tetlock et al., 2014). In contrast, we directly compared human and model-based forecasts. ...
Preprint
Full-text available
Forecasts based on epidemiological modelling have played an important role in shaping public policy throughout the COVID-19 pandemic. This modelling combines knowledge about infectious disease dynamics with the subjective opinion of the researcher who develops and refines the model and often also adjusts model outputs. Developing a forecast model is difficult, resource- and time-consuming. It is therefore worth asking what modelling is able to add beyond the subjective opinion of the researcher alone. To investigate this, we analysed different real-time forecasts of cases of and deaths from COVID-19 in Germany and Poland over a 1-4 week horizon submitted to the German and Polish Forecast Hub. We compared crowd forecasts elicited from researchers and volunteers, against a) forecasts from two semi-mechanistic models based on common epidemiological assumptions and b) the ensemble of all other models submitted to the Forecast Hub. We found crowd forecasts, despite being overconfident, to outperform all other methods across all forecast horizons when forecasting cases (weighted interval score relative to the Hub ensemble 2 weeks ahead: 0.89). Forecasts based on computational models performed comparably better when predicting deaths (rel. WIS 1.26), suggesting that epidemiological modelling and human judgement can complement each other in important ways.
... But judgmental and crowdsourced forecasting techniques have progressed as well, enabled by internet connectivity [29], as well as methods such as proper scoring rules [5], training, teaming [33], forecaster tracking, i.e. "superforecasting" [32,43], reference class forecasting [11], peer prediction [38], prediction markets [46] and prediction polling with statistical aggregation [3]. ...
... Again, all human crowdsourcing methods significantly outperformed the RSF model. 3 Addressing question 1, the human crowdsourcing methods produced more accurate forecasts than the Machine model in each tournament, as well as in the combined set of questions. Regarding question 2, the human elicitation methods did not significantly differ in accuracy as measured by Brier scores. ...
Preprint
Full-text available
How do we effectively combine historical data and human insights to predict complex outcomes? How well do human crowds compete with predictive algorithms? We provide the first description of the Human Forest method, which enables forecasters to define custom reference classes, query a historical database and review base rates specific to their selections. These base rates, and adjusted probabilistic estimates, are then aggregated statistically. Forecasters receive proper scoring feedback and accuracy incentives. Human Forest together with a new aggregation algorithm, Most Popular Selections, addresses the classic reference class problem by employing the wisdom of crowds: eliciting and aggregating reference class selection judgments. We assess the performance of Human Forest against basic Control Polls and a Random Survival Forest machine model. Methods are evaluated in two 6-month forecasting tournaments with a total of 60 questions focused on time-specific clinical trial success prediction for vaccines and treatments for COVID-19 and other infectious diseases. Results show that human crowdsourcing methods significantly outperform the predictive model, registering mean Brier score improvements between 37% and 49% for Human Forest over the predictive model. Human Forest and Control Polls exhibit approximately equivalent performance. Including Human Forest-derived base rate estimates at the aggregation stage improves overall performance. Interactive access to base-rate data through Human Forest does not hurt forecasting performance, even when new events strongly deviate from historical patterns. Even simple crowdsourcing tools can add value above predictive models in dynamic settings, while eliciting forecasters’ outside-view judgments may be especially helpful to accuracy in stable environments.
... Top-k + Weighted Mean: aggregator computes weighted mean using the forecasts of the top k users in terms of the number of updates. 71.d: Top-k + Weighted Mean + Extremize: After computing the weighted mean of the top-k users, the aggregator extremizes the forecast using the following formulaAtanasov et al. (2016):f e =f a /(f a + (1 −f ) a ).Atanasov et al. (2016) found that the optimal value of The top row shows the three statistics (updates, returns, and change) for the four treatments. Treatments SR+PP and PP performs significantly better than SR for all three statistics, however, SR+PPRank performs significantly worse than SR for the number of daily returns. ...
... Top-k + Weighted Mean: aggregator computes weighted mean using the forecasts of the top k users in terms of the number of updates. 71.d: Top-k + Weighted Mean + Extremize: After computing the weighted mean of the top-k users, the aggregator extremizes the forecast using the following formulaAtanasov et al. (2016):f e =f a /(f a + (1 −f ) a ).Atanasov et al. (2016) found that the optimal value of The top row shows the three statistics (updates, returns, and change) for the four treatments. Treatments SR+PP and PP performs significantly better than SR for all three statistics, however, SR+PPRank performs significantly worse than SR for the number of daily returns. ...
... Previous work has examined such direct human forecasts in various contexts, such as geopolitics [19,20], meta-science [21,22], sports [23] and epidemiology [11,24,25]. Several prediction platforms [26][27][28] and prediction markets [29] have been created to collate expert and non-expert predictions. ...
... However, most of these studies either focused only on the evaluation of (expert-tuned) model-based approaches [e.g. 12,13,14], or exclusively on human forecasts [19,20,24,25]. In contrast, we directly compared human and model-based forecasts. ...
Article
Full-text available
Forecasts based on epidemiological modelling have played an important role in shaping public policy throughout the COVID-19 pandemic. This modelling combines knowledge about infectious disease dynamics with the subjective opinion of the researcher who develops and refines the model and often also adjusts model outputs. Developing a forecast model is difficult, resource- and time-consuming. It is therefore worth asking what modelling is able to add beyond the subjective opinion of the researcher alone. To investigate this, we analysed different real-time forecasts of cases of and deaths from COVID-19 in Germany and Poland over a 1-4 week horizon submitted to the German and Polish Forecast Hub. We compared crowd forecasts elicited from researchers and volunteers, against a) forecasts from two semi-mechanistic models based on common epidemiological assumptions and b) the ensemble of all other models submitted to the Forecast Hub. We found crowd forecasts, despite being overconfident, to outperform all other methods across all forecast horizons when forecasting cases (weighted interval score relative to the Hub ensemble 2 weeks ahead: 0.89). Forecasts based on computational models performed comparably better when predicting deaths (rel. WIS 1.26), suggesting that epidemiological modelling and human judgement can complement each other in important ways.
... are still widely used today to motivate and measure forecasting accuracy (e.g., Atanasov et al. 2017) as well as an active area of research in decision analysis (e.g., Jose 2017, Grushka-Cockayne et al. 2017). ...
... Lichtendahl et al. (2013) show that under a commonly-known public-private signal model, a simple average of "gamed" forecasts is more accurate than a simple average of truthful forecasts. However, state-of-the-art aggregation algorithms, such as the extremized mean (Atanasov et al. 2017) and the logit aggregator (Satopää et al. 2014), consistently outperform simple averaging in practice and can take advantage of truthful reports. ...
Preprint
Full-text available
We initiate the study of incentive-compatible forecasting competitions in which multiple forecasters make predictions about one or more events and compete for a single prize. We have two objectives: (1) to incentivize forecasters to report truthfully, so that forecasts are informative and forecasters need not spend any cognitive effort strategizing about reports, and (2) to award the prize to the most accurate forecaster. Proper scoring rules incentivize truthful reporting if all forecasters are paid according to their scores. However, incentives become distorted if only the best-scoring forecaster wins a prize, since forecasters can often increase their probability of having the highest score by reporting more extreme beliefs. In this paper, we introduce two novel forecasting competition mechanisms. Our first mechanism is dominant strategy incentive compatible and guaranteed to select the most accurate forecaster with probability higher than any other forecaster. Moreover, we show that in the standard single-event, two-forecaster setting and under mild technical conditions, no other incentive-compatible mechanism selects the most accurate forecaster with higher probability. Our second mechanism is incentive compatible when forecasters' beliefs are such that information about one event does not lead to a belief update on the other events, and it selects the best forecaster with probability approaching 1 as the number of events grows. Our mechanisms are easy to implement and can be generalized to the related problems of outputting a ranking over forecasters and hiring a forecaster with high accuracy on future events.
... Market overreaction 3 is a common occurrence (De Bondt and Thaler 1985) and could be due to the discounting of stable cues (e.g., base rates) in favor of noisy inside-view cues (e.g., case-specific information), especially when the inside cues are extreme (Griffin and Tversky 1992). 4 Institutional practices may contribute as well. For example, US intelligence training emphasizes the need to avoid underreaction to new evidence, which could increase the risks of overreaction , and bring about advantages to incremental updaters, i.e., forecasters who tend to make smaller revisions, and may be less prone to overreact to new information. ...
... Information cascades may produce aggregate-level overreaction even without individual-level overreaction.4 Koehler (1996) notes that base-rate neglect depends on the structure and representation of the task, and argues in favor of ignoring base rates that are ambiguous, unreliable or unstable. ...
... From an economic perspective, the aim of many professional football clubs is to buy undervalued players to achieve both higher performance and higher returns on investment [18]. Moreover, a rapidly growing body of literature emphasizes the importance of collective judgements for assessing actual and future values [17,19]. Recent studies showed that the variance of actual transfer fees paid (for players) in the German Bundesliga can almost entirely be explained (R 2 = 0.90) by the market values reported on transfermarkt.de ...
... [17]. Current literature suggests that player market values on transfermarkt.de are good proxy estimate indicators of current as well as future players' real market values and will, therefore, play an increasing role in talent recruitment, sports economics and talent development [17,19]. ...
Article
Full-text available
Background: In football, annual age-group categorization leads to relative age effects (RAEs) in talent development. Given such trends, relative age may also associate with market values. This study analyzed the relationship between RAEs and market values of youth players. Methods: Age category, birthdate, and market values of 11,738 youth male football players were obtained from the "transfermarkt.de" database, which delivers a good proxy for real market values. RAEs were calculated using odds ratios (OR) with 95% confidence intervals (95%CI). Results: Significant RAEs were found across all age-groups (p < 0.05). The largest RAEs occurred in U18 players (Q1 [relatively older] v Q4 [relatively younger] OR = 3.1) ORs decreased with age category, i.e., U19 (2.7), U20 (2.6), U21 (2.4), U22 (2.2), and U23 (1.8). At U19s, Q1 players were associated with significantly higher market values than Q4 players. However, by U21, U22, and U23 RAEs were inversed, with correspondingly higher market values for Q4 players apparent. While large typical RAEs for all playing positions was observed in younger age categories (U18-U20), inversed RAEs were only evident for defenders (small-medium) and for strikers (medium-large) in U21-U23 (not goalkeepers and midfielders). Conclusions: Assuming an equal distribution of football talent exists across annual cohorts, results indicate the selection and market value of young professional players is dynamic. Findings suggest a potential biased selection, and undervaluing of Q4 players in younger age groups, as their representation and market value increased over time. By contrast, the changing representations and market values of Q1 players suggest initial overvaluing in performance and monetary terms. Therefore, this inefficient talent selection and the accompanying waste of money should be improved.
... With artificial intelligence (AI) systems being an important part of our lives, combining crowdsourcing workflows with AI tools, hybrid intelligence, promise great potential for improving human-only workflows. Therefore, researchers have developed intelligent hybrid systems for real-time speech transcribing Lasecki et al., , 2013Lasecki et al., , 2017, clustering data points (Gomes et al., 2011;Tamuz et al., 2011;Heikinheimo and Ukkonen, 2013), forecasting political or economic events (Baron et al., 2014;Mellers et al., 2015;Atanasov et al., 2017) or scheduling conference meetings Bhardwaj et al., 2014;Chilton et al., 2014). These hybrid workflows have been proven to perform better than human-only and machine-only systems. ...
Conference Paper
Full-text available
In recent years, crowdsourcing has gained much attention from researchers to generate data for the Natural Language Generation (NLG) tools or to evaluate them. However, the quality of crowdsourced data has been questioned repeatedly because of the complexity of NLG tasks and crowd workers{'} unknown skills. Moreover, crowdsourcing can also be costly and often not feasible for large-scale data generation or evaluation. To overcome these challenges and leverage the complementary strengths of humans and machine tools, we propose a hybrid human-machine workflow designed explicitly for NLG tasks with real-time quality control mechanisms under budget constraints. This hybrid methodology is a powerful tool for achieving high-quality data while preserving efficiency. By combining human and machine intelligence, the proposed workflow decides dynamically on the next step based on the data from previous steps and given constraints. Our goal is to provide not only the theoretical foundations of the hybrid workflow but also to provide its implementation as open-source in future work.
... Promising follow-up research is beginning to combine the predictions of 'nonexpert' forecasters from Metaculus and the Good Judgment Project with those of epidemiological modelers to produce consensus forecasts of hopefully greater accuracy than either in isolation, as well as a 'meta-forecast' which combines this consensus forecast with an ensemble of forecasts from computational models [34]; the results have yet to be systematically evaluated. Other initiatives to solicit and evaluate a wide range of approaches to epidemiological forecasting, such as the DARPA Chikungunya challenge [35], in combination with research on approaches to aggregating forecasts of subject-matter experts [36] and nonexperts [37][38][39], have also established promising routes toward improving forecasting of epidemics. In other words, we are not all doomed to be overconfident: there is much that can be done to improve the accuracy and calibration of forecasts, at least in the context of forecasting tournaments. ...
Article
Full-text available
Throughout the COVID-19 pandemic, social and traditional media have disseminated predictions from experts and nonexperts about its expected magnitude. How accurate were the predictions of 'experts'-individuals holding occupations or roles in subject-relevant fields, such as epidemiologists and statisticians-compared with those of the public? We conducted a survey in April 2020 of 140 UK experts and 2,086 UK laypersons; all were asked to make four quantitative predictions about the impact of COVID-19 by 31 Dec 2020. In addition to soliciting point estimates, we asked participants for lower and higher bounds of a range that they felt had a 75% chance of containing the true answer. Experts exhibited greater accuracy and calibration than laypersons, even when restricting the comparison to a subset of laypersons who scored in the top quartile on a numeracy test. Even so, experts substantially underestimated the ultimate extent of the pandemic, and the mean number of predictions for which the expert intervals contained the actual outcome was only 1.8 (out of 4), suggesting that experts should consider broadening the range of scenarios they consider plausible. Predictions of the public were even more inaccurate and poorly calibrated, suggesting that an important role remains for expert predictions as long as experts acknowledge their uncertainty.
... Research on human computation provides a solution for these problems (Quinn and Bederson 2011). Collective intelligence, leverages the "wisdom of crowds" to aggregate the evaluations of a large group of humans, thereby, reducing the noise and biases of individual predictions (Atanasov et al. 2017;Cogwill and Zitzewitz 2015;Blohm et al. 2016). The value of crowds compared to individuals underlies two basic principles: error reduction and knowledge aggregation (Larrick et al. 2011;Mellers et al. 2015). ...
Preprint
Full-text available
Artificial intelligence is an emerging topic and will soon be able to perform decisions better than humans. In more complex and creative contexts such as innovation, however, the question remains whether machines are superior to humans. Machines fail in two kinds of situations: processing and interpreting soft information (information that cannot be quantified) and making predictions in unknowable risk situations of extreme uncertainty. In such situations, the machine does not have representative information for a certain outcome. Thereby, humans are still the gold standard for assessing soft signals and make use of intuition. To predict the success of startups, we, thus, combine the complementary capabilities of humans and machines in a Hybrid Intelligence method. To reach our aim, we follow a design science research approach to develop a Hybrid Intelligence method that combines the strength of both machine and collective intelligence to demonstrate its utility for predictions under extreme uncertainty.
... Predictions market are designed to aggregate information that is widely dispersed amongst agents. The market price is expected to converge to a relatively stable value which is interpreted as a probability of the outcome occurring [30,31]. For replication markets it is unknown how quickly the market can converge. ...
Article
Full-text available
The reproducibility of published research has become an important topic in science policy. A number of large-scale replication projects have been conducted to gauge the overall reproducibility in specific academic fields. Here, we present an analysis of data from four studies which sought to forecast the outcomes of replication projects in the social and behavioural sciences, using human experts who participated in prediction markets and answered surveys. Because the number of findings replicated and predicted in each individual study was small, pooling the data offers an opportunity to evaluate hypotheses regarding the performance of prediction markets and surveys at a higher power. In total, peer beliefs were elicited for the replication outcomes of 103 published findings. We find there is information within the scientific community about the replicability of scientific findings, and that both surveys and prediction markets can be used to elicit and aggregate this information. Our results show prediction markets can determine the outcomes of direct replications with 73% accuracy (n = 103). Both the prediction market prices, and the average survey responses are correlated with outcomes (0.581 and 0.564 respectively, both p < .001). We also found a significant relationship between p-values of the original findings and replication outcomes. The dataset is made available through the R package “pooledmaRket” and can be used to further study community beliefs towards replications outcomes as elicited in the surveys and prediction markets.
... More recently, motivated by the emergence of external and internal prediction markets, Bassamboo et al. (2018) empirically explore the effect of group size on forecast accuracy, finding that aggregation across larger groups improves accuracy. The notion that aggregation of a large number of estimates can improve estimation-sometimes described as the wisdom of crowds-has also received significant attention in the decision analysis, economics, forecasting, social network, and other literatures (e.g., Bates and Granger 1969;Ashton and Ashton 1985;Winkler and Clemen 2004;Wallis 2011;Acemoglu et al. 2014a b;Atanasov et al. 2016;Tsoukalas and Falk 2020). Through that lens, one can view our work as exploring a related but different question: when each individual in a crowd wants to improve his or her own estimate (but cannot ask everyone in the crowd), then who in the crowd should an individual target? ...
Article
Problem definition: Autonomous sensors connected through the internet of things (IoT) are deployed by different firms in the same environment. The sensors measure an important operating-condition state variable, but their measurements are noisy, so estimates are imperfect. Sensors can improve their own estimates by soliciting estimates from other sensors. The choice of which sensors to communicate with (target) is challenging because sensors (1) are constrained in the number of sensors they can target and (2) only have partial knowledge of how other sensors operate—that is, they do not know others’ underlying inference algorithms/models. We study the targeting problem, examine the evolution of interfirm sensor communication patterns, and explore what drives the patterns. Academic/practical relevance: Many industries are increasingly using sensors to drive improvements in key performance metrics (e.g., asset uptime) through better information on operating conditions. Sensors will communicate among themselves to improve estimation. This IoT vision will have a major impact on operations management (OM), and OM scholars need to develop and examine models and frameworks to better understand sensor interactions. Methodology: Analytic modeling combining decision-making, estimation, optimization, and learning is used. Results: We show that when selecting its target(s), each sensor needs to consider both the measurement quality of the other sensors and its level of familiarity with their inference models. We establish that the state of the environment plays a key role in mediating quality and familiarity. When sensor qualities are public, we show that each sensor eventually settles on a constant target set, but this long-run target set is sample-path dependent (i.e., dependent on past states) and varies by sensor. The long-run network, however, can be fully defined at time zero as a random directed graph, and hence, one can probabilistically predict it. This prediction can be made perfect (i.e., the network can be identified in a deterministic way) after observing the state values for a limited number of periods. When sensor qualities are private, our results reveal that sensors may not settle on a constant target set but the subset among which it cycles can still be stochastically predicted. Managerial implications: Our work allows managers to predict (and influence) the set of other firms with which their sensors will form information links. Analogous to a manufacturer mapping its supplier base to help manage supply continuity, our work enables a firm to map its sensor-based-information suppliers to help manage information continuity.
... Predictions market are designed to aggregate information that is widely dispersed amongst agents. The market price is expected to converge to a relatively stable value which is interpreted as a probability of the outcome occurring 28,29 . For replication markets it is unknown how quickly the market can converge. ...
Preprint
Full-text available
The reproducibility of published research has become an important topic in science policy. A number of large-scale replication projects have been conducted to gauge the overall reproducibility in specific academic fields. Here, we present an analysis of data from four studies which sought to forecast the outcomes of replication projects in the social and behavioural sciences, using human experts who participated in prediction markets and answered surveys. Because the number of findings replicated and predicted in each individual study was small, pooling the data offers an opportunity to evaluate hypotheses regarding the performance of prediction markets and surveys at a higher power. In total, peer beliefs were elicited for the replication outcomes of 103 published findings. We find there is information within the scientific community about the replicability of scientific findings, and that both surveys and prediction markets can be used to elicit and aggregate this information. Our results show prediction markets can determine the outcomes of direct replications with 73% accuracy (n=103). Both the prediction market prices and the average survey responses are correlated with outcomes (0.581 and 0.564 respectively, both p < .001). We also found a significant relationship between p-values of the original findings and replication outcomes. The dataset is made available through the R package pooledmaRket and can be used to further study community beliefs towards replications outcomes as elicited in the surveys and prediction markets.
... Second, individuals in organizations often engage in repeated decision-making through which they have the opportunity to learn from feedback on prior decisions (Christensen and Knudsen 2009). Scholars have increasingly sought to apply the underlying ideas of the wisdom-of-crowds to organizational decisions (Page 2007), implying that the wisdom-of-crowds idea (Surowiecki 2005, Atanasov et al. 2016) is directly applicable. Our study suggests, however, that a naive application of the wisdom-of-crowds logic may lead to decision-making structures with inferior long-run performance outcomes. ...
Article
Full-text available
Organizational decision-making that leverages the collective wisdom and knowledge of multiple individuals is ubiquitous in management practice, occurring in settings such as top management teams, corporate boards, and the teams and groups that pervade modern organizations. Decision-making structures employed by organizations shape the effectiveness of knowledge aggregation. We argue that decision-making structures play a second crucial role in that they shape the learning of individuals that participate in organizational decision-making. In organizational decision making, individuals do not engage in learning-by-doing, but rather, in what we call learning-by-participating, which is distinct in that individuals learn by receiving feedback not on their own choices, but rather on the choice made by the organization. We examine how learning-by-participating influences the efficacy of aggregation and learning across alternative decision-making structures and group sizes. Our central insight is that learning-by-participating leads to an aggregation-learning tradeoff in which structures that are effective in aggregating information can be ineffective in fostering individual learning. We discuss implications for research on organizations in the areas of learning, microfoundations, teams, and crowds.
... The consensus is constructed as the aggregate of individual estimates elicited from the same group of forecasters. The aggregation follows the algorithm used in Atanasov et al. (2017), and features subsetting of the 72% most recent forecasts, higher weights placed on more frequent updaters on a given question, and an extremizing constant of a = 1.5. The aggregation algorithm was not optimized to produce maximally accurate estimates or serve as an optimal basis for proxy score calculation. ...
Preprint
Full-text available
Who is good at prediction? Addressing this question is key to recruiting and cultivating accurate crowds and effectively aggregating their judgments. Recent research on superforecasting has demonstrated the importance of individual, persistent skill in crowd prediction. This chapter takes stock of skill identification measures in probability estimation tasks, and complements the review with original analyses, comparing such measures directly within the same dataset. We classify all measures in five broad categories: 1) accuracy-related measures, such as proper scores, model-based estimates of accuracy and excess volatility scores; 2) intersubjective measures, including proxy, surrogate and similarity scores; 3) forecasting behaviors, including activity, belief updating, extremity, coherence, and linguistic properties of rationales; 4) dispositional measures of fluid intelligence, cognitive reflection, numeracy, personality and thinking styles; and 5) measures of expertise, including demonstrated knowledge, confidence calibration, biographical, and self-rated expertise. Among non-accuracy-related measures, we report a median correlation coefficient with outcomes of r = 0.20. In the absence of accuracy data, we find that intersubjective and behavioral measures are most strongly correlated with forecasting accuracy. These results hold in a LASSO machine-learning model with automated variable selection. Two focal applications provide context for these assessments: long-term, existential risk prediction and corporate forecasting tournaments.
... Q7 and Q7b both score below average for feasibility, but Q7a scores the highest of all questions with respect to feasibility. These results are perhaps unsurprising given the empirical success of aggregation methods in forecasting (Atanasov et al., 2017) and the seeming amenability of these methods to theoretical analysis (Satopää et al., 2016). Because this is also scored as important, we interpret it to suggest that work which draws from this body of existing work to develop AI-specific techniques for the aggregation and reporting of metrics should be prioritized. ...
Article
Full-text available
Forecasting AI progress is essential to reducing uncertainty in order to appropriately plan for research efforts on AI safety and AI governance. While this is generally considered to be an important topic, little work has been conducted on it and there is no published document that gives a balanced overview of the field. Moreover, the field is very diverse and there is no published consensus regarding its direction. This paper describes the development of a research agenda for forecasting AI progress which utilized the Delphi technique to elicit and aggregate experts’ opinions on what questions and methods to prioritize. Experts indicated that a wide variety of methods should be considered for forecasting AI progress. Moreover, experts identified salient questions that were both general and completely unique to the problem of forecasting AI progress. Some of the highest priority topics include the validation of (partially unresolved) forecasts, how to make forecasts action-guiding, and the quality of different performance metrics. While statistical methods seem more promising, there is also recognition that supplementing judgmental techniques can be quite beneficial.
... For each question, the advisory forecasts were simply the mean values assigned to each response option by the 65 judges in the pilot study. This ensured that the forecasts would be plausible and, based on the theory of the wisdom of crowds (Atanasov et al., 2017;Budescu & Chen, 2014;Surowiecki, 2005), they would be likely to be accurate as well. ...
Article
Full-text available
Past research has found that people treat advice differently depending on its source. In many cases, people seem to prefer human advice to algorithms, but in others, there is a reversal, and people seem to prefer algorithmic advice. Across two studies, we examine the persuasiveness of, and judges' preferences for, advice from different sources when forecasting geopolitical events. We find that judges report domain‐specific preferences, preferring human advice in the domain of politics and algorithmic advice in the domain of economics. In Study 2, participants report a preference for hybrid advice, that combines human and algorithmic sources, to either one on it's own regardless of domain. More importantly, we find that these preferences did not affect persuasiveness of advice from these different sources, regardless of domain. Judges were primarily sensitive to quantitative features pertaining to the similarity between their initial beliefs and the advice they were offered, such as the distance between them and the relative advisor confidence, when deciding whether to revise their initial beliefs in light of advice, rather than the source that generated the advice.
... For discussions of the success of prediction markets in aggregating information seePennock et al. (2001),Chen and Plott (2002),Tetlock (2004),Gürkaynak and Wolfers (2006),Berg et al. (2008),Cowgill et al. (2009),Cowgill and Zitzewitz (2015), andAtanasov et al. (2017).3 In naturally occurring prediction markets, aggregate information is unlikely to identify an outcome with certainty and in such cases differential risk attitudes may provide a motive to trade. ...
Article
The efficient market hypothesis predicts that asset prices reflect all available information. A seminal experiment reported that contingent claim markets could yield market outcomes consistent with information aggregation when traders hold heterogeneous state-contingent values. However, a recent experiment found the rational expectation model outperformed the prior information and maxi-min models in contingent claim markets when traders hold homogeneous values despite the no trade equilibrium in that setting. But that same study failed to replicate the original result calling into question when, if ever, prices reliably reflect the aggregate information of traders with heterogeneous values. In this paper, we show contingent claim markets can robustly yield prices consistent with the efficient market hypothesis when traders hold heterogeneous values in certain circumstances. The key distinction between our environment and that of the previous studies is that we consider trader values that are correlated and not too dissimilar.
... The advice itself was generated from a prior pilot study, in which different participants forecasted the same questions. The average values of across forecasts from the pilot were used as the advice values for both the human expert and statistical algorithm conditions, to ensure the advice was sensible, equivalent regardless of condition, and likely to be accurate based on the principle of the Wisdom of Crowds (Atanasov et al., 2017;Budescu & Chen, 2014;Himmelstein, Atanasov, & Budescu, 2021;Surowiecki, 2005). ...
Article
Full-text available
Research on advice utilization often operationalizes the construct via Judge Advisor Systems (JAS), where a judge’s belief is elicited, they are provided advice, and given an opportunity to revise their belief. Belief change, or weight of advice (WOA), is measured as the shift in the judge’s belief proportional to the difference between their original belief and the advice. Several JAS studies have found WOA typically takes on a trimodal distribution, with inflation at the boundary values of 0 (indicating a judge declined advice) and 1 (adoption of advice). A dual hurdle beta model is proposed to account for these inflations. In addition to being an innovative computational model to address this methodological challenge, it also serves as a descriptive theoretical model which posits that the decision process happens in two stages: an initial discrete “choosing” stage, where the judge opts to either decline, adopt, or compromise with advice; and a subsequent continuous “averaging” stage, which occurs only if the judge opts to compromise. The approach was assessed via reanalysis of three recent JAS studies reflective of popular topics in the literature, such as algorithmic advice utilization, egocentric discounting effects, and judgmental forecasting. In each case new results were uncovered about how different correlates of advice utilization influence the decision process at either or both of the discrete and continuous stages, often in quite different ways, providing support for the descriptive theoretical model. A Bayesian graphical analysis framework is provided that can be applied to future research on advice utilization.
... As before, we begin with the primary domain of immigrant crime rates. There are many ways to aggregate estimates to a single set of values by taking into account the prior history of and the correlations among estimators (Atanasov et al., 2016;Lyon & Pacuit, 2013;Navajas et al., 2018). Surprisingly, the simplest is among the best: take the arithmetic mean. ...
Article
Full-text available
In this pre-registered study, we gathered two online samples totaling 615 subjects. The first sample was nationally representative with regards to age, sex and education, the second was an online convenience sample with mostly younger people. We measured intelligence (vocabulary and science knowledge, 20 items each) using newly constructed Dutch language tests. We measured stereotypes in three domains: 68 national origin-based immigrant crime rates, 54 occupational sex distributions, and 12 provincial incomes. We additionally measured other covariates such as employment status and political voting behaviors. Results showed substantial stereotype accuracy for each domain. Aggregate (average) stereotype Pearson correlation accuracies were strong: immigrant crime .65, occupations .94, and provincial incomes .85. Results of individual accuracies found there was a weak general factor of stereotype accuracy measures, reflecting a general social perception ability. We found that intelligence moderately but robustly predicted more accurate stereotypes across domains as well as general stereotyping ability (r’s .20, .25, .26, .39, β’s 0.17, 0.25, 0.21, 0.37 from the full regression models). Other variables did not have robust effects across all domains, but had some reliable effects for one or two domains. For immigrant crime rates, we also measured the immigration preferences for the same groups, i.e. whether people would like more or fewer people from these groups. We find that actual crime rates predict net opposition at r = .55, i.e., subjects were more hostile to immigration from origins that had higher crime rates. We examined a rational immigration preference path model where actual crime rates→stereotypes of crime rates→immigrant preferences. We found that about 84% of the effect of crime rates was mediated this way, and this result was obtained whether or not one included Muslim% as a covariate in the model. Overall, our results support rational models of social perception and policy preferences for immigration.
... Prediction markets are an alternative to prediction polls and have advantages, such as continuous intermediate feedback (oscillating prices). We see however three reasons to prefer prediction polls as the major elicitation mechanism for X-Risk tournaments: (a) polls tend to beat markets on longer-duration questions: months or years vs. days or weeks, as shown by Atanasov et al. 2017; polls permit more flexibility in accuracy scoring and feedback provision; (c) polls are better at spotting skilled forecasters. 29 See the second edition of Philip Tetlock's (2017) Expert Political Judgment-and the description of expert accuracy vs. algorithms of varying sophistication in Chapter 2, The Ego-deflating Challenge of Radical Skepticism. ...
Article
Full-text available
Forecasting tournaments are misaligned with the goal of producing actionable forecasts of existential risk, an extreme-stakes domain with slow accuracy feedback and elusive proxies for long-run outcomes. We show how to improve alignment by measuring facets of human judgment that playcentral roles in policy debates but have long been dismissed as unmeasurable. The key is supplementing traditional objective accuracy metrics with intersubjective metrics that test forecasters’ skill at predicting other forecasters’ judgments on topics that resist objective scoring, such as long-range scenarios, probativeness of questions, insightfulness of explanations, and impactfulness of risk-mitigation options. We focus on the value of Reciprocal Scoring, an intersubjective method grounded in micro-economic research that challenges top forecasters to predict each other’s judgments. Even if cumulative information gains prove modest and are confined to a 1-to-5 year planning horizon, the expected value of lives saved would be massive.
... Senior analysts also showed better discrimination skill than junior analysts, contrary to what might be inferred from Tetlock (2005). Finally, forecasts, on average, were underconfident rather than overconfident, once again contrary to what Tetlock (2005) observed, but in line with independent prediction poll results from the ACE forecasting tournament noted earlier (Atanasov et al., 2017). ...
Article
Forecasting plays a vital role in intelligence assessment and contributes to national se- curity decision-making by improving strategic foresight. Remarkably, most intelligence organizations do not proactively track their forecasting accuracy and, therefore, do not know how accurate their forecasts are or what types of biases intelligence analysts (or organizations) might exhibit. We review research on geopolitical forecasting and a roughly decade-long program of research to assess the accuracy of strategic intelligence forecasts produced by and for the Government of Canada. This research is described in three phases corresponding to previously published research, following which novel analyses (drawing from the data used in the earlier phases) are reported. The findings re- veal a high degree of forecasting accuracy as well as significant underconfidence. These results were evident regardless of whether analysts assigned numeric probabilities to their forecasts. However, the novel analyses clarified that there is a substantial cost to accuracy if end-users rely on their own interpretations of verbal probability terms used in the forecasts. We recommend that intelligence organizations proactively track fore- casting accuracy as a means of supporting accountability and organizational learning. We further recommend that intelligence organizations use numeric probabilities in their forecasts to support better comprehension of these estimates by end-users.
Article
How much can rational people really disagree? If we can understand the limits of such disagreement, can we remove noise by labeling excess disagreement as irrational and then construct a group belief based on everyone's rational beliefs? Based on this idea, “Regularized Aggregation of One-Off Probability Predictions” by Satopää proposes a Bayesian aggregator that requires no user intervention and can be computed efficiently even for a large number of one-off probability predictions. To illustrate, the aggregator is evaluated on predictions collected during a four-year forecasting tournament sponsored by the U.S. intelligence community. The aggregator improves the squared error (a.k.a., the Brier score) of simple averaging by around 20% and other commonly used aggregators by 10%−25%. This advantage stems almost exclusively from improved calibration. An R package called braggR implements the method and is available on CRAN.
Article
Crowdsourcing contests allow organisations to engage with an external workforce. Over the years, the phenomenon has attracted considerable research interest. In the present review, we synthesise the crowdsourcing contest literature by adopting the social mechanism lens. We begin by observing that stakeholders in crowdsourcing contests range from individuals (solvers) to large-scale organisations (seekers). Given that such vastly different entities interact during a crowdsourcing contest, it is expected that their behaviour, too, can have a varying range of predictors, such as individual and organisational factors. However, prior reviews on Crowdsourcing contests and crowdsourcing, in general, haven't explored the phenomenon's multi-layered nature. In addressing this gap, we synthesise 127 scholarly articles and identify underlying social mechanisms that explain key behavioural outcomes of seekers and solvers. Our review makes two specific contributions. First, we determine three distinct tensions that emerge from the key design decisions that might be at odds with the central principle of crowdsourcing contests: broadcast search for solutions from a long-tail of solvers. Second, we provide three recommendations for future research that, we believe, could provide a richer understanding of the seeker and solver behaviour.
Article
Decades of research suggest that information exchange in groups and organizations can reliably improve judgment accuracy in tasks such as financial forecasting, market research, and medical decision making. However, we show that improving the accuracy of numeric estimates does not necessarily improve the accuracy of decisions. For binary-choice judgments, also known as classification tasks—for example, yes/no or build/buy decisions—social influence is most likely to grow the majority vote share, regardless of the accuracy of that opinion. As a result, initially, inaccurate groups become increasingly inaccurate after information exchange, even as they signal stronger support. We term this dynamic the “crowd classification problem.” Using both a novel data set and a reanalysis of three previous data sets, we study this process in two types of information exchange: (1) when people share votes only, and (2) when people form and exchange numeric estimates prior to voting. Surprisingly, when people exchange numeric estimates prior to voting, the binary-choice vote can become less accurate, even as the average numeric estimate becomes more accurate. Our findings recommend against voting as a form of decision making when groups are optimizing for accuracy. For those cases where voting is required, we discuss strategies for managing communication to avoid the crowd classification problem. We close with a discussion of how our results contribute to a broader contingency theory of collective intelligence. This paper was accepted by Lamar Pierce, organizations.
Article
Full-text available
Compared to the past literature on prediction markets that uses small-scale observational field data or experiments, this present research examines the efficiency of such markets by studying catastrophe (CAT) bonds. We collect actual catastrophe loss data, match them with the defined trigger events of each CAT bond contract, and then employ an empirical pricing framework to obtain the excess CAT premiums in order to represent the market-based forecasts. Our results indeed show that market-based forecasts have more significant predictive content for future CAT losses than professional forecasts that use natural catastrophe risk models. Although the predictive information for CAT events is specialized and complex, our evidence supports that CAT bond markets are successful prediction markets that efficiently aggregate information about future CAT losses. Our resultsalso highlight that actual CAT losses in future periods can explain the excess CAT bond spreads in the primary market and provide support for market efficiency when pricing CAT risk.
Article
We attempt to replicate a seminal paper that offered support for the rational expectations hypothesis and reported evidence that markets with certain features aggregate dispersed information. The original results are based on only a few observations, and our attempt to replicate the key findings with an appropriately powered experiment largely fails. The resulting poststudy probability that market performance is better described by rational expectations than the prior information (Walrasian) model under the conditions specified in the original paper is very low. As a result of our failure to replicate, we investigate an alternate set of market features that combines aspects of the original experimental design. For these markets, which include both contingent claims and homogeneous dividend payments (as in many prediction markets), we do find robust evidence of information aggregation in support of the rational expectations model. In total, our results indicate that information aggregation in asset markets is fragile and should only be expected in limited circumstances. This paper was accepted by Bruno Biais, finance.
Article
What makes some managers and entrepreneurs better at forecasting the industry context than others? We argue that, regardless of experience or expertise, a learning‐based forecasting behavior in which individuals attend to and incorporate new relevant information from the environment into an updated belief that aligns with the Bayesian belief updating process is likely to generate superior industry foresight. However, the effectiveness of such a cognitively demanding process diminishes under high levels of uncertainty. We find support for these arguments using an experimental design of forecasting tournaments in the managerially relevant context of the global automotive industry from 2016‐2019. The study provides a novel account of individual‐level forecasting behavior and its effectiveness in an evolving industry, and suggests important implications for managers and entrepreneurs. How a focal industry will evolve is a key forecasting problem faced by managers and entrepreneurs as they seek to identify opportunities and make strategic decisions. However, developing superior industry foresight in the face of significant change, and limited and often contradictory information, can be especially challenging. We study how individuals forecast the ongoing transformation of the global automotive industry with respect to electrification and autonomy, using a novel research design of forecasting tournaments. A forecasting process in which individuals update their beliefs by neither ignoring prior information nor overacting to new information helps to generate superior industry foresight. There was a significant penalty to forecasting accuracy when individuals did not update their beliefs at all, or when they updated, but overreacted to new information.
Article
Aggregating predictions from multiple judges often yields more accurate predictions than relying on a single judge, which is known as the wisdom-of-the-crowd effect. However, a wide range of aggregation methods are available, which range from one-size-fits-all techniques, such as simple averaging, prediction markets, and Bayesian aggregators, to customized (supervised) techniques that require past performance data, such as weighted averaging. In this study, we applied a wide range of aggregation methods to subjective probability estimates from geopolitical forecasting tournaments. We used the bias–information–noise (BIN) model to disentangle three mechanisms that allow aggregators to improve the accuracy of predictions: reducing bias and noise, and extracting valid information across forecasters. Simple averaging operates almost entirely by reducing noise, whereas more complex techniques such as prediction markets and Bayesian aggregators exploit all three pathways to allow better signal extraction as well as greater noise and bias reduction. Finally, we explored the utility of a BIN approach for the modular construction of aggregators.
Article
A four-year series of subjective probability forecasting tournaments sponsored by the U.S. intelligence community revealed a host of replicable drivers of predictive accuracy, including experimental interventions such as training in probabilistic reasoning, anti‐groupthink teaming, and tracking of talent. Drawing on these data, we propose a Bayesian BIN model (Bias, Information, Noise) for disentangling the underlying processes that enable forecasters and forecasting methods to improve—either by tamping down bias and noise in judgment or by ramping up the efficient extraction of valid information from the environment. The BIN model reveals that noise reduction plays a surprisingly consistent role across all three methods of enhancing performance. We see the BIN method as useful in focusing managerial interventions on what works when and why in a wide range of domains. An R-package called BINtools implements our method and is available on the first author’s personal website. This paper was accepted by Manel Baucells, decision analysis.
Article
Equity‐based crowdfunding platforms enable investors to come together to invest in startups and help lay‐investors to follow the lead of investors with good startup evaluation skills. Crowdfunding platforms often gather users’ inputs to evaluate investors and startups, but such inputs are quite noisy and often rely on past performance. Many investors with good evaluation skills do not have substantial past investment experience but still can lead investment rounds. This helps provide investment opportunities to lay‐investors who otherwise do not get to join investors with proven records. Without identifying such investors with potential, platforms lose the opportunity to put together investors to fund worthy startups and lose business. We develop a Bayesian model to address this problem and improve funding operations of equity‐based crowdfunding platforms. Specifically, the model helps platforms to better assess investors’ evaluation skills, identify lead investors for lay‐investors to follow, and increase funding opportunities on the platforms. To test the effectiveness of the proposed model, we gathered data from 319 actual investors listed on one of the largest crowdfunding platforms in the United States, picked startups randomly for investors to evaluate, and had investors evaluate startups in two ways—our approach and the conventional approach. We also discuss an extension of this Bayesian model that penalizes investors in case investors perform well by randomness. Furthermore, we used a Bayesian framework to help platforms better predict startup valuations accounting for investors’ evaluation skills.
Article
A growing body of research indicates that forecasting skill is a unique and stable trait: forecasters with a track record of high accuracy tend to maintain this record. But how does one identify skilled forecasters effectively? We address this question using data collected during two seasons of a longitudinal geopolitical forecasting tournament. Our first analysis, which compares psychometric traits assessed prior to forecasting, indicates intelligence consistently predicts accuracy. Next, using methods adapted from classical test theory and item response theory, we model latent forecasting skill based on the forecasters' past accuracy, while accounting for the timing of their forecasts relative to question resolution. Our results suggest these methods perform better at assessing forecasting skill than simpler methods employed by many previous studies. By parsing the data at different time points during the competitions, we assess the relative importance of each information source over time. When past performance information is limited, psychometric traits are useful predictors of future performance, but, as more information becomes available, past performance becomes the stronger predictor of future accuracy. Finally, we demonstrate the predictive validity of these results on out-of-sample data, and their utility in producing performance weights for wisdom-of-crowds aggregations.
Article
We initiate the study of incentive-compatible forecasting competitions in which multiple forecasters make predictions about one or more events and compete for a single prize. We have two objectives: (1) to incentivize forecasters to report truthfully and (2) to award the prize to the most accurate forecaster. Proper scoring rules incentivize truthful reporting if all forecasters are paid according to their scores. However, incentives become distorted if only the best-scoring forecaster wins a prize, since forecasters can often increase their probability of having the highest score by reporting more extreme beliefs. In this paper, we introduce two novel forecasting competition mechanisms. Our first mechanism is incentive compatible and guaranteed to select the most accurate forecaster with probability higher than any other forecaster. Moreover, we show that in the standard single-event, two-forecaster setting and under mild technical conditions, no other incentive-compatible mechanism selects the most accurate forecaster with higher probability. Our second mechanism is incentive compatible when forecasters’ beliefs are such that information about one event does not lead to belief updates on other events, and it selects the best forecaster with probability approaching one as the number of events grows. Our notion of incentive compatibility is more general than previous definitions of dominant strategy incentive compatibility in that it allows for reports to be correlated with the event outcomes. Moreover, our mechanisms are easy to implement and can be generalized to the related problems of outputting a ranking over forecasters and hiring a forecaster with high accuracy on future events. This paper was accepted by Yan Chen, behavioral economics and decision analysis.
Article
Taleb et al. (2022) portray the superforecasting research program as a masquerade that purports to build “survival functions for tail assessments via sports-like tournaments.” But that never was the goal. The program was designed to help intelligence analysts make better probability judgments, which required posing rapidly resolvable questions. From a signal detection theory perspective, the superforecasting and Taleb et al. programs are complementary, not contradictory (a point Taleb and Tetlock (2013) recognized). The superforecasting program aims at achieving high hit rates at low cost in false-positives, whereas Taleb et al. prioritize alerting us to systemic risk, even if that entails a high false-positive rate. Proponents of each program should, however, acknowledge weaknesses in their cases. It is unclear: (a) how Taleb et al. (2022) can justify extreme error-avoidance trade-offs, without tacit probability judgments of rare, high-impact events; (b) how much superforecasting interventions can improve probability judgments of such events.
Article
Full-text available
This research examines the development of confidence and accuracy over time in the context of forecasting. Although overconfidence has been studied in many contexts, little research examines its progression over long periods of time or in consequential policy domains. This study employs a unique data set from a geopolitical forecasting tournament spanning three years in which thousands of forecasters predicted the outcomes of hundreds of events. We sought to apply insights from research to structure the questions, interactions, and elicitations to improve forecasts. Indeed, forecasters' confidence roughly matched their accuracy. As information came in, accuracy increased. Confidence increased at approximately the same rate as accuracy, and good calibration persisted. Nevertheless, there was evidence of a small amount of overconfidence (3%), especially on the most confident forecasts. Training helped reduce overconfidence, and team collaboration improved forecast accuracy. Together, teams and training reduced overconfidence to 1%. Our results provide reason for tempered optimism regarding confidence calibration and its development over time in consequential field contexts.
Article
Full-text available
Across a wide range of tasks, research has shown that people make poor probabilistic predictions of future events. Recently, the U.S. Intelligence Community sponsored a series of forecasting tournaments designed to explore the best strategies for generating accurate subjective probability estimates of geopolitical events. In this article, we describe the winning strategy: culling off top performers each year and assigning them into elite teams of superforecasters. Defying expectations of regression toward the mean 2 years in a row, superforecasters maintained high accuracy across hundreds of questions and a wide array of topics. We find support for four mutually reinforcing explanations of superforecaster performance: (a) cognitive abilities and styles, (b) task-specific skills, (c) motivation and commitment, and (d) enriched environments. These findings suggest that superforecasters are partly discovered and partly created-and that the high-performance incentives of tournaments highlight aspects of human judgment that would not come to light in laboratory paradigms focused on typical performance. © The Author(s) 2015.
Article
Full-text available
Forecasting tournaments are level-playing-field competitions that reveal which individuals, teams, or algorithms generate more accurate probability estimates on which topics. This article describes a massive geopolitical tournament that tested clashing views on the feasibility of improving judgmental accuracy and on the best methods of doing so. The tournament’s winner, the Good Judgment Project, outperformed the simple average of the crowd by (a) designing new forms of cognitive-debiasing training, (b) incentivizing rigorous thinking in teams and prediction markets, (c) skimming top talent into elite collaborative teams of “super forecasters,” and (d) fine-tuning aggregation algorithms for distilling greater wisdom from crowds. Tournaments have the potential to open closed minds and increase assertion-to-evidence ratios in polarized scientific and policy debates.
Article
Full-text available
This article extends psychological methods and concepts into a domain that is as profoundly consequential as it is poorly understood: intelligence analysis. We report findings from a geopolitical forecasting tournament that assessed the accuracy of more than 150,000 forecasts of 743 participants on 199 events occurring over 2 years. Participants were above average in intelligence and political knowledge relative to the general population. Individual differences in performance emerged, and forecasting skills were surprisingly consistent over time. Key predictors were (a) dispositional variables of cognitive ability, political knowledge, and open-mindedness; (b) situational variables of training in probabilistic reasoning and participation in collaborative teams that shared information and discussed rationales (Mellers, Ungar, et al., 2014); and (c) behavioral variables of deliberation time and frequency of belief updating. We developed a profile of the best forecasters; they were better at inductive reasoning, pattern detection, cognitive flexibility, and open-mindedness. They had greater understanding of geopolitics, training in probabilistic reasoning, and opportunities to succeed in cognitively enriched team environments. Last but not least, they viewed forecasting as a skill that required deliberate practice, sustained effort, and constant monitoring of current affairs.
Article
Full-text available
Social psychologists have long recognized the power of statisticized groups. When individual judgments about some fact (e.g., the unemployment rate for next quarter) are averaged together, the average opinion is typically more accurate than most of the individual estimates, a pattern often referred to as the wisdom of crowds. The accuracy of averaging also often exceeds that of the individual perceived as most knowledgeable in the group. However, neither averaging nor relying on a single judge is a robust strategy; each performs well in some settings and poorly in others. As an alternative, we introduce the select-crowd strategy, which ranks judges based on a cue to ability (e.g., the accuracy of several recent judgments) and averages the opinions of the top judges, such as the top 5. Through both simulation and an analysis of 90 archival data sets, we show that select crowds of 5 knowledgeable judges yield very accurate judgments across a wide range of possible settings-the strategy is both accurate and robust. Following this, we examine how people prefer to use information from a crowd. Previous research suggests that people are distrustful of crowds and of mechanical processes such as averaging. We show in 3 experiments that, as expected, people are drawn to experts and dislike crowd averages-but, critically, they view the select-crowd strategy favorably and are willing to use it. The select-crowd strategy is thus accurate, robust, and appealing as a mechanism for helping individuals tap collective wisdom. (PsycINFO Database Record (c) 2014 APA, all rights reserved).
Article
Full-text available
Most subjective probability aggregation procedures use a single probability judgment from each expert, even though it is common for experts studying real problems to update their probability estimates over time. This paper advances into unexplored areas of probability aggregation by considering a dynamic context in which experts can update their beliefs at random intervals. The updates occur very infrequently, resulting in a sparse data set that cannot be modeled by standard time-series procedures. In response to the lack of appropriate methodology, this paper presents a hierarchical model that takes into account the expert's level of self-reported expertise and produces aggregate probabilities that are sharp and well calibrated both in- and out-of-sample. The model is demonstrated on a real-world data set that includes over 2300 experts making multiple probability forecasts over two years on different subsets of 166 international political events.
Conference Paper
Full-text available
We describe a hybrid forecasting method called marketcast. Marketcasts are based on bid and ask orders from prediction markets, aggregated using techniques associated with survey methods, rather than market matching algorithms. We discuss the process of conversion from market orders to probability estimates, and simple aggregation methods. The performance of marketcasts is compared to a traditional prediction market and a traditional opinion poll. Overall, marketcasts perform approximately as well as prediction markets and opinion poll methods on most questions, and performance is stable across model specifications.
Article
Full-text available
Five university-based research groups competed to recruit forecasters, elicit their predictions, and aggregate those predictions to assign the most accurate probabilities to events in a 2-year geopolitical forecasting tournament. Our group tested and found support for three psychological drivers of accuracy: training, teaming, and tracking. Probability training corrected cognitive biases, encouraged forecasters to use reference classes, and provided forecasters with heuristics, such as averaging when multiple estimates were available. Teaming allowed forecasters to share information and discuss the rationales behind their beliefs. Tracking placed the highest performers (top 2% from Year 1) in elite teams that worked together. Results showed that probability training, team collaboration, and tracking improved both calibration and resolution. Forecasting is often viewed as a statistical problem, but forecasts can be improved with behavioral interventions. Training, teaming, and tracking are psychological interventions that dramatically increased the accuracy of forecasts. Statistical algorithms (reported elsewhere) improved the accuracy of the aggregation. Putting both statistics and psychology to work produced the best forecasts 2 years in a row.
Article
Full-text available
This paper begins by presenting a simple model of the way in which experts estimate probabilities. The model is then used to construct a likelihood-based aggregation formula for combining multiple probability forecasts. The resulting aggregator has a simple analytical form that depends on a single, easily-interpretable parameter. This makes it computationally simple, attractive for further development, and robust against overfitting. Based on a large-scale dataset in which over 1300 experts tried to predict 69 geopolitical events, our aggregator is found to be superior to several widely-used aggregation algorithms.
Article
Full-text available
A fundamental debate in social sciences concerns how individual judgments and choices, resulting from psychological mechanisms, are manifested in collective economic behavior. Economists emphasize the capacity of markets to aggregate information distributed among traders into rational equilibrium prices. However, psychologists have identified pervasive and systematic biases in individual judgment that they generally assume will affect collective behavior. In particular, recent studies have found that judged likelihoods of possible events vary systematically with the way the entire event space is partitioned, with probabilities of each of N partitioned events biased toward 1/N. Thus, combining events into a common partition lowers perceived probability, and unpacking events into separate partitions increases their perceived probability. We look for evidence of such bias in various prediction markets, in which prices can be interpreted as probabilities of upcoming events. In two highly controlled experimental studies, we find clear evidence of partition dependence in a 2-h laboratory experiment and a field experiment on National Basketball Association (NBA) and Federation Internationale de Football Association (FIFA World Cup) sports events spanning several weeks. We also find evidence consistent with partition dependence in nonexperimental field data from prediction markets for economic derivatives (guessing the values of important macroeconomic statistics) and horse races. Results in any one of the studies might be explained by a specialized alternative theory, but no alternative theories can explain the results of all four studies. We conclude that psychological biases in individual judgment can affect market prices, and understanding those effects requires combining a variety of methods from psychology and economics.
Article
Full-text available
We conducted laboratory experiments for analyzing the accuracy of three structured approaches (nominal groups, Delphi, and prediction markets) relative to traditional face-to-face meetings (FTF). We recruited 227 participants (11 groups per method) who were required to solve a quantitative judgment task that did not involve distributed knowledge. This task consisted of ten factual questions, which required percentage estimates. While we did not find statistically significant differences in accuracy between the four methods overall, the results differed somewhat at the individual question level. Delphi was as accurate as FTF for eight questions and outperformed FTF for two questions. By comparison, prediction markets did not outperform FTF for any of the questions and were inferior for three questions. The relative performances of nominal groups and FTF were mixed and the differences were small. We also compared the results from the three structured approaches to prior individual estimates and staticized groups. The three structured approaches were more accurate than participants' prior individual estimates. Delphi was also more accurate than staticized groups. Nominal groups and prediction markets provided little additional value relative to a simple average of the forecasts. In addition, we examined participants' perceptions of the group and the group process. The participants rated personal communications more favorably than computer-mediated interactions. The group interactions in FTF and nominal groups were perceived as being highly cooperative and effective. Prediction markets were rated least favourably: prediction market participants were least satisfied with the group process and perceived their method as the most difficult.
Article
Full-text available
The application of Internet--based virtual stock markets (VSMs) is an additional approach that can be used to predict short-- and medium--term market developments. The basic concept involves bringing a group of participants together via the Internet and allowing them to trade shares of virtual stocks. These stocks represent a bet on the outcome of future market situations, and their value depends on the realization of these market situations. In this process, a VSM elicits and aggregates the assessments of its participants concerning future market developments. The aim of this article is to evaluate the potential use and the different design possibilities as well as the forecast accuracy and performance of VSMs compared to expert predictions for their application to business forecasting. After introducing the basic idea of using VSMs for business forecasting, we discuss the different design possibilities for such VSMs. In three real--world applications, we analyze the feasibility and forecast accuracy of VSMs, compare the performance of VSMs to expert predictions, and propose a new validity test for VSM forecasts. Finally, we draw conclusions and provide suggestions for future research.
Article
Full-text available
We review 74 experiments with no, low, or high performance-based financial incentives. The modal result has no effect on mean performance (though variance is usually reduced by higher payment). Higher incentive does improve performance often, typically judgment tasks that are responsive to better effort. Incentives also reduce presentation effects (e.g., generosity and risk-seeking). Incentive effects are comparable to effects of other variables, particularly cognitive capital and task production demands, and interact with those variables, so a narrow-minded focus on incentives alone is misguided. We also note that no replicated study has made rationality violations disappear purely by raising incentives.
Conference Paper
Full-text available
Citing recent successes in forecasting elections, movies, products, and other outcomes, prediction market advocates call for widespread use of market-based methods for government and corporate decision making. Though theoretical and empirical evidence suggests that markets do often outperform alternative mechanisms, less attention has been paid to the magnitude of improvement. Here we compare the performance of prediction markets to conventional methods of prediction, namely polls and statistical models. Examining thousands of sporting and movie events, we find that the relative advantage of prediction markets is surprisingly small, as measured by squared error, calibration, and discrimination. Moreover, these domains also exhibit remarkably steep diminishing returns to information, with nearly all the predictive power captured by only two or three parameters. As policy makers consider adoption of prediction markets, costs should be weighed against potentially modest benefits.
Article
Full-text available
The accuracy of prediction markets has been documented both for markets based on real money and those based on play money. To test how much extra accuracy can be obtained by using real money versus play money, we set up a real-world online experiment pitting the predictions of TradeSports.com (real money) against those of NewsFutures.com (play money) regarding American Football outcomes during the 2003-2004 NFL season. As expected, both types of markets exhibited significant predictive powers, and remarkable performance compared to individual humans. But, perhaps surprisingly, the play-money markets performed as well as the real-money markets. We speculate that this result reflects two opposing forces: real-money markets may better motivate information discovery while play-money markets may yield more efficient information aggregation.
Article
Full-text available
Do professional forecasters provide their true unbiased estimates, or do they behave strategically? In our model, forecasters have common information, confer actively, and thus know the true pdf of future outcomes. Intensive users of economic forecasts monitor forecasters' performance closely; occasional users are drawn to the forecaster who fared best in the previous period. In the resulting Nash equilibrium, even though economists have identical expectations, they make a range of projections that mimics the true probability distribution of the forecast variable. Those whose wages depend most on publicity produce forecasts that differ most from the consensus. Empirical evidence supports the model. © 2000 the President and Fellows of Harvard College and the Massachusetts Institute of Technology
Article
Full-text available
This target article is concerned with the implications of the surprisingly different experimental practices in economics and in areas of psychology relevant to both economists and psychologists, such as behavioral decision making. We consider four features of experimentation in economics, namely, script enactment, repeated trials, performance-based monetary payments, and the proscription against deception, and compare them to experimental practices in psychology, primarily in the area of behavioral decision making. Whereas economists bring a precisely defined "script" to experiments for participants to enact, psychologists often do not provide such a script, leaving participants to infer what choices the situation affords. By often using repeated experimental trials, economists allow participants to learn about the task and the environment; psychologists typically do not. Economists generally pay participants on the basis of clearly defined performance criteria; psychologists usually pay a flat fee or grant a fixed amount of course credit. Economists virtually never deceive participants; psychologists, especially in some areas of inquiry, often do. We argue that experimental standards in economics are regulatory in that they allow for little variation between the experimental practices of individual researchers. The experimental standards in psychology, by contrast, are comparatively laissez-faire. We believe that the wider range of experimental practices in psychology reflects a lack of procedural regularity that may contribute to the variability of empirical findings in the research fields under consideration. We conclude with a call for more research on the consequences of methodological preferences, such as the use on monetary payments, and propose a "do-it-both-ways" rule regarding the enactment of scripts, repetition of trials, and performance-based monetary payments. We also argue, on pragmatic grounds, that the default practice should be not to deceive participants.
Article
Full-text available
Results from the Iowa Political Stock Market are analyzed to ascertain how well markets work as aggregators of information. The authors find that the market worked extremely well, dominating opinion polls in forecasting the outcome of the 1988 presidential election, even though traders in the market exhibited substantial amounts of judgment biases. Their explanation is that judgment bias refers to average behavior, while in markets it is marginal traders who influence price. They present evidence that in this market a sufficient number of traders were free of judgment bias so that the market was able to work well. Copyright 1992 by American Economic Association.
Article
Full-text available
Information Aggregation Mechanisms are economics mechanisms designed explicitly for the purpose of collecting and aggregating information. The modern theory of rational expectations, together with the techniques and results of experimental economics, suggest that a set of properly designed markets can be a good information aggregation mechanism. The paper reports on the deployment of such an information aggregation mechanism inside Hewlett-Packard Corporation for the purpose of makings sales forecasts. Results who that IAMs performed better than traditional methods employed inside Hewlett-Packard. The structure of the mechanism, the methodology and the results are reported.
Chapter
Prediction markets have captured the public’s imagination with their ability to predict the future by pooling the guesswork of many. This paper summarizes the evidence and examines the economic, mathematical, and neurological foundations of this form of collective wisdom. Rather than the particular trading mechanism used, the ultimate driver of accuracy seems to be the betting proposition itself: on the one hand, a wager attracts contrarians, which enhances the diversity of opinions that can be aggregated. On the other hand, the mere prospect of reward or loss promotes more objective, less passionate thinking, thereby enhancing the quality of the opinions that can be aggregated.
Article
The intelligence failures surrounding the invasion of Iraq dramatically illustrate the necessity of developing standards for evaluating expert opinion. This book fills that need. Here, Philip E. Tetlock explores what constitutes good judgment in predicting future events, and looks at why experts are often wrong in their forecasts. Tetlock first discusses arguments about whether the world is too complex for people to find the tools to understand political phenomena, let alone predict the future. He evaluates predictions from experts in different fields, comparing them to predictions by well-informed laity or those based on simple extrapolation from current trends. He goes on to analyze which styles of thinking are more successful in forecasting. Classifying thinking styles using Isaiah Berlin's prototypes of the fox and the hedgehog, Tetlock contends that the fox--the thinker who knows many little things, draws from an eclectic array of traditions, and is better able to improvise in response to changing events--is more successful in predicting the future than the hedgehog, who knows one big thing, toils devotedly within one tradition, and imposes formulaic solutions on ill-defined problems. He notes a perversely inverse relationship between the best scientific indicators of good judgement and the qualities that the media most prizes in pundits--the single-minded determination required to prevail in ideological combat. Clearly written and impeccably researched, the book fills a huge void in the literature on evaluating expert opinion. It will appeal across many academic disciplines as well as to corporations seeking to develop standards for judging expert decision-making.
Article
In this paper, we examine the relative forecast accuracy of information markets versus expert aggregation. We leverage a unique data source of almost 2000 people's subjective probability judgments on 2003 US National Football League games and compare with the "market probabilities" given by two different information markets on exactly the same events, We combine assessments of multiple experts via linear and logarithmic aggregation functions to form pooled predictions. Prices in information markets are used to derive market predictions. Our results show that, at the same time point ahead of the game, information markets provide as accurate predictions as pooled expert assessments. In screening pooled expert predictions, we find that arithmetic average is a robust and efficient pooling function; weighting expert assessments according to their past performance does not improve accuracy of pooled predictions; and logarithmic aggregation functions offer bolder predictions than linear aggregation functions. The results provide insights into the predictive performance of information markets, and the relative merits of selecting among various opinion pooling methods.
Article
When aggregating the probability estimates of many individuals to form a consensus probability estimate of an uncertain future event, it is common to combine them using a simple weighted average. Such aggregated probabilities correspond more closely to the real world if they are transformed by pushing them closer to 0 or 1. We explain the need for such transformations in terms of two distorting factors: The first factor is the compression of the probability scale at the two ends, so that random error tends to push the average probability toward 0.5. This effect does not occur for the median forecast, or, arguably, for the mean of the log odds of individual forecasts. The second factor-which affects mean, median, and mean of log odds-is the result of forecasters taking into account their individual ignorance of the total body of information available. Individual confidence in the direction of a probability judgment (high/low) thus fails to take into account the wisdom of crowds that results from combining different evidence available to different judges. We show that the same transformation function can approximately eliminate both distorting effects with different parameters for the mean and the median. And we show how, in principle, use of the median can help distinguish the two effects.
Article
Despite the popularity of prediction, markets among economists, businesses, and policymakers have been slow to adopt them in decision-making. Most studies of prediction markets outside the lab are from public markets with large trading populations. Corporate prediction markets face additional issues, such as thinness, weak incentives, limited entry, and the potential for traders with biases or ulterior motives—raising questions about how well these markets will perform. We examine data from prediction markets run by Google, Ford Motor Company, and an anonymous basic materials conglomerate (Firm X). Despite theoretically adverse conditions, we find these markets are relatively efficient, and improve upon the forecasts of experts at all three firms by as much as a 25% reduction in mean-squared error. The most notable inefficiency is an optimism bias in the markets at Google. The inefficiencies that do exist generally become smaller over time. More experienced traders and those with higher past performance trade against the identified inefficiencies, suggesting that the markets' efficiency improves because traders gain experience and less skilled traders exit the market.
Article
Many important decisions are routinely made by transient and temporary teams, which perform their duty and disperse. Team members often continue making similar decisions as individuals. We study how the experience of team decision making affects subsequent individual decisions in two seminal probability and reasoning tasks, the Monty Hall problem and the Wason selection task. Both tasks are hard and involve a general rule, thus allowing for knowledge transfers, and can be embedded in the context of markets that offer identical incentives to teams and individuals. Our results show that teams trade closer to the rational level, learn the solution faster, and achieve this with weaker, less specific performance feedback than individuals. Most importantly, we observe significant knowledge transfers from team decision making to subsequent individual performances that take place up to five weeks later, indicating that exposure to team decision making has strong positive spillovers on the quality of individual decisions.
Article
This article presents new theoretical and empirical evidence on the forecasting ability of prediction markets. We develop a model that predicts that the time until expiration of a prediction market should negatively affect the accuracy of prices as a forecasting tool in the direction of a ‘favourite/longshot bias’. That is, high‐likelihood events are underpriced, and low‐likelihood events are over‐priced. We confirm this result using a large data set of prediction market transaction prices. Prediction markets are reasonably well calibrated when time to expiration is relatively short, but prices are significantly biased for events farther in the future. When time value of money is considered, the miscalibration can be exploited to earn excess returns only when the trader has a relatively low discount rate.
Article
Most pollsters base their election projections off questions of voter intentions, which ask “If the election were held today, who would you vote for?” By contrast, we probe the value of questions probing voters’ expectations, which typically ask: “Regardless of who you plan to vote for, who do you think will win the upcoming election?” We demonstrate that polls of voter expectations yield consistently more accurate forecasts than polls of voter intentions. A small-scale structural model reveals that this is because we are polling from a broader information set, and voters respond as if they had polled ten of their friends. This model also provides a rational interpretation for why respondents’ forecasts are correlated with their expectations. We use our structural model to extract accurate election forecasts from non-random samples.
Article
A new vector partition of the probability, or Brier, score (PS) is formulated and the nature and properties of this partition are described. The relationships between the terms in this partition and the terms in the original vector partition of the PS are indicated. The new partition consists of three terms: 1) a measure of the uncertainty inherent in the events, or states, on the occasions of concern (namely, the PS for the sample relative frequencies); 2) a measure of the reliability of the forecasts; and 3) a new measure of the resolution of the forecasts. These measures of reliability and resolution are and are not, respectively, equivalent (i.e., linearly related) to the measures of reliability and resolution provided by the original partition. Two sample collections of probability forecasts are used to illustrate the differences and relationships between these partitions. Finally, the two partitions are compared, with particular reference to the attributes of the forecasts with which the partitions are concerned, the interpretation of the partitions in geometric terms, and the use of the partitions as the bases for the formulation of measures to evaluate probability forecasts. The results of these comparisons indicate that the new partition offers certain advantages vis-à-vis the original partition.
Article
Prediction markets are viewed as the most accurate instrument for collective forecasts. However, empirical studies, mostly based on political elections, deliver mixed results. An experimental study was conducted to avoid certain biases and problems and to better control conditions of eliciting information from individuals. One typical problem is for example comparing prediction markets that focus on judging the public opinion in the future with polls asking for individual election preferences at a certain point of time. Therefore, our study compared forecast accuracy between prediction markets and a simple survey for the same forecasting item. The results showed roughly the same accuracy for all employed methods with the survey delivering slightly better results at lower costs, which was surprising. The experiments demonstrated also that it is possible to gain highly accurate forecasts with a relatively small number of participants (6-17) taking part continuously.
Article
Using the 2008 elections, I explore the accuracy and informational content of forecasts derived from two different types of data: Polls and prediction markets. Both types of data suffer from inherent biases, and this is the first analysis to compare the accuracy of these forecasts adjusting for these biases. Moreover, the analysis expands on previous research by evaluating state-level forecasts in Presidential and Senatorial races, rather than just the national popular vote. Utilizing several different estimation strategies, I demonstrate that early in the cycle and in not-certain races debiased prediction market-based forecasts provide more accurate probabilities of victory and more information than debiased poll-based forecasts. These results are significant because accurately documenting the underlying probabilities, at any given day before the election, is critical for enabling academics to determine the impact of shocks to the campaign, for the public to invest wisely and for practitioners to spend efficiently.
Article
  We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Article
The favorite–long shot bias describes the long-standing empirical regularity that betting odds provide biased estimates of the probability of a horse winning: long shots are overbet whereas favorites are underbet. Neoclassical explanations of this phenomenon focus on rational gamblers who overbet long shots because of risk-love. The competing behavioral explanations emphasize the role of misperceptions of probabilities. We provide novel empirical tests that can discriminate between these competing theories by assessing whether the models that explain gamblers’ choices in one part of their choice set (betting to win) can also rationalize decisions over a wider choice set, including compound bets in the exacta, quinella, or trifecta pools. Using a new, large-scale data set ideally suited to implement these tests, we find evidence in favor of the view that misperceptions of probability drive the favorite–long shot bias, as suggested by prospect theory.
Article
Probability forecasters who are rewarded via a proper scoring rule may care not only about the score, but also about their performance relative to other forecasters. We model this type of preference and show that a competitive forecaster who wants to do better than another forecaster typically should report more extreme probabilities, exaggerating toward zero or one. We consider a competitive forecaster's best response to truthful reporting and also investigate equilibrium reporting functions in the case where another forecaster also cares about relative performance. We show how a decision maker can revise probabilities of an event after receiving reported probabilities from competitive forecasters and note that the strategy of exaggerating probabilities can make well-calibrated forecasters (and a decision maker who takes their reported probabilities at face value) appear to be overconfident. However, a decision maker who adjusts appropriately for the misrepresentation of probabilities by one or more forecasters can still be well-calibrated. Finally, to try to overcome the forecasters' competitive instincts and induce cooperative behavior, we develop the notion of joint scoring rules based on business sharing and show that these scoring rules are strictly proper.