Conference Paper

Characterizing and aggregating agent estimates

Authors:
  • ABC Research, LLC
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In many applications, agents (whether human or computational) provide estimates that must be combined at a higher level. Recent research distinguishes two kinds of such estimates: interpreted and generated data. These two kinds of data require different kinds of aggregation processes, which behave differently from an information geometric perspective: interpreted estimates require methods such as voting that can leave the convex hull of the individual estimates, while the optimal aggregation for generated estimates lies within the convex hull and thus is accessible by methods such as weighted averages. We motivate our analysis in the context of a crowdsourced forecasting application, demonstrate the central insights theoretically, and show how these insights manifest them-selves in actual data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This is a very reasonable assumption that has been analyzed and utilized in many other settings. For example, Parunak et al. (2013) demonstrate that optimal aggregation of interpreted forecasts is not constrained to the convex hull of the forecasts; Broomell and Budescu (2009) analyze inter-forecaster correlation under the assumption that the cues can be mapped to the individual forecasts via different linear regression functions. To the best of our knowledge, no previous work has discussed a formal framework that explicitly links the interpreted forecasts to their target quantity. ...
... For one thing, Hong and Page (2009) demonstrate that the standard assumption of conditional independence poses an unrealistic structure on interpreted forecasts. Any averaging aggregator is also constrained to the convex hull of the individual forecasts, which further contradicts the interpreted signal framework (Parunak et al., 2013) and can be far from optimal on many datasets. ...
... For instance, predictions from forecasters working in close collaboration can be averaged while predictions from forecasters strategically accessing and studying disjoint sources of information should be aggregated via more extreme techniques such as voting. See Parunak et al. 2013 for a discussion of voting-like techniques. ...
Article
Full-text available
Randomness in scientific estimation is generally assumed to arise from unmeasured or uncontrolled factors. However, when combining subjective probability estimates, heterogeneity stemming from people's cognitive or information diversity is often more important than measurement noise. This paper presents a novel framework that models the heterogeneity arising from experts that use partially overlapping information sources, and applies that model to the task of aggregating the probabilities given by a group of experts who forecast whether an event will occur or not. Our model describes the distribution of information across experts in terms of easily interpretable parameters and shows how the optimal amount of extremizing of the average probability forecast (shifting it closer to its nearest extreme) varies as a function of the experts' information overlap. Our model thus gives a more principled understanding of the historically ad hoc practice of extremizing average forecasts.
... This is a very reasonable assumption that has been discussed by many authors. For example, Broomell and Budescu (2009) analyze a model that maps the cues to the individual forecasts via different linear regression functions; Parunak et al. (2013) demonstrate that the optimal aggregate of interpreted forecasts can be outside the convex hull of the forecasts. No previous work, however, has discussed a formal framework that links the interpreted forecasts to their target quantity in an explicit yet flexible manner. ...
... Second, the standard assumption of conditional independence of the observations forces a specific and highly unrealistic structure on interpreted forecasts (Hong and Page, 2009). Measurementerror aggregators also cannot leave the convex hull of the individual forecasts, which further contradicts the interpreted signal framework (Parunak et al., 2013) and can result in poor empirical performance. Third, the underlying model is rather implausible. ...
Article
Full-text available
Prediction polling is an increasingly popular form of crowdsourcing in which multiple participants estimate the probability or magnitude of some future event. These estimates are then aggregated into a single forecast. Historically, randomness in scientific estimation has been generally assumed to arise from unmeasured factors which are viewed as measurement noise. However, when combining subjective estimates, heterogeneity stemming from differences in the participants' information is often more important than measurement noise. This paper formalizes information diversity as an alternative source of such heterogeneity and introduces a novel modeling framework that is particularly well-suited for prediction polls. A practical specification of this framework is proposed and applied to the task of aggregating probability and point estimates from two real-world prediction polls. In both cases our model outperforms standard measurement-error-based aggregators, hence providing evidence in favor of information diversity being the more important source of heterogeneity.
... For one, it is unclear whether the inefficiency is particular to weighted arithmetic means of probability forecasts, or whether the phenomenon is more widely spread than that. Parunak et al. (2013) discuss aggregators beyond the arithmetic mean. They consider probability predictions that are based on (equally large) subsets of some available partial information G ⊆ F. They illustrate under a specific model how the prediction E(Y |G) can be outside the convex hull of the individual predictions. ...
Article
Even though the forecasting literature agrees that combining multiple predictions of some future outcome outperforms the individual predictions, there is no general consensus about the right way to do this. Ideally, individuals would predict based on correct information. The aggregator would then combine the predictions without losing or distorting any of the forecasters' information. We analyze whether the most common aggregators, namely the measures of central tendency can ever behave in this manner. Our results show, among others, that all weighted arithmetic, harmonic, geometric, quadratic, and cubic means distort the forecasters' information and hence are not consistent with correct information about the future outcome. More generally, no aggregator that remains strictly within the convex hull of the forecasts uses the forecasters' information efficiently as long as the individuals rely on finite information. Given that the physical universe only contains a finite amount of information, for all practical purposes, measures of central tendency do not use the forecasters' information efficiently. We conclude by discussing two concrete ways to construct efficient aggregators in practice.
Research
Full-text available
This article introduces a frequentist formalism for combining the forecasts of multiple experts who base their forecasts on information from different sources. The authors use the formalism to provide a theoretical justification for the method of reporting a combined value that is between an average value and the extreme that is closest to the average. The authors also consider a Bayesian approach to combining forecasts, expecting it to have advantages for smaller data sets. However, the great variety of seemingly applicable priors pose a formidable challenge.
Article
Objective This paper identifies general properties of language style in social media to help identify areas of need in disasters. Background In the search for metrics of need in social media data, much of the existing literature ignores processes of language usage. Psychological concepts, such as narrative breach, Gricean maxims, and lexical marking in cognition, may assist the recovery of disaster-relevant metrics from altered patterns of word prevalence. Method We analyzed several hundred thousand location-specific microblogs from Twitter for Hurricane Sandy, Oklahoma tornadoes, and the Boston Marathon bombing along with a fantasy football control corpus, examining the relative frequency of words in 36 antonym pairs. We compared the ratio of words within these pairs to the corresponding ratios recovered from an online word norm database. Results Partial rank correlation values between observed antonym ratios demonstrate consistent patterns across disasters. For Hurricane Sandy data, 25 antonym pairs have moderate to large effect sizes for discrepancies between observed and normative ratios. Across disasters, 7 pairs are stable and meet effect size criteria. Sentiment analysis, supplementary word frequency counts with respect to disaster proximity, and examples support a “breach” account for the observed results. Conclusion Lexical choice between antonyms, only somewhat related to sentiment, suggests that social media capture wide-ranging breaches of normal functioning. Application Antonym selection contributes to screening tools based on language style for identifying relevant content and quantifying disruption using social media without the a priori specification of content keywords.
Article
We generalize the results of Satopää et al(in press,2015) by showing how the Gaussian aggregator may be computed in a setting where parameter estimation is not required. We proceed to provide an explicit formula for a “one-shot” aggregation problem with two forecasters.
Article
Randomness in scientific estimation is generally assumed to arise from unmeasured or uncontrolled factors. However, when combining subjective probability estimates, heterogeneity stemming from people’s cognitive or information diversity is often more important than measurement noise. This paper presents a novel framework that uses partially overlapping information sources. A specific model is proposed within that framework and applied to the task of aggregating the probabilities given by a group of forecasters who predict whether an event will occur or not. Our model describes the distribution of information across forecasters in terms of easily interpretable parameters and shows how the optimal amount of extremizing of the average probability forecast (shifting it closer to its nearest extreme) varies as a function of the forecasters’ information overlap. Our model thus gives a more principled understanding of the historically ad hoc practice of extremizing average forecasts. Supplementary material for this article is available online.
Conference Paper
In many contexts, people generate forecasts about events of interest, and decision-makers wish to aggregate these forecasts to improve their accuracy. These forecasts differ from signals in the physical sciences. In particular, sensor signals are noisy samples from a common underlying distribution, while human-generated forecasts are based on cognitive models that vary from one informant to another. As a result, human forecasts, unlike physical signals, are not guaranteed to be statisticallyindependent conditioned on the true outcome. These differences both provide new opportunities for aggregation, and impose restrictions that do not apply to physical signals. This paper describes the difference between forecasts and physical signals, outlines a strategy for exploiting these differences in aggregation, and demonstrates modest but statistically significant gains in the accuracy of aggregated forecasts using data from a large ongoing experiment in forecasting world events.
Article
Full-text available
We show how machine learning and inference can be har- nessed to leverage the complementary strengths of humans and computational agents to solve crowdsourcing tasks. We construct a set of Bayesian predictive models from data and describe how the models operate within an overall crowd- sourcing architecture that combines the eorts of people and machine vision on the task of classifying celestial bodies de- fined within a citizens' science project named Galaxy Zoo. We show how learned probabilistic models can be used to fuse human and machine contributions and to predict the behaviors of workers. We employ multiple inferences in con- cert to guide decisions on hiring and routing workers to tasks so as to maximize the eciency of large-scale crowdsourcing processes based on expected utility.
Article
Full-text available
Neural networks are statistical models and learning rules are estimators. In this paper a theory for measuring generalisation is developed by combining Bayesian decision theory with information geometry. The performance of an estimator is measured by the information divergence between the true distribution and the estimate, averaged over the Bayesian posterior. This unifies the majority of error measures currently in use. The optimal estimators also reveal some intricate interrelationships among information geometry, Banach spaces and sufficient statistics.
Article
We are interested in aggregating forecasts of multinomial problems elicited from multiple experts. A common approach is to assign a weight to each expert, then form a weighted sum over their forecasts. Theoretical studies suggest that an important factor in such weighting is the diversity among experts. However, diversity is intrinsically a pairwise measure over experts, and does not lend itself naturally to a single weight that can be applied to an expert's forecast in a weighted average. We suggest a way to take advantage of such pairwise measures in aggregating forecasts. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
In many contexts, people generate forecasts about events of interest, and decision-makers wish to aggregate these forecasts to improve their accuracy. These forecasts differ from signals in the physical sciences. In particular, sensor signals are noisy samples from a common underlying distribution, while human-generated forecasts are based on cognitive models that vary from one informant to another. As a result, human forecasts, unlike physical signals, are not guaranteed to be statisticallyindependent conditioned on the true outcome. These differences both provide new opportunities for aggregation, and impose restrictions that do not apply to physical signals. This paper describes the difference between forecasts and physical signals, outlines a strategy for exploiting these differences in aggregation, and demonstrates modest but statistically significant gains in the accuracy of aggregated forecasts using data from a large ongoing experiment in forecasting world events.
Conference Paper
We describe methods for routing a prediction task on a network where each participant can contribute information and route the task onwards. Routing scoring rules bring truthful contribution of information about the task and optimal routing of the task into a Perfect Bayesian Equilibrium under common knowledge about the competencies of agents. Relaxing the common knowledge assumption, we address the challenge of routing in situations where each agent's knowledge about other agents is limited to a local neighborhood. A family of local routing rules isolate in equilibrium routing decisions that depend only on this local knowledge, and are the only routing scoring rules with this property. Simulation results show that local routing rules can promote effective task routing.
Article
We show how the quality of decisions based on the aggregated opinions of the crowd can be conveniently studied using a sample of individual responses to a standard IQ questionnaire. We aggregated the responses to the IQ questionnaire using simple majority voting and a machine learning approach based on a probabilistic graphical model. The score for the aggregated questionnaire, Crowd IQ, serves as a quality measure of decisions based on aggregating opinions, which also allows quantifying individual and crowd performance on the same scale. We show that Crowd IQ grows quickly with the size of the crowd but saturates, and that for small homogeneous crowds the Crowd IQ significantly exceeds the IQ of even their most intelligent member. We investigate alternative ways of aggregating the responses and the impact of the aggregation method on the resulting Crowd IQ. We also discuss Contextual IQ, a method of quantifying the individual participant’s contribution to the Crowd IQ based on the Shapley value from cooperative game theory.
Article
A challenge with the programmatic access of human talent via crowdsourcing platforms is the specification of incentives and the checking of the quality of contributions. Methodologies for checking quality include providing a payment if the work is approved by the task owner and hiring additional workers to evaluate contributors' work. Both of these approaches place a burden on people and on the organizations commissioning tasks, and may be susceptible to manipulation by workers and task owners. Moreover, neither a task owner nor the task market may know the task well enough to be able to evaluate worker reports. Methodologies for incentivizing workers without external quality checking include rewards based on agreement with a peer worker or with the final output of the system. These approaches are vulnerable to strategic manipulations by workers. Recent experiments on Mechanical Turk have demonstrated the negative influence of manipulations by workers and task owners on crowdsourcing systems [3]. We address this central challenge by introducing incentive mechanisms that promote truthful reporting in crowdsourcing and discourage manipulation by workers and task owners without introducing additional overhead.
Article
Voting procedures focus on the aggregation of individuals' preferences to produce collective decisions. In practice, a voting procedure is characterized by ballot responses and the way ballots are tallied to determine winners. Voters are assumed to have clear preferences over candidates and attempt to maximize satisfaction with the election outcome by their ballot responses. Such responses can include strategic misrepresentation of preferences.Voting procedures are formalized by social choice functions, which map ballot response profiles into election outcomes. We discuss broad classes of social choice functions as well as special cases such as plurality rule, approval voting, and Borda's point-count method. The simplest class is voting procedures for two-candidate elections. Conditions for social choice functions are presented for simple majority rule, the class of weighted majority rules, and for what are referred to as hierarchical representative systems.The second main class, which predominates in the literature, embraces all procedures for electing one candidate from three or more contenders. The multicandidate elect-one social choice functions in this broad class are divided into nonranked one-stage procedures, nonranked multistage procedures, ranked voting methods, and positional scoring rules. Nonranked methods include plurality check-one voting and approval voting, where each voter casts either no vote or a full vote for each candidate. On ballots for positional scoring methods, voters rank candidates from most preferred to least preferred. Topics for multicandidate methods include axiomatic characterizations, susceptibility to strategic manipulation, and voting paradoxes that expose questionable aspects of particular procedures.Other social choice functions are designed to elect two or more candidates for committee memberships from a slate of contenders. Proportional representation methods, including systems that elect members sequentially from a single ranked ballot with vote transfers in successive counting stages, are primary examples of this class.
Article
Private information is typically modeled as signals. A joint probability distribution captures relationships between signals and between signals and relevant variables. In this paper, we define and contrast two types of signals: generated and interpreted. We demonstrate that even though the standard assumption of conditional independence is a reasonable benchmark assumption for generated signals, it imposes a specific, and unlikely structure on interpreted signals. We also show that independent interpreted signals are negatively correlated in their correctness, but generated signals can be independent. Our findings may limit the contexts in which many models of information aggregation and strategic choices in auctions, markets, and voting apply.
Statistical Decision Rules and Optimal Inference
  • N N Cencov
Factor Based Regression Models for Forecasting
  • C.-C Cheng
  • R Sasseen