Expert Measurement and Mechanical Combination

ArticleinOrganizational Behavior and Human Performance 7(1):86-106 · February 1972
DOI: 10.1016/0030-5073(72)90009-8
Abstract
The expert can and should be used as a provider of input for a mechanical combining process since most studies show mechanical combination to be superior to clinical combination. However, even in expert measurement, the global judgment is itself a clinical combination of other judgmental components and as such it may not be as efficient as a mechanical combination of the components. The superiority of mechanically combining components as opposed to using the global judgment for predicting some external criterion is discussed. The use of components is extended to deal with multiple judges since specific judges may be differentially valid with respect to subsets of components for predicting the criterion. These ideas are illustrated by using the results of a study dealing with the prediction of survival on the basis of information contained in biopsies taken from patients having a certain type of cancer. Judgments were made by three highly trained pathologists. Implications and extensions for using expert measurement and mechanical combination are discussed.

Do you want to read the rest of this article?

  • ... Inter-rater reliability (IRR) levels on Q2-Q4, however, were much lower, with Krippendorff's Alpha scores of 0.17, 0.12, and 0.20. These values would be low for non-expert raters on simple tasks, but are consistent with many studies of expert agreement and IRR in the medical literature (e.g., Einhorn 1972;Elmore et al. 1994;Beam et al. 1996;Chiang et al. 2007). For example, a study assessing the diagnostic abilities of 32 medical interns by 12 full-time medical faculty using 128 different clinical evaluation exercise methods found IRRs ranging from 0.00 to 0.63, with a mean of only 0.23 ( Kroboth et al. 1992). ...
  • ... Dawes, Faust, & Meehl, 1989;Dawes & Kagan, 1988;Grove, Zald, Lebow, Snitz, & Nelson, 2000). Whereas models (or machines) are better at information processing and are consistent (Einhorn, 1972;Goldberg, 1970), we humans suffer cognitive and other biases that make us poor judges of probabilities (c.f. Kahneman, Slovic, & Tversky, 1982;Kahneman & Tversky, 1973;Lichtenstein, Baruch, & Phillips, 1982;Rabin, 1996). ...
    ... Nevertheless, human intelligence and judgments are often still valuable in predicting events in real-life situations, as we still possess three critical elements: first, human intelligence still outperforms even the most advanced computational systems when it comes to the acquisition and understanding of many kinds of information. This is especially true for unstructured information (Einhorn, 1972;Kleinmuntz, 1990), and this human advantage-while eroding surprisingly quickly in some domains -is not likely to completely disappear anytime soon. ...
    ... Third, this knowledge of the world also enables us to easily identify "broken-leg" situations (Camerer & Johnson, 1991;Johnson, 1988;Meehl, 1954) Several researchers (e.g. Blattberg & Hoch, 1990;Bunn & Wright, 1991;Einhorn, 1972;Seifert & Hadida, 2013) considered these different strengths of human experts and models, and contemplated the complementary nature of humans and models in making predictions. For instance, in Blattberg and Hoch's study, a "50% Model + 50% Manager" ...
  • ... um of exhaustive and mutually exclusive events often exceeds one). Some researchers (e.g., Tversky & Kahneman, 1974) attribute these observed biases to the operation of various heuristics. Others (e.g., Costello & Watts, 2014; Erev et al., 1994 ) demonstrate that some of these empirical (ir)regularities can be explained without invoking heuristics. Einhorn (1972) argued that direct judgments are inferior because of the following: (i) they can leave out important ...
    ... GM, denotes geometric mean.Table 1. Numerical example of ratio judgments of a hypothetical judge and the calculation of the joint probabilities of smoking (S) and lung cancer ( Eliciting Subjective Probabilities through Pair-wise Comparisons H.-H. Por and D. V. Budescu components; (ii) components might be weighted in suboptimal ways; and/or (iii) because of limited cognitive resources , internal combination rules take away attention from the proper assessment of events. An alternative approach is to decompose an event into its components, assess the probabilities of these basic components, and then combine these judgments (Dawes, Faust & Meehl, 1989; Einhorn, 1972; Grove, 2005; Kleinmuntz, 1990; Zhao, Shah & Osherson, 2009). This approach is akin to the " Divide And Conquer " (DAC) strategy that suggests that complex decision problems benefit from being decomposed into smaller and more manageable components that can then be logically aggregated to derive an overall value (e.g., Morera & Budescu, 1998). ...
    ... Our studies suggest that the pair-wise comparison approach is a promising alternative to the traditional approach of eliciting subjective probabilities from subject matter experts , who are often called upon to provide probabilistic estimates for uncertain events with serious consequences (e.g., predicting the likelihood of containment failure in nuclear plants accidents; U.S. Nuclear Regulatory Commission, 1975). Experts, although knowledgeable in their respective fields, are not immune to biases induced by the use of heuristics (Einhorn, 1972; Morgan & Keith, 1995). A slight, but insignificant, majority preferred the traditional direct elicitations to the pair-wise comparisons, but significantly fewer chose to return to review and revise their responses in the pair-wise comparisons. ...
  • ... The literature on performance assessment distinguishes between linear and non-linear models (e.g., Dawes & Corrigan, 1974; Einhorn, 1970 Einhorn, , 1971 Einhorn, , 1972 Hogarth & Karelaia, 2007; Ogilvie & Schmitt, 1979). Linear models are characterised by the premise that assessors combine important information in an additive, linear way. ...
    ... A framework that incorporates these assumptions represents a noncompensatory model (e.g., Einhorn, 1971). In a state of information overload, assessors may use a non-compensatory model as a cognitive simplification (Einhorn, 1971Einhorn, , 1972). ...
    ... Assessors apply various criteria that are relevant to the performance being assessed (Einhorn, 1972). However, assessors not always combine criteria in a linear, predictive way. ...
  • ... Substantial evidence from multiple domains suggests that models usually yield better (and almost never worse) predictions than do individual human experts [3] [4]. Whereas models (or machines) are better at information processing and are consistent [5], humans suffer cognitive and other biases that make them bad judges of probabilities [6] [7]. In addition, factors such as fatigue can produce random fluctuations in judgment [3]. ...
    ... Nevertheless, humans are still valuable in real-life prediction situations, for at least two good reasons. First, humans are still better at tasks requiring the handling of various types of information – especially unstructured information – including retrieval and acquisition [5] [13], categorizing [14], and pattern recognition [15] [16]. Second, humans " common-sense is required to identify and respond to " broken-leg " situations [17] in which the rules normally characterizing the phenomenon of interest do not hold. ...
    ... The scarcity of both theoretical and empirical work to that end is conspicuous. Previous work [5] [18] [19] emphasized the complementary nature of humans and models, but did not stress the potential of improving predictions by combining predictions from multiple humans and models. We know, however, that combining forecasts from multiple independent, uncorrelated forecasters leads to increased forecast accuracy whether the forecasts are judgmental or statistical [20] [21] [22]. ...
  • ... Putting forward a criterion did not necessarily lead to a certain score. Based on the present data set, it may be concluded that assessors combine criteria in a nonlinear, non-additive way (e.g., Einhorn, 1970Einhorn, , 1972 Brannick & Brannick, 1989). They made certain observations, combined information in various ways, searched for additional information to reinforce their opinion, or stopped after having found some critical piece of evidence. ...
  • ... fessionals whom they consider an expert. One such example is that of Phelps (1978) who asked professionals in agriculture whom they considered the best expert (yielding four subjects). A critical flaw to this approach is the 'popularity effect' – an individual better known to its peers is more likely to be seen as an expert (Shanteau et al., 2002).Einhorn (1972Einhorn ( , 1974) proposed that consensus, that is, agreement between subjects, is a necessary condition for expertise. If there is disagreement between subjects, then at least some of the would-be experts are not really what they claim to be.Ashton (1985)studied the relation between consensus and calibration for predictions of interest ...
  • ... Mechanical aggregation refers to the idea that once one makes judgments about components, one can combine these judgments into an overall decision in a mechanistic and a priori defined way, which itself leads to improved decision making (Einhorn 1972). ...
  • ... Right from the beginning, CAP2 tended towards a very low SA score, yet it followed an extensive discussion before they arrived at a decision: ―On a difficult, bad weather, circling approach into an area where there's high terrain, the fact that you can't recall what you were going to descend to, what way you were going to turn on the missed approach, definitely puts [ knew, that a score of 1 in any of the performance components would immediately lead to a failing rating , which in turn makes a repeat of the whole simulator session unavoidable. Here, ―not turning the correct way in an area with high terrain‖ is treated as a non-compensatory criterion (e.g., EINHORN, 1972): no matter how good the captain performed previously or following this situation, he cannot compensate for his behavior with excellence in another area of his performance. ...
    ... Instead, the interpretation of the same criteria may vary considerably. Our data support the contention that assessors combine criteria in a nonlinear, nonadditive way (e.g., EINHORN, 1970EINHORN, , 1972 BRANNICK, BRANNICK, 1989). They make certain observations, combine information in different ways, apply compensational and noncompensational criteria (even to the same criteria), search for further information to back up their opinion, or stop after having found some critical evidence. ...
  • ... Information integration is one weakness of human information processing (Libby and Libby 1989). Mathematical models perform better in integrating information than do humans (Einhorn 1972; Dawes 1979; Jiambalvo and Waller 1984; Kachelmier and Messier 1990). A number of studies show that improving subjects' ability to integrate information enhances their decision making and reduces their cognitive biases (Nelson et al. 1995; Bonner et al. 1996). ...
Article
    Studied characteristics of expertise in situations where judges deal with multidimensional information. Psychometric criteria were advocated as being indicative of expert judgment: (a) Experts should tend to cluster variables in the same way when identifying and organizing cues; (b) expert judgment should be highly reliable (intrajudge reliability), show both convergent and discriminant... [Show full abstract]
    Article
      Discusses the factors that affect decision making and presented a theoretical unity model developed to approximate the conjunctive and disjunctive models and to provide for part of the priori classification scheme necessary to go from a purely ipsative model of behavior to some more general scheme. These models are shown to give a better fit to certain decision data than the linear model, and... [Show full abstract]
      Article
        The present research was designed to assess the effect of two variables as they affect the use of nonlinear, noncompensatory models in decision making. These two variables were type of decision task and amount of information. The former variable was found to have a marked effect on the kind of combination model used by subjects, while the latter variable had a significant effect on the... [Show full abstract]
        Article
          A major problem in naturalistic studies is the defining of cues. The problem of cue definition leads to a consideration of the residual variance in modelling the judge. The part of judgment that cannot be modelled (i.e., the residual) is assumed to consist of cues not adequately defined and mis-specification of the integration function. The major question asked is whether the residual portion... [Show full abstract]
          Discover more