ArticlePDF Available

Abstract and Figures

The identification of an expert is vital to any study or application involving expertise. If external criterion (a “gold standard”) exists, then identification is straightforward: Simply compare people against the standard and select whoever is closest. However, such criteria are seldom available for domains where experts work; that's why experts are needed in the first place. The purpose here is to explore various methods for identifying experts in the absence of a gold standard. One particularly promising approach (labeled CWS for Cochran–Weiss–Shanteau) is explored in detail. We illustrate CWS through reanalyses of three previous studies of experts. In each case, CWS provided new insights into identifying experts. When applied to auditors, CWS correctly detected group differences in expertise. For agricultural judges, CWS revealed subtle distinctions between subspecialties of experts. In personnel selection, CWS showed that irrelevant attributes were more informative than relevant attributes. We believe CWS provides a valuable tool for identification and evaluation of experts.
Content may be subject to copyright.
Performance-based assessment of expertise: How to decide if
someone is an expert or not
James Shanteau
a,*
, David J. Weiss
b
, Rickey P. Thomas
a
, Julia C. Pounds
c
a
Department of Psychology, Kansas State University, Bluemont Hall 492, 1100 Mid-Campus Drive, Manhattan, KS 66506-5302, USA
b
California State University, Los Angeles, CA, USA
c
Federal Aviation Administration, Oklahoma City, OK, USA
Received 15 December 1999; accepted 15 April 2000
Abstract
The identi®cation of an expert is vital to any study or application involving expertise. If external criterion (a ``gold
standard'') exists, then identi®cation is straightforward: Simply compare people against the standard and select whoever
is closest. However, such criteria are seldom available for domains where experts work; that's why experts are needed in
the ®rst place. The purpose here is to explore various methods for identifying experts in the absence of a gold standard.
One particularly promising approach (labeled CWS for Cochran±Weiss±Shanteau) is explored in detail. We illustrate
CWS through reanalyses of three previous studies of experts. In each case, CWS provided new insights into identifying
experts. When applied to auditors, CWS correctly detected group dierences in expertise. For agricultural judges, CWS
revealed subtle distinctions between subspecialties of experts. In personnel selection, CWS showed that irrelevant at-
tributes were more informative than relevant attributes. We believe CWS provides a valuable tool for identi®cation and
evaluation of experts. Ó2002 Elsevier Science B.V. All rights reserved.
Keywords: Psychology; Expert systems; Auditing; Management; Agriculture
1. Introduction
Although experts have been studied for over a
century (Shanteau, 1999), there remains a critical
unanswered question ± how can we describe who
is, and who is not, an expert? If there is an external
criterion (a ``gold standard''), the answer is
straightforward. All we have to do is compare a
would-be expert's judgments to the correct answer.
If a person's answers are close to correct, then he
or she is an ``expert.'' If not, not.
This validity-based approach is compelling in
its simplicity. Unfortunately, it is problematic in
application. The diculty is that experts are nee-
ded precisely in domains where correct answers
seldom exist (Gigerenzer et al., 1999; Shanteau,
1995). Indeed, if we could compute (or look up)
correct answers, why would we need an expert at
all?
European Journal of Operational Research 136 (2002) 253±263
www.elsevier.com/locate/dsw
*
Corresponding author. Tel.: +1-785-532-0618; fax: +1-785-
532-5401.
E-mail address: shanteau@ksu.edu (J. Shanteau).
0377-2217/02/$ - see front matter Ó2002 Elsevier Science B.V. All rights reserved.
PII: S 0 3 7 7 - 2 2 1 7 ( 0 1 ) 0 0 113-8
The purpose of this paper is to explore the
application of a new measure of expertise (labeled
CWS for Cochran±Weiss±Shanteau) for identifying
expertise in the absence of external criteria. The
measure is based on the behavior of would-be
experts by using their performance in the domain.
In eect, this is a bootstrap approach in which the
individual's own decisions are used to validate (or
invalidate) his/her claim to expertise.
The remainder of this paper is organized into
®ve sections: In Section 2, we review approaches
used in prior research to identify would-be experts.
In Section 3, we introduce our proposed approach
to identi®cation of expertise. In Section 4, we ap-
ply this approach to several previously conducted
studies of experts. In Section 5, we consider cave-
ats and restrictions that should be considered
when applying CWS. Finally, we oer our con-
clusions about the future of CWS.
2. Prior approaches
Many approaches have been used by previous
investigators to identify experts. Nine of these
traditional approaches will be summarized here.
We also consider the advantages and, more im-
portantly, the disadvantages of each approach.
2.1. Experience
In many studies, the number of years of job-
relevant experience is used as a surrogate for ex-
pertise. Participants with many years of experience
are classi®ed as ``experts,'' while others with little
experience are labeled ``novices.'' On the surface,
this approach appears convincing. After all, no
one can function as an ``expert'' for any length of
time if they are totally incompetent.
Although the argument can be made that ex-
perts almost always have considerable experience,
the converse does not necessarily follow. There are
many examples of professionals with considerable
experience who never become experts. Such indi-
viduals may even work with top experts, but they
seldom rise to the performance levels required for
true expertise.
In a study of grain judges, for instance, Trumbo
et al. (1962) found that number of years of experi-
ence did not correlate with accuracy of wheat
grading. Instead, their results showed a dierent
trend: judges with more experience systematically
overrated grain quality (an interesting form of
``grade in¯ation''). Similarly, Goldberg (1968)
asked clinical psychologists with varying degrees of
experience to diagnose psychiatric patients. He
found no relation between experience and accuracy
of the diagnoses; however, the con®dence of clini-
cians in their diagnoses did increase with experience.
Although there are undoubtedly instances where
a positive relationship exists between experience
and performance, there is little reason to expect this
to apply universally. At best, experience is an un-
certain predictor of degree of expertise. At worst,
experience re¯ects seniority ± and little more.
2.2. Certi®cation
In many professions, individuals receive some
form of accreditation or title as a re¯ection of
their skill. For instance, doctors may be ``board
certi®ed'' and university faculty may be ``full
professor.'' Generally, it is safe to say that a cer-
ti®ed individual is more likely to be an expert than
someone who is uncerti®ed.
The problem with certi®cation is that it is more
often tied to years on the job than it is to profes-
sional performance. This can be particularly true
in bureaucracies. In military photo interpretation,
for instance, the rank of the individuals can vary
from Sergeant to Major. Yet performance is un-
related to rank (Tod Levitt, personal communi-
cation).
Another example occurs in the Israeli Air Force,
where the lead pilot in a battle is identi®ed by skill
rather than rank ± that means a General may fol-
low a Captain. This has been cited as one reason
for superiority of the Israelis in air combat against
Arab Air Forces (where lead pilots are usually
determined by rank). The Israelis recognized that
talent is not always re¯ected by formal certi®cation
(R. Lipshitz, personal communication).
Another problem with certi®cation is the
``ratchet up eect'' ± people generally move up the
254 J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263
certi®cation ladder, but seldom down. Once certi-
®ed, the recipient is accredited for life. Even if the
skill level of the individual suers a serious decline,
the title or rank remains. (Just ask students about
the teaching ability of some senior professors.)
2.3. Social acclamation
One method used by many researchers (in-
cluding the present authors) has been to rely on
identi®cation of experts by people working in the
®eld. That is, professionals are asked whom they
consider to be an expert. When there is some
agreement about the identi®cation of such an in-
dividual, that person is then labeled an expert by
``social acclamation.''
In her analysis of livestock judges, for example,
Phelps (1977) asked professionals in agriculture
whom they considered the best. From their an-
swers, she identi®ed four top livestock judges to be
the experts in her investigation (for further details
on this study, see below).
Absent other means of identifying experts, ac-
clamation is a reasonable strategy to follow. It is
unlikely that multiple professionals working in a
®eld would identify the same unquali®ed person as
an expert. If they agree, it seems safe to assume
that the agreed-upon person is an expert. The
problem with this approach is a ``popularity eect''
± someone better known by his or her peers is
more likely to be identi®ed as an expert. Mean-
while, another person outside the peer group is
unlikely to be seen as an expert ± even though that
person may be on the cutting edge of new
knowledge. Indeed, those who make new discov-
eries in a ®eld are frequently unpopular in the eyes
of their peers at the time of their breakthroughs.
2.4. Consistency (within) reliability
Einhorn (1972, 1974) argued that intra-person
(within) reliability is a necessary condition for ex-
pertise. That is, an expert's judgments should be
internally consistent. Conversely, inconsistency
would be prima facie evidence that the person is
not an expert.
Table 1 lists within-person consistency values
from eight prior studies of experts. The four ver-
tical categories correspond to a classi®cation of
task diculty proposed by Shanteau (1999). There
are two domains listed for each category, with
internal consistency correlations. For example, the
average consistency for weather forecasters (a de-
cision-aided task) is quite high at 0.98. For
stockbrokers (an unaided task), the average con-
sistency is less than 0.40.
As might be expected, aided tasks produce
higher internal consistency values than unaided
tasks. To a ®rst approximation, therefore, it ap-
pears that intra-person reliability corresponds
closely to the performance level of experts in dif-
ferent domains.
The diculty with this approach is that some-
one can be consistent by following some simple,
but incorrect rule. As long as the rule is followed
routinely, the person's behavior will exhibit high
consistency. For example, by always answering
``yes'' and ``no'' to alternate questions, one can be
perfectly repeatable. But such answers would gen-
erally be inappropriate. Thus, internal consistency
Table 1
Reliability (consistency) values across levels of expert performancea
Highest levels of performance Lowest levels of performance
Aided decisions Competent Restricted Unaided decisions
Weather forecasters Livestock judges Clinical psychologists Stockbrokers
r0:98 r0:96 r0:44 r<0.40
Auditors Grain inspectors Pathologists Polygraphers
r0:90 r0:62 r0:50 r0:91
a
The values cited in this table (left±right and top±bottom) were drawn from the following: Stewart et al. (1997), Phelps and Shanteau
(1978), Goldberg and Werts (1966), Slovic (1969), Kida (1980), Trumbo et al. (1962), Einhorn (1974), and Raskin and Podlesny (1979).
J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263 255
is a necessary condition ± an expert could
hardly behave randomly ± but not sucient for
expertise.
2.5. Consensus (between) reliability
Einhorn (1972, 1974) argued that agreement
between individuals is a necessary condition for
expertise. That is, he believed that experts in a
given ®eld should agree with each other (also see
Ashton, 1985). If there is disagreement, then it
suggests that one, some, or all of the would-be
experts are not really what they claim to be.
Table 2 lists average between-expert correla-
tions for the same studies listed in Table 1. For
instance, the consensus correlations for weather
forecasters and stockbrokers are 0.95 and 0.32,
respectively. Except for pathologists, the consen-
sus values are similar to, but lower than, the cor-
responding consistency values in Table 1.
Livestock judges and polygraphers display
quite dierent consistency and consensus results.
Further analysis reveals that there are several
schools of thought in these domains about how to
make decisions. Thus, experts from each school
may be internally consistent, but show sizable
disagreement with experts from another school.
This could explain the discrepancy between the
high consistency values and the low consensus
values in these two domains.
On the surface, consensus appears to be a
compelling property for experts. After all, we feel
quite uncomfortable when two or more experts
(such as doctors) argue about which is the correct
procedure to follow. When the experts agree, on
the other hand, we feel more comfortable with the
mutually agreed-upon course of action.
The problem with consensus is that agreement
can result from premature closure, e.g., groupthink
(Janis, 1972). There are many illustrations where
the best answer was not the one identi®ed by a
group of experts because they focused initially on
an inferior alternative. Thus, they become blind to
better options. Therefore, many experts may agree
± but they may all be wrong (Shanteau, in press;
Weiss and Shanteau, in press).
2.6. Discrimination ability
Hammond (1996) and others have pointed out
that the ability to make ®ne discriminations be-
tween similar, but not equivalent, cases is a de-
®ning skill of experts. That is, an expert must be
able to perceive and act on subtle dierences that a
non-expert may often overlook. In the study of
livestock judges by Phelps described below, the
researchers were able to develop quantitative
models of the experts' judgments. However, it
proved impossible for these researchers to apply
the models to actual livestock due to the diculty
of perceiving the appropriate characteristics of
animals. Thus, knowing how to combine infor-
mation is of no value without knowing what in-
formation to combine.
Although it seems clear that discrimination is a
necessary condition for expertise, there is a catch.
A non-expert may well dierentiate between cases
using some easily identi®able, but irrelevant at-
tribute. For instance, it is easy to distinguish be-
tween livestock based on the length or curliness of
Table 2
Reliability (consensus) values across levels of expert performancea
Highest levels of performance Lowest levels of performance
Aided decisions Competent Restricted Unaided decisions
Weather forecasters Livestock judges Clinical psychologists Stockbrokers
r0:95 r0:50 r0:40 r0:32
Auditors Grain inspectors Pathologists Polygraphers
r0:76 r0:60 r0:55 r0:33
a
The values cited in this table (left±right and top±bottom) were drawn from the following: Stewart et al. (1997), Phelps and Shanteau
(1978), Goldberg and Werts (1966), Slovic (1969), Kida (1980), Trumbo et al. (1962), Einhorn (1974), and Lykken (1979).
256 J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263
their tails. However, tail characteristics play no
role in the meat quality of farm animals (Bill Able,
personal communication). Thus, discrimination
ability is a necessary, but not sucient, condition
for identifying experts.
2.7. Behavioral characteristics
Research by (Abdolmohammadi and Shanteau,
1992; also see Shanteau, 1989) found that expert
auditors share many common behavioral charac-
teristics. Some examples are self-con®dence, cre-
ativity, perceptiveness, communication skills,and
stress tolerance. A complete list of characteristics
(along with their de®nitions) appears in the origi-
nal paper.
Because many experts exhibit such traits, Ab-
dolmohammadi and Shanteau proposed that be-
havioral characteristics might be used to develop a
``trait pro®le'' of experts. If appropriate tests can
be identi®ed or constructed, then would-be experts
would take such tests. Those that score closest to
the pro®le of established experts would then be-
come potential experts.
Although this approach has considerable po-
tential, there are three critical problems. First, the
required tests for several of these characteristics do
not exist, e.g., Communication Skills or Tolerance
of Stress. Second, even if they did, the tests would
have to be normalized for a domain (e.g., audi-
tors). Third, the extent to which non-experts may
also share these same characteristics is unclear.
Thus, although this approach holds promise, more
work is needed before experts can be identi®ed
using their behavioral characteristics.
2.8. Knowledge tests
In studies of problem solving or game-playing
experts are often identi®ed based on tests of fac-
tual knowledge. For example, Chi (1978) used
knowledge about dinosaurs to separate children
into experts and novices.
Knowledge of relevant facts is clearly a pre-
requisite for expertise. Someone who knows
nothing about a domain will be unable to make
competent decisions. Yet, knowledge alone is not
sucient to establish that someone is an expert. In
the Chi study, for example, knowledge about dif-
ferent types of dinosaurs is not enough to know
what they ate, where they lived, how long they
survived, or why they died out.
The problem is that it takes more than knowl-
edge of facts for expertise. It is also necessary to
see which facts to apply in a given situation. In
most domains, that is the hard part.
2.9. Creation of experts
In certain contexts, it is possible for experts to
be ``created'' through extensive training by re-
searchers. This approach has signi®cant advanta-
ges, including the fact that development of
expertise can be studied longitudinally. Moreover,
the skills learned are under direct control of re-
searchers.
One notable example of this approach is a
student who worked with William Chase at Car-
negie-Mellon University to enhance his short-term
memory span (Chase and Ericsson, 1981). Because
the student was a track athlete, he learned to
translate groups of digits into times for various
running distances. When asked to retrieve the
digits, he recalled the times in clusters tied to
running. Using this strategy, the student broke the
old record for short-term memory span of 18 digits
established by a German mathematician. The new
record ± over 80! (Other students since have ex-
tended the record beyond 100.)
Experts can be created in this way for certain
narrow tasks, e.g., to play computer games or
work in a simulated microworld environment. In
most realms of expertise, however, a broad range
of skills is required based on years of training and
experience. For instance, becoming a medical
doctor can take a dozen years just to get started.
Obviously, training students for a few months
cannot simulate such expertise.
3. A new approach
As the preceding survey shows, many ap-
proaches have been advanced for identifying ex-
J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263 257
perts. Each of these approaches, however, has one
or more serious ¯aws. No generally acceptable
approach exists at the present time. To ®ll this gap,
the two senior authors (Weiss and Shanteau,
submitted) proposed a new approach for de®ning
expertise. They combined two necessary, but not
sucient, measures, into a single index.
First, they agreed with Hammond (1996) that
discrimination is critical for an expert. The ability
to dierentiate between similar, but not identical,
cases is a hallmark of expertise. That is, experts
perceive and act on subtle distinctions that others
miss. Second, they followed Einhorn's (1974)
suggestion that consistency, or within-person reli-
ability, is necessary in an expert. If someone can-
not repeat their judgment in a similar situation,
then they are unlikely to be an expert.
Discrimination refers to a judge's dieren-
tial evaluation of dierent stimulus cases. Con-
sistency refers to a judge's evaluation of the
same stimuli over time; inconsistency is its
complement.
3.1. CWS ratio
As shown in Eq. (1), Weiss and Shanteau
combine discrimination and consistency into a
ratio. The CWS ratio will be large when a judge
discriminates consistently, but will be small if the
judge either discriminates less or has lower con-
sistency.
CWS Discrimination
Inconsistency :1
Our construction of this index parallels Coch-
ran's (1943) suggestion to use a ratio of variances
to assess the quality of a response instrument.
(Another reason for using variance ratios is that
they are asymptotically ecient (I.R. Goodman,
personal communication).) Cochran argued that
an eective instrument should allow participants
to express perceived dierences among stimuli in a
consistent way. We view an eective expert in the
same way. We acknowledge our intellectual debt
to Cochran by referring to our performance-based
index as CWS.
The intuition underlying the index is that a
good measuring tool necessarily has a high CWS
ratio. That is, a proper instrument yields dierent
measures for dierent objects, and gives the same
measure whenever it is applied to the same object.
A ruler, for example, discriminates among objects
of varying length, and produces identical scores
for the same objects. Thus, a proper measuring
instrument will produce a high CWS value as de-
®ned in Eq. (1).
Similarly, an expert must be both discriminat-
ing and consistent. It is easy to display one or the
other, but hard to do both. One can show dis-
crimination by generating a wide variety of re-
sponses over stimuli; one can exhibit consistency
by repeating the same response to all stimuli. But
adopting either of these strategies alone means
that the other entity will be lost. To display
both properties simultaneously requires careful
assessment of the stimuli, the essence of expert
judgment.
3.2. Using CWS
CWS can be estimated by asking would-be ex-
perts to make judgments of a series of stimulus
cases; this allows for assessment of their discrimi-
nation ability. In addition, at least some of the
cases should be repeated; this allows for assess-
ment of their consistency.
Discrimination and inconsistency values can
be estimated using a variety of analytic proce-
dures, such as analysis of variance or multiple
regression. It is important to emphasize that the
use of ratios is descriptive, not inferential. That
is, CWS is more of a qualitative tool than a
quantitative tool. There are no comparisons to
statistical tables and no determinations of sig-
ni®cance. Rather, CWS is used to establish that
someone behaves more (high value) or less (low
value) like an expert.
To rank-order two (or more) would-be experts,
CWS ratios can be compared using a procedure
developed by Schumann and Bradley (1959). This
allows the researcher to determine whether one
individual is performing better than another
(Weiss, 1985).
258 J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263
4. Reanalyses of prior studies
In this section, we apply CWS to three previous
studies of experts. By reanalyzing these results, we
hope to show the utility of CWS in a variety of
contexts.
4.1. Audit judgment
Ettenson (1984) asked two groups of auditors
to evaluate 24 ®nancial cases described by a
common set of cues. One group of 15 expert au-
ditors was recruited from Big Six accounting ®rms
in Omaha, Nebraska. The expert group included
audit seniors and partners, with 4±25 years of
audit experience. For comparison, 15 novice ac-
counting students were obtained from two large
Midwestern universities.
Every ®nancial case was described using 16
cues, each of which was given either a high or low
value. For example, net income was set at either a
high or low number. For each case, participants
were asked to make a going concern assessment. A
fractional factorial design was used to generate 16
cases. Eight of these cases were then replicated to
produce a total of 24 stimuli; participants were not
told that some cases were identical. The order of
presentation of cases was randomized.
Based on feedback from an auditor collabora-
tor, the cues were classi®ed as either ``diagnostic''
(e.g., net income), ``partially diagnostic'' (e.g., ag-
ing of receivables), or ``non-diagnostic'' (e.g., prior
audit results). From analysis of the fractional de-
sign, discrimination was estimated from the mean
square values for each cue ± high variance implies
high discrimination. Inconsistency was estimated
from the average of within-cell variances ± low
variance implies high consistency. The ratio of
discrimination variance divided by inconsistency
variance was computed to form separate CWS
values for diagnostic, partially diagnostic, and
non-diagnostic cues.
The results in Table 3 show that average CWS
values decline systematically as the diagnosticity of
the cues declines. For the expert group (®rst row in
Table 3), the dierences are notable, especially
between diagnostic and partially diagnostic cues.
For the novice group (second row in the table),
there is a similar but less pronounced decline.
More important, there is a sizable dierence be-
tween experts and novices for diagnostic cues. The
size of this dierence is less for partially diagnostic
cues, and non-existent for non-diagnostic cues.
For diagnostic cues, CWS clearly distinguishes
between experts and novices. Moreover, the size of
dierence between the groups declines for less di-
agnostic cues. These results show that CWS can
distinguish between expert and novice groups.
4.2. Livestock judgment
Phelps (1977) had four professional livestock
judges evaluate 27 drawings of gilts ± female pigs.
These drawings were created by an artist to yield a
333, size breeding meat quality, factorial
design. The judges independently evaluated each
gilt for breeding quality (how good is the animal
for reproduction) and slaughter quality (how good
is the meat from the animal.) All stimuli were
presented three times, although judges were not
told that they were being shown the same
drawings.
Two of the judges were nationally recognized
experts in assessment of swine and were very fa-
miliar with gilts of the sort shown in the drawings.
The other two were nationally recognized experts
as cattle judges; although they were knowledgeable
about swine judging, they lacked day-to-day fa-
miliarity and experience.
For breeding judgments (upper panel in
Table 4), swine experts produced the largest CWS
values for breeding and meat cues. In comparison,
cattle experts produced large CWS values only for
the meat cue. This apparently re¯ects the unfa-
miliarity of breeding characteristics of swine by
Table 3
Average CWS values for two groups of auditors with three
categories of cuesa
Diagnostic Partially
diagnostic
Non-
diagnostic
Experts 13.10 6.42 3.32
Novices 8.08 5.13 3.03
a
Results based on a reanalysis of Ettenson (1984).
J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263 259
cattle judges; meat quality characteristics, how-
ever, were readily emphasized by cattle judges.
For slaughter judgments (lower panel in Table
4), the meat cue dominates for both swine and
cattle judges. However, there is over a 2-to-1 dif-
ference in the magnitude of CWS for meat between
swine and cattle judges. Breeding and size dimen-
sions were small for both types of judges.
Interestingly, for cattle judges, there is little
dierence in CWS between breeding and slaughter
judgments. For swine judges, however, there is a
considerable dierence between breeding and
slaughter judgments, especially for the breeding
cue. Thus, it appears that swine judges are more
sensitive to changes in the task. In all, CWS pro-
vides a revealing picture of the dierence between
these two highly skilled types of experts. This
study also highlights the role that speci®c tasks
play in expertise.
4.3. Personnel hiring
Nagy (1981) used summary descriptions of job
candidates for the position of computer pro-
grammer at a large company in the state of
Washington. She asked four professional person-
nel selectors (experts) and 20 management
students (novices) to evaluate these candidates.
Each candidate was described by legally relevant
attributes (recommendations from prior employers
and amount of job-relevant experience) and legally
irrelevant attributes (age,gender,andphysical at-
tractiveness). Filler information from local phone
books was used to supply background informa-
tion, such as phone number and home address, on
the application summaries.
Each participant evaluated 32 applicants (gen-
erated from a 2 2222 factorial design)
twice. Before the evaluations, participants were
reminded about the legal requirements for hiring,
i.e., what information should and should not be
used. The importance of the ®ve attributes was
determined for each participant on a 0±100 nor-
malized scale; average CWS values are reported
for each group.
As can be seen for the relevant attributes (upper
panel in Table 5), average CWS values are nearly
identical for the two groups. This is not surprising
given that participants were told immediately be-
Table 5
Average CWS values for two groups of personnel selectorsa
Recommendations Experience
Relevant attributes
Professionals 88.25 86.17
Students 88.81 86.88
Age Attractiveness Gender
Irrelevant attributes
Professionals 0.99 1.58 0.00
Students 28.12 25.19 13.32
a
Results based on reanalysis of Nagy (1981).
Table 4
Average CWS values for swine judgments for two types of livestock expertsa
Size Breeding Meat
Breeding judgments
Swine experts 15.9 53.8 65.6
Cattle experts <1:0 3.4 79.2
Slaughter judgments
Swine experts <1:0 3.2 212.7
Cattle experts <1:0 7.5 98.0
a
Results based on a reanalysis of Phelps (1977).
260 J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263
fore the study about hiring guidelines. In contrast,
CWS values for irrelevant attributes (lower panel)
reveal a dierent pattern. For professionals, CWS
approaches zero (as it should). In contrast, CWS
values are considerably larger for students. Despite
being reminded that age, gender, and attractive-
ness are not legally allowed, business students had
sizable CWS values for these irrelevant attributes.
Certainly, it is not easy to ignore something as
obvious as age or gender, although that is what the
legal guidelines require. Experts, however, appar-
ently have developed strategies to do just that.
Thus, there are tasks where CWS values for irrel-
evant attributes may be more diagnostic of ex-
pertise than relevant attributes.
5. Caveats
There are ®ve caveats and precautions that
deserve mention. First, the application of CWS to
these three prior studies is encouraging as far as it
goes. However, more evidence is needed before
CWS can be used by itself to identify experts. For
now, it is clear that CWS can be used as a useful
supplement to other approaches, e.g., social ac-
clamation.
Second, the stimuli used in these studies were
abstractions of real-world problems. Speci®cally,
cases were presented in static (non-changing) en-
vironments, with no feedback or dynamic/tempo-
ral changes. We are now applying CWS in
complex, real-time environments.
Third, CWS was applied here to individuals
whose results were combined to produce group
averages. However, most experts work in teams. If
teams are treated as a decision-making unit, then it
is possible to apply CWS in the same way as with
individuals. Preliminary eorts to apply CWS to
team decision making have been encouraging.
Fourth, CWS assumes that there are real dif-
ferences in the stimuli to be judged. If the stimuli
are not dierent, then there is nothing to discrim-
inate. If multiple patients have the same disease,
for instance, then there will be no dierential di-
agnoses. Therefore, there must be a range of
stimuli before CWS can be used to identify ex-
perts.
Finally, it is possible for CWS to yield high
values for non-experts who use a consistent, but
incorrect rule. Suppose all job candidates with
short names (e.g., Ann) get high recommendations
while all job candidates with long names (e.g.,
Georgette) get low recommendations. Because of
high consistency, such an inappropriate rule would
produce high CWS values. One way around this
``catch'' is to ask judges to evaluate the same cases
in dierent contexts, e.g., recommendations for a
dierent job. If judgments are the same as before,
then the participant is not likely to be an expert ±
despite having a high CWS value.
6. Conclusions
The present application of CWS leads to ®ve
conclusions: First, in the analyses above, CWS
proved superior to any previously proposed ap-
proach for identifying experts. If CWS continues
to be successful, it may provide an answer to the
long-standing question of how to identify expertise
in the absence of external criteria.
Second, the success of CWS across dierent
domains is noteworthy. In addition to auditing,
livestock judging, and personnel selection, we have
applied CWS to wine judging, medical decision
making, soil judging, microworld simulations,
sensory food evaluations, and air trac control.
Thus far, CWS has worked well in every do-
main.
Third, in addition to identifying experts, CWS
has provided new insights into interpretation of
previous research. In the Phelps study of livestock
judges, for example, CWS clari®ed a long-standing
question about how to distinguish between experts
from closely related specialty areas.
Fourth, by focusing on discrimination and
consistency, CWS may have important implica-
tions for selection and training of novices to be-
come experts. It is unclear, for example, whether
discrimination and consistency can be learned, or
whether novices should be preselected for these
skills. Either way, CWS oers new perspectives on
what it means to be an expert.
Finally, we are now applying CWS to data sets
where there is no prior information about the
J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263 261
relevance of attributes. The question is whether
CWS can identify experts in the absence of any
knowledge of what is relevant and what is irrele-
vant. In preliminary analyses, the dierences do
not appear to be as large as shown in the present
tables. However, CWS does consistently separate
experts from non-experts. In all, the future for
CWS looks hopeful.
Acknowledgements
Preparation of this manuscript was supported,
in part, by grant 96-12126 from the National Sci-
ence Foundation and by grant 98-G-026 from the
Federal Aviation Administration in the Department
of Transportation (in the USA).
References
Abdolmohammadi, M.J., Shanteau, J., 1992. Personal charac-
teristics of expert auditors. Organizational Behavior and
Human Decision Processes 58, 158±172.
Ashton, A.H., 1985. Does consensus imply accuracy in
accounting studies of decision making. Accounting Review
60, 173±185.
Chase, W.G., Ericsson, K.A., 1981. Skilled memory. In:
Anderson, J.R. (Ed.), Cognitive Skills and Their
Acquisition. Erlbaum Associates, Hillsdale, NJ, pp. 141±
189.
Chi, M.T.H., 1978. Knowledge structures and memory
development. In: Siegler, R.S. (Ed.), Children's Thinking:
What Develops? Erlbaum Associates, Hillsdale, NJ,
pp. 73±96.
Cochran, W.G., 1943. The comparison of dierent scales of
measurement for experimental results. Annals of Mathe-
matical Statistics 14, 205±216.
Einhorn, H.J., 1972. Expert measurement and mechanical
combination. Organizational Behavior and Human Per-
formance 7, 86±106.
Einhorn, H.J., 1974. Expert judgment: Some necessary condi-
tions and an example. Journal of Applied Psychology 59,
562±571.
Ettenson, R., 1984. A schematic approach to the examination
of the search for and use of information in expert decision
making. Unpublished Doctoral Dissertation, Kansas State
University, Manhattan, KS.
Gigerenzer, G., Todd, P., & the ABC group, 1999. Simple
Heuristics that Make Us Smart. Oxford University Press,
London.
Goldberg, L.R., 1968. Simple models or simple processes: Some
research on clinical judgments. American Psychologist 23,
482±496.
Goldberg, L.R., Werts, C.E., 1966. The reliability of clinicians
judgments: A multitrait±multimethod approach. Journal
of Consulting Psychology 30, 199±206.
Hammond, K.R., 1996. Human Judgment and Social Policy.
Oxford University Press, New York.
Janis, I.L., 1972. Victims of Groupthink. Houghton-Miin,
Boston.
Kida, T., 1980. An investigation into auditor's continuity and
related quali®cation judgments. Journal of Accounting
Research 22, 145±152.
Lykken, D.T., 1979. The detection of deception. Psychological
Bulletin 80, 47±53.
Nagy, G.F., 1981. How are personnel selection decisions made
An analysis of decision strategies in a simulated personnel
selection. Unpublished Doctoral Dissertation, Kansas
State University, Manhattan, KS.
Phelps, R.H., 1977. Expert livestock judgment: A descriptive
analysis of the development of expertise. Unpublished
Doctoral Dissertation, Kansas State University, Manhat-
tan, KS.
Phelps, R.H., Shanteau, J., 1978. Livestock judges: How much
information can an expert use? Organizational Behavior
and Human Performance 21, 209±219.
Raskin, D.C., Podlesny, J.A., 1979. Truth and deception: A
reply to Lykken. Psychological Bulletin 86, 54±59.
Schumann, D.E.W., Bradley, R.A., 1959. The comparison of
the sensitivities of similar experiments: Model II of the
analysis of variance. Biometrics 15, 405±416.
Shanteau, J., 1989. Psychological characteristics and strate-
gies of expert decision makers. In: Rohrmann, B.,
Beach, L.R., Vlek, C., Watson, S.R. (Eds.), Advances
in Decision Research. North-Holland, Amsterdam,
pp. 203±215.
Shanteau, J., 1995. Expert judgment and ®nancial deci-
sion making, In: Green, B. (Ed.), Risky Business, Uni-
versity of Stockholm School of Business, Stockholm,
pp. 16±32.
Shanteau, J., 1999. Decision making by experts: The GNAHM
eect. In: Shanteau, J., Mellers, B.A., Schum, D.A. (Eds.),
Decision Science and Technology: Re¯ections on the
Contributions of Ward Edwards. Kluwer Academic Pub-
lishers, Boston, pp. 105±130.
Shanteau, J., in press. What does it mean when experts
disagree? In: Salas, E., Klein, G. (Ed.), Linking Expertise
and Naturalistic Decision Making. Erlbaum Associates,
Hillsdale, NJ.
Slovic, P., 1969. Analyzing the expert judge: A descriptive study
of a stockbroker's decision processes. Journal of Applied
Psychology 53, 255±263.
Stewart, T.R., Roebber, P.J., Bosart, L.F., 1997. The impor-
tance of the task in analyzing expert judgment. Organiza-
tional Behavior and Human Decision Processes 69, 205±
219.
262 J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263
Trumbo, D., Adams, C., Milner, M., Schipper, L., 1962.
Reliability and accuracy in the inspection of hard red
winter wheat, Cereal Science Today 7.
Weiss, D.J., 1985. SCHUBRAD: The comparison of the
sensitivities of similar experiments. Behavior Research
Methods Instrumentation and Computers 17, 572.
Weiss, D.J., Shanteau, J., in press. The Vice of Consensus and
the Virtue of Consistency. In: Shanteau, J., Johnson, P.,
Smith, C. (Eds.), Psychological Explorations of Competent
Decision Making. Cambridge University Press, New York.
Weiss, D.J., Shanteau, J., submitted. Empirical assessment of
expertise.
J. Shanteau et al. / European Journal of Operational Research 136 (2002) 253±263 263
... Expertise is a staple of modern life 1,2 , influencing everything from entertainment choices (for example, millions of people watch elite sports and games 3,4 ) to daily problem solving (for example, by asking for a doctor's opinion about health care) 2 . Scholars from diverse fields, including psychology, sociology and philosophy, have studied expertise and define it in different ways 2,[5][6][7][8] . Yet, there is general agreement that experts demonstrate remarkable performances despite the limitations imposed by human physiology, perception and cognition [9][10][11] . ...
... Examples include weather forecasts (for example, the chance of rain), psychological risk predictions (for example, the likelihood of future dangerousness), medical judgements (for example, diagnosis or treatment selection) and forensic feature comparisons (for example, unknown fingerprint identification). The actual performance on these kinds of know-it tasks is not visible to observers in the way that show-it tasks are, and ground truth tends to be more difficult to establish 7,132,133 . Consequently, know-it tasks often lack immediate, observable and unambiguous criteria for success, making performance difficult to measure despite such measurement being essential for distinguishing genuine expertise from pseudo-expertise. ...
Article
Determining which experts to trust is essential for both routine and high-stakes decisions, yet evaluating expertise can be difficult. In this Review, we examine the cognitive processes that underpin genuine expertise and explore the disconnect between psychological insights into expertise and the practical methods used to evaluate it. In settings where expertise must be evaluated by laypeople, such as adversarial legal trials, evaluators face substantial challenges, including knowledge disparities that hinder analysis, communication barriers that impact the clear explanation of expert methods, and procedural constraints that limit the scrutiny of expert evidence. These challenges complicate the assessment of expert claims and contribute to wrongful convictions and unjust outcomes. We suggest that a distinction between ‘show-it’ and ‘know-it’ expert performances that differ in their visibility, measurability and immediacy can be used as a heuristic for identifying when evaluations of expertise require greater care and should incorporate a variety of diagnostic factors including foundational and applied validity. Finally, we highlight key knowledge gaps and propose promising directions for future research to improve evaluations of expertise in a range of contexts.
... Their selection was based on their professional backgrounds, practical experience and direct involvement in the care or technological design aspects of this patient group. The practical experience required a minimum of 10 years, along with academic or professional credentials in relevant areas (Ismail et al. 2019;Shanteau et al. 2002;Skulmoski et al. 2007). Additionally, the experts were required to demonstrate a willingness to participate and dedicate sufficient time to the Delphi exercise (Dhaliwal et al. 2022;Mustapha et al. 2023). ...
Article
Full-text available
Aim This study explores' perceptions and expectations of experts regarding the role of VR in supporting daily spirituality practices among Muslim patients undergoing haemodialysis treatment. Methods and Design The Fuzzy Delphi Method (FDM) is utilised to gather insights from a panel of experts in nephrology, psychology, Islamic studies, and VR technology. A total of 11 experts was selected based on their expertise and experience in relevant domains to ensure the credibility and validity of the findings. Results Key themes emerging from the study include VR's potential to create immersive and meaningful spiritual experiences, alleviate psychological distress, and enhance coping mechanisms. Factors concerning accessibility, cultural sensitivity, and integration into clinical settings are also highlighted as important considerations. Conclusion The study explores how VR technology can aid Muslim haemodialysis patients in spiritual routines. It underscores the significance of holistic healthcare methods in addressing patients' spiritual and psychological needs. Future research should aim to develop VR‐based therapies tailored for these patients while considering practical and ethical challenges in medical settings. Implications for the Profession and/or Patient Care The findings of this study have significant practical implications for designing and implementing VR interventions in healthcare settings. Developing VR content that is culturally sensitive and aligned with Islamic practices is crucial for its acceptance and effectiveness. Integrating VR into the spiritual practices of Muslim haemodialysis patients also raises ethical and pragmatic considerations. Patient or Public Contribution No patient or public contributions were made in this study.
... All interviewed experts were employees of Company X. They were selected using the three criteria mentioned by Shanteau et al. (2002): experience, certification, and social acclamation. All experts were selected using the criterion certification; at a minimum, they had a Bachelor of Sciences degree. ...
Article
Full-text available
Standards play a significant role in the semiconductor industry. However, few scholars have focused on gaining a better understanding of standardization in this industry. This study examines a specific aspect of standardization: the adoption of quality standards by companies in The Netherlands' semiconductor industry. Multiple quality standards are available and the uncertainty surrounding that choice is high. There is a need to decrease this uncertainty. This paper attempts to accomplish that by focusing on a Dutch multinational semiconductor company that has adopted quality standards that improve sustainability. This is a typical example of a company affected by uncertainty regarding the quality standards that should be adopted. Based on a literature review and interviews with experts from the company, we develop a list of factors that influence the company's adoption of two quality standards and assign weights to these factors by applying the best-worst method. Our results show that pressure from customers, pressure from big players, management support, and formalization are the most important factors explaining quality standard adoption in The Netherlands' semiconductor industry. Applying these factors and weights can reduce the uncertainty for companies regarding which standards should be adopted, which is the practical implication of our study.
... For example, children are less likely to learn a new word from an adult who previously labeled an object incorrectly (e.g., labeling a ball a "shoe") compared to an adult with a record of accuracy (Koenig et al., 2004). This general sensitivity to past success persists in adulthood, as individuals attempt to distinguish between competent and incompetent experts based on their past reliability (Shanteau et al., 2002;Weiss & Shanteau, 2003). ...
... [13][14][15] Posteriormente, se estandarizó el procedimiento, a través del Método Delphi con la participación de un grupo multidisciplinario de 10 expertos; los criterios de selección fueron con base en la definición utilizada en el artículo "Del principiante al experto: excelencia y poder de la enfermería clínica", 16 psiquiatría, con más de 10 años en atención directa a pacientes psiquiátricos). 16,17 El cuestionario aplicado a este grupo de expertos describía el procedimiento de sujeción, ya diseñado y conformado por 35 ítems, clasificados en seis dimensiones: la primera es la Identificación de signos de agitación psicomotriz y toma de decisión para solicitud de ayuda y contención verbal, tres dimensiones que contenían tres maniobras de contención cada una, una de sujeción y la última de vigilancia y control de la seguridad del paciente. Se utilizó una escala de medición tipo Likert del 1 al 6, donde 1 significó totalmente en desacuerdo, 2 medianamente en desacuerdo, 3 escasamente en desacuerdo, 4 escasamente en acuerdo, 5 medianamente en acuerdo y 6 totalmente en acuerdo. ...
... In the preparation phase, selecting appropriate experts was crucial, as expert selection is often considered a key challenge in Delphi studies [31][32][33]. Experts are typically identified based on traits such as experience, education and/or training, social recognition, and relevant skills like creativity, self-confidence, and strong communication [34][35][36][37][38][39]. For our study, we prioritised experience, education and/or training, and recognition within their field. ...
Article
Full-text available
The forest sector is a significant contributor to Swedish society but requires continuous improvements in logging operations. Implementing innovations in operations is dependent on advances in other sectors, since forest machine manufacturers have only a fraction of the development capacity of, e.g., car or truck manufacturers. The aim of this study was to identify the most promising logging-machine systems, with different innovations, for implementation within ten years. The Delphi method was used to gather expert views on the importance of criteria in their decision making, their expectations regarding developments in external factors, and the most promising machine systems. Environmental and social criteria were ranked higher than economic criteria, but the rankings were relatively close. A future with greater and more stringent regulation was expected, but with scope to improve operations through technological developments such as automation and remote control. There was interest in new machine systems, but the established system dominated. Of the expected innovations, renewable energy sources were ranked highly, along with the automation of the work elements that are easiest to automate. The study provides stakeholders with a basis for decision making regarding which technologies to evaluate and test in the future.
... Although experts should be able to correctly identify when information is task-irrelevant (see the performance-based approach to expertise by Shanteau et al., 2002), the processing of this information itself to identify it as task-irrelevant may lead to biasing beliefs. Numerous empirical studies have shown that people typically fail to prevent or correct the biasing influence of beliefs even when they are aware of it or explicitly motivated to avoid bias (e.g., Harley, 2007;Lieberman & Arndt, 2000). ...
Article
Purpose A large body of research indicates that bias is an inherent part of human information processing. This way, bias affects all disciplines that rely on human judgements, such as forensic psychological assessment, including criminal risk evaluation. Although there is a lack of empirical studies, scholars recommend considering case information sequentially beginning with the most relevant information to reduce the effect of potentially biasing task‐irrelevant contextual information. Methods We ran a preregistered experimental study to test, first, whether task‐irrelevant information results in bias effects when people use criminal risk assessment tools, and second, whether such bias could be reduced by sequencing case information according to its prognostic relevance. We collected data of 308 informed lay participants instructed to apply an empirical actuarial risk scale based on a case vignette. Results Results showed that task‐irrelevant information biased risk assessment. Yet, sequencing case information did not protect against it. Conclusions Considering various boundary conditions (e.g., overconfidence in the accuracy of one's own assessment and other sources of bias), we discuss challenges to mitigate the biasing effect of task‐irrelevant information.
Article
Failure to engage with expertise underpins many organizational and societal problems. Despite decades of research on expertise, we still do not fully understand why such failures persist, and there is a sense that these failures are becoming a crisis of expertise. In this article, we highlight a person's system of beliefs about the meaning of expertise – what we term their lay theory of expertise (LTE) – as an important factor for understanding their engagement with expertise. Through a free response study (Study 1), analysis of social media data (Study 2), and word sorting study (Study 3), we first develop a taxonomy of common LTE elements. We then examine how LTEs affect expert recognition via two experiments with managers (Studies 4 and 5). Study 4 reveals that congruence between the most psychologically active element of a person's lay theory and expert conduct is conducive to expert recognition, while Study 5 highlights that (in)congruence between LTEs and expert conduct can alter how other mental representations (such as gender stereotypes) shape expert recognition. Our work provides a conceptual foundation for exploring variation in the subjective meaning of expertise in future research on expert recognition and engagement, both within and beyond organizations.
Book
Decision Science and Technology is a compilation of chapters written in honor of a remarkable man, Ward Edwards. Among Ward's many contributions are two significant accomplishments, either of which would have been enough for a very distinguished career. First, Ward is the founder of behavioral decision theory. This interdisciplinary discipline addresses the question of how people actually confront decisions, as opposed to the question of how they should make decisions. Second, Ward laid the groundwork for sound normative systems by noticing which tasks humans can do well and which tasks computers should perform. This volume, organized into five parts, reflects those accomplishments and more. The book is divided into four sections: `Behavioral Decision Theory' examines theoretical descriptions and empirical findings about human decision making. `Decision Analysis' examines topics in decision analysis.`Decision in Society' explores issues in societal decision making. The final section, `Historical Notes', provides some historical perspectives on the development of the decision theory. Within these sections, major, multi-disciplinary scholars in decision theory have written chapters exploring some very bold themes in the field, as an examination of the book's contents will show. The main reason for the health of the Decision Analysis field is its close links between theory and applications that have characterized it over the years. In this volume, the chapters by Barron and Barrett; Fishburn; Fryback; Keeney; Moreno, Pericchi, and Kadane; Howard; Phillips; Slovic and Gregory; Winkler; and, above all, von Winterfeldt focus on those links. Decision science originally developed out of concern with real decision problems; and applied work, such as is represented in this volume, will help the field to remain strong.
Chapter
Psychological studies involving experts date back to the earliest days of experimental psychology. Research on domain experts has also been a fundamental part of the history of judgment and decision making (JDM). The purpose of this chapter is to look at how domain experts have been viewed in the decision making literature. The focus will be on an unappreciated historical bias derived from a misinterpretation of the foundations of experimental psychology.
Article
The expert can and should be used as a provider of input for a mechanical combining process since most studies show mechanical combination to be superior to clinical combination. However, even in expert measurement, the global judgment is itself a clinical combination of other judgmental components and as such it may not be as efficient as a mechanical combination of the components. The superiority of mechanically combining components as opposed to using the global judgment for predicting some external criterion is discussed. The use of components is extended to deal with multiple judges since specific judges may be differentially valid with respect to subsets of components for predicting the criterion. These ideas are illustrated by using the results of a study dealing with the prediction of survival on the basis of information contained in biopsies taken from patients having a certain type of cancer. Judgments were made by three highly trained pathologists. Implications and extensions for using expert measurement and mechanical combination are discussed.
Article
The paper describes the ACT theory of learning. The theory is embodied as a computer simulation program that makes predictions about human learning of various cognitive skills such as language fluency, study skills for social science texts, problem-solving skills in mathematics, and computer programming skills. The learning takes place within the ACT theory of the performance of such skills. This theory involves a propositional network representation of general factual knowledge and a production system representation of procedural knowledge. Skill learning mainly involves addition and modification of the productions. There are five mechanisms by which this takes place: Designation, strengthening, generalization, discrimination, and composition. Each of these five learning mechanisms is discussed in detail and related to available data in procedural learning.