ArticlePDF Available

Abstract and Figures

Currently, guidelines do not exist for applying interrater agreement indices to the vast majority of methodological and theoretical problems that organizational and applied psychology researchers encounter. For a variety of methodological problems, we present critical values for interpreting the practical significance of observed average deviation (AD) values relative to either single items or scales. For a variety of theoretical problems, we present null ranges for AD values, relative to either single items or scales, to be used for determining whether an observed distribution of responses within a group is consistent with a theoretically specified distribution of responses. Our discussion focuses on important ways to extend the usage of interrater agreement indices beyond problems relating to the aggregation of individual level data.
Content may be subject to copyright.
http://orm.sagepub.com/
Methods
Organizational Research
http://orm.sagepub.com/content/16/1/127
The online version of this article can be found at:
DOI: 10.1177/1094428112465898
2013 16: 127 originally published online 27 November 2012Organizational Research Methods
Kristin Smith-Crowe, Michael J. Burke, Maryam Kouchaki and Sloane M. Signal
Theoretical and Methodological Problems
Assessing Interrater Agreement via the Average Deviation Index Given a Variety of
Published by:
http://www.sagepublications.com
On behalf of:
The Research Methods Division of The Academy of Management
can be found at:Organizational Research MethodsAdditional services and information for
http://orm.sagepub.com/cgi/alertsEmail Alerts:
http://orm.sagepub.com/subscriptionsSubscriptions:
http://www.sagepub.com/journalsReprints.navReprints:
http://www.sagepub.com/journalsPermissions.navPermissions:
What is This?
- Nov 27, 2012OnlineFirst Version of Record
- Feb 8, 2013Version of Record >>
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Article
Assessing Interrater
Agreement via the Average
Deviation Index Given a
Variety of Theoretical and
Methodological Problems
Kristin Smith-Crowe
1
, Michael J. Burke
2
,
Maryam Kouchaki
1
, and Sloane M. Signal
3
Abstract
Currently, guidelines do not exist for applying interrater agreement indices to the vast majority of
methodological and theoretical problems that organizational and applied psychology researchers
encounter. For a variety of methodological problems, we present critical values for interpreting the
practical significance of observed average deviation (AD) values relative to either single items or
scales. For a variety of theoretical problems, we present null ranges for AD values, relative to either
single items or scales, to be used for determining whether an observed distribution of responses
within a group is consistent with a theoretically specified distribution of responses. Our discussion
focuses on important ways to extend the usage of interrater agreement indices beyond problems
relating to the aggregation of individual level data.
Keywords
average deviation (AD), interrater agreement, multilevel research, aggregation, null distribution
Assessments of interrater agreement, or the degree to which raters are interchangeable (Kozlowski
& Hattrup, 1992),
1
are integral to many types of organizational and applied psychology research. For
instance, interrater agreement assessments have recently been central with respect to addressing
substantive questions within domains such as organizational climate and leadership (e.g., Dawson,
Gonzalez-Roma, Davis, & West, 2008; Walumbwa & Schaubroeck, 2009), conducting quantitative
1
University of Utah, David Eccles School of Business, Salt Lake City, UT, USA
2
Tulane University, Freeman School of Business, New Orleans, LA, USA
3
College of Education and Human Development, Jackson State University, Jackson, MS, USA
Corresponding Author:
Kristin Smith-Crowe, University of Utah, David Eccles School of Business, 1655 East Campus Center Drive, Salt Lake City,
UT 84112, USA.
Email: kristin.smith-crowe@business.utah.edu
Organizational Research Methods
16(1) 127-151
ªThe Author(s) 2012
Reprints and permission:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/1094428112465898
orm.sagepub.com
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
and qualitative research, as well as laboratory and field studies (e.g., Katz-Navon, Naveh, & Stern,
2009; Kreiner, Hollensbe, & Sheep, 2009; Van Kleef et al., 2009), developing measures (e.g.,
Bledow & Frese, 2009; Lawrence, Lenk, & Quinn, 2009), dealing with various types of data analysis
problems (e.g., Grant & Mayer, 2009; Nicklin & Roch, 2009; Trougakos, Beal, Green, & Weiss,
2008), and deciding whether or not to aggregate data (e.g., Borucki & Burke, 1999; Takeuchi, Chen,
& Lepak, 2009). Further, usage of interrater agreement statistics is on the rise. In the Journal of
Applied Psychology and Personnel Psychology alone, there has been a largely linear increase in the
use of these statistics over the past decade (see Figure 1). Notably, in 2010 almost half of the articles
published in these journals used interrater agreement statistics.
Despite the relevance of interrater agreement assessments for dealing with a broad array of
theoretical and methodological issues and their widespread usage, systematically derived guidelines
for applying interrater agreement indices to the vast majority of problems that researchers and
practitioners encounter do not exist. The primary objective of this article is to derive practical guide-
lines to assist researchers using the average deviation (AD) index in making more informed
decisions about interrater agreement problems. We focus on the AD index, the average deviation
from the mean or median of ratings, for two primary reasons. First, AD is straightforward. It
measures agreement, while intraclass correlations (ICC) measure both agreement and reliability
simultaneously (LeBreton & Senter, 2008), potentially complicating inferences. Further, for both
ICC and r
WG
, researchers must choose from among numerous variations to employ the statistic (see
LeBreton & Senter, 2008, for a review). Second, AD performs well. In a simulation study, Roberson,
Sturman, and Simons (2007) found that the AD index performs as well as similar other statistics.
Kline and Hambley (2007) reported similar findings.
Importantly, we are concerned with practical significance, or ‘‘whether an index indicates
that interrater agreement is sufficiently strong or disagreement is sufficiently weak so that one
can trust that the average opinion of a group is interpretable or representative’
2
(Dunlap,
Burke, & Smith-Crowe, 2003, p. 356), as practical significance is the basis on which agreement
23%
17%
20%
33%
22%
27% 29%
33%
36%
43%
47%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Percentage
Year
Figure 1. Percentage of articles published in Personnel Psychology and the Journal of Applied Psychology that used
interrater agreement statistics, including r
WG
, average deviation (AD), intraclass correlation (ICC), percentage
agreement, and Cohen’s kappa.
128 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
is typically evaluated. We present critical values for addressing the frequently asked methodo-
logical question concerning practical significance, ‘‘How much agreement/dispersion is there?’
These critical values can be used to assess agreement on a single item or a scale. This question
concerns the level of agreement in a set of ratings. An answer to this question often informs
decisions about the quality of a measure of central tendency, such as a group’s mean, as an
indicator of the group’s standing on a phenomenon or construct of interest. While previous
work has also addressed this question, as we will discuss in what follows, the guidelines pro-
vided are of very limited use.
In particular, we go beyond the work of Burke and Dunlap (2002), who previously provided a
decision rule for interpreting the practical significance of observed AD values, to provide decision
rules that cover many more circumstances. As we detail in what follows, though the calculation of
AD does not require the specification of a null distribution representing no agreement, the inter-
pretation of AD does. In other words, while one can calculate AD in the absence of a specified null
distribution, one cannot draw conclusions regarding observed AD values without comparing them
to some notion of ‘‘no agreement.’’ Burke and Dunlap’s guideline is based exclusively on the uni-
form distribution as the null distribution; there are no guidelines for interpreting the practical
significance of AD relative to any other null distributions. In what follows, we discuss the criti-
cisms of researchers’ overreliance on the uniform distribution despite other distributions often
being more appropriate. Herein, we provide guidelines for interpreting AD in terms of the level
of agreement relative to numerous other distributions. Our guidelines will allow researchers to
interpret interrater agreement relative to null distributions more appropriate to their research than
the uniform distribution.
Furthermore, we present guidelines for addressing the less commonly posed yet theoretically
important question of ‘‘How well does the pattern of observed agreement/dispersion match the
theoretically specified pattern of agreement/dispersion?’’ These guidelines can be used in relation
to either agreement on a single item or a scale. An answer to this question informs decisions
regarding the scoring of the group as consistent or not with the theoretically specified distribution
and, thus, the use of such scores in subsequent analyses at the group level of analysis. Addressing
questions related to the pattern of dispersion will be of increasing importance as researchers
attempt to test new theories concerning group and other higher level phenomena that specify pat-
terns of dispersion as variables (e.g., see DeRue, Hollenbeck, Ilgen, & Feltz, 2010; Harrison &
Klein, 2007). By focusing on the pattern in addition to level of agreement/dispersion, our work
promotes conceptual advances in research and goes beyond previous work on interrater agreement
(e.g., Burke & Dunlap, 2002).
For the purpose of demonstrating how our guidelines would be used to address problems relating
to the pattern of dispersion, we will focus on notions of diversity and team efficacy dispersion, as
theories relating to these phenomena have recently been presented. For the purpose of demonstrating
how our guidelines would be applied to the assessment of the level of agreement, we focus our
discussion on the common use of interrater agreement indices for data aggregation decisions. The
guidelines we present, however, would apply to the study of a broad array of interrater agreement
problems.
To unfold our discussion, we begin with a brief summary of research on multilevel modeling and
data aggregation to set the stage for a discussion related to assessments of the level of agreement.
This discussion also includes an overview of the relevance of interrater agreement assessments for
determining whether or not the observed pattern of dispersion matches a theoretically specified pat-
tern of dispersion. Then, we present interpretive standards for assessments of interrater agreement
for both the level of agreement and pattern of dispersion, with detailed discussions of how the
derived guidelines can be applied to a variety of research problems.
Smith-Crowe et al. 129
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Issues Related to the Level of Agreement and Pattern of Dispersion
In this section we discuss the use of interrater agreement in multilevel research to justify the
aggregation of lower level data to higher levels of analysis based on the level of observed
agreement. Then, we discuss a second possible usage of interrater agreement statistics, which
is to assess the goodness of fit between an observed pattern of dispersion with a theoretically
specified pattern of dispersion. We give examples of recent multilevel theories that predict
outcomes based on patterns of dispersion. Related to both level of agreement and pattern of
dispersion, we discuss the limited availability of guidelines available to researchers for inter-
preting agreement.
Level of Agreement
Multilevel research commonly entails researchers aggregating data so as to create measures or indi-
cators of higher level constructs. The appropriateness of representing higher level constructs by
aggregating individual-level data is established by a composition model, which represents theory
on how multilevel constructs are related at each level of analysis (Chan, 1998; Kozlowski & Klein,
2000; see also Klein, Dansereau, & Hall, 1994; Rousseau, 1985). For instance, Chan’s (1998, p. 236)
direct consensus model is the idea that the ‘‘meaning of [the] higher level construct is in the consen-
sus among lower levels’’; the referent-shift consensus model is the idea that the ‘‘lower level units
being composed by consensus are conceptually distinct though derived from the original individual-
level units’’; and the dispersion model is the idea that the ‘‘meaning of [the] higher level construct is
in the dispersion or variance among lower level units.’’ Importantly, composition arguments indicate
the type of evidence needed to justify the aggregation of individual-level data, with several models,
including the direct consensus and referent-shift models (Chan, 1998), specifying interrater agree-
ment, or the interchangeability of raters, as the appropriate type of evidence. Interrater agreement is
also important for dispersion models (Chan, 1998); in this case, the degree of agreement itself rep-
resents the higher level construct.
Essentially, interrater agreement via the average deviation index is established by demonstrating
that observed agreement is sufficiently greater than no agreement. Thus, though it is not necessary to
the calculation of AD, in order to assess, or interpret, observed AD values, researchers must identify
an appropriate random response distribution, or null distribution, to which observed variability in
responses can be compared. A number of scholars have cited the choice of a null distribution as key
to interpreting indices of interrater agreement, and thus drawing appropriate inferences from data
(e.g., Brown & Hauenstein, 2005; A. Cohen, Doveh, & Nahum-Shani, 2009; James, Demaree, &
Wolf, 1984; LeBreton & Senter, 2008; Lindell & Brandt, 1997; Lu
¨dtke & Robitzsch, 2009; Meyer,
Mumford, & Campion, 2010). In practice, however, researchers routinely rely on the uniform dis-
tribution as the null distribution, though doing so is likely often inappropriate (e.g., Brown & Hauen-
stein, 2005; Meyer et al., 2010). In fact, LeBreton and Senter (2008) recently called for a moratorium
on the unconditional reliance on the uniform distribution.
The consequences of inappropriately comparing observed data to the uniform null distribution
can be (a) that researchers mistakenly do not read interrater agreement as being sufficient for aggre-
gation to higher levels of analysis, (b) that researchers mistakenly read interrater agreement as being
sufficient for aggregation to higher levels of analysis (e.g., see Meyer et al., 2010), or (c) that
researchers fail to appropriately interpret a group’s standing on a variable of interest. Thus, compar-
ing observed data to an inappropriate null distribution can lead to erroneous inferences that have
important implications for researchers. Nonetheless, the only decision rule for interpreting the
practical significance of observed AD values is Burke and Dunlap’s (2002) decision rule, which
130 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
is based on the uniform distribution as the null distribution. Currently, there are no guidelines for
interpreting practical significance relative to any other distributions.
While assessments of within-group agreement for methodological purposes, such as data aggre-
gation as discussed previously, address the question, ‘‘How much agreement/dispersion is there?’
another question researchers can answer using interrater agreement indices is ‘‘How well does the
pattern of observed agreement/dispersion match the theoretically specified pattern of agreement/dis-
persion?’’ In the following we discuss the issue of the pattern of dispersion and the theoretical
distributions to which observed patterns can be compared.
Pattern of Dispersion
Harrison and Klein (2007) recently argued for the theoretical import of considering the pattern of
dispersion. They distinguished among separation diversity (e.g., differences in opinions, beliefs,
or attitudes), variety diversity (e.g., differences in knowledge or experience), and disparity diversity
(e.g., differences in proportionate ownership or control over socially valued assets). They argued
that depending on the type of diversity, minimum, moderate, and maximum diversity would be asso-
ciated with differently shaped distributions; that is, both the type and degree of diversity determine
the shapes of distributions. For instance, maximum separation diversity is characterized by a
bimodal distribution, maximum variety diversity is characterized by a uniform distribution, and
maximum disparity diversity is characterized by a skewed distribution. For separation diversity,
minimum, moderate, and maximum degrees of diversity are characterized as unimodal, uniform, and
bimodal, respectively. Considering both type of diversity and pattern of dispersion, they argued that
maximum separation diversity (bimodal distribution) and maximum disparity diversity (skewed
distribution) will have negative outcomes, such as reduced cohesion and group member input,
respectively, while maximum variety diversity (uniform distribution) will have positive outcomes,
such as increased creativity.
Importantly, according to their theory, both the type of diversity and the pattern of dispersion
must be known in order to effectively predict outcomes. For example, separation diversity could
be measured with regard to team members’ opinions about what their teams’ goals are (Harrison
& Klein, 2007). For each team, the pattern of the distribution of these opinions would be compared
to unimodal, uniform, and bimodal distributions as these are the distributions theoretically specified
by Harrison and Klein (2007) as representing minimum, moderate, and maximum separation diver-
sity. The degree of separation diversity, then, would be indicated by the theoretical distribution that
is most similar to the observed distribution. With this measure of degree of separation diversity for
each team, in addition to measures of cohesion, conflict, trust, and performance, researchers could
test Harrison and Klein’s hypothesis that as the degree of separation diversity increases, team out-
comes will be more negative: less cohesion and trust, more conflict, and lower performance.
DeRue et al.’s (2010) work on team efficacy provides another example of the potential theoretical
importance of the pattern of dispersion above and beyond the level of dispersion. They argued that
teams could have the same level of dispersion in their team efficacy ratings, but have different
theoretically meaningful patterns of dispersion. These different patterns of dispersion, they argued,
would predict different outcomes. Thus, according to DeRue et al.’s theory of team efficacy disper-
sion, assessing the pattern of dispersion in team efficacy ratings is essential for making predictions
about team effectiveness. For instance, they argued that while a bimodal distribution of team
efficacy ratings would lead to both positive and negative outcomes, a uniform distribution would
lead to positive outcomes. Regarding the effects of a uniform distribution, their argument was that
their disagreement will lead team members to share their differing views, thus enhancing team struc-
turing, planning, and learning, while simultaneously allowing the team to avoid problems of extreme
magnitudes of efficacy, which can lead either to overconfidence or helplessness, and social factions,
Smith-Crowe et al. 131
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
which create dysfunctional conflict. In contrast, they argue that a bimodal distribution will similarly
lead to team members sharing their differing views and thus enhancing team processes, but due to
the existence of social factions, will also lead to dysfunctional conflict.
While the question of the level of dispersion has been important for various reasons, especially
justifying aggregating individual-level data to form higher level variables, it is likely that the ques-
tion of the pattern of dispersion will become increasingly important as more researchers consider the
theoretical import of response distributions in and of themselves. This forecast is consistent with a
recent call from Edwards and Berry (2010) to increase the theoretical precision in management
research by developing hypotheses that specify effects in terms of magnitude, form (linear,
nonlinear, etc.), and conditions (i.e., moderators). In reviewing 25 years (1985-2009) of articles pub-
lished in the Academy of Management Review, Edwards and Berry (2010) found that 10.4%of the
propositions stated only that a relationship would exist, and 89.6%only indicated the direction of
the relationship. The theories presented by DeRue et al. (2010) and Harrison and Klein (2007) are
important steps toward more precise management theories because they consider the shapes of
distributions rather than simply measures of central tendency.
In cases for which the pattern of dispersion is of interest, it will be necessary to specify a ‘‘null
response range,’’ analogous to a null range with regard to a formal test of the null hypothesis
(see Greenwald, 1975), to determine whether the observed pattern of responses, or the relative
percentages of individuals within the respective categories,is consistent with the theoretical distribution.
To date, though researchers have suggested that observed patterns of dispersion can be quantitatively
assessed (DeRue et al., 2010; Harrison & Klein, 2007), no one has developed practical guidelines for
drawing inferences about the goodness of fit between an observed distribution and a theoretically spec-
ified distribution. As such, practical guidelines are needed for addressing both the methodological ques-
tion of the level of agreement/dispersion and the theoretical question of the pattern of responses.
Summary
In order to address this dearth of guidelines, we specify a variety of response distributions that
researchers could use to address a number of theoretical and methodological issues, and we
derive decision rules for the AD index relevant to each of these distributions to aid researchers
in making inferences about interrater agreement. We explain why and how the critical values
presented must be used differently to answer different research questions. Our intention is to
help researchers to interpret interrater agreement under the specified conditions, and impor-
tantly, the results will help researchers to make more appropriate decisions, including those
regarding the aggregation of data and more appropriate inferences regarding the interpretation
of group phenomena. In what follows, we discuss the AD index, relevant distributions, and
interpretive standards for the AD index.
The AD Index of Interrater Agreement
Burke, Finkelstein, and Dusig (1999) introduced the average deviation as an index of interrater agree-
ment, which represents the average absolute deviation in ratings from the mean rating of an item (AD
M
),
3
and as such is interpretable in the metric of the original scale. AD
M
for an item is calculated as follows:
ADMðjÞ¼P
N
k¼1jxjk
xjj
N;ð1Þ
where Nis the number of judges, or observations, of item j,x
jk
is equal to the kth judge’s rating of
item j, and
xjis equal to the mean rating of item j(Burke et al., 1999). The scale AD
M(J)
is the mean
132 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
of AD
M(j)
for essentially parallel items. Because the AD index is a measure of dispersion, lower val-
ues indicate greater agreement.
As noted previously, Burke and Dunlap (2002) derived a decision rule for inferring the practical
significance of observed AD values. This decision rule has two critical limitations. First, it only
addresses assessments of the level of agreement, not the pattern of distributions, which may be the-
oretically important. With the advance of theories such as DeRue et al.’s (2010) theory of team
efficacy dispersion and Harrison and Klein’s (2007) theoretical classification of types of diversity,
multilevel researchers will need to consider agreement/dispersion as a theoretically meaningful
issue. As such, guidelines addressing interpretations of the shapes of distributions are needed.
Second, this decision rule applies only when the uniform distribution is the appropriate null distri-
bution. As discussed previously, though the uniform distribution is widely applied, it is thought to be
quite often inappropriately applied. There is a mounting push from the scholarly community to jus-
tify the choice of a particular null distribution, rather than using the uniform distribution uncondi-
tionally, yet too few guidelines exist for researchers who do opt to use alternative null distributions.
In what follows, we identify the null and theoretical distributions used in our article. Then we
explain how we derived critical values for evaluating the practical significance of interrater agree-
ment in relation to these null distributions. These critical values can be used to assess the level of
interrater agreement in regard to data aggregation, which is a within-group assessment, as well as
a host of other problems relating to interrater agreement. Further, based on these critical values,
we calculated null ranges to be used in relation to studying theoretical problems; that is, assessing
the fit between an observed pattern of dispersion and a theoretical distribution.
Interpretive Standards for the AD Index
Here we present our derivations and resulting critical values and null ranges for the AD index given a
number of different response distributions. First, the distributions are described in brief. Then, we
explain our derivations of interpretive standards for the AD index. Finally, detailed discussions of
problems that relate to these theoretical and methodological reasons are presented.
Distributions
The distributions and their methodological and theoretical bases are listed in Table 1. The propor-
tions endorsing each value for 5-point and 7-point scales are listed for each distribution in Tables 2
and 3. Graphical depictions of these distributions are presented in Figures 2 through 5.
Table 1. Example Theoretical and Methodological Bases for Different Response Distributions.
Distribution Theoretical Basis Methodological Basis
Skew Social interaction
Work interdependence
Shared schemas
Maximum disparity diversity
Social desirability
Leniency
Bimodal Equal subgroups
Maximum separation diversity
Factions
Subgroup Minority belief, or unequal subgroups Response formats and unintended
question interpretation
Triangular/Bell-Shaped Central tendency
Uniform Fragmentation
Maximum variety diversity
Absence of bias
Conceptual ambiguity
Smith-Crowe et al. 133
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
First, we developed critical values for three basic forms of skewed distributions: slight skew, mod-
erate skew, and heavy skew (see Table 2 and Figure 2). Second, while one could model many forms of
bimodal distributions, here we simplify our presentation by suggesting two: ‘‘moderate’’ bimodal and
‘extreme’’ bimodal. As shown in Table 3 and Figure 3, the size of the subgroups in both cases is 50%
of the raters; the difference between the two distributions is that in the moderate bimodal distribution,
the subgroups are less divergent than they are in the extreme bimodal distribution. Third, though there
are numerous possible ways in which one could model subgroup distributions, we have simplified our
presentation by considering four possibilities based on two dimensions: the size of the subgroup and
the distance between the subgroup ratings and the majority of ratings. These distributions are shown in
Table 3, and they are graphically depicted for a 5-point scale in Figure 4. We define a smaller subgroup
as 10%of the raters (labeled as ‘‘A’ in Table 3 and Figure 4) and a more moderately sized subgroup to
be 20%of the raters (labeled as ‘‘B’’ in Table 3 and Figure 4). We define extreme distance as the sub-
group responses and the majority of responses being on opposite ends of the Likert-type scale and
moderate distance as the subgroup responses being at the midpoint of the scale, while the majority
of responses are at one extreme of the scale (these are labeled ‘‘extreme’’ and ‘‘moderate’’ in Table
3 and Figure 4). Finally, we present triangular-shaped, bell-shaped, and uniform distributions in Table
3; they are graphically represented in Figure 5. The triangular-shaped distributions are based on a for-
mula presented by Messick (1982) and the bell-shaped distributions are based on LeBreton and Senter
(2008). Note that the upper limits for the uniform distribution (presented in both Tables 2 and 3) are
consistent with Burke and Dunlap’s (2002) c/6 decision rule for assessing the practical significance of
AD, where c is equal to the number of response categories.
Critical Values for Level of Agreement
In order to simplify our derivations, we begin with the basic case of agreement across judges on a
single item with respect to two categories.
4
In the case of a dichotomy, AD can be calculated based
on the proportion of judges falling into one of the two categories (Burke & Dunlap, 2002).
5
Based on
an upper limit for AD of .35, where .35 or lower represents meaningful agreement, Burke and Dun-
lap (2002) demonstrated that meaningful agreement could be defined as 77%of the judges endorsing
Table 2. Critical Values and Null Ranges for AD
M
Given Distributions Defined by Skew.
Critical Values Null Ranges
Proportion Endorsing Each Value AD
M
Distribution 1 2 3 4567s
2
AD
M
s
ADMADMUL þ
5-Point Scale
Slight Skew .05 .15 .20 .35 .25 1.34 0.98 1.18 0.69 0.84 1.12
Moderate Skew .00 .10 .15 .40 .35 0.90 0.70 1.36 0.49 0.60 0.80
Heavy Skew .00 .00 .10 .40 .50 0.44 0.60 1.11 0.42 0.51 0.69
Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.85 1.02 1.38
7-Point Scale
Slight Skew .05 .08 .12 .15 .20 .25 .15 2.92 1.44 1.19 1.02 1.23 1.65
Moderate Skew .00 .06 .10 .14 .28 .22 .20 2.09 1.16 1.25 0.82 0.99 1.33
Heavy Skew .00 .00 .05 .10 .15 .30 .40 1.39 0.94 1.25 0.66 0.80 1.08
Uniform
a
.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.21 1.46 1.97
Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for
reporting purposes. The only exception was AD
M
, which was restricted to two decimal places when inputted into the
calculations.
a
The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15
decimal places.
134 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Table 3. Critical Values and Null Ranges for AD
M
Given Distributions Defined by Kurtosis and Variance.
Critical Values Null Ranges
Proportion Endorsing Each Value AD
M
Distribution 1234567s
2
AD
M
s
ADMADMUL þ
5-Point Scale
Moderate Bimodal
a
.00 .50 .00 .50 .00 1.00 1.00 1.00 0.71 0.85 1.15
Extreme Bimodal
a
.50 .00 .00 .00 .50 4.00 2.00 1.00 1.41 1.71 2.29
Moderate Subgroup A
a,b
.00 .00 .10 .00 .90 0.36 0.36 1.67 0.25 0.31 0.41
Extreme Subgroup A
a,b
.10 .00 .00 .00 .90 1.44 0.72 1.67 0.51 0.61 0.83
Moderate Subgroup B
a,b
.00 .00 .20 .00 .80 0.64 0.64 1.25 0.45 0.55 0.73
Extreme Subgroup B
a,b
.20 .00 .00 .00 .80 2.56 1.28 1.25 0.91 1.09 1.47
Triangular-Shaped .11 .22 .34 .22 .11 1.32 0.88 1.31 0.62 0.75 1.01
Bell-Shaped .07 .24 .38 .24 .07 1.04 0.76 1.34 0.54 0.65 0.87
Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.85 1.02 1.38
7-Point Scale
Moderate Bimodal
a
.00 .50 .00 .00 .00 .50 .00 4.00 2.00 1.00 1.41 1.71 2.29
Extreme Bimodal
a
.50 .00 .00 .00 .00 .00 .50 9.00 3.00 1.00 2.12 2.56 3.44
Moderate Subgroup A
a,b
.00 .00 .00 .10 .00 .00 .90 0.81 0.54 1.67 0.38 0.46 0.62
Extreme Subgroup A
a,b
.10 .00 .00 .00 .00 .00 .90 3.24 1.08 1.67 0.76 0.92 1.24
Moderate Subgroup B
a,b
.00 .00 .00 .20 .00 .00 .80 1.44 0.96 1.25 0.68 0.82 1.10
Extreme Subgroup B
a,b
.20 .00 .00 .00 .00 .00 .80 5.76 1.92 1.25 1.36 1.64 2.20
Triangular-Shaped .06 .13 .19 .24 .19 .13 .06 2.50 1.26 1.25 0.89 1.08 1.44
Bell-Shaped .02 .08 .20 .40 .20 .08 .02 1.40 0.84 1.41 0.59 0.72 0.96
Uniform
c
.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.21 1.46 1.97
Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for reporting purposes. TheonlyexceptionwasAD
M
,whichwas
restricted to two decimal places when inputted into the calculations.
a
‘‘Moderate’’ and ‘‘extreme’’ refer to the distance between subgroups.
b
‘‘A’’ and ‘‘B’’ refer to the differential proportion of subgroup responses.
c
The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15 decimal places.
135
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
one category. Based on a more stringent upper limit of AD, .33, they indicated that meaningful
agreement could be defined as 79%agreement.
6
They noted that this notion of 77%to 79%agree-
ment being meaningful is consistent with many practical examples and problems relating to propor-
tional agreement, such as 60%to 80%agreement being required for including critical incidents
when creating behaviorally anchored rating scales (BARS; Cascio, 1998). Based on Burke and Dun-
lap’s calculations for the AD index, as well as conventional interpretations of meaningful agreement
in percentage or proportional terms, we adopted a starting value of 80%agreement.
We note that assumptions are necessary to the derivation process. By making ours explicit,
readers can readily revise the starting value as needed; yet, we suggest that a starting value of
80%agreement, or 20%disagreement, relative to the AD index is likely to suit most readers’ situa-
tions. Notably, the value of 20%disagreement is comparable to the upper limits of acceptable dis-
agreement for the AD index for scales that range from 3 to 99 response options (see Burke & Dunlap,
2002); that is, they are comparable in the sense of being approximately equal to the maximum level
of allowable disagreement. In other words, our use of the dichotomous case here does not limit the
applicability of our derivations to dichotomies.
Given our intent of proposing interrater agreement cut-offs and null ranges for response distribu-
tions for Likert-type scales with markedly different dispersion, which have different numbers of
response options and reflect a variety of response patterns including non-normal distributions, we next
convert a proportion of .80 (or 80%) to a standardized effect size (a correlation coefficient, r)towork
further with variances as indicators of dispersion. Since non-normal response distributions are
expected in many cases for theoretical reasons, we initially employ an arcsine transformation to con-
vert the proportion of .80 to a standardized effect size (i.e., a d-statistic; see Lipsey & Wilson, 2001)
and then use a maximum likelihood transformation of this d-value to obtain a correlation coefficient.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12345
Proportion Endorsing Each Value
Response Option
Slight Skew Moderate Skew Heavy Skew
Figure 2. Slight, moderate, and heavy skew distributions for a 5-point scale.
136 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12345
Proportion Endorsing Each Value
Response Option
Moderate Bimodal Extreme Bimodal
Figure 3. Bimodal distributions for a 5-point scale.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12345
Proportion Endorsing Each Value
Response Option
Moderate Subgroup A Extreme Subgroup A
Moderate Subgroup B Extreme Subgroup B
Figure 4. Subgroup distributions for a 5-point scale.
Smith-Crowe et al. 137
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
The d-value is computed as the difference between the arcsine of the proportion representing
meaningful agreement (i.e., .80) and the arcsine of the proportion representing no agreement
(.00) using Lipsey and Wilson’s (2001) formula:
d¼ð2arcsineðffiffiffiffi
p1
pÞÞ  ð2arcsineðffiffiffiffi
p2
pÞÞ:ð2Þ
The resulting d-value is 2.214. Next, we transform the value of 2.214 to a correlation coefficient via
the maximum likelihood formula (Hunter & Schmidt, 2004):
7
r¼
d=
2
1þd
=
2

2

1=2ð3Þ
Given that the proportions in the two groups are unequal (i.e., .80 and .20), the number ‘‘2’’ in Equa-
tion 3 is replaced by 1/(pq)
1/2
, where pand qare the proportions in each group. The result is a
correlation of approximately .66. By rounding this value up to .7, our derivations continue at the
starting point for Burke and Dunlap’s (2002) derivations for practical cut-offs for the AD index for
the restricted case of the uniform distribution.
We note that arriving at approximately the same value for a correlation as Burke and Dunlap
(2002) does not indicate circularity in our derivations, but it does reflect our explicit assump-
tion that the underlying response distribution may meaningfully deviate from normality, thus
calling for the arcsine transformation of percentage agreement to produce a correlation. As
we discuss in the appendix, assuming that the underlying distribution ofresponsesisnormal
would call for a probit transformation of the proportion to produce a correlation. The resulting
value for the correlation would become approximately .8 (Lipsey & Wilson, 2001). Further-
more, Burke and Dunlap’s starting point of .7 for a correlation was, in large part, based on
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12345
Proportion Endorsing Each Value
Response Option
Triangular-Shaped Bell-Shaped Uniform
Figure 5. Triangular-shaped, bell-shaped, and uniform distributions for a 5-point scale.
138 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
empirical data relating to stability coefficients and correlations based on ratings of targets by
alternate sources. Their judgment and ours that a correlationof.7isareasonablyhighcorrela-
tion is consistent with J. Cohen (1977) who indicated that correlations greater than or equal to
.5 can be considered large.
Next, as is recognized in a number of quantitative fields (e.g., see Burke & Dunlap, 2002;
Greene, 1997; Guion, 1998; McCall, 1970; Parsons, 1978), we define a correlation (r)interms
of variances as
r¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1s2
es2
T
q;ð4Þ
where s2
eis the error variance, here representing disagreement, and s2
Tis the total variance.
Given that the average deviation is a reasonable approximation to the standard deviation (we
discuss the more specific relationship in what follows), and the square of the standard deviation
is the variance, we can let s2
eequal AD
2
. Consistent with James et al.’s (1984, 1993) work, we
then set s2
Tto be equal to the variance of the chance responding in the population (s2
crpop).
Then, setting requal to .7, as a reasonable value for the correlation in Equation 4, we can
rewrite Equation 4 as
:7¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1AD2.s2
crpop:
rð5Þ
Squaring both sides and solving for the ratio of variances, we obtain
AD2.s2
crpop ¼1:72¼:51:ð6Þ
Rounding .51 to .50 as did Burke and Dunlap (2002) and rewriting Equation 6, we get
AD2¼s2
crpop=2:ð7Þ
We used Equation 7 to calculate AD
2
for the different response distributions we identified. That is,
we calculated the variance of each response distribution and then divided the variance by 2 in order
to calculate AD
2
. By then taking the square root of this resulting value, we calculated AD:
ffiffiffiffiffiffiffiffi
AD2
p¼AD:ð8Þ
Recall that in Equation 5, AD
2
was substituted as an approximation for the observed variance; that
is to say that AD approximates the standard deviation (s). In fact, the standard and average devia-
tions vary by a constant that is dependent on the specified response distribution. As Burke and Dun-
lap (2002) noted, for the uniform distribution the s:AD ratio is 1.2. Thus, in order to calculate the
upper limits, or critical values, for the AD index, assuming a uniform distribution, they divided AD
(the result of Equation 8) by 1.2. That is, they corrected for the difference between sand AD intro-
duced in Equation 5. The same adjustment is needed here. The resulting value of Equation 8 must be
divided by the s:AD ratio relevant to a given response distribution. Thus, upper limits for acceptable
interrater agreement for AD
M
must be calculated separately for each null or theoretical response dis-
tribution using the following equation:
ADMUL ¼AD
ðse=ADMÞ;ð9Þ
where AD is calculated according to Equations 7 and 8 and AD
M
is calculated according to Equation 1.
The resulting critical values are listed in Tables 2 and 3 along with the pattern of responses for
each of the distributions identified and the relevant statistics. For use as decision heuristics, we have
Smith-Crowe et al. 139
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
rounded the critical values in Tables 2 and 3 to two decimal places; they can be applied to individual
items or to multi-item scales. For the purpose of assessing the level of agreement, the critical values
should be used in the conventional way (Burke & Dunlap, 2002): An observed AD
M
value equal to or
less than the relevant critical value (ADMUL ) indicates practically significant agreement. For
example, referring to Table 2, under the condition of slight skew for a 5-point scale, ADMUL ¼.69.
Thus, for researchers to infer a practically significant level of observed agreement, one’s observed
AD
M
value must be less than or equal to .69. Based on such an indication of practically significant
agreement, researchers would be justified in using the mean score as an indicator of the group’s
standing on a construct of interest and as a data point for further, multilevel analysis.
Null Ranges for Pattern of Dispersion
In addition to developing critical values, in response to recent advances in multilevel theory, we also
developed null ranges to facilitate researchers’ ability to assess how well the shape of an observed
distribution fits a theoretically specified distribution. This issue of comparing the pattern of observed
dispersion with a theoretical distribution is analogous to judging the goodness of fit between one’s
data and the theoretical response distribution. Cortina and Folger (1998) described tests of goodness
of fit as a matter of accepting the null hypothesis of no statistically significant difference between
observed data and theoretical models. Here, we are dealing with practical significance rather than
statistical significance meaning that goodness of fit in this context is a matter of concluding that
there is no meaningful difference between an observed distribution and the theoretically specified
response distribution.
The values for AD
M
shown in Tables 2 and 3 quantify the dispersion of different distributions.
Therefore, if an observed value is equal to the relevant tabled AD
M
value, then the pattern of
observed dispersion should fit perfectly with the pattern of theoretical dispersion. It is unlikely, how-
ever, that observed and tabled values will perfectly match; thus, the question becomes what is the
‘null range’’? In other words, how far can an observed value be from the tabled value before
researchers must conclude that their observed distribution has a poor fit with the theoretical
distribution?
Analogous to Greenwald’s (1975) discussion of how to accept a null hypothesis gracefully
(also see discussions by Cashen & Geiger, 2004; Cortina & Folger, 1998), researchers would
need to decide in advance of collecting data what magnitude of effect, in this case, the mag-
nitude of AD
M
, would be considered nontrivial. We suggest defining this magnitude as the dif-
ference between the expected AD
M
value for a distribution and therespectiveupperlimitfor
that distribution. While the decision to specify this magnitude is arguably somewhat arbitrary,
it is nevertheless made in advance of collecting data and tied to our derivations for assessments
of practical agreement. Consistent with Greenwald’s arguments about establishing a null range
for the formal test of a null hypothesis, this minimum magnitude of AD
M
that the researcher is
willing to consider nontrivial is then the boundary of the null range. That is, for observed AD
M
values, this magnitude would be the difference between the tabled AD
M
value and the relevant
tabled critical value, ADMUL . The general equation for establishing a null range for a theoretical
response distribution is as follows:
ADMnullrange ¼ADMðADMADMULÞ=w;ð10Þ
where wis used to define the width of the null range. Herein, following Greenwald’s (1975, pp.
16-18) logic regarding establishing a ‘‘two-tailed’’ null range that is symmetric around the zero point
of a test statistic, we define was equal to 2:
ADMnullrange ¼ADMðADMADMULÞ=2:ð11Þ
140 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
The resulting values are presented in Tables 2 and 3. For use as decision heuristics, we have rounded
the lower (symbolized by a minus sign) and upper (symbolized by a plus sign) limits of the null
range to two decimal points.
Although we present null ranges that are symmetrical around the expected AD
M
value,
researchers can readily define wand the width of the null range relative to the purposes of their
investigations. In these cases, larger values for wwill result in smaller, more conservative null
ranges than those reported in Tables 2 and 3 for the respective response distributions. In addi-
tion, researchers may desire to consider the construction of nonsymmetrical, one-tailed null
ranges for some types of theoretical response distributions. As with the use of critical values,
the researcher may desire to consider a priori several theoretical response distributions when
making judgments about whether the observed and theoretical response distributions are mean-
ingfully different.
Using the null ranges to gauge the goodness of fit between an observed and a theoretical distri-
bution is straightforward. Using the previous example of a slightly skewed null distribution and a
5-point scale, an observed AD
M
of .98 would suggest a perfect match between the observed and the-
oretical distributions (see Table 2). Yet, how can a researcher interpret an observed AD
M
of .70? The
relevant range is .84 to 1.12.
8
Thus, a researcher who observes an AD
M
of .70 would conclude that
the observed distribution is meaningfully different from the theoretical distribution. That is, the
observed AD
M
of .70 falls outside of the null range.
What researchers would do after determining a lack of fit would depend upon the theoretical con-
text. In some cases it may be that a lack of fit suggests that the phenomena researchers are attempting
to study are not represented in the data. This eventuality is analogous to researchers who do multi-
level research finding a lack of agreement such that aggregation to a higher level of analysis cannot
be justified (e.g., Chan, 1998). Or, it may be the case that shapes of observed distributions are com-
pared to multiple theoretically specified distributions. While .70 does not fall into the null range for
slight skew, it does fall into the range for moderate skew. In this case, the researcher would be able to
categorize the group as a ‘‘moderate skew’’ group and make theoretically based predictions accord-
ingly. More broadly, researchers can use the null ranges provided in Tables 2 and 3 in order to
classify groups according to the pattern of their distributions of scores, and then based on this clas-
sification, make theoretically derived predictions about group outcomes. These null ranges and those
relevant to the other distributions discussed in the following can be used relative to a single item or a
scale.
It is important to note that researchers must visually check the observed distribution of responses.
For instance, the direction of skew may be of theoretical relevance. Because the AD index is calcu-
lated via absolute values, the direction of skew cannot be determined from AD values. Quantifying
agreement as well as visually checking the direction of skew is necessary. This point holds for other
distributions discussed as well.
Distribution Choice
In the following, we discuss examples of when these different response distributions might be rel-
evant (see also Table 1). Note that we do not assume that only one distribution is relevant in any
given research context; rather, as others have suggested (e.g., James et al., 1984), we think it is rea-
sonable that multiple distributions may be appropriate. We organize our discussion by first consid-
ering the issue of level of agreement and then considering the issue of pattern of dispersion. Within
these sections, we refer to distributions defined by skew (Table 2) and those defined by kurtosis and
variance (Table 3).
Smith-Crowe et al. 141
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Level of Agreement
A number of response biases suggest that the appropriate null distribution is a skewed distribution.
James et al. (1984) and LeBreton and Senter (2008) have discussed the likelihood of leniency and
social desirability in contexts of assessing interrater agreement. Leniency may apply, for instance, in
the performance appraisal domain where subordinates tend to judge their supervisors in relatively
positive terms (Schriesheim, 1981). Klein, Conn, Smith, and Sorra (2001) found social desirability
to be applicable in a survey of organizational members’ workplace perceptions. Agreement among
members was related to the social desirability of the survey items (e.g., ‘‘The supervisor to whom I
report praises me for excellent performance’’ and ‘‘My work here is enjoyable’’; Klein et al., 2001, p.
11). To the extent that these biases are expected to be strong versus weak, and to the extent that mul-
tiple biases are expected to be relevant, researchers could utilize moderately to heavily skewed dis-
tributions as their null distributions.
Though skewed distributions have most often been suggested as alternatives to the uniform null
distribution, other distributions are relevant as well. Likert-type response formats that convey or
have different informational value may result in subgroups or small to moderate percentages of
respondents using particular response options. For instance, Schwarz, Knauper, Hippler, Noelle-
Neumann, and Clark (1991) showed that participants responded differently to the question ‘‘How
successful would you say you have been in life?’’ when the 11-point scale ranged from –5 to 5 rather
than 0 to 10, even though the anchors were identical (not at all successful to extremely successful).
For the former scale, 34%endorsed –5 to 0; for the latter scale, 13%endorsed 0 to 5. Schwarz (1999)
argued that the question of success is somewhat ambiguous in that success could be marked by the
presence of positive features or the absence of negative features and that participants use the scale
numbers as well as the anchors to interpret the items. In addition, Lindell and Brandt (1997) dis-
cussed the possibility of distinct factions among raters due to characteristics such as clinical orienta-
tions in assessments of psychotherapy, raters’ academic disciplines in rating research proposals,
raters’ functional department in ratings of organizational climate, and so on. The possibility of such
factions might call for the use of a bimodal response distribution as the baseline distribution for
assessing level of agreement among a set or raters. We present four different subgroup distributions
and two different bimodal distributions (see Table 3). In addition, triangular-shaped or bell-shaped
distributions
9
are applicable null distributions if one expects raters to succumb to the central ten-
dency bias (e.g., James et al., 1984; LeBreton & Senter, 2008). For instance, James et al. (1984,
p. 91) suggested that the central tendency bias may occur ‘‘when judges are purposefully cautious
or evasive because responses to items are not collected on a confidential basis, and political reasons
exist for not departing from the neutral alternatives on the scales.’’ They also suggested that naı¨ve
and unmotivated participants may exhibit the central tendency bias when responding to ambiguous
or complicated items.
Finally, while the uniform distribution has been described as an often inappropriate null distribu-
tion, there are circumstances under which it is the appropriate null distribution. It is applicable if no
rater bias is expected. It may also be applicable if raters face conceptual ambiguity. For instance,
Heidemeier and Moser (2009) found that raters demonstrated less agreement in job performance
ratings when the work being evaluated was less straightforward; that is, there was less agreement
regarding white-collar work and work high in job complexity compared to agreement regarding
blue-collar work and work low in job complexity.
Pattern of Dispersion
There are also theoretical bases for modeling agreement on most of these distributions. DeRue
et al. (2010) provided an example theoretical basis for choosing a bimodal distribution as a
142 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
theoretical response distribution: Equally sized subgroups within teams that judge team efficacy
differently will have mixed effects on team effectiveness by impairing social processes, but enhan-
cing task processes. They went on to propose that the greater the divergence between the sub-
groups, the more negative the effect on team effectiveness will be. In discussing maximum
separation diversity, such as diversity in team members’ judgments of team efficacy, Harrison and
Klein (2007) discussed an extreme bimodal distribution, where subgroups exist on opposite ends
of a continuum. Consistent with DeRue et al., they argued that this extreme bimodal distribution
would have negative outcomes: reduced cohesiveness, interpersonal conflict, distrust, and
decreased task performance.
Related to bimodal distributions are unimodal distributions that have distinct subgroups. From a
theoretical perspective, DeRue et al. (2010) discussed ‘‘minority belief’’ dispersion where one team
member rates team efficacy differently than the other team members. We previously reported their
proposition that when minority belief dispersion is characterized by one individual rating team effi-
cacy lower than everyone else, the effect on team effectiveness will be negative. DeRue et al. also
theorized about the opposite distribution: One individual rates team efficacy more highly than the
other team members. They proposed that this pattern of dispersion, which is the mirror-image of the
first scenario, will have mixed effects on team effectiveness because the dispersion is likely to
impair social processes, but enhance task processes.
Finally, DeRue et al. (2010) provided an example of when the uniform distribution would be the-
oretically specified as the expected response distribution: Fragmentation, characterized by a uniform
distribution of team efficacy beliefs, should augment team effectiveness by positively impacting
social and task processes. Their argument is based on the idea that fragmented teams may commu-
nicate more effectively than other teams because they do not have subgroups, coalitions, and
factions that can hinder effective communication in teams and they are motivated to create a shared
understanding of team efficacy. These teams are likely to openly discuss issues like goals and expec-
tations that can help in teams’ task-related processes, as well as helping to establish a shared belief
about team efficacy. Harrison and Klein (2007) proposed similarly positive effects for variety
diversity, such as diversity in educational background, when it is at a maximum level, which is
characterized by a uniform distribution: more creativity, greater innovation, higher decision quality,
more task conflict, and increased unit flexibility.
Conclusion
Given numerous calls for researchers who use interrater agreement indices to stop their uncondi-
tional use of the uniform response distribution, a primary purpose of our study was to provide
researchers with guidelines for using alternative null distributions and theoretical distributions
to make judgments about practical significance, when addressing both methodological and theo-
retical issues. In doing so, we derived critical values for a variety of response distributions that
vary in terms of skew, kurtosis, and variance. We also discussed how to use the critical values
shown in Tables 2 and 3 differently depending on whether one seeks to ascertain the level of agree-
ment or the pattern of dispersion. While the question of the level of agreement is familiar, the
question of the pattern of dispersion is more novel, but likely to become more and more important
with advances in multilevel theory and research. The current paper stands to promote such con-
ceptual advances.
Although we focused the substantive discussion of interrater agreement problems on data
aggregation in relation to the level of agreement and team efficacy dispersion and diversity
in relation to the pattern of dispersion, the derived critical values and null ranges can be
applied to numerous other research questions. For instance, the alternative null distributions and
critical values could assist in addressing interrater agreement questions related to job analysts’
Smith-Crowe et al. 143
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
ratings of task items for a job, or judges’ ratings of critical or cut-off scores on the items of a
test (e.g., using the Angoff method whereby cut-off scores are based on subject matter experts’
estimates of the probability that a competent person will respond to an item accurately; e.g., see
Hudson & Campion, 1994) just to name a few types of pertinent research questions. As another
example, the notion of a theoretical distribution could be used to specify the demographic
makeup of a community (e.g., racial/ethnic composition in terms of percentages within each
category), thereby permitting the quantification of demographic similarity/dissimilarity between
employees and residents (i.e., the difference between an observed AD
M
value and the relevant
tabled value for a theoretical distribution). Quantifying the effects of community demographic
similarity in this manner may meaningfully extend the measurement and study of employee-
community racial/ethnic similarity from the individual level of analysis (e.g., see Avery,
McKay, & Wilson, 2008; Brief et al., 2005) to the organization or business unit levels of anal-
ysis. Importantly, irrespective of the group phenomena under study, the AD index itself and the
derived null ranges provide a means for trackingorstudyingexpectedchangesingroupphe-
nomena possibly relative to stages of group development or shocks that the group might
encounter. Further, practical significance critical values could be similarly developed for other
interrater agreement indices.
Future research should also address the problem of assessing the statistical significance of AD
values relative to a variety of null or theoretical distributions. As discussed earlier, the work to date
in this area is limited. Burke and Dunlap (2002) and Dunlap et al. (2003) used an approximate ran-
domization test to establish statistical significance cut-offs for AD for judges’ ratings of a single item
relative to the uniform distribution. Cohen et al. (2009) built upon this work to establish statistical
significance cut-offs for AD for judges’ agreement on multi-item scales relative to the uniform dis-
tribution and a slightly skewed distribution. In order to assist researchers in inferring whether levels
of agreement (i.e., AD values) are due to chance, cut-off values for statistical significance should be
established relative to more distributions, such as those identified in Tables 2 and 3. Without addi-
tional guidelines, researchers are likely to continue to over-rely on the uniform distribution when
making inferences about their data.
In closing, we emphasize that the practical guidelines presented herein are just that: guide-
lines. As others have advised, it is important that researchers take a common sense approach to
interpreting observed agreement. Speaking in terms of whether interrater agreement is suffi-
cient to justify the aggregation of individual-level data, LeBreton and Senter (2008, p. 836)
asserted that ‘‘the value used to justify aggregation ultimately should be based on a researcher’s
consideration of (a) the quality of the measures, (b) the seriousness of the consequences result-
ing from the use of aggregate scores, and (c) the particular composition model to be tested.’
James et al. (1984), in addressing the problem of uncertainty over which null distribution
applies in a given situation, suggested interpreting observed agreement on the basis of several
null distributions: ‘‘The rationale here is that even though we cannot pinpoint a particular null
with a high degree of confidence, we can place bounds on the most likely types of nulls and
thereby increase the likelihood that the true null lies somewhere in this range of distributions’’
(p. 95).
Similarly, we urge researchers to consider their particular circumstances when assessing
interrater agreement and to consider the use of a range of critical values based on several dif-
ferent null or theoretical distributions. For instance, researchers should consider whether they
have missing data (e.g., see Newman & Sin, 2009). Our guidelines do not account for system-
atically missing data and thus may be sensitive to this problem, particularly in the cases of
certain distributions, such as the bimodal distribution, which may appear as a unimodal distri-
bution if data are systematically missing from one of the two subgroups. In other cases,
researchers may need to apply a null or theoretical distribution not included in Tables 2 and
144 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
3, or they may need to adjust the starting value for interrater agreement of 80%agreement used
in the present derivations. Moving away from 80%agreement or considering another transfor-
mation of percentage agreement for derivational purposes, such as a probit transformation of a
proportion, will result in more stringent or more lenient critical values and decisions concerning
interrater agreement depending on whether one adjusts this value upward or downward, or
whether one employs a more versus less conservative transformation of the proportion, such
as arcsine versus probit transformations. Recognizing the possibility that the research context
may dictate the consideration of other assumptions or response distributions than those used
in the study, we present in the appendix a general procedure for researchers to use in establish-
ing critical values and null ranges based on other assumptions not considered here. In this
regard, our proposed guidelines offer a uniform and parsimonious means for studying interrater
agreement given a variety of methodological and theoretical problems.
Appendix
Calculating Critical Values and Null Ranges for the AD Index
Although we presented a number of different response distributions in Tables 2 and 3, researchers
may find that their methodologically or theoretically specified response distribution is not listed. In
this case, researchers can follow our procedures to calculate the relevant critical values and null
ranges. The first step is to express the response distribution in terms of the proportion of individuals
endorsing each value of a scale as we did in Tables 2 and 3. The second step is to calculate the var-
iance for the specified distribution. Third, as per Equation 7, divide the variance by 2 to calculate
AD
2
; then, take the square root of the resulting value (i.e., AD, see Equation 8). Finally, as per
Equation 9, divide AD by the s:AD
M
ratio to calculate ADMUL ; follow Equation 10 to calculate the
null range (see also Equation 11). The agree.exe program available at http://www.tulane
.edu/*dunlap/psylib.html can be used to calculate AD
M
for items or multi-item scales. The
calculations conducted by the software are based on Burke and Dunlap (2002) and Dunlap, Burke,
and Smith-Crowe (2003). Note that the ‘‘actual variance’’ reported by the software is calculated for a
sample; rather, our calculations are based on the variance calculated for a population.
In addition, researchers may find that starting with a percentage agreement of 80%and using a
correlation of .7 (within Equation 5) is either too lenient or too stringent given their particular
circumstances. In such cases, researchers can derive their own critical values and null ranges based
on different initial assumptions. Beginning with either a different percentage agreement or continu-
ing derivations with another value for the correlation will produce different cut-offs and null ranges.
One can substitute another reasonable value in Equation 5 and follow the sequence through Equation
10 to arrive at new critical values and null ranges (see also Equation 11). For example, replacing .7
with .8 in Equations 5 and 6, results in 1 – .8
2
¼.36. Thus, Equation 7 would be rewritten such that
the variance would be divided by 2.78. To calculate their own critical values then, researchers using
r¼.8 would calculate the variance of a given distribution and divide that variance by 2.78 (as per the
revised version of Equation 7). Then, they would follow Equations 8-9 to calculate ADMUL and Equa-
tion 10 to calculate the null range (see also Equation 11).
There are several reasons for which researchers may want to use cut-offs associated with a
correlation of .8. For instance, from our starting point of defining meaningful agreement as 80%
agreement, one could use a probit transformation to convert this proportion to an effect size. A probit
transformation of the proportion may be called for if the underlying distribution of scores is expected
to be normally distributed. Also the probit transformation may be a particularly good choice in esti-
mating a standardized effect from proportions if the cut-point between the two groups is in the tail
portion of a skewed distribution (Lipsey & Wilson, 2001). A probit transformation of a proportion of
Smith-Crowe et al. 145
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
.8 will produce a correlation equal to .78501 (Lipsey & Wilson), which rounds to .8. Also, as
discussed in Burke and Dunlap (2002), a correlation of .8 would correspond to a high level of
stability in scores. Substituting .8 for .7 in Equation 6 and then following the remainder of the
equations, one would arrive at more stringent critical values than those presented in Tables 2 and
3. We present this more stringent set of criteria in Tables A1 and A2.
Finally, some researchers may want to use AD calculated from the median (AD
Md
) rather than the
mean (AD
M
). These different versions of the AD index are equal when the mean and median of a
distribution are equal, and otherwise they tend to be highly correlated (Burke, Finkelstein, & Dusig,
1999). Though AD
M
has been used more often by researchers, Burke et al. (1999) argued that AD
Md
is more sensitive in detecting agreement since the median of a distribution is the point at which the
sum of the absolute deviations are the most minimal, and more minimal deviations indicate higher
agreement. AD
Md
for an item is calculated as follows:
ADMdðjÞ¼P
N
k¼1jxjk Mdjj
N;ð12Þ
where Md
j
is equal to the median rating of item jand all other notations are consistent with those in
Equation 1. The scale AD
Md(J)
is the mean of AD
Md(j)
for essentially parallel items. The upper limit
for AD
Md
would be calculated as follows:
ADMdUL ¼AD
ðse=ADMd Þ;ð13Þ
where AD is calculated according to Equations 5 through 8. Finally, the null range for AD
Md
would
be calculated as follows:
ADMdnullrange ¼ADMd ðADMd ADMdULÞ=w;ð14Þ
Table A1. Critical Values and Null Ranges for AD
M
Given Distributions Defined by Skew and r¼.8.
Critical Values Null Ranges
Proportion Endorsing Each Value AD
M
Distribution 1 2 34567s
2
AD
M
s
ADMADMUL þ
5-Point Scale
Slight Skew .05 .15 .20 .35 .25 1.34 0.98 1.18 0.59 0.78 1.18
Moderate Skew .00 .10 .15 .40 .35 0.90 0.70 1.36 0.42 0.56 0.84
Heavy Skew .00 .00 .10 .40 .50 0.44 0.60 1.11 0.36 0.48 0.72
Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.72 0.96 1.44
7-Point Scale
Slight Skew .05 .08 .12 .15 .20 .25 .15 2.92 1.44 1.19 0.86 1.15 1.72
Moderate Skew .00 .06 .10 .14 .28 .22 .20 2.09 1.16 1.25 0.69 0.92 1.39
Heavy Skew .00 .00 .05 .10 .15 .30 .40 1.39 0.94 1.25 0.56 0.75 1.13
Uniform
a
.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.03 1.37 2.06
Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for
reporting purposes. The only exception was AD
M
, which was restricted to two decimal places when inputted into the
calculations.
a
The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15
decimal places.
146 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Table A2. Critical Values and Null Ranges for AD
M
Given Distributions Defined by Kurtosis and Variance and r¼.8.
Critical Values Null Ranges
Proportion Endorsing Each Value AD
M
Distribution 1 2 3 4 5 6 7 s
2
AD
M
s
ADMADMUL þ
5-Point Scale
Moderate Bimodal
a
.00 .50 .00 .50 .00 1.00 1.00 1.00 0.60 0.80 1.20
Extreme Bimodal
a
.50 .00 .00 .00 .50 4.00 2.00 1.00 1.20 1.60 2.40
Moderate Subgroup A
a,b
.00 .00 .10 .00 .90 0.36 0.36 1.67 0.22 0.29 0.43
Extreme Subgroup A
a,b
.10 .00 .00 .00 .90 1.44 0.72 1.67 0.43 0.58 0.86
Moderate Subgroup B
a,b
.00 .00 .20 .00 .80 0.64 0.64 1.25 0.38 0.51 0.77
Extreme Subgroup B
a,b
.20 .00 .00 .00 .80 2.56 1.28 1.25 0.77 1.02 1.54
Triangular-Shaped .11 .22 .34 .22 .11 1.32 0.88 1.31 0.53 0.70 1.06
Bell-Shaped .07 .24 .38 .24 .07 1.04 0.76 1.34 0.46 0.61 0.91
Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.72 0.96 1.44
7-Point Scale
Moderate Bimodal
a
.00 .50 .00 .00 .00 .50 .00 4.00 2.00 1.00 1.20 1.60 2.40
Extreme Bimodal
a
.50 .00 .00 .00 .00 .00 .50 9.00 3.00 1.00 1.80 2.40 3.60
Moderate Subgroup A
a,b
.00 .00 .00 .10 .00 .00 .90 0.81 0.54 1.67 0.32 0.43 0.65
Extreme Subgroup A
a,b
.10 .00 .00 .00 .00 .00 .90 3.24 1.08 1.67 0.65 0.86 1.30
Moderate Subgroup B
a,b
.00 .00 .00 .20 .00 .00 .80 1.44 0.96 1.25 0.58 0.77 1.15
Extreme Subgroup B
a,b
.20 .00 .00 .00 .00 .00 .80 5.76 1.92 1.25 1.15 1.54 2.30
Triangular-Shaped .06 .13 .19 .24 .19 .13 .06 2.50 1.26 1.25 0.76 1.01 1.51
Bell-Shaped .02 .08 .20 .40 .20 .08 .02 1.40 0.84 1.41 0.50 0.67 1.01
Uniform
c
.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.03 1.37 2.06
Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for reporting purposes. The only exception was AD
M
, which
was restricted to two decimal places when inputted into the calculations.
a
‘‘Moderate’’ and ‘‘extreme’’ refer to the distance between subgroups.
b
‘‘A’’ and ‘‘B’’ refer to the differential proportion of subgroup responses.
c
The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15 decimal places.
147
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
where wis used to define the width of the null range. For the same reasons given previously in the
discussion of Equation 11, we suggest defining was equal to 2.
Acknowledgments
We would like to thank Greg Oldham and Isaac Smith for helpful comments on previous drafts of the article and
Julie Seidel and Teng Zhang for research assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publi-
cation of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
1. Interrater agreement is distinct from interrater reliability (e.g., Wagner, Rau, & Lindermann, 2010). While
the former is concerned with the extent to which ratings are the same across raters, the latter is concerned
with consistency in the rank order of ratings across raters. Kozlowski and Hattrup (1992) helpfully distin-
guished ‘‘consensus’’ (agreement) from ‘‘consistency’’ (reliability).
2. While practical significance concerns whether agreement is meaningful, statistical significance concerns
whether it is due to chance (Dunlap et al., 2003; Smith-Crowe & Burke, 2003).
3. The average deviation (AD) can also be calculated from the median (AD
Md
) rather than the mean (AD
M
).
These different versions of the AD index are equal when the mean and median of a distribution are equal,
and otherwise they tend to be highly correlated (Burke, Finkelstein, & Dusig, 1999). Because AD
M
is used
more often by researchers than AD
Md
, we focus our article on AD
M
. The appendix, however, provides the
information researchers would need to calculate critical values for AD
Md
as needed.
4. We base our work in part on equations presented by Burke and Dunlap (2002).
5. Burke and Dunlap (2002) demonstrated in their Equation 12 (p. 165) how to calculate AD from a proportion
in the case of a dichotomy:ADð2categoriesÞ¼2pð1pÞ.
6. For the assessment of interrater agreement where judges rate a single target with respect to only two cate-
gories (e.g., on a yes–no or agree–disagree dichotomous item format), Burke and Dunlap (2002, p. 164)
obtained an upper limit value for AD of .35 using their Equation 9:
ADUL ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
c21
=
24
q;
where cis equal to the number of categories. When cequals 2, AD
UL
equals .35. Burke and Dunlap (p. 164)
also presented an approximation or simplification of Equation 9, which was their Equation 10:
ffiffiffiffiffiffiffiffiffiffi
c2
=
25
q¼c
=
5:
Using an approximation or simplification to Equation 9 of their article (c/5) and dividing this quantity by 1.2,
which is the constant by which AD and the standard deviation of responses on an item differ relative to the
uniform distribution, yields the value of .33 as the upper limit of AD for a dichotomous item.
7. When dis within the range of –.40 to .40,a close approximation of ris ddivided by 2. Yet, when dfalls outside
of this range, the relationship between dand rbecomes nonlinear. For the later cases, an accurate approxima-
tion of dto ris obtained by the maximum likelihood estimate (see Hunter & Schmidt, 2004, pp. 277-279).
Because our dof 2.214 is greater than .4, we used Equation 3 to obtain an accurate estimate of r.
8. Note that if readers were to plug the tabled values into Equation 9, they would arrive at a slightly different
range due to rounding error.
148 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
9. Here we are dealing with multinomial distributions, and as a result, we refer to them as triangular-shaped and
bell-shaped as they are, respectively, similar in a figurative sense to the normal probability distribution.
References
Avery, D. R., McKay, P. F., & Wilson, D. C. (2008). What are the odds? How demographic similarity affects
the prevalence of perceived employment discrimination. Journal of Applied Psychology,93, 235-249.
Bledow, R., & Frese, M. (2009). A situational judgment test of personal initiative and its relationship to
performance. Personnel Psychology,62, 229-258.
Borucki, C. C., & Burke, M. J. (1999). An examination of service-related antecedents to retail store perfor-
mance. Journal of Organizational Behavior,20, 943-962.
Brief, A. P., Umphress, E. E., Dietz, J., Burrows, J. W., Butz, R. M., & Scholten, L. (2005). Community matters:
Realistic group conflict theory and the impact of diversity. Academy of Management Journal,48, 830-844.
Brown, R. D., & Hauenstein, N. M. A. (2005). Interrater agreement reconsidered: An alternative to the rwg
indices. Organizational Research Methods,8, 165-184.
Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation index: A
user’s guide. Organizational Research Methods,5, 159-172.
Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater
agreement. Organizational Research Methods,2, 49-68.
Cascio, W. F. (1998). Applied psychology in human resource management. Upper Saddle River, NJ: Prentice
Hall.
Cashen, L. H., & Geiger, S. W. (2004). Statistical power and the testing of null hypotheses: A review of con-
temporary management research and recommendations for future studies. Organizational Research
Methods,7(2), 151-167.
Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of
analysis: A typology of composition models. Journal of Applied Psychology,83, 234-246.
Cohen, A., Doveh, E., & Nahum-Shani, I. (2009). Testing agreement for multi-item scales with the indices
r
WG(J)
and AD
M(J)
.Organizational Research Methods,12, 148-164.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York, NY: Academic Press.
Cortina, J. M., & Folger, R. G. (1998). When is it acceptable to accept a null hypothesis: No way, Jose?
Organizational Research Methods,1, 334-350.
Dawson, J. F., Gonzalez-Roma, V., Davis, A., & West, M. A. (2008). Organizational climate and climate
strength in UK hospitals. European Journal of Work and Organizational Psychology,17, 89-111.
DeRue, D. S., Hollenbeck, J. R., Ilgen, D. R., & Feltz, D. (2010). Efficacy dispersion in teams: Moving beyond
agreement and aggregation. Personnel Psychology,63, 1-40.
Dunlap, W. P., Burke, M. J., & Smith-Crowe, K. (2003). Accurate tests of statistical significance for r
WG
and
average deviation indexes. Journal of Applied Psychology,88, 356-362.
Edwards, J. R., & Berry, J. W. (2010). The presence of something or the absence of nothing: Increasing theo-
retical precision in management research. Organizational Research Methods,13, 668-689.
Grant, A. M., & Mayer, D. M. (2009). Good soldiers and good actors: Prosocial and impression management
motives as interactive predictors of affiliate citizenship behaviors. Journal of Applied Psychology,94,
900-912.
Greene, W. H. (1997). Econometric analysis. Upper Saddle River, NJ: Prentice Hall.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin,82,
1-20.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ:
Lawrence Erlbaum.
Harrison, D. A., & Klein, K. J. (2007). What’s the difference? Diversity constructs as separation, variety, or
disparity in organizations. Academy of Management Review,32, 1199-1228.
Smith-Crowe et al. 149
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Heidemeier, H., & Moser, K. (2009). Self-other agreement in job performance ratings: A meta-analytic test of a
process model. Journal of Applied Psychology,94, 353-370.
Hudson, J. P., & Campion, J. E. (1994). Hindsight bias in an application of the Angoff method for setting cutoff
scores. Journal of Applied Psychology,79, 860-865.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research find-
ings. Thousand Oaks, CA: Sage.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and with-
out response bias. Journal of Applied Psychology,69, 85-98.
James, L. R., Demaree, R. G., & Wolf, G. (1993). r
wg
: An assessment of within-group interrater agreement.
Journal of Applied Psychology,78, 306-309.
Katz-Navon, T., Naveh, E., & Stern, Z. (2009). Active learning: When is more better? The case of resident
physicians’ medical errors. Journal of Applied Psychology,94, 1200-1209.
Klein, K. J., Conn, A. B., Smith, D. B., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of
within-group agreement in employee perceptions of the work environment. Journal of Applied
Psychology,86, 3-16.
Klein, K. J., Dansereau, F., & Hall, R. J. (1994). Levels issues in theory development, data collection, and
analysis. Academy of Management Review,19, 195-229.
Kline, T. J. B., & Hambley, L. A. (2007). Four multi-item interrater agreement options: Comparisons and
outcomes. Psychological Reports,101, 1001-1010.
Kozlowski, S. W. J., & Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues
of consistency versus consensus. Journal of Applied Psychology,77, 161-167.
Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations:
Contextual, temporal, and emergent processes. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory,
research, and methods in organizations (pp. 3-90). San Francisco, CA: Jossey-Bass.
Kreiner, G. E., Hollensbe, E. C., & Sheep, M. L. (2009). Balancing borders and bridges: Negotiating the work-
home interface via boundary work tactics. Academy of Management Journal,52, 704-730.
Lawrence, K. L., Lenk, P., & Quinn, R. E. (2009). Behavioral complexity in leadership: The psycho-
metric properties of a new instrument to measure behavioral repertoire. Leadership Quarterly,20,
87-102.
LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agree-
ment. Organizational Research Methods,11, 815-852.
Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied
Psychological Measurement,21, 271-278.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Lu
¨dtke, O., & Robitzsch, A. (2009). Assessing within-group agreement: A critical examination of a random-
group resampling approach. Organizational Research Methods,12, 461-487.
McCall, R. B. (1970). Fundamental statistics for psychology. New York, NY: Harcourt, Brace, & World, Inc.
Messick, D. M. (1982). Some cheap tricks for making inferences about distribution shapes from variances.
Educational and Psychological Measurement,42, 749-758.
Meyer, R. D., Mumford, T. V., & Campion, M. A. (2010, August). The practical consequences of null distri-
bution choice on rwg. Paper presented at the annual meeting of the Academy of Management, Montreal,
Canada.
Newman, D. A., & Sin, H-P. (2009). How do missing data bias estimates of within-group agreement?
Sensitivity of SD
WG
,CV
WG
,r
WG(J)
,r
WG(J)
*, and ICC to systematic nonresponses. Organizational
Research Methods,12, 113-147.
Nicklin, J. M., & Roch, S. G. (2009). Letters of recommendation: Controversy and consensus from expert
perspectives. International Journal of Selection and Assessment,17, 76-91.
Parsons, R. (1978). Statistical analysis. New York, NY: Harper & Row.
150 Organizational Research Methods 16(1)
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
Roberson, Q. M., Sturman, M. C., & Simons, T. L. (2007). Does the measure of dispersion matter in multilevel
research? A comparison of the relative performance of dispersion indexes. Organizational Research
Methods,10, 564-588.
Rousseau, D. M. (1985). Issues of level in organizational research: Multi-level and cross-level perspectives.
Research in Organizational Behavior,7, 1-37.
Schriesheim, C. A. (1981). The effect of grouping or randomizing items on leniency response bias. Educational
and Psychological Measurement,41, 401-411.
Schwarz, N. (1999). Self-reports: How questions shape the answers. American Psychologist,54, 93-105.
Schwarz, N., Knauper, B., Hippler, H. J., Noelle-Neumann, E., & Clark, F. (1991). Rating scales: Numeric
values may change the meaning of scale labels. Public Opinion Quarterly,55, 570-582.
Smith-Crowe, K., & Burke, M.J. (2003). Interpreting the statistical significance of observed AD interrater
agreement values: Corrections to Burke and Dunlap (2002). Organizational Research Methods,6, 129-131.
Takeuchi, R., Chen, G., & Lepak, D. P. (2009). Through the looking glass of a social system: Cross-level effects
of high performance work systems on employees’ attitudes. Personnel Psychology,62, 1-29.
Trougakos, J. P., Beal, D. J., Green, S. G., & Weiss, H. M. (2008). Making the break count: An episodic exam-
ination of recovery activities, emotional experiences, and positive affective displays. Academy of
Management Journal,51, 131-146.
Van Kleef, G. A., Homan, A. C., Beersma, B., Van Knippenberg, D., Knippenberg, B. V., & Damen, F. (2009).
Searing sentiment or cold calculation? The effects of leader emotional displays on team performance depend
on follower epistemic motivation. Academy of Management Journal,52, 562-580.
Wagner, S. M., Rau, C., & Lindermann, E. (2010). Multiple informant methodology: A critical review and
recommendations. Sociological Methods and Research,38, 582-618.
Walumbwa, F. O., & Schaubroeck, J. (2009). Leader personality traits and employee voice behavior: Mediating
roles of ethical leadership and work group psychological safety. Journal of Applied Psychology,94,
1275-1286.
Author Biographies
Kristin Smith-Crowe is an associate professor of organizational behavior in the David Eccles School of Busi-
ness, University of Utah. Her research focuses on behavioral ethics, interrater agreement, and worker safety.
Michael J. Burke is the Lawrence Martin Chair in Business, Freeman School of Business, Tulane University.
His current research focuses on learning and the efficacy of workplace safety and health interventions as well as
the meaning of employee perceptions of work environment characteristics (psychological and organizational
climate).
Maryam Kouchaki received her PhD in organizational behavior from the David Eccles School of Business,
University of Utah, and is currently a postdoctoral fellow at the Edmond J. Safra Center for Ethics, Harvard
University. Her research focuses on the moral dimension of social life, in particular, ethical behavior in the
workplace.
Sloane M. Signal is a doctoral student in the College of Education and Human Development at Jackson State
University. Her research interests include communicating across cultures both inside and outside of the United
States, diversity and multiculturalism in the workplace, and the scholarship of teaching and learning.
Smith-Crowe et al. 151
at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from
... This intensive work and these procedures always improve the content validity of the instrument, by ensuring that the items cover a representative sample of the item domain (Florence, 2014;Van Breda, 2008). Two commonly used judgmental measures of inter-rater agreement, the Content Validity Index (Polit & Beck, 2007), and the Average Deviation Mean Index (Smith-Crowe et al., 2013), could be used by scale developers to evaluate an expert's agreement on the ratings of the various scale elements. ...
... On the other hand, the average deviation designates the degree of disagreement among experts in the response option selected, regardless of whether they, as a group, endorse an element or not. Therefore, the content validity index values should be examined first, to determine whether the experts have endorsed an item or not, and thereafter, the level of agreement among the experts, by examining the average deviation mean (Polit & Beck 2007;Smith et al., 2013). The content evidence of validity demands a good, detailed definition of the construct, and the ability to check the operationalisation against this definition (Clark-Carter, 2010;De Vellis, 2017). ...
... We explain how each specific measure functioned at the team level, but our general strategy involved aggregating team members' responses to the team level by calculating the mean of those responses. Therefore, for all measures, we confirmed the statistical appropriateness of aggregation by computing the average deviation index (AD M(J) ), a prominent scalelevel interrater agreement index (Burke et al. 1999, Burke and Dunlap 2002, Smith-Crowe et al. 2013. It provides the average of the absolute numerical distances between the mean rating for a team and the ratings obtained from each team member (hence, smaller Knowledge Management Activities. ...
... Using a direct consensus composition approach (Chan 1998), these scores were aggregated to the team level by taking the mean across members. The average deviation indices (median across teams and averaged across time points) were attention-focusing AD M(J) 0.07, sensemaking AD M(J) 0.05, and codification AD M(J) 0.07, indicating significant within-team agreement (Smith-Crowe et al. 2013). Figure 3 depicts the use of the separate technologies, as well as the categories of affordances at each time point. ...
Article
Despite the dynamic nature of knowledge-related activities and the availability of a variety of communication technologies, many global teams habitually use technology in the same way across activities. However, as teams move through cycles of accumulating, integrating, and implementing knowledge, the purposes for communication technologies change. Current theorizing and empirical work on team knowledge management has yet to develop a dynamic theory that incorporates these changes. By conducting a multiwave, mixed method analysis of 48 global teams, we develop a theory of how global teams sustain effectiveness through technology affordance processes. We found that effective teams are those that recognize cues indicating change is necessary and coevolve a symbiosis between new activities, new purposes for interaction, and new uses of communication technologies. This coevolution of purpose with technology use forms new affordances, which enable the team to move on to new knowledge management activities and sustain effectiveness. Our theory more realistically models the dynamics of staying connected while sharing, combining, and implementing knowledge across the globe.
... There has been long discussion about which one of the available indices (AD, intraclass correlation-ICC or within group agreement index-r wg ) used for similar cases (James, Demaree and Wolf 1993;Mutz, Bornmann and Daniel 2012;Smith-Crowe et al. 2014), performs better and describes more accurately inter-rater agreement. According to simulation research (Smith-Crowe, Burke and Kouchaki 2013;Smith-Crowe et al. 2014), AD index has been proved to perform better (Kline and Hambley 2007;Roberson, Sturman and Simons 2007). Additionally, the fact that ICC measures both agreement and reliability simultaneously (LeBreton and Senter 2008) might potentially complicate inferences. ...
... To the same conclusion additionally contributes if we consider the highly interdisciplinary character of FET-Open proposals. Our calculated AD index is lower than the values published for all different kinds of null-distributions (Smith-Crowe, Burke and Kouchaki 2013). To the best of our knowledge, there is no other available published study about such enhanced interdisciplinary proposals, with which our results could be compared with in a straightforward way. ...
Article
Full-text available
In this study, we analyze the two-phase bottom-up procedure applied by the Future and Emerging Technologies Program (FET-Open) at the Research Executive Agency (REA) of the European Commission (EC), for the evaluation of highly interdisciplinary, multi-beneficiary research proposals which request funding. In the first phase, remote experts assess the proposals and draft comments addressing the pre-defined (by FET-Open) evaluation criteria. In the second phase, a new set of additional experts (of more general expertise and different from the remote ones), after cross reading the proposals and their remote evaluation reports, they convene in an on-site panel where they discuss the proposals. They complete the evaluation by reinforcing per proposal and per criterion one or another assessment, as assigned remotely during the first phase. We analyze the level of the inter-rater agreement among the remote experts and we identify its relative correlation with the funded proposals resulted after the end of the evaluation. Our study also provides comparative figures of the evolution of the proposalsscores during the two phases of the evaluation process. Finally, by carrying out an appropriate quantitative and qualitative analysis of all scores from the seven past cutoffs , we elaborate on the significant contribution of the panel (the second phase of the evaluation) in identifying and promoting the best proposals for funding.
... Additionally, criteria may include the use of other cutoffs for agreement or cutoffs moored to statistical significance testing (cf. Bliese & Halverson, 2002;Burke et al., 2017;Cohen et al., 2001;Dunlap et al., 2003;LeBreton & Senter, 2008;Smith-Crowe et al., 2014;Smith-Crowe et al., 2012;Woehr et al., 2015). Most importantly, researchers must clearly articulate the criteria used to guide decisions about data aggregation. ...
Article
Full-text available
The multilevel paradigm is omnipresent in the organizational sciences, with scholars recognizing data are almost always nested – either hierarchically (e.g., individuals within teams) or temporally (e.g., repeated observations within individuals). The multilevel paradigm is moored in the assumption that relationships between constructs often reside across different levels, often requiring data from a lower-level (e.g., employee-level justice perceptions) to be aggregated to a higher-level (e.g., team-level justice climate). Given the increased scrutiny in the social sciences around issues of clarity, transparency, and reproducibility, this paper first introduces a set of data aggregation principles that are then used to guide a brief literature review. We found that reporting practices related to data aggregation are quite variable with little standardization as to what information and statistics are included by authors. We conclude our paper with a Data Aggregation Checklist and a new R package, WGA (Within-Group Agreement & Aggregation), intended to improve the clarity and transparency of future multilevel studies.
... Then, the mean agreement across groups (mean r WG ) is compared to some practical cutoff value (LeBreton & Senter, 2008), and if the mean agreement is greater than the cutoff, the phenomenon is arguably understood to exist. Many researchers also employ other average, aggregation statistics such as mean average deviation (AD) indices and ICC statistics as supplemental indicators of within-group agreement and similarly interpret them relative to practical cutoffs for acceptable agreement (see LeBreton & Senter, 2008;Smith-Crowe et al., 2013). ...
Article
Full-text available
A variety of collective phenomena are understood to exist to the extent that workers agree on their perceptions of the phenomena, such as perceptions of their organization’s climate or perceptions of their team’s mental model. Researchers conducting group-level studies of such phenomena measure individuals’ perceptions via surveys and then aggregate data to the group level if the mean within-group agreement for a sample of groups is sufficiently high. Despite this widespread practice, we know little about the factors potentially affecting mean within-group agreement. Here, focusing on work climate, we report an investigation of a number of expected contextual (social interaction) and methodological predictors of mean rWG, a common statistic for judging within-group agreement in applied psychology and management research. We used the novel approach of meta-CART, which allowed us to assess the relative importance and possible interactions of the predictor variables. Notably, mean rWG values are driven by both contextual (average number of individuals per group and cultural individualism-collectivism) and methodological factors (the number of items in a scale and scale reliability). Our findings are largely consistent with expectations concerning how social interaction affects within-group agreement and psychometric arguments regarding why adding more items to a scale will not necessarily increase the magnitude of an index based on a Spearman-Brown “stepped-up correction.” We discuss the key insights from our results, which are relevant to the study of multilevel phenomena relying on the aggregation of individual-level data and informative for how meta-analytic researchers can simultaneously examine multiple moderator variables.
... (James, Demaree Robert & Gerrit, 1984), and value of interclass correlation (ICC1 = 0.10), the reliability of group mean (ICC2 = 0.37) and between-group variance (i.e., the value of F = 1.59, p < 0.01) (Bliese, 2000;Meng, Clausen & Borg, 2018). We followed previous literature (Chiu et al., 2016;Smith-Crowe et al., 2013) and reported the values based on both uniform and slightly skewness null distributions. We found a lower level of ICC2 values than the acceptable threshold of 0.70. ...
Article
This study examines how and to what extent social and technological factors promote shared leadership that leads to team innovation in knowledge work teams. It hypothesizes that a transactive memory system influence team innovation and shared leadership conduits the relationship between them. Additionally, the relationship effectiveness between the transactive memory system and shared leadership increases with the use of social media by team members. Time‐lagged, multi‐sourced data are collected from the information technology industry in China. In addition, we used a network‐based measure to assess the level of shared leadership in teams. Empirical analysis found support for the hypotheses of this study. The results reveal that transactive memory system is a significant predictor of team innovation and the shared leadership channels the relationship between transactive memory system and team innovation. Furthermore, use of social media by team members amplifies the relationship between transactive memory system and shared leadership. The implications of the study are discussed in the later sections.
... For example, if there are three reviewers and their scores are IER1, IER2 and IRE3, then the AD index is (|IER1 À IER2| + |IER1 À IER3| + |IER2 À IER3|)/3. The AD index does not require the specification of null distribution and returns value in the units of the original scale (0-100 in our case), making its interpretation easier and more pragmatic (Smith-Crowe et al., 2013): the closer the AD index is to zero, the greater the agreement between reviewers. We also calculated the difference between the CR score and the average of the IER scores (AVIER) for each proposal. ...
Article
Full-text available
Most funding agencies rely on peer review to evaluate grant applications and proposals, but research into the use of this process by funding agencies has been limited. Here we explore if two changes to the organization of peer review for proposals submitted to various funding actions by the European Union has an influence on the outcome of the peer review process. Based on an analysis of more than 75,000 applications to three actions of the Marie Curie programme over a period of 12 years, we find that the changes – a reduction in the number of evaluation criteria used by reviewers and a move from in-person to virtual meetings – had little impact on the outcome of the peer review process. Our results indicate that other factors, such as the type of grant or area of research, have a larger impact on the outcome.
... The concept of resilience has four inter-related dimensions, namely technical, organizational, social and economic (Bruneau et al. 2003). Simpson (2006) thoroughly examined hazards, community assets, social capital, infrastructure quality, planning, social services and population demographics as key elements and indicators for resilience. Further advances to community resilience indicators include ecological, social, economic, institutional, infrastructure and community capacity dimensions (Cutter et al. 2008). ...
Chapter
Droughts and floods are some of the major climatic hazards in the semi-arid areas of Sub-Saharan Africa (SSA). Climate change affects the periodicity and severity of such hazards, and eventually the well-being of many rural communities in the region, including semi-arid Ghana. Enhancing the resilience of local communities to droughts and floods would be a necessary step to meet different national development priorities. The aim of this chapter is to assess the perceived community resilience to recurrent floods and droughts induced by climate change. We focus on two purposely selected districts in the Northern and Upper West regions of Ghana that have varying degrees of vulnerability to floods and droughts. Following an extensive literature review, we develop a conceptual framework that contains the most commonly used elements of ecological, engineering and socio-economic resilience. We populate this framework with data collected through 30 focus group discussions (FGDs) with elderly men, elderly women and young adults across ten local communities. The resilience perceptions for each group are elicited using a customized Likert-type scale. The results reveal that local communities and study groups across the case districts consistently perceive their overall resilience to be “low” or “very low”, with little variation between drought-prone and flood-prone sites. Despite the large consensus in most resilience elements, young adults tend to report lower resilience scores compared to elderly men and women. These findings reflect empirical evidence about the low adaptive capacity to human and natural shocks and stresses in semi-arid Ghana. This bottom-up approach can be used as a pre-planning tool to identify priority areas and inform the development of context-specific interventions to enhance community resilience to floods and droughts.
... (James, Demaree Robert & Gerrit, 1984), and value of interclass correlation (ICC1 = 0.10), the reliability of group mean (ICC2 = 0.37) and between-group variance (i.e., the value of F = 1.59, p < 0.01) (Bliese, 2000;Meng, Clausen & Borg, 2018). We followed previous literature (Chiu et al., 2016;Smith-Crowe et al., 2013) and reported the values based on both uniform and slightly skewness null distributions. We found a lower level of ICC2 values than the acceptable threshold of 0.70. ...
Article
Full-text available
Researchers assessing interrater agreement for ratings of a single target have increasingly used the rWG(j) index, but have found it can display irregular behavior. Mathematical analyses show this problem arises from the use of random response, operationalized by the variance of a uniform distribution (jgy), for the baseline of comparison. These analyses suggest that researchers should continue to use rWG(j) as a summary measure of interrater agreement, but should use maximum dissensus as a reference distribution for computing rWG(j). Although values of s2EU can be descriptively misleading, they provide an important inferential baseline. Thus, s2EU should be used in computing χ2 tests of the departure of the observed response variance from random responding. Researchers should also examine interrater agreement as a theoretical variable in its own right, investigating the causes and consequences of rater dissensus.
Article
Full-text available
We conducted two studies that bring communities into the study of organizational demographics. Reasoning from a realistic group conflict theory base, we predicted (1) negative white reactions to racial and ethnic diversity in organizations and (2) moderation of this relationship by whites' diversity experiences in their communities. Data from the National Organizations Study and an experiment supported our hypotheses. The closer whites lived to blacks (Study 1) and the more interethnic conflict the former perceived in their communities (Study 2), the more negatively they responded to diverse workplaces. Our discussion focuses on understanding organizations as reflections of their environments.
Article
Full-text available
We investigated how people manage boundaries to negotiate the demands between work and home life. We discovered and classified four types of boundary work tactics (behavioral, temporal, physical, and communicative) that individuals utilized to help create their ideal level and style of work-home segmentation or integration. We also found important differences between the generalized state of work-home conflict and "boundary violations," which we define as behaviors, events, or episodes that either breach or neglect the desired work-home boundary. We present a model based on two qualitative studies that demonstrates how boundary work tactics reduce the negative effects of work-home challenges. "Balance" between work and home lives is a much sought after but rarely claimed state of being. Work-family researchers have successfully encour- aged organizations, families, and individuals to recognize the importance of tending to their needs for balance. Over 30 years ago, Kanter (1977) spoke of the "myth of separate worlds" and called atten- tion to the reality that work and home are inexora- bly linked. Yet, she argued, organizations are often structured in such a way that their leadership for- gets or ignores employees' outside lives. Although organizational leaders and managers generally tend more to employees' nonwork needs than they did when Kanter wrote her landmark work, struggles to balance work and home demands are still common-
Article
Full-text available
Research on organizational diversity, heterogeneity, and related concepts has prolif- erated in the past decade, but few consistent findings have emerged. We argue that the construct of diversity requires closer examination. We describe three distinctive types of diversity: separation, variety, and disparity. Failure to recognize the meaning, maximum shape, and assumptions underlying each type has held back theory devel- opment and yielded ambiguous research conclusions. We present guidelines for conceptualization, measurement, and theory testing, highlighting the special case of demographic diversity
Article
Multilevel researchers often gather individual-level data to measure group-level constructs. Within-group agreement is a key consideration in the measurement of such constructs, yet antecedents of within-group agreement have been little studied. The authors found that group member social interaction and work interdependence were significantly positively related to within-group agreement regarding perceptions of the work environment. Demographic heterogeneity was not significantly related to within-group agreement. Survey wording showed a complex relationship to agreement. Both evaluative items and socially undesirable items generated high within-group agreement. The use of a group rather than individual referent increased within-group agreement in response to descriptive items but decreased within-group agreement in response to evaluative items. Items with a group referent showed greater between-group variability than items with an individual referent.
Article
This new edition suggests new directions for research and practice, includes emphasis on modern computers and technology useful in assessment, and pays more attention to prediction of individual growth and globalization challenges in the assessment process. The book will be of interest to faculty and students in Industrial Organizational psychology, human resource management and business. IO psychologists in private business and public sector organizations who have responsibilities for staffing and an interest in measurement and statistics will find this book useful.
Book
Psychological theories, complete with tools and methods, for dealing with human resource issues. Interdisciplinary and research-based in approach, Applied Psychology in Human Resource Management integrates psychological theory with tools and methods for dealing with human resource problems in organizations and for making organizations more effective and more satisfying places to work. The seventh edition reflects the state of the art in personnel psychology and dramatic changes that have recently characterized the field, and outlines a forward-looking, progressive model toward which HR specialists should aim. - See more at: http://www.pearsonhighered.com/educator/product/Applied-Psychology-in-Human-Resource-Management/9780136090953.page#sthash.ib9JwzRf.dpuf
Article
For continuous constructs, the most frequently used index of interrater agreement (r wg(1))can be problematic. Typically, rwg(1) is estimated with the assumption that a uniform distribution represents no agreement. The authors review the limitations of this uniform nullr wg(1) index and discuss alternative methods for measuring interrater agreement. A new interrater agreement statistic,a wg(1),is proposed. The authors derive thea wg(1)statistic and demonstrate thatawg(1) is an analogue to Cohen’s kappa, an interrater agreement index for nominal data. A comparison is made between agreement estimates based on the uniformr wg(1)and a wg(1), and issues such as minimum sample size and practical significance levels are discussed. The authors close with recommendations regarding the use ofr wg(1)/rwg(J) when a uniform null is assumed,r wg(1)/rwg(J) indices that do not assume a uniform null,awg(1) / a wg(J)indices, and generalizability estimates of interrater agreement.
Article
In management research, theory testing confronts a paradox described by Meehl in which designing studies with greater methodological rigor puts theories at less risk of falsification. This paradox exists because most management theories make predictions that are merely directional, such as stating that two variables will be positively or negatively related. As methodological rigor increases, the probability that an estimated effect will differ from zero likewise increases, and the likelihood of finding support for a directional prediction boils down to a coin toss. This paradox can be resolved by developing theories with greater precision, such that their propositions predict something more meaningful than deviations from zero. This article evaluates the precision of theories in management research, offers guidelines for making theories more precise, and discusses ways to overcome barriers to the pursuit of theoretical precision.