Content uploaded by Michael Burke

Author content

All content in this area was uploaded by Michael Burke

Content may be subject to copyright.

http://orm.sagepub.com/

Methods

Organizational Research

http://orm.sagepub.com/content/16/1/127

The online version of this article can be found at:

DOI: 10.1177/1094428112465898

2013 16: 127 originally published online 27 November 2012Organizational Research Methods

Kristin Smith-Crowe, Michael J. Burke, Maryam Kouchaki and Sloane M. Signal

Theoretical and Methodological Problems

Assessing Interrater Agreement via the Average Deviation Index Given a Variety of

Published by:

http://www.sagepublications.com

On behalf of:

The Research Methods Division of The Academy of Management

can be found at:Organizational Research MethodsAdditional services and information for

http://orm.sagepub.com/cgi/alertsEmail Alerts:

http://orm.sagepub.com/subscriptionsSubscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.com/journalsPermissions.navPermissions:

What is This?

- Nov 27, 2012OnlineFirst Version of Record

- Feb 8, 2013Version of Record >>

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Article

Assessing Interrater

Agreement via the Average

Deviation Index Given a

Variety of Theoretical and

Methodological Problems

Kristin Smith-Crowe

1

, Michael J. Burke

2

,

Maryam Kouchaki

1

, and Sloane M. Signal

3

Abstract

Currently, guidelines do not exist for applying interrater agreement indices to the vast majority of

methodological and theoretical problems that organizational and applied psychology researchers

encounter. For a variety of methodological problems, we present critical values for interpreting the

practical significance of observed average deviation (AD) values relative to either single items or

scales. For a variety of theoretical problems, we present null ranges for AD values, relative to either

single items or scales, to be used for determining whether an observed distribution of responses

within a group is consistent with a theoretically specified distribution of responses. Our discussion

focuses on important ways to extend the usage of interrater agreement indices beyond problems

relating to the aggregation of individual level data.

Keywords

average deviation (AD), interrater agreement, multilevel research, aggregation, null distribution

Assessments of interrater agreement, or the degree to which raters are interchangeable (Kozlowski

& Hattrup, 1992),

1

are integral to many types of organizational and applied psychology research. For

instance, interrater agreement assessments have recently been central with respect to addressing

substantive questions within domains such as organizational climate and leadership (e.g., Dawson,

Gonzalez-Roma, Davis, & West, 2008; Walumbwa & Schaubroeck, 2009), conducting quantitative

1

University of Utah, David Eccles School of Business, Salt Lake City, UT, USA

2

Tulane University, Freeman School of Business, New Orleans, LA, USA

3

College of Education and Human Development, Jackson State University, Jackson, MS, USA

Corresponding Author:

Kristin Smith-Crowe, University of Utah, David Eccles School of Business, 1655 East Campus Center Drive, Salt Lake City,

UT 84112, USA.

Email: kristin.smith-crowe@business.utah.edu

Organizational Research Methods

16(1) 127-151

ªThe Author(s) 2012

Reprints and permission:

sagepub.com/journalsPermissions.nav

DOI: 10.1177/1094428112465898

orm.sagepub.com

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

and qualitative research, as well as laboratory and field studies (e.g., Katz-Navon, Naveh, & Stern,

2009; Kreiner, Hollensbe, & Sheep, 2009; Van Kleef et al., 2009), developing measures (e.g.,

Bledow & Frese, 2009; Lawrence, Lenk, & Quinn, 2009), dealing with various types of data analysis

problems (e.g., Grant & Mayer, 2009; Nicklin & Roch, 2009; Trougakos, Beal, Green, & Weiss,

2008), and deciding whether or not to aggregate data (e.g., Borucki & Burke, 1999; Takeuchi, Chen,

& Lepak, 2009). Further, usage of interrater agreement statistics is on the rise. In the Journal of

Applied Psychology and Personnel Psychology alone, there has been a largely linear increase in the

use of these statistics over the past decade (see Figure 1). Notably, in 2010 almost half of the articles

published in these journals used interrater agreement statistics.

Despite the relevance of interrater agreement assessments for dealing with a broad array of

theoretical and methodological issues and their widespread usage, systematically derived guidelines

for applying interrater agreement indices to the vast majority of problems that researchers and

practitioners encounter do not exist. The primary objective of this article is to derive practical guide-

lines to assist researchers using the average deviation (AD) index in making more informed

decisions about interrater agreement problems. We focus on the AD index, the average deviation

from the mean or median of ratings, for two primary reasons. First, AD is straightforward. It

measures agreement, while intraclass correlations (ICC) measure both agreement and reliability

simultaneously (LeBreton & Senter, 2008), potentially complicating inferences. Further, for both

ICC and r

WG

, researchers must choose from among numerous variations to employ the statistic (see

LeBreton & Senter, 2008, for a review). Second, AD performs well. In a simulation study, Roberson,

Sturman, and Simons (2007) found that the AD index performs as well as similar other statistics.

Kline and Hambley (2007) reported similar findings.

Importantly, we are concerned with practical significance, or ‘‘whether an index indicates

that interrater agreement is sufficiently strong or disagreement is sufficiently weak so that one

can trust that the average opinion of a group is interpretable or representative’’

2

(Dunlap,

Burke, & Smith-Crowe, 2003, p. 356), as practical significance is the basis on which agreement

23%

17%

20%

33%

22%

27% 29%

33%

36%

43%

47%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Percentage

Year

Figure 1. Percentage of articles published in Personnel Psychology and the Journal of Applied Psychology that used

interrater agreement statistics, including r

WG

, average deviation (AD), intraclass correlation (ICC), percentage

agreement, and Cohen’s kappa.

128 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

is typically evaluated. We present critical values for addressing the frequently asked methodo-

logical question concerning practical significance, ‘‘How much agreement/dispersion is there?’’

These critical values can be used to assess agreement on a single item or a scale. This question

concerns the level of agreement in a set of ratings. An answer to this question often informs

decisions about the quality of a measure of central tendency, such as a group’s mean, as an

indicator of the group’s standing on a phenomenon or construct of interest. While previous

work has also addressed this question, as we will discuss in what follows, the guidelines pro-

vided are of very limited use.

In particular, we go beyond the work of Burke and Dunlap (2002), who previously provided a

decision rule for interpreting the practical significance of observed AD values, to provide decision

rules that cover many more circumstances. As we detail in what follows, though the calculation of

AD does not require the specification of a null distribution representing no agreement, the inter-

pretation of AD does. In other words, while one can calculate AD in the absence of a specified null

distribution, one cannot draw conclusions regarding observed AD values without comparing them

to some notion of ‘‘no agreement.’’ Burke and Dunlap’s guideline is based exclusively on the uni-

form distribution as the null distribution; there are no guidelines for interpreting the practical

significance of AD relative to any other null distributions. In what follows, we discuss the criti-

cisms of researchers’ overreliance on the uniform distribution despite other distributions often

being more appropriate. Herein, we provide guidelines for interpreting AD in terms of the level

of agreement relative to numerous other distributions. Our guidelines will allow researchers to

interpret interrater agreement relative to null distributions more appropriate to their research than

the uniform distribution.

Furthermore, we present guidelines for addressing the less commonly posed yet theoretically

important question of ‘‘How well does the pattern of observed agreement/dispersion match the

theoretically specified pattern of agreement/dispersion?’’ These guidelines can be used in relation

to either agreement on a single item or a scale. An answer to this question informs decisions

regarding the scoring of the group as consistent or not with the theoretically specified distribution

and, thus, the use of such scores in subsequent analyses at the group level of analysis. Addressing

questions related to the pattern of dispersion will be of increasing importance as researchers

attempt to test new theories concerning group and other higher level phenomena that specify pat-

terns of dispersion as variables (e.g., see DeRue, Hollenbeck, Ilgen, & Feltz, 2010; Harrison &

Klein, 2007). By focusing on the pattern in addition to level of agreement/dispersion, our work

promotes conceptual advances in research and goes beyond previous work on interrater agreement

(e.g., Burke & Dunlap, 2002).

For the purpose of demonstrating how our guidelines would be used to address problems relating

to the pattern of dispersion, we will focus on notions of diversity and team efficacy dispersion, as

theories relating to these phenomena have recently been presented. For the purpose of demonstrating

how our guidelines would be applied to the assessment of the level of agreement, we focus our

discussion on the common use of interrater agreement indices for data aggregation decisions. The

guidelines we present, however, would apply to the study of a broad array of interrater agreement

problems.

To unfold our discussion, we begin with a brief summary of research on multilevel modeling and

data aggregation to set the stage for a discussion related to assessments of the level of agreement.

This discussion also includes an overview of the relevance of interrater agreement assessments for

determining whether or not the observed pattern of dispersion matches a theoretically specified pat-

tern of dispersion. Then, we present interpretive standards for assessments of interrater agreement

for both the level of agreement and pattern of dispersion, with detailed discussions of how the

derived guidelines can be applied to a variety of research problems.

Smith-Crowe et al. 129

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Issues Related to the Level of Agreement and Pattern of Dispersion

In this section we discuss the use of interrater agreement in multilevel research to justify the

aggregation of lower level data to higher levels of analysis based on the level of observed

agreement. Then, we discuss a second possible usage of interrater agreement statistics, which

is to assess the goodness of fit between an observed pattern of dispersion with a theoretically

specified pattern of dispersion. We give examples of recent multilevel theories that predict

outcomes based on patterns of dispersion. Related to both level of agreement and pattern of

dispersion, we discuss the limited availability of guidelines available to researchers for inter-

preting agreement.

Level of Agreement

Multilevel research commonly entails researchers aggregating data so as to create measures or indi-

cators of higher level constructs. The appropriateness of representing higher level constructs by

aggregating individual-level data is established by a composition model, which represents theory

on how multilevel constructs are related at each level of analysis (Chan, 1998; Kozlowski & Klein,

2000; see also Klein, Dansereau, & Hall, 1994; Rousseau, 1985). For instance, Chan’s (1998, p. 236)

direct consensus model is the idea that the ‘‘meaning of [the] higher level construct is in the consen-

sus among lower levels’’; the referent-shift consensus model is the idea that the ‘‘lower level units

being composed by consensus are conceptually distinct though derived from the original individual-

level units’’; and the dispersion model is the idea that the ‘‘meaning of [the] higher level construct is

in the dispersion or variance among lower level units.’’ Importantly, composition arguments indicate

the type of evidence needed to justify the aggregation of individual-level data, with several models,

including the direct consensus and referent-shift models (Chan, 1998), specifying interrater agree-

ment, or the interchangeability of raters, as the appropriate type of evidence. Interrater agreement is

also important for dispersion models (Chan, 1998); in this case, the degree of agreement itself rep-

resents the higher level construct.

Essentially, interrater agreement via the average deviation index is established by demonstrating

that observed agreement is sufficiently greater than no agreement. Thus, though it is not necessary to

the calculation of AD, in order to assess, or interpret, observed AD values, researchers must identify

an appropriate random response distribution, or null distribution, to which observed variability in

responses can be compared. A number of scholars have cited the choice of a null distribution as key

to interpreting indices of interrater agreement, and thus drawing appropriate inferences from data

(e.g., Brown & Hauenstein, 2005; A. Cohen, Doveh, & Nahum-Shani, 2009; James, Demaree, &

Wolf, 1984; LeBreton & Senter, 2008; Lindell & Brandt, 1997; Lu

¨dtke & Robitzsch, 2009; Meyer,

Mumford, & Campion, 2010). In practice, however, researchers routinely rely on the uniform dis-

tribution as the null distribution, though doing so is likely often inappropriate (e.g., Brown & Hauen-

stein, 2005; Meyer et al., 2010). In fact, LeBreton and Senter (2008) recently called for a moratorium

on the unconditional reliance on the uniform distribution.

The consequences of inappropriately comparing observed data to the uniform null distribution

can be (a) that researchers mistakenly do not read interrater agreement as being sufficient for aggre-

gation to higher levels of analysis, (b) that researchers mistakenly read interrater agreement as being

sufficient for aggregation to higher levels of analysis (e.g., see Meyer et al., 2010), or (c) that

researchers fail to appropriately interpret a group’s standing on a variable of interest. Thus, compar-

ing observed data to an inappropriate null distribution can lead to erroneous inferences that have

important implications for researchers. Nonetheless, the only decision rule for interpreting the

practical significance of observed AD values is Burke and Dunlap’s (2002) decision rule, which

130 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

is based on the uniform distribution as the null distribution. Currently, there are no guidelines for

interpreting practical significance relative to any other distributions.

While assessments of within-group agreement for methodological purposes, such as data aggre-

gation as discussed previously, address the question, ‘‘How much agreement/dispersion is there?’’

another question researchers can answer using interrater agreement indices is ‘‘How well does the

pattern of observed agreement/dispersion match the theoretically specified pattern of agreement/dis-

persion?’’ In the following we discuss the issue of the pattern of dispersion and the theoretical

distributions to which observed patterns can be compared.

Pattern of Dispersion

Harrison and Klein (2007) recently argued for the theoretical import of considering the pattern of

dispersion. They distinguished among separation diversity (e.g., differences in opinions, beliefs,

or attitudes), variety diversity (e.g., differences in knowledge or experience), and disparity diversity

(e.g., differences in proportionate ownership or control over socially valued assets). They argued

that depending on the type of diversity, minimum, moderate, and maximum diversity would be asso-

ciated with differently shaped distributions; that is, both the type and degree of diversity determine

the shapes of distributions. For instance, maximum separation diversity is characterized by a

bimodal distribution, maximum variety diversity is characterized by a uniform distribution, and

maximum disparity diversity is characterized by a skewed distribution. For separation diversity,

minimum, moderate, and maximum degrees of diversity are characterized as unimodal, uniform, and

bimodal, respectively. Considering both type of diversity and pattern of dispersion, they argued that

maximum separation diversity (bimodal distribution) and maximum disparity diversity (skewed

distribution) will have negative outcomes, such as reduced cohesion and group member input,

respectively, while maximum variety diversity (uniform distribution) will have positive outcomes,

such as increased creativity.

Importantly, according to their theory, both the type of diversity and the pattern of dispersion

must be known in order to effectively predict outcomes. For example, separation diversity could

be measured with regard to team members’ opinions about what their teams’ goals are (Harrison

& Klein, 2007). For each team, the pattern of the distribution of these opinions would be compared

to unimodal, uniform, and bimodal distributions as these are the distributions theoretically specified

by Harrison and Klein (2007) as representing minimum, moderate, and maximum separation diver-

sity. The degree of separation diversity, then, would be indicated by the theoretical distribution that

is most similar to the observed distribution. With this measure of degree of separation diversity for

each team, in addition to measures of cohesion, conflict, trust, and performance, researchers could

test Harrison and Klein’s hypothesis that as the degree of separation diversity increases, team out-

comes will be more negative: less cohesion and trust, more conflict, and lower performance.

DeRue et al.’s (2010) work on team efficacy provides another example of the potential theoretical

importance of the pattern of dispersion above and beyond the level of dispersion. They argued that

teams could have the same level of dispersion in their team efficacy ratings, but have different

theoretically meaningful patterns of dispersion. These different patterns of dispersion, they argued,

would predict different outcomes. Thus, according to DeRue et al.’s theory of team efficacy disper-

sion, assessing the pattern of dispersion in team efficacy ratings is essential for making predictions

about team effectiveness. For instance, they argued that while a bimodal distribution of team

efficacy ratings would lead to both positive and negative outcomes, a uniform distribution would

lead to positive outcomes. Regarding the effects of a uniform distribution, their argument was that

their disagreement will lead team members to share their differing views, thus enhancing team struc-

turing, planning, and learning, while simultaneously allowing the team to avoid problems of extreme

magnitudes of efficacy, which can lead either to overconfidence or helplessness, and social factions,

Smith-Crowe et al. 131

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

which create dysfunctional conflict. In contrast, they argue that a bimodal distribution will similarly

lead to team members sharing their differing views and thus enhancing team processes, but due to

the existence of social factions, will also lead to dysfunctional conflict.

While the question of the level of dispersion has been important for various reasons, especially

justifying aggregating individual-level data to form higher level variables, it is likely that the ques-

tion of the pattern of dispersion will become increasingly important as more researchers consider the

theoretical import of response distributions in and of themselves. This forecast is consistent with a

recent call from Edwards and Berry (2010) to increase the theoretical precision in management

research by developing hypotheses that specify effects in terms of magnitude, form (linear,

nonlinear, etc.), and conditions (i.e., moderators). In reviewing 25 years (1985-2009) of articles pub-

lished in the Academy of Management Review, Edwards and Berry (2010) found that 10.4%of the

propositions stated only that a relationship would exist, and 89.6%only indicated the direction of

the relationship. The theories presented by DeRue et al. (2010) and Harrison and Klein (2007) are

important steps toward more precise management theories because they consider the shapes of

distributions rather than simply measures of central tendency.

In cases for which the pattern of dispersion is of interest, it will be necessary to specify a ‘‘null

response range,’’ analogous to a null range with regard to a formal test of the null hypothesis

(see Greenwald, 1975), to determine whether the observed pattern of responses, or the relative

percentages of individuals within the respective categories,is consistent with the theoretical distribution.

To date, though researchers have suggested that observed patterns of dispersion can be quantitatively

assessed (DeRue et al., 2010; Harrison & Klein, 2007), no one has developed practical guidelines for

drawing inferences about the goodness of fit between an observed distribution and a theoretically spec-

ified distribution. As such, practical guidelines are needed for addressing both the methodological ques-

tion of the level of agreement/dispersion and the theoretical question of the pattern of responses.

Summary

In order to address this dearth of guidelines, we specify a variety of response distributions that

researchers could use to address a number of theoretical and methodological issues, and we

derive decision rules for the AD index relevant to each of these distributions to aid researchers

in making inferences about interrater agreement. We explain why and how the critical values

presented must be used differently to answer different research questions. Our intention is to

help researchers to interpret interrater agreement under the specified conditions, and impor-

tantly, the results will help researchers to make more appropriate decisions, including those

regarding the aggregation of data and more appropriate inferences regarding the interpretation

of group phenomena. In what follows, we discuss the AD index, relevant distributions, and

interpretive standards for the AD index.

The AD Index of Interrater Agreement

Burke, Finkelstein, and Dusig (1999) introduced the average deviation as an index of interrater agree-

ment, which represents the average absolute deviation in ratings from the mean rating of an item (AD

M

),

3

and as such is interpretable in the metric of the original scale. AD

M

for an item is calculated as follows:

ADMðjÞ¼P

N

k¼1jxjk

xjj

N;ð1Þ

where Nis the number of judges, or observations, of item j,x

jk

is equal to the kth judge’s rating of

item j, and

xjis equal to the mean rating of item j(Burke et al., 1999). The scale AD

M(J)

is the mean

132 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

of AD

M(j)

for essentially parallel items. Because the AD index is a measure of dispersion, lower val-

ues indicate greater agreement.

As noted previously, Burke and Dunlap (2002) derived a decision rule for inferring the practical

significance of observed AD values. This decision rule has two critical limitations. First, it only

addresses assessments of the level of agreement, not the pattern of distributions, which may be the-

oretically important. With the advance of theories such as DeRue et al.’s (2010) theory of team

efficacy dispersion and Harrison and Klein’s (2007) theoretical classification of types of diversity,

multilevel researchers will need to consider agreement/dispersion as a theoretically meaningful

issue. As such, guidelines addressing interpretations of the shapes of distributions are needed.

Second, this decision rule applies only when the uniform distribution is the appropriate null distri-

bution. As discussed previously, though the uniform distribution is widely applied, it is thought to be

quite often inappropriately applied. There is a mounting push from the scholarly community to jus-

tify the choice of a particular null distribution, rather than using the uniform distribution uncondi-

tionally, yet too few guidelines exist for researchers who do opt to use alternative null distributions.

In what follows, we identify the null and theoretical distributions used in our article. Then we

explain how we derived critical values for evaluating the practical significance of interrater agree-

ment in relation to these null distributions. These critical values can be used to assess the level of

interrater agreement in regard to data aggregation, which is a within-group assessment, as well as

a host of other problems relating to interrater agreement. Further, based on these critical values,

we calculated null ranges to be used in relation to studying theoretical problems; that is, assessing

the fit between an observed pattern of dispersion and a theoretical distribution.

Interpretive Standards for the AD Index

Here we present our derivations and resulting critical values and null ranges for the AD index given a

number of different response distributions. First, the distributions are described in brief. Then, we

explain our derivations of interpretive standards for the AD index. Finally, detailed discussions of

problems that relate to these theoretical and methodological reasons are presented.

Distributions

The distributions and their methodological and theoretical bases are listed in Table 1. The propor-

tions endorsing each value for 5-point and 7-point scales are listed for each distribution in Tables 2

and 3. Graphical depictions of these distributions are presented in Figures 2 through 5.

Table 1. Example Theoretical and Methodological Bases for Different Response Distributions.

Distribution Theoretical Basis Methodological Basis

Skew Social interaction

Work interdependence

Shared schemas

Maximum disparity diversity

Social desirability

Leniency

Bimodal Equal subgroups

Maximum separation diversity

Factions

Subgroup Minority belief, or unequal subgroups Response formats and unintended

question interpretation

Triangular/Bell-Shaped Central tendency

Uniform Fragmentation

Maximum variety diversity

Absence of bias

Conceptual ambiguity

Smith-Crowe et al. 133

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

First, we developed critical values for three basic forms of skewed distributions: slight skew, mod-

erate skew, and heavy skew (see Table 2 and Figure 2). Second, while one could model many forms of

bimodal distributions, here we simplify our presentation by suggesting two: ‘‘moderate’’ bimodal and

‘‘extreme’’ bimodal. As shown in Table 3 and Figure 3, the size of the subgroups in both cases is 50%

of the raters; the difference between the two distributions is that in the moderate bimodal distribution,

the subgroups are less divergent than they are in the extreme bimodal distribution. Third, though there

are numerous possible ways in which one could model subgroup distributions, we have simplified our

presentation by considering four possibilities based on two dimensions: the size of the subgroup and

the distance between the subgroup ratings and the majority of ratings. These distributions are shown in

Table 3, and they are graphically depicted for a 5-point scale in Figure 4. We define a smaller subgroup

as 10%of the raters (labeled as ‘‘A’’ in Table 3 and Figure 4) and a more moderately sized subgroup to

be 20%of the raters (labeled as ‘‘B’’ in Table 3 and Figure 4). We define extreme distance as the sub-

group responses and the majority of responses being on opposite ends of the Likert-type scale and

moderate distance as the subgroup responses being at the midpoint of the scale, while the majority

of responses are at one extreme of the scale (these are labeled ‘‘extreme’’ and ‘‘moderate’’ in Table

3 and Figure 4). Finally, we present triangular-shaped, bell-shaped, and uniform distributions in Table

3; they are graphically represented in Figure 5. The triangular-shaped distributions are based on a for-

mula presented by Messick (1982) and the bell-shaped distributions are based on LeBreton and Senter

(2008). Note that the upper limits for the uniform distribution (presented in both Tables 2 and 3) are

consistent with Burke and Dunlap’s (2002) c/6 decision rule for assessing the practical significance of

AD, where c is equal to the number of response categories.

Critical Values for Level of Agreement

In order to simplify our derivations, we begin with the basic case of agreement across judges on a

single item with respect to two categories.

4

In the case of a dichotomy, AD can be calculated based

on the proportion of judges falling into one of the two categories (Burke & Dunlap, 2002).

5

Based on

an upper limit for AD of .35, where .35 or lower represents meaningful agreement, Burke and Dun-

lap (2002) demonstrated that meaningful agreement could be defined as 77%of the judges endorsing

Table 2. Critical Values and Null Ranges for AD

M

Given Distributions Defined by Skew.

Critical Values Null Ranges

Proportion Endorsing Each Value AD

M

Distribution 1 2 3 4567s

2

AD

M

s

ADMADMUL –þ

5-Point Scale

Slight Skew .05 .15 .20 .35 .25 1.34 0.98 1.18 0.69 0.84 1.12

Moderate Skew .00 .10 .15 .40 .35 0.90 0.70 1.36 0.49 0.60 0.80

Heavy Skew .00 .00 .10 .40 .50 0.44 0.60 1.11 0.42 0.51 0.69

Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.85 1.02 1.38

7-Point Scale

Slight Skew .05 .08 .12 .15 .20 .25 .15 2.92 1.44 1.19 1.02 1.23 1.65

Moderate Skew .00 .06 .10 .14 .28 .22 .20 2.09 1.16 1.25 0.82 0.99 1.33

Heavy Skew .00 .00 .05 .10 .15 .30 .40 1.39 0.94 1.25 0.66 0.80 1.08

Uniform

a

.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.21 1.46 1.97

Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for

reporting purposes. The only exception was AD

M

, which was restricted to two decimal places when inputted into the

calculations.

a

The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15

decimal places.

134 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Table 3. Critical Values and Null Ranges for AD

M

Given Distributions Defined by Kurtosis and Variance.

Critical Values Null Ranges

Proportion Endorsing Each Value AD

M

Distribution 1234567s

2

AD

M

s

ADMADMUL –þ

5-Point Scale

Moderate Bimodal

a

.00 .50 .00 .50 .00 1.00 1.00 1.00 0.71 0.85 1.15

Extreme Bimodal

a

.50 .00 .00 .00 .50 4.00 2.00 1.00 1.41 1.71 2.29

Moderate Subgroup A

a,b

.00 .00 .10 .00 .90 0.36 0.36 1.67 0.25 0.31 0.41

Extreme Subgroup A

a,b

.10 .00 .00 .00 .90 1.44 0.72 1.67 0.51 0.61 0.83

Moderate Subgroup B

a,b

.00 .00 .20 .00 .80 0.64 0.64 1.25 0.45 0.55 0.73

Extreme Subgroup B

a,b

.20 .00 .00 .00 .80 2.56 1.28 1.25 0.91 1.09 1.47

Triangular-Shaped .11 .22 .34 .22 .11 1.32 0.88 1.31 0.62 0.75 1.01

Bell-Shaped .07 .24 .38 .24 .07 1.04 0.76 1.34 0.54 0.65 0.87

Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.85 1.02 1.38

7-Point Scale

Moderate Bimodal

a

.00 .50 .00 .00 .00 .50 .00 4.00 2.00 1.00 1.41 1.71 2.29

Extreme Bimodal

a

.50 .00 .00 .00 .00 .00 .50 9.00 3.00 1.00 2.12 2.56 3.44

Moderate Subgroup A

a,b

.00 .00 .00 .10 .00 .00 .90 0.81 0.54 1.67 0.38 0.46 0.62

Extreme Subgroup A

a,b

.10 .00 .00 .00 .00 .00 .90 3.24 1.08 1.67 0.76 0.92 1.24

Moderate Subgroup B

a,b

.00 .00 .00 .20 .00 .00 .80 1.44 0.96 1.25 0.68 0.82 1.10

Extreme Subgroup B

a,b

.20 .00 .00 .00 .00 .00 .80 5.76 1.92 1.25 1.36 1.64 2.20

Triangular-Shaped .06 .13 .19 .24 .19 .13 .06 2.50 1.26 1.25 0.89 1.08 1.44

Bell-Shaped .02 .08 .20 .40 .20 .08 .02 1.40 0.84 1.41 0.59 0.72 0.96

Uniform

c

.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.21 1.46 1.97

Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for reporting purposes. TheonlyexceptionwasAD

M

,whichwas

restricted to two decimal places when inputted into the calculations.

a

‘‘Moderate’’ and ‘‘extreme’’ refer to the distance between subgroups.

b

‘‘A’’ and ‘‘B’’ refer to the differential proportion of subgroup responses.

c

The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15 decimal places.

135

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

one category. Based on a more stringent upper limit of AD, .33, they indicated that meaningful

agreement could be defined as 79%agreement.

6

They noted that this notion of 77%to 79%agree-

ment being meaningful is consistent with many practical examples and problems relating to propor-

tional agreement, such as 60%to 80%agreement being required for including critical incidents

when creating behaviorally anchored rating scales (BARS; Cascio, 1998). Based on Burke and Dun-

lap’s calculations for the AD index, as well as conventional interpretations of meaningful agreement

in percentage or proportional terms, we adopted a starting value of 80%agreement.

We note that assumptions are necessary to the derivation process. By making ours explicit,

readers can readily revise the starting value as needed; yet, we suggest that a starting value of

80%agreement, or 20%disagreement, relative to the AD index is likely to suit most readers’ situa-

tions. Notably, the value of 20%disagreement is comparable to the upper limits of acceptable dis-

agreement for the AD index for scales that range from 3 to 99 response options (see Burke & Dunlap,

2002); that is, they are comparable in the sense of being approximately equal to the maximum level

of allowable disagreement. In other words, our use of the dichotomous case here does not limit the

applicability of our derivations to dichotomies.

Given our intent of proposing interrater agreement cut-offs and null ranges for response distribu-

tions for Likert-type scales with markedly different dispersion, which have different numbers of

response options and reflect a variety of response patterns including non-normal distributions, we next

convert a proportion of .80 (or 80%) to a standardized effect size (a correlation coefficient, r)towork

further with variances as indicators of dispersion. Since non-normal response distributions are

expected in many cases for theoretical reasons, we initially employ an arcsine transformation to con-

vert the proportion of .80 to a standardized effect size (i.e., a d-statistic; see Lipsey & Wilson, 2001)

and then use a maximum likelihood transformation of this d-value to obtain a correlation coefficient.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12345

Proportion Endorsing Each Value

Response Option

Slight Skew Moderate Skew Heavy Skew

Figure 2. Slight, moderate, and heavy skew distributions for a 5-point scale.

136 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12345

Proportion Endorsing Each Value

Response Option

Moderate Bimodal Extreme Bimodal

Figure 3. Bimodal distributions for a 5-point scale.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12345

Proportion Endorsing Each Value

Response Option

Moderate Subgroup A Extreme Subgroup A

Moderate Subgroup B Extreme Subgroup B

Figure 4. Subgroup distributions for a 5-point scale.

Smith-Crowe et al. 137

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

The d-value is computed as the difference between the arcsine of the proportion representing

meaningful agreement (i.e., .80) and the arcsine of the proportion representing no agreement

(.00) using Lipsey and Wilson’s (2001) formula:

d¼ð2arcsineðﬃﬃﬃﬃﬃ

p1

pÞÞ ð2arcsineðﬃﬃﬃﬃﬃ

p2

pÞÞ:ð2Þ

The resulting d-value is 2.214. Next, we transform the value of 2.214 to a correlation coefficient via

the maximum likelihood formula (Hunter & Schmidt, 2004):

7

r¼

d=

2

1þd

=

2

2

1=2ð3Þ

Given that the proportions in the two groups are unequal (i.e., .80 and .20), the number ‘‘2’’ in Equa-

tion 3 is replaced by 1/(pq)

1/2

, where pand qare the proportions in each group. The result is a

correlation of approximately .66. By rounding this value up to .7, our derivations continue at the

starting point for Burke and Dunlap’s (2002) derivations for practical cut-offs for the AD index for

the restricted case of the uniform distribution.

We note that arriving at approximately the same value for a correlation as Burke and Dunlap

(2002) does not indicate circularity in our derivations, but it does reflect our explicit assump-

tion that the underlying response distribution may meaningfully deviate from normality, thus

calling for the arcsine transformation of percentage agreement to produce a correlation. As

we discuss in the appendix, assuming that the underlying distribution ofresponsesisnormal

would call for a probit transformation of the proportion to produce a correlation. The resulting

value for the correlation would become approximately .8 (Lipsey & Wilson, 2001). Further-

more, Burke and Dunlap’s starting point of .7 for a correlation was, in large part, based on

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12345

Proportion Endorsing Each Value

Response Option

Triangular-Shaped Bell-Shaped Uniform

Figure 5. Triangular-shaped, bell-shaped, and uniform distributions for a 5-point scale.

138 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

empirical data relating to stability coefficients and correlations based on ratings of targets by

alternate sources. Their judgment and ours that a correlationof.7isareasonablyhighcorrela-

tion is consistent with J. Cohen (1977) who indicated that correlations greater than or equal to

.5 can be considered large.

Next, as is recognized in a number of quantitative fields (e.g., see Burke & Dunlap, 2002;

Greene, 1997; Guion, 1998; McCall, 1970; Parsons, 1978), we define a correlation (r)interms

of variances as

r¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

1s2

es2

T

q;ð4Þ

where s2

eis the error variance, here representing disagreement, and s2

Tis the total variance.

Given that the average deviation is a reasonable approximation to the standard deviation (we

discuss the more specific relationship in what follows), and the square of the standard deviation

is the variance, we can let s2

eequal AD

2

. Consistent with James et al.’s (1984, 1993) work, we

then set s2

Tto be equal to the variance of the chance responding in the population (s2

crpop).

Then, setting requal to .7, as a reasonable value for the correlation in Equation 4, we can

rewrite Equation 4 as

:7¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

1AD2.s2

crpop:

rð5Þ

Squaring both sides and solving for the ratio of variances, we obtain

AD2.s2

crpop ¼1:72¼:51:ð6Þ

Rounding .51 to .50 as did Burke and Dunlap (2002) and rewriting Equation 6, we get

AD2¼s2

crpop=2:ð7Þ

We used Equation 7 to calculate AD

2

for the different response distributions we identified. That is,

we calculated the variance of each response distribution and then divided the variance by 2 in order

to calculate AD

2

. By then taking the square root of this resulting value, we calculated AD:

ﬃﬃﬃﬃﬃﬃﬃﬃﬃ

AD2

p¼AD:ð8Þ

Recall that in Equation 5, AD

2

was substituted as an approximation for the observed variance; that

is to say that AD approximates the standard deviation (s). In fact, the standard and average devia-

tions vary by a constant that is dependent on the specified response distribution. As Burke and Dun-

lap (2002) noted, for the uniform distribution the s:AD ratio is 1.2. Thus, in order to calculate the

upper limits, or critical values, for the AD index, assuming a uniform distribution, they divided AD

(the result of Equation 8) by 1.2. That is, they corrected for the difference between sand AD intro-

duced in Equation 5. The same adjustment is needed here. The resulting value of Equation 8 must be

divided by the s:AD ratio relevant to a given response distribution. Thus, upper limits for acceptable

interrater agreement for AD

M

must be calculated separately for each null or theoretical response dis-

tribution using the following equation:

ADMUL ¼AD

ðse=ADMÞ;ð9Þ

where AD is calculated according to Equations 7 and 8 and AD

M

is calculated according to Equation 1.

The resulting critical values are listed in Tables 2 and 3 along with the pattern of responses for

each of the distributions identified and the relevant statistics. For use as decision heuristics, we have

Smith-Crowe et al. 139

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

rounded the critical values in Tables 2 and 3 to two decimal places; they can be applied to individual

items or to multi-item scales. For the purpose of assessing the level of agreement, the critical values

should be used in the conventional way (Burke & Dunlap, 2002): An observed AD

M

value equal to or

less than the relevant critical value (ADMUL ) indicates practically significant agreement. For

example, referring to Table 2, under the condition of slight skew for a 5-point scale, ADMUL ¼.69.

Thus, for researchers to infer a practically significant level of observed agreement, one’s observed

AD

M

value must be less than or equal to .69. Based on such an indication of practically significant

agreement, researchers would be justified in using the mean score as an indicator of the group’s

standing on a construct of interest and as a data point for further, multilevel analysis.

Null Ranges for Pattern of Dispersion

In addition to developing critical values, in response to recent advances in multilevel theory, we also

developed null ranges to facilitate researchers’ ability to assess how well the shape of an observed

distribution fits a theoretically specified distribution. This issue of comparing the pattern of observed

dispersion with a theoretical distribution is analogous to judging the goodness of fit between one’s

data and the theoretical response distribution. Cortina and Folger (1998) described tests of goodness

of fit as a matter of accepting the null hypothesis of no statistically significant difference between

observed data and theoretical models. Here, we are dealing with practical significance rather than

statistical significance meaning that goodness of fit in this context is a matter of concluding that

there is no meaningful difference between an observed distribution and the theoretically specified

response distribution.

The values for AD

M

shown in Tables 2 and 3 quantify the dispersion of different distributions.

Therefore, if an observed value is equal to the relevant tabled AD

M

value, then the pattern of

observed dispersion should fit perfectly with the pattern of theoretical dispersion. It is unlikely, how-

ever, that observed and tabled values will perfectly match; thus, the question becomes what is the

‘‘null range’’? In other words, how far can an observed value be from the tabled value before

researchers must conclude that their observed distribution has a poor fit with the theoretical

distribution?

Analogous to Greenwald’s (1975) discussion of how to accept a null hypothesis gracefully

(also see discussions by Cashen & Geiger, 2004; Cortina & Folger, 1998), researchers would

need to decide in advance of collecting data what magnitude of effect, in this case, the mag-

nitude of AD

M

, would be considered nontrivial. We suggest defining this magnitude as the dif-

ference between the expected AD

M

value for a distribution and therespectiveupperlimitfor

that distribution. While the decision to specify this magnitude is arguably somewhat arbitrary,

it is nevertheless made in advance of collecting data and tied to our derivations for assessments

of practical agreement. Consistent with Greenwald’s arguments about establishing a null range

for the formal test of a null hypothesis, this minimum magnitude of AD

M

that the researcher is

willing to consider nontrivial is then the boundary of the null range. That is, for observed AD

M

values, this magnitude would be the difference between the tabled AD

M

value and the relevant

tabled critical value, ADMUL . The general equation for establishing a null range for a theoretical

response distribution is as follows:

ADMnullrange ¼ADMðADMADMULÞ=w;ð10Þ

where wis used to define the width of the null range. Herein, following Greenwald’s (1975, pp.

16-18) logic regarding establishing a ‘‘two-tailed’’ null range that is symmetric around the zero point

of a test statistic, we define was equal to 2:

ADMnullrange ¼ADMðADMADMULÞ=2:ð11Þ

140 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

The resulting values are presented in Tables 2 and 3. For use as decision heuristics, we have rounded

the lower (symbolized by a minus sign) and upper (symbolized by a plus sign) limits of the null

range to two decimal points.

Although we present null ranges that are symmetrical around the expected AD

M

value,

researchers can readily define wand the width of the null range relative to the purposes of their

investigations. In these cases, larger values for wwill result in smaller, more conservative null

ranges than those reported in Tables 2 and 3 for the respective response distributions. In addi-

tion, researchers may desire to consider the construction of nonsymmetrical, one-tailed null

ranges for some types of theoretical response distributions. As with the use of critical values,

the researcher may desire to consider a priori several theoretical response distributions when

making judgments about whether the observed and theoretical response distributions are mean-

ingfully different.

Using the null ranges to gauge the goodness of fit between an observed and a theoretical distri-

bution is straightforward. Using the previous example of a slightly skewed null distribution and a

5-point scale, an observed AD

M

of .98 would suggest a perfect match between the observed and the-

oretical distributions (see Table 2). Yet, how can a researcher interpret an observed AD

M

of .70? The

relevant range is .84 to 1.12.

8

Thus, a researcher who observes an AD

M

of .70 would conclude that

the observed distribution is meaningfully different from the theoretical distribution. That is, the

observed AD

M

of .70 falls outside of the null range.

What researchers would do after determining a lack of fit would depend upon the theoretical con-

text. In some cases it may be that a lack of fit suggests that the phenomena researchers are attempting

to study are not represented in the data. This eventuality is analogous to researchers who do multi-

level research finding a lack of agreement such that aggregation to a higher level of analysis cannot

be justified (e.g., Chan, 1998). Or, it may be the case that shapes of observed distributions are com-

pared to multiple theoretically specified distributions. While .70 does not fall into the null range for

slight skew, it does fall into the range for moderate skew. In this case, the researcher would be able to

categorize the group as a ‘‘moderate skew’’ group and make theoretically based predictions accord-

ingly. More broadly, researchers can use the null ranges provided in Tables 2 and 3 in order to

classify groups according to the pattern of their distributions of scores, and then based on this clas-

sification, make theoretically derived predictions about group outcomes. These null ranges and those

relevant to the other distributions discussed in the following can be used relative to a single item or a

scale.

It is important to note that researchers must visually check the observed distribution of responses.

For instance, the direction of skew may be of theoretical relevance. Because the AD index is calcu-

lated via absolute values, the direction of skew cannot be determined from AD values. Quantifying

agreement as well as visually checking the direction of skew is necessary. This point holds for other

distributions discussed as well.

Distribution Choice

In the following, we discuss examples of when these different response distributions might be rel-

evant (see also Table 1). Note that we do not assume that only one distribution is relevant in any

given research context; rather, as others have suggested (e.g., James et al., 1984), we think it is rea-

sonable that multiple distributions may be appropriate. We organize our discussion by first consid-

ering the issue of level of agreement and then considering the issue of pattern of dispersion. Within

these sections, we refer to distributions defined by skew (Table 2) and those defined by kurtosis and

variance (Table 3).

Smith-Crowe et al. 141

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Level of Agreement

A number of response biases suggest that the appropriate null distribution is a skewed distribution.

James et al. (1984) and LeBreton and Senter (2008) have discussed the likelihood of leniency and

social desirability in contexts of assessing interrater agreement. Leniency may apply, for instance, in

the performance appraisal domain where subordinates tend to judge their supervisors in relatively

positive terms (Schriesheim, 1981). Klein, Conn, Smith, and Sorra (2001) found social desirability

to be applicable in a survey of organizational members’ workplace perceptions. Agreement among

members was related to the social desirability of the survey items (e.g., ‘‘The supervisor to whom I

report praises me for excellent performance’’ and ‘‘My work here is enjoyable’’; Klein et al., 2001, p.

11). To the extent that these biases are expected to be strong versus weak, and to the extent that mul-

tiple biases are expected to be relevant, researchers could utilize moderately to heavily skewed dis-

tributions as their null distributions.

Though skewed distributions have most often been suggested as alternatives to the uniform null

distribution, other distributions are relevant as well. Likert-type response formats that convey or

have different informational value may result in subgroups or small to moderate percentages of

respondents using particular response options. For instance, Schwarz, Knauper, Hippler, Noelle-

Neumann, and Clark (1991) showed that participants responded differently to the question ‘‘How

successful would you say you have been in life?’’ when the 11-point scale ranged from –5 to 5 rather

than 0 to 10, even though the anchors were identical (not at all successful to extremely successful).

For the former scale, 34%endorsed –5 to 0; for the latter scale, 13%endorsed 0 to 5. Schwarz (1999)

argued that the question of success is somewhat ambiguous in that success could be marked by the

presence of positive features or the absence of negative features and that participants use the scale

numbers as well as the anchors to interpret the items. In addition, Lindell and Brandt (1997) dis-

cussed the possibility of distinct factions among raters due to characteristics such as clinical orienta-

tions in assessments of psychotherapy, raters’ academic disciplines in rating research proposals,

raters’ functional department in ratings of organizational climate, and so on. The possibility of such

factions might call for the use of a bimodal response distribution as the baseline distribution for

assessing level of agreement among a set or raters. We present four different subgroup distributions

and two different bimodal distributions (see Table 3). In addition, triangular-shaped or bell-shaped

distributions

9

are applicable null distributions if one expects raters to succumb to the central ten-

dency bias (e.g., James et al., 1984; LeBreton & Senter, 2008). For instance, James et al. (1984,

p. 91) suggested that the central tendency bias may occur ‘‘when judges are purposefully cautious

or evasive because responses to items are not collected on a confidential basis, and political reasons

exist for not departing from the neutral alternatives on the scales.’’ They also suggested that naı¨ve

and unmotivated participants may exhibit the central tendency bias when responding to ambiguous

or complicated items.

Finally, while the uniform distribution has been described as an often inappropriate null distribu-

tion, there are circumstances under which it is the appropriate null distribution. It is applicable if no

rater bias is expected. It may also be applicable if raters face conceptual ambiguity. For instance,

Heidemeier and Moser (2009) found that raters demonstrated less agreement in job performance

ratings when the work being evaluated was less straightforward; that is, there was less agreement

regarding white-collar work and work high in job complexity compared to agreement regarding

blue-collar work and work low in job complexity.

Pattern of Dispersion

There are also theoretical bases for modeling agreement on most of these distributions. DeRue

et al. (2010) provided an example theoretical basis for choosing a bimodal distribution as a

142 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

theoretical response distribution: Equally sized subgroups within teams that judge team efficacy

differently will have mixed effects on team effectiveness by impairing social processes, but enhan-

cing task processes. They went on to propose that the greater the divergence between the sub-

groups, the more negative the effect on team effectiveness will be. In discussing maximum

separation diversity, such as diversity in team members’ judgments of team efficacy, Harrison and

Klein (2007) discussed an extreme bimodal distribution, where subgroups exist on opposite ends

of a continuum. Consistent with DeRue et al., they argued that this extreme bimodal distribution

would have negative outcomes: reduced cohesiveness, interpersonal conflict, distrust, and

decreased task performance.

Related to bimodal distributions are unimodal distributions that have distinct subgroups. From a

theoretical perspective, DeRue et al. (2010) discussed ‘‘minority belief’’ dispersion where one team

member rates team efficacy differently than the other team members. We previously reported their

proposition that when minority belief dispersion is characterized by one individual rating team effi-

cacy lower than everyone else, the effect on team effectiveness will be negative. DeRue et al. also

theorized about the opposite distribution: One individual rates team efficacy more highly than the

other team members. They proposed that this pattern of dispersion, which is the mirror-image of the

first scenario, will have mixed effects on team effectiveness because the dispersion is likely to

impair social processes, but enhance task processes.

Finally, DeRue et al. (2010) provided an example of when the uniform distribution would be the-

oretically specified as the expected response distribution: Fragmentation, characterized by a uniform

distribution of team efficacy beliefs, should augment team effectiveness by positively impacting

social and task processes. Their argument is based on the idea that fragmented teams may commu-

nicate more effectively than other teams because they do not have subgroups, coalitions, and

factions that can hinder effective communication in teams and they are motivated to create a shared

understanding of team efficacy. These teams are likely to openly discuss issues like goals and expec-

tations that can help in teams’ task-related processes, as well as helping to establish a shared belief

about team efficacy. Harrison and Klein (2007) proposed similarly positive effects for variety

diversity, such as diversity in educational background, when it is at a maximum level, which is

characterized by a uniform distribution: more creativity, greater innovation, higher decision quality,

more task conflict, and increased unit flexibility.

Conclusion

Given numerous calls for researchers who use interrater agreement indices to stop their uncondi-

tional use of the uniform response distribution, a primary purpose of our study was to provide

researchers with guidelines for using alternative null distributions and theoretical distributions

to make judgments about practical significance, when addressing both methodological and theo-

retical issues. In doing so, we derived critical values for a variety of response distributions that

vary in terms of skew, kurtosis, and variance. We also discussed how to use the critical values

shown in Tables 2 and 3 differently depending on whether one seeks to ascertain the level of agree-

ment or the pattern of dispersion. While the question of the level of agreement is familiar, the

question of the pattern of dispersion is more novel, but likely to become more and more important

with advances in multilevel theory and research. The current paper stands to promote such con-

ceptual advances.

Although we focused the substantive discussion of interrater agreement problems on data

aggregation in relation to the level of agreement and team efficacy dispersion and diversity

in relation to the pattern of dispersion, the derived critical values and null ranges can be

applied to numerous other research questions. For instance, the alternative null distributions and

critical values could assist in addressing interrater agreement questions related to job analysts’

Smith-Crowe et al. 143

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

ratings of task items for a job, or judges’ ratings of critical or cut-off scores on the items of a

test (e.g., using the Angoff method whereby cut-off scores are based on subject matter experts’

estimates of the probability that a competent person will respond to an item accurately; e.g., see

Hudson & Campion, 1994) just to name a few types of pertinent research questions. As another

example, the notion of a theoretical distribution could be used to specify the demographic

makeup of a community (e.g., racial/ethnic composition in terms of percentages within each

category), thereby permitting the quantification of demographic similarity/dissimilarity between

employees and residents (i.e., the difference between an observed AD

M

value and the relevant

tabled value for a theoretical distribution). Quantifying the effects of community demographic

similarity in this manner may meaningfully extend the measurement and study of employee-

community racial/ethnic similarity from the individual level of analysis (e.g., see Avery,

McKay, & Wilson, 2008; Brief et al., 2005) to the organization or business unit levels of anal-

ysis. Importantly, irrespective of the group phenomena under study, the AD index itself and the

derived null ranges provide a means for trackingorstudyingexpectedchangesingroupphe-

nomena possibly relative to stages of group development or shocks that the group might

encounter. Further, practical significance critical values could be similarly developed for other

interrater agreement indices.

Future research should also address the problem of assessing the statistical significance of AD

values relative to a variety of null or theoretical distributions. As discussed earlier, the work to date

in this area is limited. Burke and Dunlap (2002) and Dunlap et al. (2003) used an approximate ran-

domization test to establish statistical significance cut-offs for AD for judges’ ratings of a single item

relative to the uniform distribution. Cohen et al. (2009) built upon this work to establish statistical

significance cut-offs for AD for judges’ agreement on multi-item scales relative to the uniform dis-

tribution and a slightly skewed distribution. In order to assist researchers in inferring whether levels

of agreement (i.e., AD values) are due to chance, cut-off values for statistical significance should be

established relative to more distributions, such as those identified in Tables 2 and 3. Without addi-

tional guidelines, researchers are likely to continue to over-rely on the uniform distribution when

making inferences about their data.

In closing, we emphasize that the practical guidelines presented herein are just that: guide-

lines. As others have advised, it is important that researchers take a common sense approach to

interpreting observed agreement. Speaking in terms of whether interrater agreement is suffi-

cient to justify the aggregation of individual-level data, LeBreton and Senter (2008, p. 836)

asserted that ‘‘the value used to justify aggregation ultimately should be based on a researcher’s

consideration of (a) the quality of the measures, (b) the seriousness of the consequences result-

ing from the use of aggregate scores, and (c) the particular composition model to be tested.’’

James et al. (1984), in addressing the problem of uncertainty over which null distribution

applies in a given situation, suggested interpreting observed agreement on the basis of several

null distributions: ‘‘The rationale here is that even though we cannot pinpoint a particular null

with a high degree of confidence, we can place bounds on the most likely types of nulls and

thereby increase the likelihood that the true null lies somewhere in this range of distributions’’

(p. 95).

Similarly, we urge researchers to consider their particular circumstances when assessing

interrater agreement and to consider the use of a range of critical values based on several dif-

ferent null or theoretical distributions. For instance, researchers should consider whether they

have missing data (e.g., see Newman & Sin, 2009). Our guidelines do not account for system-

atically missing data and thus may be sensitive to this problem, particularly in the cases of

certain distributions, such as the bimodal distribution, which may appear as a unimodal distri-

bution if data are systematically missing from one of the two subgroups. In other cases,

researchers may need to apply a null or theoretical distribution not included in Tables 2 and

144 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

3, or they may need to adjust the starting value for interrater agreement of 80%agreement used

in the present derivations. Moving away from 80%agreement or considering another transfor-

mation of percentage agreement for derivational purposes, such as a probit transformation of a

proportion, will result in more stringent or more lenient critical values and decisions concerning

interrater agreement depending on whether one adjusts this value upward or downward, or

whether one employs a more versus less conservative transformation of the proportion, such

as arcsine versus probit transformations. Recognizing the possibility that the research context

may dictate the consideration of other assumptions or response distributions than those used

in the study, we present in the appendix a general procedure for researchers to use in establish-

ing critical values and null ranges based on other assumptions not considered here. In this

regard, our proposed guidelines offer a uniform and parsimonious means for studying interrater

agreement given a variety of methodological and theoretical problems.

Appendix

Calculating Critical Values and Null Ranges for the AD Index

Although we presented a number of different response distributions in Tables 2 and 3, researchers

may find that their methodologically or theoretically specified response distribution is not listed. In

this case, researchers can follow our procedures to calculate the relevant critical values and null

ranges. The first step is to express the response distribution in terms of the proportion of individuals

endorsing each value of a scale as we did in Tables 2 and 3. The second step is to calculate the var-

iance for the specified distribution. Third, as per Equation 7, divide the variance by 2 to calculate

AD

2

; then, take the square root of the resulting value (i.e., AD, see Equation 8). Finally, as per

Equation 9, divide AD by the s:AD

M

ratio to calculate ADMUL ; follow Equation 10 to calculate the

null range (see also Equation 11). The agree.exe program available at http://www.tulane

.edu/*dunlap/psylib.html can be used to calculate AD

M

for items or multi-item scales. The

calculations conducted by the software are based on Burke and Dunlap (2002) and Dunlap, Burke,

and Smith-Crowe (2003). Note that the ‘‘actual variance’’ reported by the software is calculated for a

sample; rather, our calculations are based on the variance calculated for a population.

In addition, researchers may find that starting with a percentage agreement of 80%and using a

correlation of .7 (within Equation 5) is either too lenient or too stringent given their particular

circumstances. In such cases, researchers can derive their own critical values and null ranges based

on different initial assumptions. Beginning with either a different percentage agreement or continu-

ing derivations with another value for the correlation will produce different cut-offs and null ranges.

One can substitute another reasonable value in Equation 5 and follow the sequence through Equation

10 to arrive at new critical values and null ranges (see also Equation 11). For example, replacing .7

with .8 in Equations 5 and 6, results in 1 – .8

2

¼.36. Thus, Equation 7 would be rewritten such that

the variance would be divided by 2.78. To calculate their own critical values then, researchers using

r¼.8 would calculate the variance of a given distribution and divide that variance by 2.78 (as per the

revised version of Equation 7). Then, they would follow Equations 8-9 to calculate ADMUL and Equa-

tion 10 to calculate the null range (see also Equation 11).

There are several reasons for which researchers may want to use cut-offs associated with a

correlation of .8. For instance, from our starting point of defining meaningful agreement as 80%

agreement, one could use a probit transformation to convert this proportion to an effect size. A probit

transformation of the proportion may be called for if the underlying distribution of scores is expected

to be normally distributed. Also the probit transformation may be a particularly good choice in esti-

mating a standardized effect from proportions if the cut-point between the two groups is in the tail

portion of a skewed distribution (Lipsey & Wilson, 2001). A probit transformation of a proportion of

Smith-Crowe et al. 145

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

.8 will produce a correlation equal to .78501 (Lipsey & Wilson), which rounds to .8. Also, as

discussed in Burke and Dunlap (2002), a correlation of .8 would correspond to a high level of

stability in scores. Substituting .8 for .7 in Equation 6 and then following the remainder of the

equations, one would arrive at more stringent critical values than those presented in Tables 2 and

3. We present this more stringent set of criteria in Tables A1 and A2.

Finally, some researchers may want to use AD calculated from the median (AD

Md

) rather than the

mean (AD

M

). These different versions of the AD index are equal when the mean and median of a

distribution are equal, and otherwise they tend to be highly correlated (Burke, Finkelstein, & Dusig,

1999). Though AD

M

has been used more often by researchers, Burke et al. (1999) argued that AD

Md

is more sensitive in detecting agreement since the median of a distribution is the point at which the

sum of the absolute deviations are the most minimal, and more minimal deviations indicate higher

agreement. AD

Md

for an item is calculated as follows:

ADMdðjÞ¼P

N

k¼1jxjk Mdjj

N;ð12Þ

where Md

j

is equal to the median rating of item jand all other notations are consistent with those in

Equation 1. The scale AD

Md(J)

is the mean of AD

Md(j)

for essentially parallel items. The upper limit

for AD

Md

would be calculated as follows:

ADMdUL ¼AD

ðse=ADMd Þ;ð13Þ

where AD is calculated according to Equations 5 through 8. Finally, the null range for AD

Md

would

be calculated as follows:

ADMdnullrange ¼ADMd ðADMd ADMdULÞ=w;ð14Þ

Table A1. Critical Values and Null Ranges for AD

M

Given Distributions Defined by Skew and r¼.8.

Critical Values Null Ranges

Proportion Endorsing Each Value AD

M

Distribution 1 2 34567s

2

AD

M

s

ADMADMUL –þ

5-Point Scale

Slight Skew .05 .15 .20 .35 .25 1.34 0.98 1.18 0.59 0.78 1.18

Moderate Skew .00 .10 .15 .40 .35 0.90 0.70 1.36 0.42 0.56 0.84

Heavy Skew .00 .00 .10 .40 .50 0.44 0.60 1.11 0.36 0.48 0.72

Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.72 0.96 1.44

7-Point Scale

Slight Skew .05 .08 .12 .15 .20 .25 .15 2.92 1.44 1.19 0.86 1.15 1.72

Moderate Skew .00 .06 .10 .14 .28 .22 .20 2.09 1.16 1.25 0.69 0.92 1.39

Heavy Skew .00 .00 .05 .10 .15 .30 .40 1.39 0.94 1.25 0.56 0.75 1.13

Uniform

a

.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.03 1.37 2.06

Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for

reporting purposes. The only exception was AD

M

, which was restricted to two decimal places when inputted into the

calculations.

a

The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15

decimal places.

146 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Table A2. Critical Values and Null Ranges for AD

M

Given Distributions Defined by Kurtosis and Variance and r¼.8.

Critical Values Null Ranges

Proportion Endorsing Each Value AD

M

Distribution 1 2 3 4 5 6 7 s

2

AD

M

s

ADMADMUL –þ

5-Point Scale

Moderate Bimodal

a

.00 .50 .00 .50 .00 1.00 1.00 1.00 0.60 0.80 1.20

Extreme Bimodal

a

.50 .00 .00 .00 .50 4.00 2.00 1.00 1.20 1.60 2.40

Moderate Subgroup A

a,b

.00 .00 .10 .00 .90 0.36 0.36 1.67 0.22 0.29 0.43

Extreme Subgroup A

a,b

.10 .00 .00 .00 .90 1.44 0.72 1.67 0.43 0.58 0.86

Moderate Subgroup B

a,b

.00 .00 .20 .00 .80 0.64 0.64 1.25 0.38 0.51 0.77

Extreme Subgroup B

a,b

.20 .00 .00 .00 .80 2.56 1.28 1.25 0.77 1.02 1.54

Triangular-Shaped .11 .22 .34 .22 .11 1.32 0.88 1.31 0.53 0.70 1.06

Bell-Shaped .07 .24 .38 .24 .07 1.04 0.76 1.34 0.46 0.61 0.91

Uniform .20 .20 .20 .20 .20 2.00 1.20 1.18 0.72 0.96 1.44

7-Point Scale

Moderate Bimodal

a

.00 .50 .00 .00 .00 .50 .00 4.00 2.00 1.00 1.20 1.60 2.40

Extreme Bimodal

a

.50 .00 .00 .00 .00 .00 .50 9.00 3.00 1.00 1.80 2.40 3.60

Moderate Subgroup A

a,b

.00 .00 .00 .10 .00 .00 .90 0.81 0.54 1.67 0.32 0.43 0.65

Extreme Subgroup A

a,b

.10 .00 .00 .00 .00 .00 .90 3.24 1.08 1.67 0.65 0.86 1.30

Moderate Subgroup B

a,b

.00 .00 .00 .20 .00 .00 .80 1.44 0.96 1.25 0.58 0.77 1.15

Extreme Subgroup B

a,b

.20 .00 .00 .00 .00 .00 .80 5.76 1.92 1.25 1.15 1.54 2.30

Triangular-Shaped .06 .13 .19 .24 .19 .13 .06 2.50 1.26 1.25 0.76 1.01 1.51

Bell-Shaped .02 .08 .20 .40 .20 .08 .02 1.40 0.84 1.41 0.50 0.67 1.01

Uniform

c

.14 .14 .14 .14 .14 .14 .14 4.00 1.71 1.17 1.03 1.37 2.06

Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for reporting purposes. The only exception was AD

M

, which

was restricted to two decimal places when inputted into the calculations.

a

‘‘Moderate’’ and ‘‘extreme’’ refer to the distance between subgroups.

b

‘‘A’’ and ‘‘B’’ refer to the differential proportion of subgroup responses.

c

The proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15 decimal places.

147

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

where wis used to define the width of the null range. For the same reasons given previously in the

discussion of Equation 11, we suggest defining was equal to 2.

Acknowledgments

We would like to thank Greg Oldham and Isaac Smith for helpful comments on previous drafts of the article and

Julie Seidel and Teng Zhang for research assistance.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publi-

cation of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

1. Interrater agreement is distinct from interrater reliability (e.g., Wagner, Rau, & Lindermann, 2010). While

the former is concerned with the extent to which ratings are the same across raters, the latter is concerned

with consistency in the rank order of ratings across raters. Kozlowski and Hattrup (1992) helpfully distin-

guished ‘‘consensus’’ (agreement) from ‘‘consistency’’ (reliability).

2. While practical significance concerns whether agreement is meaningful, statistical significance concerns

whether it is due to chance (Dunlap et al., 2003; Smith-Crowe & Burke, 2003).

3. The average deviation (AD) can also be calculated from the median (AD

Md

) rather than the mean (AD

M

).

These different versions of the AD index are equal when the mean and median of a distribution are equal,

and otherwise they tend to be highly correlated (Burke, Finkelstein, & Dusig, 1999). Because AD

M

is used

more often by researchers than AD

Md

, we focus our article on AD

M

. The appendix, however, provides the

information researchers would need to calculate critical values for AD

Md

as needed.

4. We base our work in part on equations presented by Burke and Dunlap (2002).

5. Burke and Dunlap (2002) demonstrated in their Equation 12 (p. 165) how to calculate AD from a proportion

in the case of a dichotomy:ADð2categoriesÞ¼2pð1pÞ.

6. For the assessment of interrater agreement where judges rate a single target with respect to only two cate-

gories (e.g., on a yes–no or agree–disagree dichotomous item format), Burke and Dunlap (2002, p. 164)

obtained an upper limit value for AD of .35 using their Equation 9:

ADUL ¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

c21

=

24

q;

where cis equal to the number of categories. When cequals 2, AD

UL

equals .35. Burke and Dunlap (p. 164)

also presented an approximation or simplification of Equation 9, which was their Equation 10:

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

c2

=

25

q¼c

=

5:

Using an approximation or simplification to Equation 9 of their article (c/5) and dividing this quantity by 1.2,

which is the constant by which AD and the standard deviation of responses on an item differ relative to the

uniform distribution, yields the value of .33 as the upper limit of AD for a dichotomous item.

7. When dis within the range of –.40 to .40,a close approximation of ris ddivided by 2. Yet, when dfalls outside

of this range, the relationship between dand rbecomes nonlinear. For the later cases, an accurate approxima-

tion of dto ris obtained by the maximum likelihood estimate (see Hunter & Schmidt, 2004, pp. 277-279).

Because our dof 2.214 is greater than .4, we used Equation 3 to obtain an accurate estimate of r.

8. Note that if readers were to plug the tabled values into Equation 9, they would arrive at a slightly different

range due to rounding error.

148 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

9. Here we are dealing with multinomial distributions, and as a result, we refer to them as triangular-shaped and

bell-shaped as they are, respectively, similar in a figurative sense to the normal probability distribution.

References

Avery, D. R., McKay, P. F., & Wilson, D. C. (2008). What are the odds? How demographic similarity affects

the prevalence of perceived employment discrimination. Journal of Applied Psychology,93, 235-249.

Bledow, R., & Frese, M. (2009). A situational judgment test of personal initiative and its relationship to

performance. Personnel Psychology,62, 229-258.

Borucki, C. C., & Burke, M. J. (1999). An examination of service-related antecedents to retail store perfor-

mance. Journal of Organizational Behavior,20, 943-962.

Brief, A. P., Umphress, E. E., Dietz, J., Burrows, J. W., Butz, R. M., & Scholten, L. (2005). Community matters:

Realistic group conflict theory and the impact of diversity. Academy of Management Journal,48, 830-844.

Brown, R. D., & Hauenstein, N. M. A. (2005). Interrater agreement reconsidered: An alternative to the rwg

indices. Organizational Research Methods,8, 165-184.

Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation index: A

user’s guide. Organizational Research Methods,5, 159-172.

Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater

agreement. Organizational Research Methods,2, 49-68.

Cascio, W. F. (1998). Applied psychology in human resource management. Upper Saddle River, NJ: Prentice

Hall.

Cashen, L. H., & Geiger, S. W. (2004). Statistical power and the testing of null hypotheses: A review of con-

temporary management research and recommendations for future studies. Organizational Research

Methods,7(2), 151-167.

Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of

analysis: A typology of composition models. Journal of Applied Psychology,83, 234-246.

Cohen, A., Doveh, E., & Nahum-Shani, I. (2009). Testing agreement for multi-item scales with the indices

r

WG(J)

and AD

M(J)

.Organizational Research Methods,12, 148-164.

Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York, NY: Academic Press.

Cortina, J. M., & Folger, R. G. (1998). When is it acceptable to accept a null hypothesis: No way, Jose?

Organizational Research Methods,1, 334-350.

Dawson, J. F., Gonzalez-Roma, V., Davis, A., & West, M. A. (2008). Organizational climate and climate

strength in UK hospitals. European Journal of Work and Organizational Psychology,17, 89-111.

DeRue, D. S., Hollenbeck, J. R., Ilgen, D. R., & Feltz, D. (2010). Efficacy dispersion in teams: Moving beyond

agreement and aggregation. Personnel Psychology,63, 1-40.

Dunlap, W. P., Burke, M. J., & Smith-Crowe, K. (2003). Accurate tests of statistical significance for r

WG

and

average deviation indexes. Journal of Applied Psychology,88, 356-362.

Edwards, J. R., & Berry, J. W. (2010). The presence of something or the absence of nothing: Increasing theo-

retical precision in management research. Organizational Research Methods,13, 668-689.

Grant, A. M., & Mayer, D. M. (2009). Good soldiers and good actors: Prosocial and impression management

motives as interactive predictors of affiliate citizenship behaviors. Journal of Applied Psychology,94,

900-912.

Greene, W. H. (1997). Econometric analysis. Upper Saddle River, NJ: Prentice Hall.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin,82,

1-20.

Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ:

Lawrence Erlbaum.

Harrison, D. A., & Klein, K. J. (2007). What’s the difference? Diversity constructs as separation, variety, or

disparity in organizations. Academy of Management Review,32, 1199-1228.

Smith-Crowe et al. 149

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Heidemeier, H., & Moser, K. (2009). Self-other agreement in job performance ratings: A meta-analytic test of a

process model. Journal of Applied Psychology,94, 353-370.

Hudson, J. P., & Campion, J. E. (1994). Hindsight bias in an application of the Angoff method for setting cutoff

scores. Journal of Applied Psychology,79, 860-865.

Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research find-

ings. Thousand Oaks, CA: Sage.

James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and with-

out response bias. Journal of Applied Psychology,69, 85-98.

James, L. R., Demaree, R. G., & Wolf, G. (1993). r

wg

: An assessment of within-group interrater agreement.

Journal of Applied Psychology,78, 306-309.

Katz-Navon, T., Naveh, E., & Stern, Z. (2009). Active learning: When is more better? The case of resident

physicians’ medical errors. Journal of Applied Psychology,94, 1200-1209.

Klein, K. J., Conn, A. B., Smith, D. B., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of

within-group agreement in employee perceptions of the work environment. Journal of Applied

Psychology,86, 3-16.

Klein, K. J., Dansereau, F., & Hall, R. J. (1994). Levels issues in theory development, data collection, and

analysis. Academy of Management Review,19, 195-229.

Kline, T. J. B., & Hambley, L. A. (2007). Four multi-item interrater agreement options: Comparisons and

outcomes. Psychological Reports,101, 1001-1010.

Kozlowski, S. W. J., & Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues

of consistency versus consensus. Journal of Applied Psychology,77, 161-167.

Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations:

Contextual, temporal, and emergent processes. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory,

research, and methods in organizations (pp. 3-90). San Francisco, CA: Jossey-Bass.

Kreiner, G. E., Hollensbe, E. C., & Sheep, M. L. (2009). Balancing borders and bridges: Negotiating the work-

home interface via boundary work tactics. Academy of Management Journal,52, 704-730.

Lawrence, K. L., Lenk, P., & Quinn, R. E. (2009). Behavioral complexity in leadership: The psycho-

metric properties of a new instrument to measure behavioral repertoire. Leadership Quarterly,20,

87-102.

LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agree-

ment. Organizational Research Methods,11, 815-852.

Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied

Psychological Measurement,21, 271-278.

Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.

Lu

¨dtke, O., & Robitzsch, A. (2009). Assessing within-group agreement: A critical examination of a random-

group resampling approach. Organizational Research Methods,12, 461-487.

McCall, R. B. (1970). Fundamental statistics for psychology. New York, NY: Harcourt, Brace, & World, Inc.

Messick, D. M. (1982). Some cheap tricks for making inferences about distribution shapes from variances.

Educational and Psychological Measurement,42, 749-758.

Meyer, R. D., Mumford, T. V., & Campion, M. A. (2010, August). The practical consequences of null distri-

bution choice on rwg. Paper presented at the annual meeting of the Academy of Management, Montreal,

Canada.

Newman, D. A., & Sin, H-P. (2009). How do missing data bias estimates of within-group agreement?

Sensitivity of SD

WG

,CV

WG

,r

WG(J)

,r

WG(J)

*, and ICC to systematic nonresponses. Organizational

Research Methods,12, 113-147.

Nicklin, J. M., & Roch, S. G. (2009). Letters of recommendation: Controversy and consensus from expert

perspectives. International Journal of Selection and Assessment,17, 76-91.

Parsons, R. (1978). Statistical analysis. New York, NY: Harper & Row.

150 Organizational Research Methods 16(1)

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from

Roberson, Q. M., Sturman, M. C., & Simons, T. L. (2007). Does the measure of dispersion matter in multilevel

research? A comparison of the relative performance of dispersion indexes. Organizational Research

Methods,10, 564-588.

Rousseau, D. M. (1985). Issues of level in organizational research: Multi-level and cross-level perspectives.

Research in Organizational Behavior,7, 1-37.

Schriesheim, C. A. (1981). The effect of grouping or randomizing items on leniency response bias. Educational

and Psychological Measurement,41, 401-411.

Schwarz, N. (1999). Self-reports: How questions shape the answers. American Psychologist,54, 93-105.

Schwarz, N., Knauper, B., Hippler, H. J., Noelle-Neumann, E., & Clark, F. (1991). Rating scales: Numeric

values may change the meaning of scale labels. Public Opinion Quarterly,55, 570-582.

Smith-Crowe, K., & Burke, M.J. (2003). Interpreting the statistical significance of observed AD interrater

agreement values: Corrections to Burke and Dunlap (2002). Organizational Research Methods,6, 129-131.

Takeuchi, R., Chen, G., & Lepak, D. P. (2009). Through the looking glass of a social system: Cross-level effects

of high performance work systems on employees’ attitudes. Personnel Psychology,62, 1-29.

Trougakos, J. P., Beal, D. J., Green, S. G., & Weiss, H. M. (2008). Making the break count: An episodic exam-

ination of recovery activities, emotional experiences, and positive affective displays. Academy of

Management Journal,51, 131-146.

Van Kleef, G. A., Homan, A. C., Beersma, B., Van Knippenberg, D., Knippenberg, B. V., & Damen, F. (2009).

Searing sentiment or cold calculation? The effects of leader emotional displays on team performance depend

on follower epistemic motivation. Academy of Management Journal,52, 562-580.

Wagner, S. M., Rau, C., & Lindermann, E. (2010). Multiple informant methodology: A critical review and

recommendations. Sociological Methods and Research,38, 582-618.

Walumbwa, F. O., & Schaubroeck, J. (2009). Leader personality traits and employee voice behavior: Mediating

roles of ethical leadership and work group psychological safety. Journal of Applied Psychology,94,

1275-1286.

Author Biographies

Kristin Smith-Crowe is an associate professor of organizational behavior in the David Eccles School of Busi-

ness, University of Utah. Her research focuses on behavioral ethics, interrater agreement, and worker safety.

Michael J. Burke is the Lawrence Martin Chair in Business, Freeman School of Business, Tulane University.

His current research focuses on learning and the efficacy of workplace safety and health interventions as well as

the meaning of employee perceptions of work environment characteristics (psychological and organizational

climate).

Maryam Kouchaki received her PhD in organizational behavior from the David Eccles School of Business,

University of Utah, and is currently a postdoctoral fellow at the Edmond J. Safra Center for Ethics, Harvard

University. Her research focuses on the moral dimension of social life, in particular, ethical behavior in the

workplace.

Sloane M. Signal is a doctoral student in the College of Education and Human Development at Jackson State

University. Her research interests include communicating across cultures both inside and outside of the United

States, diversity and multiculturalism in the workplace, and the scholarship of teaching and learning.

Smith-Crowe et al. 151

at TULANE UNIV on May 29, 2013orm.sagepub.comDownloaded from