Content uploaded by Kilem Li Gwet

Author content

All content in this area was uploaded by Kilem Li Gwet on Apr 27, 2018

Content may be subject to copyright.

Computing inter-rater reliability and its variance

in the presence of high agreement

Kilem Li Gwet*

STATAXIS Consulting, Gaithersburg, USA

Pi (p) and kappa (k) statistics are widely used in the areas of psychiatry and

psychological testing to compute the extent of agreement between raters on nominally

scaled data. It is a fact that these coefﬁcients occasionally yield unexpected results in

situations known as the paradoxes of kappa. This paper explores the origin of these

limitations, and introduces an alternative and more stable agreement coefﬁcient

referred to as the AC

1

coefﬁcient. Also proposed are new variance estimators for the

multiple-rater generalized pand AC

1

statistics, whose validity does not depend upon

the hypothesis of independence between raters. This is an improvement over existing

alternative variances, which depend on the independence assumption. A Monte-Carlo

simulation study demonstrates the validity of these variance estimators for conﬁdence

interval construction, and conﬁrms the value of AC

1

as an improved alternative to

existing inter-rater reliability statistics.

1. Introduction

Researchers in various ﬁelds often need to evaluate the quality of a data collection

method. In many studies, a data collection tool, such as a survey questionnaire, a

laboratory procedure or a classiﬁcation system, is used by different people referred to as

raters, observers or judges. In an effort to minimize the effect of the rater factor on data

quality, investigators like to know whether all raters apply the data collection method in

a consistent manner. Inter-rater reliability quantiﬁes the closeness of scores assigned by

a pool of raters to the same study participants. The closer the scores, the higher the

reliability of the data collection method. Although reliability data can be discrete or

continuous, the focus of this paper is on inter-rater reliability assessment on nominally

scaled data. Such data are typically obtained from studies where raters must classify

study participants into one category among a limited number of possible categories.

Banerjee, Capozzoli, McSweeney, and Sinha (1999) provide a good review of the

techniques developed to date for analysing nominally scaled data. Two of the most

inﬂuential papers in this area are those of Fleiss (1971) and Fleiss, Cohen, and Everitt

* Correspondence should be addressed to Dr Kilem Li Gwet, Statistical Consultants , STATAXIS Consulting, 15914B Shady

Grove Road, No. 145, Gaithersburg, MD 20877-1322, USA (e-mail: gwet@stataxis.com).

The

British

Psychological

Society

1

British Journal of Mathematical and Statistical Psychology (2008), 00, 1–21

q2008 The British Psychological Society

www.bpsjournals.co.uk

DOI:10.1348/000711006X126600

BJMSP 139—29/1/2008—ROBINSON—217240

(1969), which contain the most popular results in use today. Fleiss et al. provide large

sample approximations of the variances of the kand weighted kstatistics suggested by

Cohen (1960, 1968), respectively in the case of two raters, while Fleiss extends the p-

statistic to the case of multiple raters. Landis and Koch (1977) also give an instructive

discussion of inter-rater agreement among multiple observers. Agresti (2002) presents

several modelling techniques for analysing rating data in addition to presenting a short

account of the state of the art. Light (1971) introduces measures of agreement

conditionally upon a speciﬁc classiﬁcation category, and proposes a generalization of

Cohen’s k-coefﬁcient to the case of multiple raters. Conger (1980) suggests an

alternative multiple-rater agreement statistic obtained by averaging all pairwise overall

and chance-corrected probabilities proposed by Cohen (1960). Conger (1980) also

extends the notion of pairwise agreement to that of g-wise agreement where agreement

occurs if graters rather than two raters classify an object into the same category.

In section 2, I introduce the most commonly used pairwise indexes. Section 3

discusses a theoretical framework for analysing the origins of the kappa’s paradox. An

alternative and more stable agreement coefﬁcient referred to as the AC

1

statistic is

introduced in section 4. Section 5 is devoted to the analysis of the bias associated with

the various pairwise agreement coefﬁcients under investigation. In section 6, a variance

estimator for the generalized p-statistic is proposed, which is valid even under the

assumption of dependence of ratings. Section 7 presents a variance estimator of the AC

1

statistic, which is always valid. The important special case of two raters is discussed in

section 8, while section 9 describes a small simulation study aimed at verifying the

validity of the variance estimators as well as the magnitude of the biases associated with

the various indexes under investigation.

2. Cohen’s k, Scott’s p,G-index and Fleiss’s generalized p

In a two-rater reliability study involving raters Aand B, the data will be reported in a two-

way contingency table such as Table 1. Table 1 shows the distribution of nstudy

participants by rater and response category, where n

kl

indicates the number of

participants that raters Aand Bclassiﬁed into categories kand l, respectively.

All inter-rater reliability coefﬁcients discussed in this paper have two components:

the overall agreement probability p

a

, which is common to all coefﬁcients, and the

chance-agreement probability p

e

, which is speciﬁc to each index. For the two-rater

Table 1. Distribution of nparticipants by rater and response category

Rater B

Rater A12···qTotal

1

2

.

.

.

q

n11 n12 ··· n1q

n21 n22 ··· n2q

···

nq1nq2··· nqq

nA1

nA2

.

.

.

nAq

Total nB1nB2··· nBq n

2Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

reliability data of Table 1, the overall agreement probability is given by:

pa¼X

q

k¼1

pkk;where pkk ¼nkk =n:

Let p

Ak

¼n

Ak

/n,p

Bk

¼n

Bk

/n, and ^

pk¼ðpAk þpBkÞ=2:Cohen’s k-statistic is given by:

^

gk¼ðpa2pejkÞ=ð12pejkÞ;where pejk¼X

q

k¼1

pAkpBk :

Scott (1955) proposed the p-statistic given by:

^

gp¼ðpa2pejpÞ=ð12pejpÞ;where pejp¼X

q

k¼1

^

p2

k:

The G-index of Holley and Guilford (1964) is given by:

^

gG¼ðpa2pejGÞ=ð12pejGÞ;

where p

ejG

¼1/q, and qrepresents the number of response categories. Note that the

expression used for ^

gGhere is more general than the original Holley–Guilford formula,

which was presented for the simpler situation of two raters and two response categories

only.

If a reliability study involves an arbitrarily large number rof raters, rating data are

often reported in a frequency table showing the distribution of raters by participant and

response category, as described in Table 2. For a given participant iand category k,r

ik

represents the number of raters who classiﬁed participant iinto category k.

Fleiss (1971) extended Scott’s p-statistic to the case of multiple raters (r) and

proposed the following equation:

^

gp¼pa2pejp

12pejp

;where

pa¼1

nX

n

i¼1X

q

k¼1

rikðrik 21Þ

rðr21Þ;

pejp¼X

q

k¼1

^

p2

k;and ^

pk¼1

nX

n

i¼1

rik

r:

8

>

>

>

>

>

<

>

>

>

>

>

:

ð1Þ

Table 2. Distribution of rraters by participant and response category

Category

Participant 12···qTotal

1

2

.

.

.

n

r11 r12 ··· r1q

r21 r22 ··· r2q

···

rn1rn2··· rnq

r

r

.

.

.

r

Total rþ1rþ2··· rþqnr

Computing inter-rater reliability and its variance 3

BJMSP 139—29/1/2008—ROBINSON—217240

The terms p

a

and p

ejp

are, respectively, the overall agreement probability and the

probability of agreement due to chance. Conger (1980) suggested a generalized version

of the k-statistic that is obtained by averaging all r(r21)/2 pairwise k-statistics as

deﬁned by Cohen (1960). The k-statistic can also be generalized as follows:

^

gk¼pa2pejk

12pejk

;

where p

a

is deﬁned as above and chance-agreement probability p

ejk

given by:

pejk¼X

q

k¼1X

r

a¼2

ð21ÞaX

i1,···,iaY

a

j¼1

pkij

!

:ð2Þ

The term pkijðj¼1; :::; aÞrepresents the proportion of participants that rater i

j

classiﬁed into category k. It follows from equation (2) that if r¼2, then p

ejk

reduces to

the usual formula of chance-agreement probability for the k-statistic. For r¼3 and r¼4

the chance-agreement probabilities are, respectively, given by:

pejkð3Þ¼X

q

k¼1

ðpk1pk2þpk1pk3þpk2pk32pk1pk2pk3Þ;

pejkð4Þ¼

X

q

k¼1

½ð pk1pk2þpk1pk3þpk1pk4þpk2pk3þpk2pk4þpk3pk4Þ

2ðpk1pk2pk3þpk1pk2pk4þpk1pk3pk4þpk2pk3pk4Þ

þpk1pk2pk3pk4:

This general version of the k-statistic has not been studied yet and no expression for its

variance is available. There is no indication, however, that it has better statistical

properties than Fleiss’s generalized statistic. Nevertheless, a practitioner interested in

using this estimator, may still estimate its variance using the jackknife method described

by equation (36) for the p-statistic. Hubert (1977) discusses other possible extensions of

the k-statistic to the case of multiple raters.

3. Paradox’s origin

Table 3 contains an example of rating data. These illustrate the limitations of equation

(1) as a measure of the extent of agreement between raters. For those data,

Table 3. Distribution of 125 participants by rater and response category

Rater B

Rater Aþ2Total

þ118 5 123

220 2

Total 120 5 125

4Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

^

gp¼ð0:9440 20:9456Þ=ð120:9456Þ¼20:0288, which is even a negative value. This

result is the opposite of what our intuition would suggest and illustrates one of the

paradoxes noted by Cicchetti and Feinstein (1990) where high agreement is coupled

with low k. In this example, raters Aand Bare expected to have a high inter-rater

reliability.

To understand the nature and the causes of the paradoxical behaviour of the p- and

k-statistics, I will conﬁne myself to the case of two raters, Aand B, who must identify the

presence or absence of a trait on individuals of a given population of interest. These

individuals will eventually be selected to participate in a study, and are therefore

potential study participants. The two raters will classify participants into the ‘ þ’or

‘2’ categories according to whether the trait is found or not. I will study how

agreement indexes are affected by raters’ sensitivity, speciﬁcity and the trait prevalence

in the population. The rater’s sensitivity is deﬁned as the conditional probability of

classifying a participant into the ‘ þ’ category given that the trait is indeed present. The

rater’s speciﬁcity is the conditional probability of classifying a participant into the ‘ 2’

category given that the trait is actually absent.

Let a

A

and a

B

denote, respectively, raters Aand Bsensitivity values. Similarly, b

A

and

b

B

will denote raters Aand Bspeciﬁcity values. It follows that the probabilities P

Aþ

and

P

Bþ

for raters Aand Bto classify a participant into the ‘ þ’ category are given by

PAþ¼PraAþð12PrÞð12bAÞ;ð3Þ

PBþ¼PraBþð12PrÞð12bBÞ;ð4Þ

where P

r

represents the population trait prevalence. Our objective is to study how trait

prevalence, sensitivity and speciﬁcity affect inter-rater reliability. For the sake of

simplicity I will make the following two assumptions:

(A1) Sensitivity and speciﬁcity are identical for both raters. That is aA¼bAand

aB¼bB.

(A2) Correct classiﬁcations are independent. That is, if a

AB

denotes the probability

that raters Aand Bcorrectly classify an individual into the ‘ þ’ category, then

aAB ¼aAaB.

The probability P

a

that both raters agree is given by Pa¼pþþ þp22, where

pþþ and p22 are obtained as follows

pþþ ¼aAaBPrþð12PrÞð12bAÞð12bBÞ¼aAaBPrþð12PrÞð12aAÞð12aBÞ;

and p22 ¼12ðPAþþPBþ2pþþÞ. The following important equation can be

established:

Pa¼ð12aAð12aBÞþaAaB:ð5Þ

This relation shows that the overall agreement probability between two raters Aand B

does not depend upon the trait prevalence. Rather, it depends upon the rater’s

sensitivity and speciﬁcity values.

Computing inter-rater reliability and its variance 5

BJMSP 139—29/1/2008—ROBINSON—217240

The partial derivative with respect to P

r

of an inter-rater coefﬁcient of the form

g¼ðPa2PeÞ=ð12PeÞis given by

›

g=

›

Pr¼212Pa

ð12PeÞ2

›

Pe=

›

Pr;ð6Þ

since, from equation (5), one can conclude that

›

Pa=

›

Pr¼0. For Scott’s pand Cohen’s

k-statistics,

›

gp=

›

Pr¼2ð122lÞ2ð12PaÞð122PrÞ

ð12PejpÞ2;ð7Þ

›

gk=

›

Pr¼2ð122aAÞð122aBÞð12PaÞð122PrÞ

ð12PejkÞ2;ð8Þ

where l¼ðaAþaBÞ=2. Let p

þ

be the probability that a randomly chosen rater classiﬁes

a randomly chosen participant into the ‘ þ’ category. Then,

pþ¼ðPAþþPBþÞ=2¼lPrþð12lÞð12PrÞ:ð9Þ

The two equations (7) and (8) are derived from the fact that Pejp¼p2

þþð12pþÞ2and

Pejk¼12ðPAþþPBþÞþ2PAþPBþ. It follows from assumption A1 that

›

gG=

›

Pr¼0,

since ^

gGis solely a function of P

a

. Under this assumption, the G-index takes a constant

value of 2P

a

21 that depends on the raters’ sensitivity. Equation (6) shows that chance-

agreement probability plays a pivotal role on how inter-rater reliability relates to trait

prevalence. Equation (7) indicates that Scott’s p-statistic is an increasing function of P

r

for the values of trait prevalence between 0 and 0.50, and becomes decreasing for

P

r

.0.50, reaching its maximum value when Pr¼0:50. Because 0 #Pr#1, ^

gptakes

its smallest value at Pr¼0 and Pr¼1. Using equation (5) and the expression of P

ejp

,

one can show that:

gp¼ð2l21Þ2Prð12PrÞ2ðaA2aBÞ2=4

ð2l21Þ2Prð12PrÞþlð12lÞ:ð10Þ

It follows that,

if Pr¼0orPr¼1 then gp¼2ðaA2aBÞ2

4pð12lÞ;ð11Þ

if Pr¼0:50 then ^

gp¼2Pa21¼124lþ4aAaB;ð12Þ

Equations (10), (11) and (12) show very well how paradoxes often occur in practice.

From equation (11) it appears that whenever a trait is very rare or omnipresent, Scott’s

p-statistic yields a negative inter-rater reliability regardless of the raters’ sensitivity

values. In other words, if prevalence is low or high, any large extent of agreement

between raters will not be reﬂected in the p-statistic.

Equation (8), on the other hand, indicates that when trait prevalence is smaller than

0.5, Cohen’s k-statistic may be an increasing or a decreasing function of trait prevalence

depending on raters Aand Bsensitivity values. That is, if one rater has a sensitivity

smaller than 0.5 and the other a sensitivity greater than 0.5 then k-statistic is a

decreasing function of P

r

, otherwise it is increasing. The situation is similar when trait

prevalence is greater than 0.50. The maximum or minimum value of kis reached at

Pr¼0:50. If one rater has a sensitivity of 0.50, then k¼0 regardless of the trait

6Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

prevalence. The general equation of Cohen’s k-statistic is given by:

gk¼ð2aA21Þð2aB21ÞPrð12PrÞ

ð2aA21Þð2aB21ÞPrð12PrÞþð12PaÞ=2:ð13Þ

It follows that:

if Pr¼0orPr¼1 then ^

gk¼0;ð14Þ

if Pr¼0:50 then ^

gk¼2Pa21¼124lþ4aAaB:ð15Þ

Similar to Scott’s p-statistic, kseems to yield reasonable values only when trait

prevalence is close to 0.5. A value of trait prevalence that is either close to 0 or close to 1

will considerably reduce the ability of kto reﬂect any extent of agreement between

raters.

Many inter-rater agreement coefﬁcients proposed in the literature have been

criticized on the grounds that they are dependent upon trait prevalence. Such a

dependence is inevitable if raters’ sensitivity levels are different from their speciﬁcity

levels. In fact, without assumption A1, even the overall agreement probability P

a

is

dependent upon trait prevalence P

r

due to the fact that P

a

can be expressed as follows:

Pa¼2ððaAaB2bAbBÞ2½ðaAþaBÞ2ðbAþbBÞÞPrþð1þ2bAbB2ðbAþbBÞÞ:

However, the impact of prevalence on the overall agreement probability is small if

sensitivity and speciﬁcity are reasonably close.

The previous analysis indicates that the G-index, p-statistic and k-statistic all have the

same reasonable behaviour when trait prevalence P

r

takes a value in the neighbourhood

of 0.5. However, their behaviour becomes very erratic (with the exception of G-index)

as soon as trait prevalence goes to the extremes. I argue that the chance-agreement

probability used in these statistics is ill-estimated when trait prevalence is in the

neighbourhood of 0 or 1. I will now propose a new agreement coefﬁcient that will share

the common reasonable behaviour of its competitors in the neighbourhood of 0.5, but

will outperform them when trait prevalence goes to the extremes.

4. An alternative agreement statistic

Before I introduce an improved alternative inter-rater reliability coefﬁcient, it is

necessary to develop a clear picture of the goal one normally attempts to achieve by

correcting inter-rater reliability for chance agreement. My premises are the following:

(a) Chance agreement occurs when at least one rater rates an individual randomly.

(b) Only an unknown portion of the observed ratings is subject to randomness.

I will consider that a rater Aclassiﬁes an individual into one of two categories either

randomly, when he or she does not know where it belongs, or with certainty, when he

or she is certain about its ‘true’ membership. Rater Aperforms a random rating not all

the time, but with a probability u

A

. That is, u

A

is the propensity for rater Ato perform a

random rating. The participants not classiﬁed randomly are supposed to have been

classiﬁed into the correct category. If the random portion of the study was identiﬁable,

rating data of two raters Aand Bclassifying Nindividuals into categories ‘ þ’ and ‘ 2’

could be reported as shown in Table 4.

Computing inter-rater reliability and its variance 7

BJMSP 139—29/1/2008—ROBINSON—217240

Note that N

þ2·RC

for example, represents the number of individuals that rater A

classiﬁed randomly into the ‘ þ’ category and that rater Bclassiﬁed with certainty into

the ‘ 2’ category. In general, for (k,l)[{þ,2}, and (X,Y)[{R,C}, N

kl·XY

represents the number of individuals that rater Aclassiﬁed into category kusing a

classiﬁcation method X(random or certainty), and that rater Bclassiﬁed into category l

using a classiﬁcation method Y(random or certainty).

To evaluate the extent of agreement between raters Aand Bfrom Table 4, what is

needed is the ability to remove from consideration all agreements that occurred by

chance; that is Nþþ·RR þNþþ·CR þNþþ·RC þN22·RR þN22·CR þN22·RC. This yields the

following ‘true’ inter-rater reliability:

g¼NþþCC þN22CC

X

k[{þ;2}

NkkRR þNkkCR þNkkRC

ð16Þ

Equation (16) can also be written as:

g¼Pa2Pe

12Pe

;where

Pa¼X

k[{þ;2}X

X;Y[{C;R}

NkkXY

N;and Pe¼X

k[{þ;2}X

X;Y2{C;R}

ðX;YÞ–ðC;CÞ

NkkXY

N

ð17Þ

In a typical reliability study the two raters Aand Bwould rate nstudy participants, and

rating data reported as shown in Table 1, with q¼2. The problem is to ﬁnd a good

statistic ^

gfor estimating g. A widely accepted statistic for estimating the overall

agreement probability P

a

is given by:

pa¼ðnþþ þn22Þ=n:ð18Þ

The estimation of P

e

represents a more difﬁcult problem, since it requires one to be able

to isolate ratings performed with certainty from random ratings. To get around this

difﬁculty, I decided to approximate P

e

by a parameter that can be quantiﬁed more easily,

and to evaluate the quality of the approximation in section 5.

Table 4. Distribution of Nparticipants by rater, randomness of classiﬁcation and response category

Rater B

Random (R) Certain (C)

þ2þ2

Rater ARandom (R) þN

þþ·RR

N

þ2·RR

N

þþ·RC

N

þ2·RC

2N

2þ·RR

N

22·RR

N

2þ·RC

N

22·RC

Certain (C) þN

þþ·CR

N

þ2·CR

N

þþ·CC

0

2N

2þ·CR

N

22·CR

0N

22·CC

8Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

Suppose an individual is selected randomly from a pool of individuals and rated by

raters Aand B. Let Gand Rbe two events deﬁned as follows:

G¼{The two raters Aand Bagree};ð19Þ

R¼{A rater ðA;or B;or bothÞperforms a random rating}:ð20Þ

It follows that Pe¼PðG>RÞ¼PðG=RÞPðRÞ,whereP(G/R) is the conditional

probability that Aand Bagree given that one of them (or both) has performed a

random rating.

A random rating would normally lead to the classiﬁcation of an individual into either

category with the same probability 1/2, although this may not always be case. Since

agreement may occur on either category, it follows that PðG=RÞ¼2£1=22¼1=2. As for

the estimation of the probability of random rating P(R), one should note that when the

trait prevalence P

r

is high or low (i.e. if P

r

(1 2P

r

) is small), a uniform distribution of

participants among categories is an indication of high proportion of random ratings,

hence of high probability P(R).

Let the random variable X

þ

be deﬁned as follows:

Xþ¼1 if a rater classifies the participan into category þ;

0 otherwise:

(

I suggest approximating P(R) with a normalized measure of randomness Cdeﬁned by

the ratio of the variance V(X

þ

)ofX

þ

to the maximum possible variance V

MAX

for X

þ

,

which is reached only when the rating is totally random. It follows that

C¼VðXþÞ=VMAX ¼pþð12pþÞ

1=2ð121=2Þ¼4pþð12pþÞ;

where p

þ

represents the probability that a randomly chosen rater classiﬁes a randomly

chosen individual into the ‘ þ’ category. This leads to the following formulation of

chance agreement:

P

e¼PðG=RÞC¼2pþð12pþÞ:ð21Þ

This approximation leads to the following approximated ‘true’ inter-rater reliability:

g¼Pa2P

e

12P

e

;ð22Þ

The probability p

þ

can be estimated from sample data by ^

pþ¼ðpAþþpBþÞ=2, where

pAþ¼nAþ=nand pBþ¼nBþ=n. This leads to a chance-agreement probability

estimator p

e¼PðG=RÞ

^

C;where

^

C¼4^

pþð12^

pþÞ:That is,

p

e¼2^

pþð12^

pþÞ:ð23Þ

Note that ^

pþð12^

pþÞ¼ ^

p2ð12^

p2Þ:Therefore, p

ecan be rewritten as

p

e¼^

pþð12^

pþÞþ ^

p2ð12^

p2Þ:

The resulting agreement statistic is given by,

^

g1¼ðpa2p

eÞ=ð12p

eÞ;

with p

a

given by equation (18), and is shown in section 5 mathematically to have a

smaller bias with respect to the ‘true’ agreement coefﬁcient than all its competitors.

Computing inter-rater reliability and its variance 9

BJMSP 139—29/1/2008—ROBINSON—217240

Unlike the k- and p-statistics, this agreement coefﬁcient uses a chance-agreement

probability that is calibrated to be consistent with the propensity of random rating that

is suggested by the observed ratings. I will refer to the calibrated statistic ^

g1as the AC

1

estimator, where AC stands for agreement coefﬁcient and digit 1 indicates the ﬁrst-order

chance correction, which accounts for full agreement only as opposed to full and partial

agreement (second-order chance correction); the latter problem, which will be

investigated elsewhere, will lead to the AC

2

statistic.

A legitimate question to be asked is whether the inter-rater reliability statistic ^

g1;

estimates the ‘true’ inter-rater reliability of equation (16) at all, and under what

circumstances. I will show in the next section that if trait prevalence is high or low, then

^

g1does estimate the ‘true’ inter-rater reliability very well. However, with trait

prevalence at the extremes, p,kand G-index are all biased for estimating the ‘true’ inter-

rater reliability under any circumstances.

5. Biases of inter-rater reliability statistics

Let us consider two raters, Aand B, who would perform a random rating with

probabilities u

A

and u

B

, respectively. Each classiﬁcation of a study participant by a

random mechanism will either lead to a disagreement or to an agreement by chance.

The rater’s sensitivity values (which are assumed to be identical to their speciﬁcity

values) are given by:

aA¼12uA=2 and aB¼12uB=2:

These equations are obtained under the assumption that any rating that is not random

will automatically lead to a correct classiﬁcation, while a random rating leads to a

correct classiﬁcation with probability 1/2. In fact, aA¼ð12uAÞþuA=2¼12uA=2.

Under this simple rating model, and following equation (5), the overall agreement

probability is given by Pa¼aAaBþð12aAÞð12aBÞ¼12ðuAþuBÞ=2þuAuB=2. A s

for chance-agreement probability P

e

let R

A

and R

B

be two events deﬁned as follows:

.R

A

: Rater Aperforms a random rating.

.R

B

: Rater Bperforms a random rating.

Then,

Pe¼PðG>RÞ¼PðG>RA>

RBÞþPðG>RA>RBÞþPðG>

RA>RBÞ

¼uAð12uBÞ=2þuAuB=2þuBð12uAÞ=2¼ðuAþuB2uAuBÞ=2:

The ‘true’ inter-rater reliability is then given by:

g¼2ð12uAÞð12uBÞ

1þð12uAÞð12uBÞ:ð24Þ

The theoretical agreement coefﬁcients will now be derived for the AC

1

,G-index, k, and

pstatistics. Let l¼12ðuAþuBÞ=4.

10 Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

For AC

1

coefﬁcient, it follows from equations (5) and (21) that chance-agreement

probability P

eis obtained as follows:

P

e¼2pþð12pþÞ¼2½lPrþð12lÞð12PrÞ½ð12lPrÞ2ð12lÞð12PrÞ

¼2lð12lÞþ2ð122lÞ2Prð12PrÞ¼Pe2ðuA2uBÞ2=8þD;

where D¼2ð122lÞ2Prð12PrÞ. The theoretical AC

1

coefﬁcient is given by:

g1¼g2ð12gÞðuA2uBÞ22D

ð12PeÞþ½ðuA2uBÞ2=82D:ð25Þ

For Scott’s p-coefﬁcient, one can establish that the chance-agreement probability P

ejp

is given by Pejp¼Peþð12uAÞð12uBÞþðuA2uBÞ2=82D. This leads to Scott’s

p-coefﬁcient of

gp¼g2ð12gÞð12uAÞð12uBÞþðuA2uBÞ2=82D

ð12PeÞ2½ð12uAÞð12uBÞþðuA2uBÞ2=8þD:ð26Þ

For the G-index, PejG¼1=2¼Peþð12uAÞð12uBÞ=2:

gG¼g2ð12gÞð12uAÞð12uBÞ=2

ð12PeÞ2ð12uAÞð12uBÞ=2:ð27Þ

For Cohen’s k-coefﬁcient, P

ejk

¼P

e

þ(1 2u

A

)(1 2u

B

)2D

k

, where D

k

¼2(1 2

u

A

)(1 2u

B

)P

r

(1 2P

r

):

gk¼g2ð12gÞð12uAÞð12uBÞ2Dk

ð12PeÞ2½ð12uAÞð12uBÞ2Dk:ð28Þ

To gain further insight into the magnitude of the biases of these different inter-rater

reliability statistics, let us consider the simpler case where raters Aand Bhave the same

propensity for random rating; that is, u

A

¼u

B

¼u. The ‘true’ inter-rater reliability is

given by:

g¼2ð12uÞ2

1þð12uÞ2:ð29Þ

I deﬁne the bias of an agreement coefﬁcient g

X

as BXðuÞ¼gX2g, the difference

between the agreement coefﬁcient and the ‘true’ coefﬁcient. The biases of AC

1

,p,kand

G-index statistics, respectively denoted by B

1

(u), B

p

(u), B

k

(u) and B

G

(u), satisfy the

following relations:

BGðuÞ¼2uð12uÞ2ð22uÞ

1þð12uÞ2;2uð12uÞ2ð22uÞ

1þð12uÞ2#B1ðuÞ#0;

22ð12uÞ2

1þð12uÞ2#BpðuÞ#2

uð12uÞ2ð22uÞ

1þð12uÞ2;

22ð12uÞ2

1þð12uÞ2#BkðuÞ#2

uð12uÞ2ð22uÞ

1þð12uÞ2:

Which way the bias will go depends on the magnitude of trait prevalence. It follows

from these equations that the G-index consistently exhibits a negative bias, which may

Computing inter-rater reliability and its variance 11

BJMSP 139—29/1/2008—ROBINSON—217240

take a maximum absolute value around 17%, when the rater’s propensity for random

rating is around 35%, and will gradually decrease as ugoes to 1. The AC

1

statistic, on the

other hand, has a negative bias that ranges from 2uð12uÞ2ð22uÞ=ð1þð12uÞ2Þto 0,

reaching its largest absolute value of uð12uÞ2ð22uÞ=ð1þð12uÞ2Þonly when the trait

prevalence is around 50%. The remaining two statistics, on the other hand, have some

serious bias problems on the negative side. The pand kstatistics each have a bias whose

lowest value is 22ð12uÞ2=½1þð12uÞ2, which varies from 0 to 21. That means pand

kmay underestimate the ‘true’ inter-rater reliability by 100%.

The next two sections, 6 and 7, are devoted to variance estimation of the

generalized p-statistic and the AC

1

statistic, respectively, in the context of multiple

raters. For the two sections, I will assume that the nparticipants in the reliability

study were randomly selected from a bigger population of Npotential participants.

Likewise, the rraters can be assumed to belong to a bigger universe of Rpotential

raters. This ﬁnite-population framework has not yet been considered in the study of

inter-rater agreement assessment. For this paper, however, I will conﬁne myself to the

case where r¼R, that is the estimators are not subject to any variability due to the

sampling of raters. Methods needed to extrapolate to a bigger universe of raters will

be discussed in a different paper.

6. Variance of the generalized p-statistic

The p-statistic denoted by ^

gpis deﬁned as follows:

^

gp¼pa2pejp

12pejp

;ð30Þ

where p

a

and p

ejp

are deﬁned as follows:

pa¼1

nX

n

i¼1X

q

k¼1

rikðrik 21Þ

rðr21Þ;and pejp¼X

q

k¼1

^

p2

k;with ^

pk¼1

nX

n

i¼1

rik

r:ð31Þ

Concerning the estimation of the variance of ^

gp;Fleiss (1971) suggested the following

variance estimator under the hypothesis of no agreement between raters beyond

chance:

vð^

gpjNo agreementÞ¼ 2ð12fÞ

nrðr21Þ£

pejp2ð2r23Þp2

ejpþ2ðr22ÞX

q

k¼1

^

p3

k

ð12pejpÞ2;ð32Þ

where f¼n=Nis the sampling fraction, which could be neglected if the population of

potential participants is deemed very large. It should be noted that this variance

estimator is invalid for conﬁdence interval construction. The original expression

proposed by Fleiss does not include the ﬁnite-population correction factor 1 2f.

Cochran (1977) is a good reference for readers interested in statistical methods in ﬁnite-

population sampling.

I propose here a non-paramateric variance estimator for ^

gpthat is valid for

conﬁdence interval construction using the linearization technique. Unlike

12 Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

vð^

gpjNo agreementÞ, the validity of the non-parametric variance estimator does not

depend on the extent of agreement between the raters. This variance estimator is

given by

vð^

gpÞ¼12f

n

1

n21X

n

i¼1

ð^

g

pi2^

gpÞ2;ð33Þ

where ^

g

piis given by

^

g

pi¼^

gpi22ð12^

gpÞpepji2pejp

12pejp

;ð34Þ

where ^

gpi¼ðpaji2pejpÞ=ð12pejpÞ, and p

aji

, and p

epji

are given by:

paji¼X

q

i¼1

rikðrik 21Þ

rðr21Þ;and pepji¼X

q

k¼1

rik

r

^

pk:ð35Þ

To see how equation (33) is derived, one should consider the standard approach

that consists of deriving an approximation of the actual variance of the estimator

and using a consistent estimator of that approximate variance as the variance

estimator. Let us assume that as the sample size nincreases, the estimated chance-

agreement probability p

ejp

converges to a value P

ejp

and that each ^

pkconverges to

p

k

.If ^

pand pdenote the vectors of the ^

pk’s and p

k

’s, respectively, it can be

shown that,

pejp2Pejp¼2

nX

n

i¼1

ðpepji2PejpÞþOpðk ^

p2pk2Þ;

and that if G

p

¼(p

a

2P

ejp

)/(1 2P

ejp

), then ^

gpcan be expressed as,

^

gp¼ðpa2PejpÞ2ð12GpÞð pejp2PejpÞ

12Pejp

þOpðð pejp2PejpÞ2Þ:

The combination of these two equations gives us an approximation of ^

gpthat is a

linear function of r

ik

and that captures all terms except those with a stochastic

order of magnitude of 1/n, which can be neglected. Bishop, Fienberg, and Holland

(1975, chapter 14) provide a detailed discussion of the concept of stochastic order

of magnitude.

The variance estimator of equation (33) can be used for conﬁdence interval

construction as well as for hypothesis testing. Its validity is conﬁrmed by the simulation

study presented in section 9.

Alternatively, a jackknife variance estimator can be used to estimate the variance of the

p-statistic. The jackknife technique introduced by Quenouille (1949) and developed by

Tukey (1958), is a general purpose technique for estimating variances. It has wide

applicability although it is computation intensive. The jackknife variance of ^

gpis given by:

vJð^

gpÞ¼ð12fÞðn21Þ

nX

n

i¼1

ð^

gðiÞ

p2^

gð†Þ

pÞ2;ð36Þ

where ^

gðiÞ

pis the p-statistic obtained after removing participant ifrom the sample, while

Computing inter-rater reliability and its variance 13

BJMSP 139—29/1/2008—ROBINSON—217240

^

gð†Þ

prepresents the average of all ^

gðiÞ

p’s. Simulation results not reported in this paper show

that this jackknife variance works well for estimating the variance of ^

gp:The idea of using

the jackknife methodology for estimating the variance of an agreement coefﬁcient was

previously evoked by Kraemer (1980).

7. Variance of the generalized AC

1

estimator

The AC

1

statistic ^

g1introduced in section 4 can be extended to the case of rraters

(r.2) and qresponse categories (q.2) as follows:

^

g1¼pa2pejg

12pejg

;ð37Þ

where p

a

is deﬁned in equation (1), and chance-agreement probability p

ejg

deﬁned as

follows:

pejg¼1

q21X

q

k¼1

^

pkð12^

pkÞ;ð38Þ

the ^

pk’s being deﬁned in equation (1).

The estimator ^

g1is a non-linear statistic of the r

ik

’s. To derive its variance, I have used

a linear approximation that includes all terms with a stochastic order of magnitude up to

n

21/2

. This will yield a correct asymptotic variance that includes all terms with an order

of magnitude up to 1/n. Although a rigorous treatment of the asymptotics is not

presented here, it is possible to establish that for large values of n, a consistent estimator

for estimating the variance of ^

g1is given by:

vð^

g1Þ¼12f

n

1

n21X

n

i¼1

ð^

g

1ji2^

g1Þ2;ð39Þ

where f¼n=Nis the sampling fraction,

^

g

1ji¼^

g1ji22ð12^

g1Þpegji2pejg

12pejg

;

^

g1ji¼ðpaji2pejgÞ=ð12pejgÞis the agreement coefﬁcient with respect to participant i,

p

aji

is given by,

paji¼X

q

k¼1

rikðrik 21Þ

rðr21Þ;

and chance-agreement probability with respect to unit i,p

egji

is given by:

pegji¼1

q21X

q

k¼1

rik

rð12^

pkÞ;

To obtain equation (39), one should ﬁrst derive a large-sample approximation of the

14 Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

actual variance of ^

g1:This is achieved by considering that as the size nof the

participant sample increases, chance-agreement probability p

ejg

converges to a ﬁxed

probability P

ejg

and each classiﬁcation probability ^

pkconverges to a constant p

k

. Let

us deﬁne the following two vectors: ^

p¼ð^

p1; :::; ^

pqÞ0and p¼ðp1; :::; pqÞ0:One can

establish that:

pejg2Pejg¼2

nX

n

i¼1

ðpegji2PejgÞþOpðk ^

p2pkÞ;

^

g1¼ðpa2PejgÞ2ð12GgÞð pejg2PejgÞ

12Pejg

þOpðð pejg2PejgÞ2Þ;

where Gg¼ðpa2PejgÞ=ð12PejgÞ. Combining these two expressions leads to a

linear approximation of ^

g1;which can be used to approximate the asymptotic

variance of ^

g1.

An alternative approach for estimating the variance of ^

g1is the jackknife method.

The jackknife variance estimator is given by:

vJð^

g1Þ¼ð12fÞn21

nX

n

i¼1

ð^

gðiÞ

12^

gð†Þ

1Þ2;ð40Þ

where ^

gðiÞ

1represents the estimator ^

g1computed after removing participant ifrom the

participant sample, and ^

gð†Þ

1the average of all the ^

gðiÞ

1’s:

8. Special case of two raters

Two-rater reliability studies are of special interest. Rating data in this case are often

conveniently reported using the distribution of participants by rater and response

category as shown in Table 1. Therefore, the inter-rater reliability coefﬁcient and its

associated variance must be expressed as functions of the n

kl

’s.

For two raters classifying nparticipants into qresponse categories, Fleiss et al.

(1969) proposed an estimator vð^

gkjNo agreementÞfor estimating the variance of

Cohen’s k-statistic under the hypothesis of no agreement between the raters. If there

exists an agreement between the two raters, Fleiss et al. recommended another variance

estimator vð^

gkjAgreementÞ:These estimators are given by:

vð^

gkjNo agreementÞ¼ 12f

nð12pejkÞ2X

q

k¼1

pBk pAk½12ðpBk þpAk Þ2

(

þX

q

k¼1X

q

1¼1

k–1

pBk pAlðpBk þpAlÞ22p2

ejk)ð41Þ

Computing inter-rater reliability and its variance 15

BJMSP 139—29/1/2008—ROBINSON—217240

and

vð^

gkjAgreementÞ¼ 12f

nð12pejkÞ2X

q

k¼1

pkk½12ðpAk þpBk Þð12^

gkÞ2

(

þð12^

gkÞ2X

q

k¼1X

q

1¼1

k–1

pklðpBk þpAlÞ22½^

gk2pejkð12^

gkÞ2):

ð42Þ

It can be shown that vð^

gkjAgreementÞcaptures all terms of magnitude order up to

n

21

, is consistent for estimating the true population variance and provides valid

normality-based conﬁdence intervals when the number of participants is reasonably

large.

When r¼2, the variance of the AC

1

statistic given in equation (39) reduces to the

following estimator:

vð^

g1Þ¼ 12f

nð12pejgÞ2pað12paÞ24ð12^

g1Þ1

q21X

q

k¼1

pkkð12^

pkÞ2papejg

!(

þ4ð12^

g1Þ21

ðq21Þ2X

q

k¼1X

q

l¼1

pkl½12ð^

pkþ^

plÞ=222p2

ejg

!):

ð43Þ

As for Scott’s p-estimator, its correct variance is given by:

vð^

gpÞ¼ 12f

nð12pejpÞ2pað12paÞ24ð12^

gpÞX

q

k¼1

pkk

^

pk2papejp

!(

þ4ð12^

gpÞ2X

q

k¼1X

q

l¼1

pkl ½ð ^

pkþ^

plÞ=222p2

ejp

!)

ð44Þ

For the sake of comparability, one should note that the correct variance of kappa can be

rewritten as follows:

vð^

gkÞ¼ 12f

nð12pejkÞ2pað12paÞ24ð12^

gkÞX

q

k¼1

pkk

^

pk2papek

!(

þ4ð12^

gkÞ2X

q

k¼1X

q

l¼1

pkl½ðpAk þpBl Þ=222p2

ejk

!):

ð45Þ

The variance of the G-index is given by:

vð^

gGÞ¼412f

npað12paÞ:ð46Þ

Using the rating data of Table 3, I obtained the following inter-rater reliability estimates

16 Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

and variance estimates:

Statistic Estimate (%) Standard error (%)

AC1

Kappa

Pi

G-Index

94.08

–2.34

–2.88

88.80

2.30

1.23

1.09

4.11

Because the percentage agreement p

a

equals 94.4%, it appears that AC

1

and G-index

are more consistent with the observed extent of agreement. The kand pstatistics have

low values that are very inconsistent with the data conﬁguration and would be difﬁcult

to justify. If the standard error is compared with the inter-rater reliability estimate, the

AC

1

appears to be the most accurate of all agreement coefﬁcients.

9. Monte-Carlo simulation

In order to compare the biases of the various inter-rater reliability coefﬁcients under

investigation and to verify the validity of the different variance estimators discussed in

the previous sections, I have conducted a small Monte-Carlo experiment. This

experiment involves two raters, Aand B, who must classify n(for n¼20, 60, 80, 100)

participants into one of two possible categories ‘ þ’ and ‘ 2’.

All the Monte-Carlo experiments are based upon the assumption of a prevalence rate

of P

r

¼95%. A propensity of random rating u

A

is set for rater Aand another one u

B

for

rater Bat the beginning of each experiment. These parameters allow us to use equation

(19) to determine the ‘true’ inter-rater reliability to be estimated. Each Monte-Carlo

experiment is conducted as follows:

.The nparticipants are ﬁrst randomly classiﬁed into the two categories ‘ þ’ and ‘ 2’

in such a way that a participant falls into category ‘ þ’ with probability P

r

.

.If a rater performs a random rating (with probabilities u

A

for rater Aand u

B

for rater

B), then the participant to be rated is randomly classiﬁed into one of the two

categories with the same probability 1/2. A non-random rating is supposed to lead to

a correct classiﬁcation.

.The number of replicate samples drawn in this simulation is 500.

Each Monte-Carlo experiment has two speciﬁc objectives, which are to evaluate the

magnitude of the biases associated with the agreement coefﬁcients and to verify the

validity of their variance estimators.

The bias of an estimator is measured by the difference of its Monte-Carlo expectation

to the ‘true’ inter-rater reliability. The bias of a variance estimator, on the other hand, is

obtained by comparing its Monte-Carlo expectation with the Monte-Carlo variance of

the agreement coefﬁcient. A small bias is desirable as it indicates that a given estimator

or variance estimator has neither a tendency to overestimate the true population

parameter nor a tendency to underestimate it.

In the simulation programmes, the calculation of the p-statistic and that of the

k-statistic were modiﬁed slightly in order to avoid the difﬁculty posed by undeﬁned

estimates. When pejp¼1orpejk¼1, these chance-agreement probabilities were

replaced with 0.99999 so that the agreement coefﬁcient can be deﬁned.

Computing inter-rater reliability and its variance 17

BJMSP 139—29/1/2008—ROBINSON—217240

Table 5 contains the relative bias of the agreement coefﬁcients ^

gp;^

gk;^

gG, and ^

g1.

A total of 500 replicate samples were selected and for each sample san estimate ^

gswas

calculated. The relative bias is obtained as follows:

RelBiasð^

gÞ¼ 1

500 X

500

s¼1

^

gs2g

!

=g;

where gis the ‘true’ inter-rater reliability obtained with equation (19). It follows from

Table 5 that the relative bias of the AC

1

estimator, which varies from 20.8 to 0.0% when

uA¼uB¼5%, and from 22.1 to 21.3% when uA¼20%and uB¼5%, is consistently

smaller than the relative bias of the other inter-rater reliability statistics. The pand k

statistics generally exhibit a very large negative bias under current conditions, ranging

from 232.8 to 262.5%. The main advantage of the AC

1

statistic over the G-index stems

from the fact that when the rater’s propensity for random rating is large (i.e. around

35%), the bias of the G-index is at its highest, while that of the AC

1

will decrease as the

trait prevalence increases.

Table 6 shows the Monte-Carlo variances of the four agreement statistics under

investigation, as well as the Monte-Carlo expectations of the associated variance

estimators. The Monte-Carlo expectation of a variance estimator vis obtained by

averaging all 500 variance estimates v

s

obtained from each replicate sample s. The

Monte-Carlo variance of an agreement coefﬁcient ^

g, on the other hand, is obtained by

averaging all 500 squared differences between the estimates ^

gsand their average. More

formally, the Monte-Carlo expectation E(v) of a variance estimator vis deﬁned as

Ta b l e 6 . Monte-Carlo variances and Monte-Carlo expectations of variance estimates for

Pr¼0:95 uðÞ

A¼uðÞ

B¼0:05

nVð^

gpÞ%E½vð^

gpÞ %Vð^

gkÞ%E½vð^

gkÞ %Vð^

gGÞ%E½vð^

gGÞ %Vð^

g1Þ%E½vð^

g1Þ %

20 15.8 3.3 15.0 3.13 0.79 0.78 0.32 0.33

60 6.0 3.9 5.9 3.83 0.28 0.31 0.10 0.12

80 3.9 3.0 3.8 3.00 0.24 0.23 0.09 0.09

100 2.5 2.4 2.5 2.39 0.17 0.19 0.07 0.07

(*)u

A

and u

B

represent the propensity for random rating of raters Aand B, respectively.

Table 5. Relative bias of agreement coefﬁcients for Pr¼0:95 based on 500 replicate samples

u

A

,u

B

nBð^

gpÞ%Bð^

gkÞ%Bð^

gGÞ%Bð^

g1Þ%

uð*Þ

A¼uð*Þ

B¼5%20 232.8 232.0 23.6 0.0

60 239.5 239.3 25.1 20.7

80 236.5 236.4 24.9 20.6

100 235.1 235.0 25.2 20.8

u

A

¼20% u

B

¼5% 20 262.5 259.9 211.9 22.1

60 258.4 257.0 211.7 21.4

80 258.2 256.9 212.1 21.6

100 257.4 256.3 211.6 21.3

(*)u

A

and u

B

represent the propensity for random rating of raters Aand B, respectively.

18 Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

follows:

EðvÞ¼ 1

500 X

500

s¼1

vs;

while the Monte-Carlo variance Vð^

gÞof an agreement statistic ^

gis given by:

Vð^

gÞ¼ 1

500 X

500

s¼1

½^

gs2averageð^

gÞ2:

It follows from Table 6 that the variance of the AC

1

statistic is smaller than that of the

other statistics. In fact, Vð^

g1Þvaries from 0.07% when the sample size is 100 to 0.33%

when the sample size is 20. The second smallest variance is that of the G-index, which

varies from 0.17 to 0.79%. The kand pstatistics generally have larger variances, which

range from 2% to about 15%. An examination of the Monte-Carlo expectation of the

various variance estimators indicates that the proposed variance estimators for AC

1

and

G-index work very well. Even for a small sample size, these expectations are very close

to the Monte-Carlo approximations. The variance estimators of the kand pstatistics also

work well except for small sample sizes, for which they underestimate the ‘true’

variance.

10. Concluding remarks

In this paper, I have explored the problem of inter-rater reliability estimation when the

extent of agreement between raters is high. The paradox of the kand pstatistics has

been investigated and an alternative agreement coefﬁcient proposed. I have proposed

new variance estimators for the k,pand the AC

1

statistics using the linearization and

jackknife methods. The validity of these variance estimators does not depend upon the

assumption of independence. The absence of such variance estimators has prevented

practitioners from constructing conﬁdence intervals of multiple-rater agreement

coefﬁcients.

I have introduced the AC

1

statistic which is shown to have better statistical

properties than its k,pand G-index competitors. The kand pestimators became well-

known for their supposed ability to correct the percentage agreement for chance

agreement. However, this paper argues that not all observed ratings would lead to

agreement by chance. This will particularly be the case if the extent of agreement is high

in a situation of high trait prevalence. Kappa and pi evaluate the chance-agreement

probability as if all observed ratings may yield an agreement by chance. This may lead to

unpredictable results with rating data that suggest a rather small propensity for chance

agreement. The AC

1

statistic was developed in such a way that the propensity for chance

agreement is proportional to the portion of ratings that may lead to an agreement by

chance, reducing the overall agreement by chance to the right magnitude.

The simulation results tend to indicate that the AC

1

and G-index statistics have

reasonably small biases for estimating the ‘true’ inter-rater reliability, while the kand p

statistics tend to underestimate it. The AC

1

outperforms the G-index when the trait

prevalence is high or low. If the trait prevalence is around 50%, all agreement statistics

perform alike. The absolute bias in this case increases with the raters’ propensity for

random rating, which can be reduced by giving extra training to the raters. The proposed

Computing inter-rater reliability and its variance 19

BJMSP 139—29/1/2008—ROBINSON—217240

variance estimators work well according to our simulations. For small sample sizes, the

variance estimators proposed for kand pstatistics tend to underestimate the true

variances.

References

Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley.

Q1

Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of

interrater agreement measures. Canadian Journal of Statistics,27, 3–23.

Bishop, Y. V. V., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis.

Cambridge, MA: MIT Press.

Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the

paradoxes. Journal of Clinical Epidemiology,43, 551–558.

Cochran, W. C. (1977). Sampling techniques (3rd ed.). New York: Wiley.

Cohen, J. (1960). A coefﬁcient of agreement for nominal scales. Educational and Psychological

Measurement,20, 37–46.

Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled

disagreement or partial credit. Psychological Bulletin,70, 213–220.

Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological

Bulletin,88, 322–328.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological

Bulletin,76, 378–382.

Fleiss, J. L., Cohen, J., & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted

kappa. Psychological Bulletin,72, 323–327.

Holley, J. W., & Guilford, J. P. (1964). A note on the Gindex of agreement. Educational and

Psychological Measurement,24, 749–753.

Hubert, L. (1977). Kappa revisited. Psychological Bulletin,84, 289–297.

Kraemer, H. C. (1980). Ramiﬁcations of a population model for kas a coefﬁcient of reliability.

Psychometrika,44, 461–472.

Landis, R. J., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the

assessment of majority agreement among multiple observers. Biometrics,33, 363–374.

Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and

alternatives. Psychological Bulletin,76, 365–377.

Quenouille, M. H. (1949). Approximate tests of correlation in times series. Journal of the Royal

Statistical Society, B,11, 68–84.

Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public

Opinion Quarterly,XIX, 321–325.

Tukey, J. W. (1958). Bias and conﬁdence in not quite large samples. Annals of Mathematical

Statistics,29, 614.

Received 6 January 2006; revised version received 14 June 2006

20 Kilem Li Gwet

BJMSP 139—29/1/2008—ROBINSON—217240

Author Queries

JOB NUMBER: 139

JOURNAL: BJMSP

Q1 Please provide publisher location details for reference Agresti (2002).

Computing inter-rater reliability and its variance 21

BJMSP 139—29/1/2008—ROBINSON—217240