Content uploaded by Kilem Li Gwet
Author content
All content in this area was uploaded by Kilem Li Gwet on Apr 27, 2018
Content may be subject to copyright.
Computing inter-rater reliability and its variance
in the presence of high agreement
Kilem Li Gwet*
STATAXIS Consulting, Gaithersburg, USA
Pi (p) and kappa (k) statistics are widely used in the areas of psychiatry and
psychological testing to compute the extent of agreement between raters on nominally
scaled data. It is a fact that these coefficients occasionally yield unexpected results in
situations known as the paradoxes of kappa. This paper explores the origin of these
limitations, and introduces an alternative and more stable agreement coefficient
referred to as the AC
1
coefficient. Also proposed are new variance estimators for the
multiple-rater generalized pand AC
1
statistics, whose validity does not depend upon
the hypothesis of independence between raters. This is an improvement over existing
alternative variances, which depend on the independence assumption. A Monte-Carlo
simulation study demonstrates the validity of these variance estimators for confidence
interval construction, and confirms the value of AC
1
as an improved alternative to
existing inter-rater reliability statistics.
1. Introduction
Researchers in various fields often need to evaluate the quality of a data collection
method. In many studies, a data collection tool, such as a survey questionnaire, a
laboratory procedure or a classification system, is used by different people referred to as
raters, observers or judges. In an effort to minimize the effect of the rater factor on data
quality, investigators like to know whether all raters apply the data collection method in
a consistent manner. Inter-rater reliability quantifies the closeness of scores assigned by
a pool of raters to the same study participants. The closer the scores, the higher the
reliability of the data collection method. Although reliability data can be discrete or
continuous, the focus of this paper is on inter-rater reliability assessment on nominally
scaled data. Such data are typically obtained from studies where raters must classify
study participants into one category among a limited number of possible categories.
Banerjee, Capozzoli, McSweeney, and Sinha (1999) provide a good review of the
techniques developed to date for analysing nominally scaled data. Two of the most
influential papers in this area are those of Fleiss (1971) and Fleiss, Cohen, and Everitt
* Correspondence should be addressed to Dr Kilem Li Gwet, Statistical Consultants , STATAXIS Consulting, 15914B Shady
Grove Road, No. 145, Gaithersburg, MD 20877-1322, USA (e-mail: gwet@stataxis.com).
The
British
Psychological
Society
1
British Journal of Mathematical and Statistical Psychology (2008), 00, 1–21
q2008 The British Psychological Society
www.bpsjournals.co.uk
DOI:10.1348/000711006X126600
BJMSP 139—29/1/2008—ROBINSON—217240
(1969), which contain the most popular results in use today. Fleiss et al. provide large
sample approximations of the variances of the kand weighted kstatistics suggested by
Cohen (1960, 1968), respectively in the case of two raters, while Fleiss extends the p-
statistic to the case of multiple raters. Landis and Koch (1977) also give an instructive
discussion of inter-rater agreement among multiple observers. Agresti (2002) presents
several modelling techniques for analysing rating data in addition to presenting a short
account of the state of the art. Light (1971) introduces measures of agreement
conditionally upon a specific classification category, and proposes a generalization of
Cohen’s k-coefficient to the case of multiple raters. Conger (1980) suggests an
alternative multiple-rater agreement statistic obtained by averaging all pairwise overall
and chance-corrected probabilities proposed by Cohen (1960). Conger (1980) also
extends the notion of pairwise agreement to that of g-wise agreement where agreement
occurs if graters rather than two raters classify an object into the same category.
In section 2, I introduce the most commonly used pairwise indexes. Section 3
discusses a theoretical framework for analysing the origins of the kappa’s paradox. An
alternative and more stable agreement coefficient referred to as the AC
1
statistic is
introduced in section 4. Section 5 is devoted to the analysis of the bias associated with
the various pairwise agreement coefficients under investigation. In section 6, a variance
estimator for the generalized p-statistic is proposed, which is valid even under the
assumption of dependence of ratings. Section 7 presents a variance estimator of the AC
1
statistic, which is always valid. The important special case of two raters is discussed in
section 8, while section 9 describes a small simulation study aimed at verifying the
validity of the variance estimators as well as the magnitude of the biases associated with
the various indexes under investigation.
2. Cohen’s k, Scott’s p,G-index and Fleiss’s generalized p
In a two-rater reliability study involving raters Aand B, the data will be reported in a two-
way contingency table such as Table 1. Table 1 shows the distribution of nstudy
participants by rater and response category, where n
kl
indicates the number of
participants that raters Aand Bclassified into categories kand l, respectively.
All inter-rater reliability coefficients discussed in this paper have two components:
the overall agreement probability p
a
, which is common to all coefficients, and the
chance-agreement probability p
e
, which is specific to each index. For the two-rater
Table 1. Distribution of nparticipants by rater and response category
Rater B
Rater A12···qTotal
1
2
.
.
.
q
n11 n12 ··· n1q
n21 n22 ··· n2q
···
nq1nq2··· nqq
nA1
nA2
.
.
.
nAq
Total nB1nB2··· nBq n
2Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
reliability data of Table 1, the overall agreement probability is given by:
pa¼X
q
k¼1
pkk;where pkk ¼nkk =n:
Let p
Ak
¼n
Ak
/n,p
Bk
¼n
Bk
/n, and ^
pk¼ðpAk þpBkÞ=2:Cohen’s k-statistic is given by:
^
gk¼ðpa2pejkÞ=ð12pejkÞ;where pejk¼X
q
k¼1
pAkpBk :
Scott (1955) proposed the p-statistic given by:
^
gp¼ðpa2pejpÞ=ð12pejpÞ;where pejp¼X
q
k¼1
^
p2
k:
The G-index of Holley and Guilford (1964) is given by:
^
gG¼ðpa2pejGÞ=ð12pejGÞ;
where p
ejG
¼1/q, and qrepresents the number of response categories. Note that the
expression used for ^
gGhere is more general than the original Holley–Guilford formula,
which was presented for the simpler situation of two raters and two response categories
only.
If a reliability study involves an arbitrarily large number rof raters, rating data are
often reported in a frequency table showing the distribution of raters by participant and
response category, as described in Table 2. For a given participant iand category k,r
ik
represents the number of raters who classified participant iinto category k.
Fleiss (1971) extended Scott’s p-statistic to the case of multiple raters (r) and
proposed the following equation:
^
gp¼pa2pejp
12pejp
;where
pa¼1
nX
n
i¼1X
q
k¼1
rikðrik 21Þ
rðr21Þ;
pejp¼X
q
k¼1
^
p2
k;and ^
pk¼1
nX
n
i¼1
rik
r:
8
>
>
>
>
>
<
>
>
>
>
>
:
ð1Þ
Table 2. Distribution of rraters by participant and response category
Category
Participant 12···qTotal
1
2
.
.
.
n
r11 r12 ··· r1q
r21 r22 ··· r2q
···
rn1rn2··· rnq
r
r
.
.
.
r
Total rþ1rþ2··· rþqnr
Computing inter-rater reliability and its variance 3
BJMSP 139—29/1/2008—ROBINSON—217240
The terms p
a
and p
ejp
are, respectively, the overall agreement probability and the
probability of agreement due to chance. Conger (1980) suggested a generalized version
of the k-statistic that is obtained by averaging all r(r21)/2 pairwise k-statistics as
defined by Cohen (1960). The k-statistic can also be generalized as follows:
^
gk¼pa2pejk
12pejk
;
where p
a
is defined as above and chance-agreement probability p
ejk
given by:
pejk¼X
q
k¼1X
r
a¼2
ð21ÞaX
i1,···,iaY
a
j¼1
pkij
!
:ð2Þ
The term pkijðj¼1; :::; aÞrepresents the proportion of participants that rater i
j
classified into category k. It follows from equation (2) that if r¼2, then p
ejk
reduces to
the usual formula of chance-agreement probability for the k-statistic. For r¼3 and r¼4
the chance-agreement probabilities are, respectively, given by:
pejkð3Þ¼X
q
k¼1
ðpk1pk2þpk1pk3þpk2pk32pk1pk2pk3Þ;
pejkð4Þ¼
X
q
k¼1
½ð pk1pk2þpk1pk3þpk1pk4þpk2pk3þpk2pk4þpk3pk4Þ
2ðpk1pk2pk3þpk1pk2pk4þpk1pk3pk4þpk2pk3pk4Þ
þpk1pk2pk3pk4:
This general version of the k-statistic has not been studied yet and no expression for its
variance is available. There is no indication, however, that it has better statistical
properties than Fleiss’s generalized statistic. Nevertheless, a practitioner interested in
using this estimator, may still estimate its variance using the jackknife method described
by equation (36) for the p-statistic. Hubert (1977) discusses other possible extensions of
the k-statistic to the case of multiple raters.
3. Paradox’s origin
Table 3 contains an example of rating data. These illustrate the limitations of equation
(1) as a measure of the extent of agreement between raters. For those data,
Table 3. Distribution of 125 participants by rater and response category
Rater B
Rater Aþ2Total
þ118 5 123
220 2
Total 120 5 125
4Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
^
gp¼ð0:9440 20:9456Þ=ð120:9456Þ¼20:0288, which is even a negative value. This
result is the opposite of what our intuition would suggest and illustrates one of the
paradoxes noted by Cicchetti and Feinstein (1990) where high agreement is coupled
with low k. In this example, raters Aand Bare expected to have a high inter-rater
reliability.
To understand the nature and the causes of the paradoxical behaviour of the p- and
k-statistics, I will confine myself to the case of two raters, Aand B, who must identify the
presence or absence of a trait on individuals of a given population of interest. These
individuals will eventually be selected to participate in a study, and are therefore
potential study participants. The two raters will classify participants into the ‘ þ’or
‘2’ categories according to whether the trait is found or not. I will study how
agreement indexes are affected by raters’ sensitivity, specificity and the trait prevalence
in the population. The rater’s sensitivity is defined as the conditional probability of
classifying a participant into the ‘ þ’ category given that the trait is indeed present. The
rater’s specificity is the conditional probability of classifying a participant into the ‘ 2’
category given that the trait is actually absent.
Let a
A
and a
B
denote, respectively, raters Aand Bsensitivity values. Similarly, b
A
and
b
B
will denote raters Aand Bspecificity values. It follows that the probabilities P
Aþ
and
P
Bþ
for raters Aand Bto classify a participant into the ‘ þ’ category are given by
PAþ¼PraAþð12PrÞð12bAÞ;ð3Þ
PBþ¼PraBþð12PrÞð12bBÞ;ð4Þ
where P
r
represents the population trait prevalence. Our objective is to study how trait
prevalence, sensitivity and specificity affect inter-rater reliability. For the sake of
simplicity I will make the following two assumptions:
(A1) Sensitivity and specificity are identical for both raters. That is aA¼bAand
aB¼bB.
(A2) Correct classifications are independent. That is, if a
AB
denotes the probability
that raters Aand Bcorrectly classify an individual into the ‘ þ’ category, then
aAB ¼aAaB.
The probability P
a
that both raters agree is given by Pa¼pþþ þp22, where
pþþ and p22 are obtained as follows
pþþ ¼aAaBPrþð12PrÞð12bAÞð12bBÞ¼aAaBPrþð12PrÞð12aAÞð12aBÞ;
and p22 ¼12ðPAþþPBþ2pþþÞ. The following important equation can be
established:
Pa¼ð12aAð12aBÞþaAaB:ð5Þ
This relation shows that the overall agreement probability between two raters Aand B
does not depend upon the trait prevalence. Rather, it depends upon the rater’s
sensitivity and specificity values.
Computing inter-rater reliability and its variance 5
BJMSP 139—29/1/2008—ROBINSON—217240
The partial derivative with respect to P
r
of an inter-rater coefficient of the form
g¼ðPa2PeÞ=ð12PeÞis given by
›
g=
›
Pr¼212Pa
ð12PeÞ2
›
Pe=
›
Pr;ð6Þ
since, from equation (5), one can conclude that
›
Pa=
›
Pr¼0. For Scott’s pand Cohen’s
k-statistics,
›
gp=
›
Pr¼2ð122lÞ2ð12PaÞð122PrÞ
ð12PejpÞ2;ð7Þ
›
gk=
›
Pr¼2ð122aAÞð122aBÞð12PaÞð122PrÞ
ð12PejkÞ2;ð8Þ
where l¼ðaAþaBÞ=2. Let p
þ
be the probability that a randomly chosen rater classifies
a randomly chosen participant into the ‘ þ’ category. Then,
pþ¼ðPAþþPBþÞ=2¼lPrþð12lÞð12PrÞ:ð9Þ
The two equations (7) and (8) are derived from the fact that Pejp¼p2
þþð12pþÞ2and
Pejk¼12ðPAþþPBþÞþ2PAþPBþ. It follows from assumption A1 that
›
gG=
›
Pr¼0,
since ^
gGis solely a function of P
a
. Under this assumption, the G-index takes a constant
value of 2P
a
21 that depends on the raters’ sensitivity. Equation (6) shows that chance-
agreement probability plays a pivotal role on how inter-rater reliability relates to trait
prevalence. Equation (7) indicates that Scott’s p-statistic is an increasing function of P
r
for the values of trait prevalence between 0 and 0.50, and becomes decreasing for
P
r
.0.50, reaching its maximum value when Pr¼0:50. Because 0 #Pr#1, ^
gptakes
its smallest value at Pr¼0 and Pr¼1. Using equation (5) and the expression of P
ejp
,
one can show that:
gp¼ð2l21Þ2Prð12PrÞ2ðaA2aBÞ2=4
ð2l21Þ2Prð12PrÞþlð12lÞ:ð10Þ
It follows that,
if Pr¼0orPr¼1 then gp¼2ðaA2aBÞ2
4pð12lÞ;ð11Þ
if Pr¼0:50 then ^
gp¼2Pa21¼124lþ4aAaB;ð12Þ
Equations (10), (11) and (12) show very well how paradoxes often occur in practice.
From equation (11) it appears that whenever a trait is very rare or omnipresent, Scott’s
p-statistic yields a negative inter-rater reliability regardless of the raters’ sensitivity
values. In other words, if prevalence is low or high, any large extent of agreement
between raters will not be reflected in the p-statistic.
Equation (8), on the other hand, indicates that when trait prevalence is smaller than
0.5, Cohen’s k-statistic may be an increasing or a decreasing function of trait prevalence
depending on raters Aand Bsensitivity values. That is, if one rater has a sensitivity
smaller than 0.5 and the other a sensitivity greater than 0.5 then k-statistic is a
decreasing function of P
r
, otherwise it is increasing. The situation is similar when trait
prevalence is greater than 0.50. The maximum or minimum value of kis reached at
Pr¼0:50. If one rater has a sensitivity of 0.50, then k¼0 regardless of the trait
6Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
prevalence. The general equation of Cohen’s k-statistic is given by:
gk¼ð2aA21Þð2aB21ÞPrð12PrÞ
ð2aA21Þð2aB21ÞPrð12PrÞþð12PaÞ=2:ð13Þ
It follows that:
if Pr¼0orPr¼1 then ^
gk¼0;ð14Þ
if Pr¼0:50 then ^
gk¼2Pa21¼124lþ4aAaB:ð15Þ
Similar to Scott’s p-statistic, kseems to yield reasonable values only when trait
prevalence is close to 0.5. A value of trait prevalence that is either close to 0 or close to 1
will considerably reduce the ability of kto reflect any extent of agreement between
raters.
Many inter-rater agreement coefficients proposed in the literature have been
criticized on the grounds that they are dependent upon trait prevalence. Such a
dependence is inevitable if raters’ sensitivity levels are different from their specificity
levels. In fact, without assumption A1, even the overall agreement probability P
a
is
dependent upon trait prevalence P
r
due to the fact that P
a
can be expressed as follows:
Pa¼2ððaAaB2bAbBÞ2½ðaAþaBÞ2ðbAþbBÞÞPrþð1þ2bAbB2ðbAþbBÞÞ:
However, the impact of prevalence on the overall agreement probability is small if
sensitivity and specificity are reasonably close.
The previous analysis indicates that the G-index, p-statistic and k-statistic all have the
same reasonable behaviour when trait prevalence P
r
takes a value in the neighbourhood
of 0.5. However, their behaviour becomes very erratic (with the exception of G-index)
as soon as trait prevalence goes to the extremes. I argue that the chance-agreement
probability used in these statistics is ill-estimated when trait prevalence is in the
neighbourhood of 0 or 1. I will now propose a new agreement coefficient that will share
the common reasonable behaviour of its competitors in the neighbourhood of 0.5, but
will outperform them when trait prevalence goes to the extremes.
4. An alternative agreement statistic
Before I introduce an improved alternative inter-rater reliability coefficient, it is
necessary to develop a clear picture of the goal one normally attempts to achieve by
correcting inter-rater reliability for chance agreement. My premises are the following:
(a) Chance agreement occurs when at least one rater rates an individual randomly.
(b) Only an unknown portion of the observed ratings is subject to randomness.
I will consider that a rater Aclassifies an individual into one of two categories either
randomly, when he or she does not know where it belongs, or with certainty, when he
or she is certain about its ‘true’ membership. Rater Aperforms a random rating not all
the time, but with a probability u
A
. That is, u
A
is the propensity for rater Ato perform a
random rating. The participants not classified randomly are supposed to have been
classified into the correct category. If the random portion of the study was identifiable,
rating data of two raters Aand Bclassifying Nindividuals into categories ‘ þ’ and ‘ 2’
could be reported as shown in Table 4.
Computing inter-rater reliability and its variance 7
BJMSP 139—29/1/2008—ROBINSON—217240
Note that N
þ2·RC
for example, represents the number of individuals that rater A
classified randomly into the ‘ þ’ category and that rater Bclassified with certainty into
the ‘ 2’ category. In general, for (k,l)[{þ,2}, and (X,Y)[{R,C}, N
kl·XY
represents the number of individuals that rater Aclassified into category kusing a
classification method X(random or certainty), and that rater Bclassified into category l
using a classification method Y(random or certainty).
To evaluate the extent of agreement between raters Aand Bfrom Table 4, what is
needed is the ability to remove from consideration all agreements that occurred by
chance; that is Nþþ·RR þNþþ·CR þNþþ·RC þN22·RR þN22·CR þN22·RC. This yields the
following ‘true’ inter-rater reliability:
g¼NþþCC þN22CC
X
k[{þ;2}
NkkRR þNkkCR þNkkRC
ð16Þ
Equation (16) can also be written as:
g¼Pa2Pe
12Pe
;where
Pa¼X
k[{þ;2}X
X;Y[{C;R}
NkkXY
N;and Pe¼X
k[{þ;2}X
X;Y2{C;R}
ðX;YÞ–ðC;CÞ
NkkXY
N
ð17Þ
In a typical reliability study the two raters Aand Bwould rate nstudy participants, and
rating data reported as shown in Table 1, with q¼2. The problem is to find a good
statistic ^
gfor estimating g. A widely accepted statistic for estimating the overall
agreement probability P
a
is given by:
pa¼ðnþþ þn22Þ=n:ð18Þ
The estimation of P
e
represents a more difficult problem, since it requires one to be able
to isolate ratings performed with certainty from random ratings. To get around this
difficulty, I decided to approximate P
e
by a parameter that can be quantified more easily,
and to evaluate the quality of the approximation in section 5.
Table 4. Distribution of Nparticipants by rater, randomness of classification and response category
Rater B
Random (R) Certain (C)
þ2þ2
Rater ARandom (R) þN
þþ·RR
N
þ2·RR
N
þþ·RC
N
þ2·RC
2N
2þ·RR
N
22·RR
N
2þ·RC
N
22·RC
Certain (C) þN
þþ·CR
N
þ2·CR
N
þþ·CC
0
2N
2þ·CR
N
22·CR
0N
22·CC
8Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
Suppose an individual is selected randomly from a pool of individuals and rated by
raters Aand B. Let Gand Rbe two events defined as follows:
G¼{The two raters Aand Bagree};ð19Þ
R¼{A rater ðA;or B;or bothÞperforms a random rating}:ð20Þ
It follows that Pe¼PðG>RÞ¼PðG=RÞPðRÞ,whereP(G/R) is the conditional
probability that Aand Bagree given that one of them (or both) has performed a
random rating.
A random rating would normally lead to the classification of an individual into either
category with the same probability 1/2, although this may not always be case. Since
agreement may occur on either category, it follows that PðG=RÞ¼2£1=22¼1=2. As for
the estimation of the probability of random rating P(R), one should note that when the
trait prevalence P
r
is high or low (i.e. if P
r
(1 2P
r
) is small), a uniform distribution of
participants among categories is an indication of high proportion of random ratings,
hence of high probability P(R).
Let the random variable X
þ
be defined as follows:
Xþ¼1 if a rater classifies the participan into category þ;
0 otherwise:
(
I suggest approximating P(R) with a normalized measure of randomness Cdefined by
the ratio of the variance V(X
þ
)ofX
þ
to the maximum possible variance V
MAX
for X
þ
,
which is reached only when the rating is totally random. It follows that
C¼VðXþÞ=VMAX ¼pþð12pþÞ
1=2ð121=2Þ¼4pþð12pþÞ;
where p
þ
represents the probability that a randomly chosen rater classifies a randomly
chosen individual into the ‘ þ’ category. This leads to the following formulation of
chance agreement:
P
e¼PðG=RÞC¼2pþð12pþÞ:ð21Þ
This approximation leads to the following approximated ‘true’ inter-rater reliability:
g¼Pa2P
e
12P
e
;ð22Þ
The probability p
þ
can be estimated from sample data by ^
pþ¼ðpAþþpBþÞ=2, where
pAþ¼nAþ=nand pBþ¼nBþ=n. This leads to a chance-agreement probability
estimator p
e¼PðG=RÞ
^
C;where
^
C¼4^
pþð12^
pþÞ:That is,
p
e¼2^
pþð12^
pþÞ:ð23Þ
Note that ^
pþð12^
pþÞ¼ ^
p2ð12^
p2Þ:Therefore, p
ecan be rewritten as
p
e¼^
pþð12^
pþÞþ ^
p2ð12^
p2Þ:
The resulting agreement statistic is given by,
^
g1¼ðpa2p
eÞ=ð12p
eÞ;
with p
a
given by equation (18), and is shown in section 5 mathematically to have a
smaller bias with respect to the ‘true’ agreement coefficient than all its competitors.
Computing inter-rater reliability and its variance 9
BJMSP 139—29/1/2008—ROBINSON—217240
Unlike the k- and p-statistics, this agreement coefficient uses a chance-agreement
probability that is calibrated to be consistent with the propensity of random rating that
is suggested by the observed ratings. I will refer to the calibrated statistic ^
g1as the AC
1
estimator, where AC stands for agreement coefficient and digit 1 indicates the first-order
chance correction, which accounts for full agreement only as opposed to full and partial
agreement (second-order chance correction); the latter problem, which will be
investigated elsewhere, will lead to the AC
2
statistic.
A legitimate question to be asked is whether the inter-rater reliability statistic ^
g1;
estimates the ‘true’ inter-rater reliability of equation (16) at all, and under what
circumstances. I will show in the next section that if trait prevalence is high or low, then
^
g1does estimate the ‘true’ inter-rater reliability very well. However, with trait
prevalence at the extremes, p,kand G-index are all biased for estimating the ‘true’ inter-
rater reliability under any circumstances.
5. Biases of inter-rater reliability statistics
Let us consider two raters, Aand B, who would perform a random rating with
probabilities u
A
and u
B
, respectively. Each classification of a study participant by a
random mechanism will either lead to a disagreement or to an agreement by chance.
The rater’s sensitivity values (which are assumed to be identical to their specificity
values) are given by:
aA¼12uA=2 and aB¼12uB=2:
These equations are obtained under the assumption that any rating that is not random
will automatically lead to a correct classification, while a random rating leads to a
correct classification with probability 1/2. In fact, aA¼ð12uAÞþuA=2¼12uA=2.
Under this simple rating model, and following equation (5), the overall agreement
probability is given by Pa¼aAaBþð12aAÞð12aBÞ¼12ðuAþuBÞ=2þuAuB=2. A s
for chance-agreement probability P
e
let R
A
and R
B
be two events defined as follows:
.R
A
: Rater Aperforms a random rating.
.R
B
: Rater Bperforms a random rating.
Then,
Pe¼PðG>RÞ¼PðG>RA>
RBÞþPðG>RA>RBÞþPðG>
RA>RBÞ
¼uAð12uBÞ=2þuAuB=2þuBð12uAÞ=2¼ðuAþuB2uAuBÞ=2:
The ‘true’ inter-rater reliability is then given by:
g¼2ð12uAÞð12uBÞ
1þð12uAÞð12uBÞ:ð24Þ
The theoretical agreement coefficients will now be derived for the AC
1
,G-index, k, and
pstatistics. Let l¼12ðuAþuBÞ=4.
10 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
For AC
1
coefficient, it follows from equations (5) and (21) that chance-agreement
probability P
eis obtained as follows:
P
e¼2pþð12pþÞ¼2½lPrþð12lÞð12PrÞ½ð12lPrÞ2ð12lÞð12PrÞ
¼2lð12lÞþ2ð122lÞ2Prð12PrÞ¼Pe2ðuA2uBÞ2=8þD;
where D¼2ð122lÞ2Prð12PrÞ. The theoretical AC
1
coefficient is given by:
g1¼g2ð12gÞðuA2uBÞ22D
ð12PeÞþ½ðuA2uBÞ2=82D:ð25Þ
For Scott’s p-coefficient, one can establish that the chance-agreement probability P
ejp
is given by Pejp¼Peþð12uAÞð12uBÞþðuA2uBÞ2=82D. This leads to Scott’s
p-coefficient of
gp¼g2ð12gÞð12uAÞð12uBÞþðuA2uBÞ2=82D
ð12PeÞ2½ð12uAÞð12uBÞþðuA2uBÞ2=8þD:ð26Þ
For the G-index, PejG¼1=2¼Peþð12uAÞð12uBÞ=2:
gG¼g2ð12gÞð12uAÞð12uBÞ=2
ð12PeÞ2ð12uAÞð12uBÞ=2:ð27Þ
For Cohen’s k-coefficient, P
ejk
¼P
e
þ(1 2u
A
)(1 2u
B
)2D
k
, where D
k
¼2(1 2
u
A
)(1 2u
B
)P
r
(1 2P
r
):
gk¼g2ð12gÞð12uAÞð12uBÞ2Dk
ð12PeÞ2½ð12uAÞð12uBÞ2Dk:ð28Þ
To gain further insight into the magnitude of the biases of these different inter-rater
reliability statistics, let us consider the simpler case where raters Aand Bhave the same
propensity for random rating; that is, u
A
¼u
B
¼u. The ‘true’ inter-rater reliability is
given by:
g¼2ð12uÞ2
1þð12uÞ2:ð29Þ
I define the bias of an agreement coefficient g
X
as BXðuÞ¼gX2g, the difference
between the agreement coefficient and the ‘true’ coefficient. The biases of AC
1
,p,kand
G-index statistics, respectively denoted by B
1
(u), B
p
(u), B
k
(u) and B
G
(u), satisfy the
following relations:
BGðuÞ¼2uð12uÞ2ð22uÞ
1þð12uÞ2;2uð12uÞ2ð22uÞ
1þð12uÞ2#B1ðuÞ#0;
22ð12uÞ2
1þð12uÞ2#BpðuÞ#2
uð12uÞ2ð22uÞ
1þð12uÞ2;
22ð12uÞ2
1þð12uÞ2#BkðuÞ#2
uð12uÞ2ð22uÞ
1þð12uÞ2:
Which way the bias will go depends on the magnitude of trait prevalence. It follows
from these equations that the G-index consistently exhibits a negative bias, which may
Computing inter-rater reliability and its variance 11
BJMSP 139—29/1/2008—ROBINSON—217240
take a maximum absolute value around 17%, when the rater’s propensity for random
rating is around 35%, and will gradually decrease as ugoes to 1. The AC
1
statistic, on the
other hand, has a negative bias that ranges from 2uð12uÞ2ð22uÞ=ð1þð12uÞ2Þto 0,
reaching its largest absolute value of uð12uÞ2ð22uÞ=ð1þð12uÞ2Þonly when the trait
prevalence is around 50%. The remaining two statistics, on the other hand, have some
serious bias problems on the negative side. The pand kstatistics each have a bias whose
lowest value is 22ð12uÞ2=½1þð12uÞ2, which varies from 0 to 21. That means pand
kmay underestimate the ‘true’ inter-rater reliability by 100%.
The next two sections, 6 and 7, are devoted to variance estimation of the
generalized p-statistic and the AC
1
statistic, respectively, in the context of multiple
raters. For the two sections, I will assume that the nparticipants in the reliability
study were randomly selected from a bigger population of Npotential participants.
Likewise, the rraters can be assumed to belong to a bigger universe of Rpotential
raters. This finite-population framework has not yet been considered in the study of
inter-rater agreement assessment. For this paper, however, I will confine myself to the
case where r¼R, that is the estimators are not subject to any variability due to the
sampling of raters. Methods needed to extrapolate to a bigger universe of raters will
be discussed in a different paper.
6. Variance of the generalized p-statistic
The p-statistic denoted by ^
gpis defined as follows:
^
gp¼pa2pejp
12pejp
;ð30Þ
where p
a
and p
ejp
are defined as follows:
pa¼1
nX
n
i¼1X
q
k¼1
rikðrik 21Þ
rðr21Þ;and pejp¼X
q
k¼1
^
p2
k;with ^
pk¼1
nX
n
i¼1
rik
r:ð31Þ
Concerning the estimation of the variance of ^
gp;Fleiss (1971) suggested the following
variance estimator under the hypothesis of no agreement between raters beyond
chance:
vð^
gpjNo agreementÞ¼ 2ð12fÞ
nrðr21Þ£
pejp2ð2r23Þp2
ejpþ2ðr22ÞX
q
k¼1
^
p3
k
ð12pejpÞ2;ð32Þ
where f¼n=Nis the sampling fraction, which could be neglected if the population of
potential participants is deemed very large. It should be noted that this variance
estimator is invalid for confidence interval construction. The original expression
proposed by Fleiss does not include the finite-population correction factor 1 2f.
Cochran (1977) is a good reference for readers interested in statistical methods in finite-
population sampling.
I propose here a non-paramateric variance estimator for ^
gpthat is valid for
confidence interval construction using the linearization technique. Unlike
12 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
vð^
gpjNo agreementÞ, the validity of the non-parametric variance estimator does not
depend on the extent of agreement between the raters. This variance estimator is
given by
vð^
gpÞ¼12f
n
1
n21X
n
i¼1
ð^
g
pi2^
gpÞ2;ð33Þ
where ^
g
piis given by
^
g
pi¼^
gpi22ð12^
gpÞpepji2pejp
12pejp
;ð34Þ
where ^
gpi¼ðpaji2pejpÞ=ð12pejpÞ, and p
aji
, and p
epji
are given by:
paji¼X
q
i¼1
rikðrik 21Þ
rðr21Þ;and pepji¼X
q
k¼1
rik
r
^
pk:ð35Þ
To see how equation (33) is derived, one should consider the standard approach
that consists of deriving an approximation of the actual variance of the estimator
and using a consistent estimator of that approximate variance as the variance
estimator. Let us assume that as the sample size nincreases, the estimated chance-
agreement probability p
ejp
converges to a value P
ejp
and that each ^
pkconverges to
p
k
.If ^
pand pdenote the vectors of the ^
pk’s and p
k
’s, respectively, it can be
shown that,
pejp2Pejp¼2
nX
n
i¼1
ðpepji2PejpÞþOpðk ^
p2pk2Þ;
and that if G
p
¼(p
a
2P
ejp
)/(1 2P
ejp
), then ^
gpcan be expressed as,
^
gp¼ðpa2PejpÞ2ð12GpÞð pejp2PejpÞ
12Pejp
þOpðð pejp2PejpÞ2Þ:
The combination of these two equations gives us an approximation of ^
gpthat is a
linear function of r
ik
and that captures all terms except those with a stochastic
order of magnitude of 1/n, which can be neglected. Bishop, Fienberg, and Holland
(1975, chapter 14) provide a detailed discussion of the concept of stochastic order
of magnitude.
The variance estimator of equation (33) can be used for confidence interval
construction as well as for hypothesis testing. Its validity is confirmed by the simulation
study presented in section 9.
Alternatively, a jackknife variance estimator can be used to estimate the variance of the
p-statistic. The jackknife technique introduced by Quenouille (1949) and developed by
Tukey (1958), is a general purpose technique for estimating variances. It has wide
applicability although it is computation intensive. The jackknife variance of ^
gpis given by:
vJð^
gpÞ¼ð12fÞðn21Þ
nX
n
i¼1
ð^
gðiÞ
p2^
gð†Þ
pÞ2;ð36Þ
where ^
gðiÞ
pis the p-statistic obtained after removing participant ifrom the sample, while
Computing inter-rater reliability and its variance 13
BJMSP 139—29/1/2008—ROBINSON—217240
^
gð†Þ
prepresents the average of all ^
gðiÞ
p’s. Simulation results not reported in this paper show
that this jackknife variance works well for estimating the variance of ^
gp:The idea of using
the jackknife methodology for estimating the variance of an agreement coefficient was
previously evoked by Kraemer (1980).
7. Variance of the generalized AC
1
estimator
The AC
1
statistic ^
g1introduced in section 4 can be extended to the case of rraters
(r.2) and qresponse categories (q.2) as follows:
^
g1¼pa2pejg
12pejg
;ð37Þ
where p
a
is defined in equation (1), and chance-agreement probability p
ejg
defined as
follows:
pejg¼1
q21X
q
k¼1
^
pkð12^
pkÞ;ð38Þ
the ^
pk’s being defined in equation (1).
The estimator ^
g1is a non-linear statistic of the r
ik
’s. To derive its variance, I have used
a linear approximation that includes all terms with a stochastic order of magnitude up to
n
21/2
. This will yield a correct asymptotic variance that includes all terms with an order
of magnitude up to 1/n. Although a rigorous treatment of the asymptotics is not
presented here, it is possible to establish that for large values of n, a consistent estimator
for estimating the variance of ^
g1is given by:
vð^
g1Þ¼12f
n
1
n21X
n
i¼1
ð^
g
1ji2^
g1Þ2;ð39Þ
where f¼n=Nis the sampling fraction,
^
g
1ji¼^
g1ji22ð12^
g1Þpegji2pejg
12pejg
;
^
g1ji¼ðpaji2pejgÞ=ð12pejgÞis the agreement coefficient with respect to participant i,
p
aji
is given by,
paji¼X
q
k¼1
rikðrik 21Þ
rðr21Þ;
and chance-agreement probability with respect to unit i,p
egji
is given by:
pegji¼1
q21X
q
k¼1
rik
rð12^
pkÞ;
To obtain equation (39), one should first derive a large-sample approximation of the
14 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
actual variance of ^
g1:This is achieved by considering that as the size nof the
participant sample increases, chance-agreement probability p
ejg
converges to a fixed
probability P
ejg
and each classification probability ^
pkconverges to a constant p
k
. Let
us define the following two vectors: ^
p¼ð^
p1; :::; ^
pqÞ0and p¼ðp1; :::; pqÞ0:One can
establish that:
pejg2Pejg¼2
nX
n
i¼1
ðpegji2PejgÞþOpðk ^
p2pkÞ;
^
g1¼ðpa2PejgÞ2ð12GgÞð pejg2PejgÞ
12Pejg
þOpðð pejg2PejgÞ2Þ;
where Gg¼ðpa2PejgÞ=ð12PejgÞ. Combining these two expressions leads to a
linear approximation of ^
g1;which can be used to approximate the asymptotic
variance of ^
g1.
An alternative approach for estimating the variance of ^
g1is the jackknife method.
The jackknife variance estimator is given by:
vJð^
g1Þ¼ð12fÞn21
nX
n
i¼1
ð^
gðiÞ
12^
gð†Þ
1Þ2;ð40Þ
where ^
gðiÞ
1represents the estimator ^
g1computed after removing participant ifrom the
participant sample, and ^
gð†Þ
1the average of all the ^
gðiÞ
1’s:
8. Special case of two raters
Two-rater reliability studies are of special interest. Rating data in this case are often
conveniently reported using the distribution of participants by rater and response
category as shown in Table 1. Therefore, the inter-rater reliability coefficient and its
associated variance must be expressed as functions of the n
kl
’s.
For two raters classifying nparticipants into qresponse categories, Fleiss et al.
(1969) proposed an estimator vð^
gkjNo agreementÞfor estimating the variance of
Cohen’s k-statistic under the hypothesis of no agreement between the raters. If there
exists an agreement between the two raters, Fleiss et al. recommended another variance
estimator vð^
gkjAgreementÞ:These estimators are given by:
vð^
gkjNo agreementÞ¼ 12f
nð12pejkÞ2X
q
k¼1
pBk pAk½12ðpBk þpAk Þ2
(
þX
q
k¼1X
q
1¼1
k–1
pBk pAlðpBk þpAlÞ22p2
ejk)ð41Þ
Computing inter-rater reliability and its variance 15
BJMSP 139—29/1/2008—ROBINSON—217240
and
vð^
gkjAgreementÞ¼ 12f
nð12pejkÞ2X
q
k¼1
pkk½12ðpAk þpBk Þð12^
gkÞ2
(
þð12^
gkÞ2X
q
k¼1X
q
1¼1
k–1
pklðpBk þpAlÞ22½^
gk2pejkð12^
gkÞ2):
ð42Þ
It can be shown that vð^
gkjAgreementÞcaptures all terms of magnitude order up to
n
21
, is consistent for estimating the true population variance and provides valid
normality-based confidence intervals when the number of participants is reasonably
large.
When r¼2, the variance of the AC
1
statistic given in equation (39) reduces to the
following estimator:
vð^
g1Þ¼ 12f
nð12pejgÞ2pað12paÞ24ð12^
g1Þ1
q21X
q
k¼1
pkkð12^
pkÞ2papejg
!(
þ4ð12^
g1Þ21
ðq21Þ2X
q
k¼1X
q
l¼1
pkl½12ð^
pkþ^
plÞ=222p2
ejg
!):
ð43Þ
As for Scott’s p-estimator, its correct variance is given by:
vð^
gpÞ¼ 12f
nð12pejpÞ2pað12paÞ24ð12^
gpÞX
q
k¼1
pkk
^
pk2papejp
!(
þ4ð12^
gpÞ2X
q
k¼1X
q
l¼1
pkl ½ð ^
pkþ^
plÞ=222p2
ejp
!)
ð44Þ
For the sake of comparability, one should note that the correct variance of kappa can be
rewritten as follows:
vð^
gkÞ¼ 12f
nð12pejkÞ2pað12paÞ24ð12^
gkÞX
q
k¼1
pkk
^
pk2papek
!(
þ4ð12^
gkÞ2X
q
k¼1X
q
l¼1
pkl½ðpAk þpBl Þ=222p2
ejk
!):
ð45Þ
The variance of the G-index is given by:
vð^
gGÞ¼412f
npað12paÞ:ð46Þ
Using the rating data of Table 3, I obtained the following inter-rater reliability estimates
16 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
and variance estimates:
Statistic Estimate (%) Standard error (%)
AC1
Kappa
Pi
G-Index
94.08
–2.34
–2.88
88.80
2.30
1.23
1.09
4.11
Because the percentage agreement p
a
equals 94.4%, it appears that AC
1
and G-index
are more consistent with the observed extent of agreement. The kand pstatistics have
low values that are very inconsistent with the data configuration and would be difficult
to justify. If the standard error is compared with the inter-rater reliability estimate, the
AC
1
appears to be the most accurate of all agreement coefficients.
9. Monte-Carlo simulation
In order to compare the biases of the various inter-rater reliability coefficients under
investigation and to verify the validity of the different variance estimators discussed in
the previous sections, I have conducted a small Monte-Carlo experiment. This
experiment involves two raters, Aand B, who must classify n(for n¼20, 60, 80, 100)
participants into one of two possible categories ‘ þ’ and ‘ 2’.
All the Monte-Carlo experiments are based upon the assumption of a prevalence rate
of P
r
¼95%. A propensity of random rating u
A
is set for rater Aand another one u
B
for
rater Bat the beginning of each experiment. These parameters allow us to use equation
(19) to determine the ‘true’ inter-rater reliability to be estimated. Each Monte-Carlo
experiment is conducted as follows:
.The nparticipants are first randomly classified into the two categories ‘ þ’ and ‘ 2’
in such a way that a participant falls into category ‘ þ’ with probability P
r
.
.If a rater performs a random rating (with probabilities u
A
for rater Aand u
B
for rater
B), then the participant to be rated is randomly classified into one of the two
categories with the same probability 1/2. A non-random rating is supposed to lead to
a correct classification.
.The number of replicate samples drawn in this simulation is 500.
Each Monte-Carlo experiment has two specific objectives, which are to evaluate the
magnitude of the biases associated with the agreement coefficients and to verify the
validity of their variance estimators.
The bias of an estimator is measured by the difference of its Monte-Carlo expectation
to the ‘true’ inter-rater reliability. The bias of a variance estimator, on the other hand, is
obtained by comparing its Monte-Carlo expectation with the Monte-Carlo variance of
the agreement coefficient. A small bias is desirable as it indicates that a given estimator
or variance estimator has neither a tendency to overestimate the true population
parameter nor a tendency to underestimate it.
In the simulation programmes, the calculation of the p-statistic and that of the
k-statistic were modified slightly in order to avoid the difficulty posed by undefined
estimates. When pejp¼1orpejk¼1, these chance-agreement probabilities were
replaced with 0.99999 so that the agreement coefficient can be defined.
Computing inter-rater reliability and its variance 17
BJMSP 139—29/1/2008—ROBINSON—217240
Table 5 contains the relative bias of the agreement coefficients ^
gp;^
gk;^
gG, and ^
g1.
A total of 500 replicate samples were selected and for each sample san estimate ^
gswas
calculated. The relative bias is obtained as follows:
RelBiasð^
gÞ¼ 1
500 X
500
s¼1
^
gs2g
!
=g;
where gis the ‘true’ inter-rater reliability obtained with equation (19). It follows from
Table 5 that the relative bias of the AC
1
estimator, which varies from 20.8 to 0.0% when
uA¼uB¼5%, and from 22.1 to 21.3% when uA¼20%and uB¼5%, is consistently
smaller than the relative bias of the other inter-rater reliability statistics. The pand k
statistics generally exhibit a very large negative bias under current conditions, ranging
from 232.8 to 262.5%. The main advantage of the AC
1
statistic over the G-index stems
from the fact that when the rater’s propensity for random rating is large (i.e. around
35%), the bias of the G-index is at its highest, while that of the AC
1
will decrease as the
trait prevalence increases.
Table 6 shows the Monte-Carlo variances of the four agreement statistics under
investigation, as well as the Monte-Carlo expectations of the associated variance
estimators. The Monte-Carlo expectation of a variance estimator vis obtained by
averaging all 500 variance estimates v
s
obtained from each replicate sample s. The
Monte-Carlo variance of an agreement coefficient ^
g, on the other hand, is obtained by
averaging all 500 squared differences between the estimates ^
gsand their average. More
formally, the Monte-Carlo expectation E(v) of a variance estimator vis defined as
Ta b l e 6 . Monte-Carlo variances and Monte-Carlo expectations of variance estimates for
Pr¼0:95 uðÞ
A¼uðÞ
B¼0:05
nVð^
gpÞ%E½vð^
gpÞ %Vð^
gkÞ%E½vð^
gkÞ %Vð^
gGÞ%E½vð^
gGÞ %Vð^
g1Þ%E½vð^
g1Þ %
20 15.8 3.3 15.0 3.13 0.79 0.78 0.32 0.33
60 6.0 3.9 5.9 3.83 0.28 0.31 0.10 0.12
80 3.9 3.0 3.8 3.00 0.24 0.23 0.09 0.09
100 2.5 2.4 2.5 2.39 0.17 0.19 0.07 0.07
(*)u
A
and u
B
represent the propensity for random rating of raters Aand B, respectively.
Table 5. Relative bias of agreement coefficients for Pr¼0:95 based on 500 replicate samples
u
A
,u
B
nBð^
gpÞ%Bð^
gkÞ%Bð^
gGÞ%Bð^
g1Þ%
uð*Þ
A¼uð*Þ
B¼5%20 232.8 232.0 23.6 0.0
60 239.5 239.3 25.1 20.7
80 236.5 236.4 24.9 20.6
100 235.1 235.0 25.2 20.8
u
A
¼20% u
B
¼5% 20 262.5 259.9 211.9 22.1
60 258.4 257.0 211.7 21.4
80 258.2 256.9 212.1 21.6
100 257.4 256.3 211.6 21.3
(*)u
A
and u
B
represent the propensity for random rating of raters Aand B, respectively.
18 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
follows:
EðvÞ¼ 1
500 X
500
s¼1
vs;
while the Monte-Carlo variance Vð^
gÞof an agreement statistic ^
gis given by:
Vð^
gÞ¼ 1
500 X
500
s¼1
½^
gs2averageð^
gÞ2:
It follows from Table 6 that the variance of the AC
1
statistic is smaller than that of the
other statistics. In fact, Vð^
g1Þvaries from 0.07% when the sample size is 100 to 0.33%
when the sample size is 20. The second smallest variance is that of the G-index, which
varies from 0.17 to 0.79%. The kand pstatistics generally have larger variances, which
range from 2% to about 15%. An examination of the Monte-Carlo expectation of the
various variance estimators indicates that the proposed variance estimators for AC
1
and
G-index work very well. Even for a small sample size, these expectations are very close
to the Monte-Carlo approximations. The variance estimators of the kand pstatistics also
work well except for small sample sizes, for which they underestimate the ‘true’
variance.
10. Concluding remarks
In this paper, I have explored the problem of inter-rater reliability estimation when the
extent of agreement between raters is high. The paradox of the kand pstatistics has
been investigated and an alternative agreement coefficient proposed. I have proposed
new variance estimators for the k,pand the AC
1
statistics using the linearization and
jackknife methods. The validity of these variance estimators does not depend upon the
assumption of independence. The absence of such variance estimators has prevented
practitioners from constructing confidence intervals of multiple-rater agreement
coefficients.
I have introduced the AC
1
statistic which is shown to have better statistical
properties than its k,pand G-index competitors. The kand pestimators became well-
known for their supposed ability to correct the percentage agreement for chance
agreement. However, this paper argues that not all observed ratings would lead to
agreement by chance. This will particularly be the case if the extent of agreement is high
in a situation of high trait prevalence. Kappa and pi evaluate the chance-agreement
probability as if all observed ratings may yield an agreement by chance. This may lead to
unpredictable results with rating data that suggest a rather small propensity for chance
agreement. The AC
1
statistic was developed in such a way that the propensity for chance
agreement is proportional to the portion of ratings that may lead to an agreement by
chance, reducing the overall agreement by chance to the right magnitude.
The simulation results tend to indicate that the AC
1
and G-index statistics have
reasonably small biases for estimating the ‘true’ inter-rater reliability, while the kand p
statistics tend to underestimate it. The AC
1
outperforms the G-index when the trait
prevalence is high or low. If the trait prevalence is around 50%, all agreement statistics
perform alike. The absolute bias in this case increases with the raters’ propensity for
random rating, which can be reduced by giving extra training to the raters. The proposed
Computing inter-rater reliability and its variance 19
BJMSP 139—29/1/2008—ROBINSON—217240
variance estimators work well according to our simulations. For small sample sizes, the
variance estimators proposed for kand pstatistics tend to underestimate the true
variances.
References
Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley.
Q1
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of
interrater agreement measures. Canadian Journal of Statistics,27, 3–23.
Bishop, Y. V. V., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis.
Cambridge, MA: MIT Press.
Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the
paradoxes. Journal of Clinical Epidemiology,43, 551–558.
Cochran, W. C. (1977). Sampling techniques (3rd ed.). New York: Wiley.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement,20, 37–46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit. Psychological Bulletin,70, 213–220.
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological
Bulletin,88, 322–328.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological
Bulletin,76, 378–382.
Fleiss, J. L., Cohen, J., & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted
kappa. Psychological Bulletin,72, 323–327.
Holley, J. W., & Guilford, J. P. (1964). A note on the Gindex of agreement. Educational and
Psychological Measurement,24, 749–753.
Hubert, L. (1977). Kappa revisited. Psychological Bulletin,84, 289–297.
Kraemer, H. C. (1980). Ramifications of a population model for kas a coefficient of reliability.
Psychometrika,44, 461–472.
Landis, R. J., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the
assessment of majority agreement among multiple observers. Biometrics,33, 363–374.
Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and
alternatives. Psychological Bulletin,76, 365–377.
Quenouille, M. H. (1949). Approximate tests of correlation in times series. Journal of the Royal
Statistical Society, B,11, 68–84.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public
Opinion Quarterly,XIX, 321–325.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of Mathematical
Statistics,29, 614.
Received 6 January 2006; revised version received 14 June 2006
20 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
Author Queries
JOB NUMBER: 139
JOURNAL: BJMSP
Q1 Please provide publisher location details for reference Agresti (2002).
Computing inter-rater reliability and its variance 21
BJMSP 139—29/1/2008—ROBINSON—217240