ArticlePDF Available

# Computing inter-rater reliability and its variance in the presence of high agreement

Authors:

## Abstract and Figures

Pi (pi) and kappa (kappa) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte-Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC1 as an improved alternative to existing inter-rater reliability statistics.
Content may be subject to copyright.
Computing inter-rater reliability and its variance
in the presence of high agreement
Kilem Li Gwet*
STATAXIS Consulting, Gaithersburg, USA
Pi (p) and kappa (k) statistics are widely used in the areas of psychiatry and
psychological testing to compute the extent of agreement between raters on nominally
scaled data. It is a fact that these coefﬁcients occasionally yield unexpected results in
situations known as the paradoxes of kappa. This paper explores the origin of these
limitations, and introduces an alternative and more stable agreement coefﬁcient
referred to as the AC
1
coefﬁcient. Also proposed are new variance estimators for the
multiple-rater generalized pand AC
1
statistics, whose validity does not depend upon
the hypothesis of independence between raters. This is an improvement over existing
alternative variances, which depend on the independence assumption. A Monte-Carlo
simulation study demonstrates the validity of these variance estimators for conﬁdence
interval construction, and conﬁrms the value of AC
1
as an improved alternative to
existing inter-rater reliability statistics.
1. Introduction
Researchers in various ﬁelds often need to evaluate the quality of a data collection
method. In many studies, a data collection tool, such as a survey questionnaire, a
laboratory procedure or a classiﬁcation system, is used by different people referred to as
raters, observers or judges. In an effort to minimize the effect of the rater factor on data
quality, investigators like to know whether all raters apply the data collection method in
a consistent manner. Inter-rater reliability quantiﬁes the closeness of scores assigned by
a pool of raters to the same study participants. The closer the scores, the higher the
reliability of the data collection method. Although reliability data can be discrete or
continuous, the focus of this paper is on inter-rater reliability assessment on nominally
scaled data. Such data are typically obtained from studies where raters must classify
study participants into one category among a limited number of possible categories.
Banerjee, Capozzoli, McSweeney, and Sinha (1999) provide a good review of the
techniques developed to date for analysing nominally scaled data. Two of the most
inﬂuential papers in this area are those of Fleiss (1971) and Fleiss, Cohen, and Everitt
* Correspondence should be addressed to Dr Kilem Li Gwet, Statistical Consultants , STATAXIS Consulting, 15914B Shady
Grove Road, No. 145, Gaithersburg, MD 20877-1322, USA (e-mail: gwet@stataxis.com).
The
British
Psychological
Society
1
British Journal of Mathematical and Statistical Psychology (2008), 00, 1–21
q2008 The British Psychological Society
www.bpsjournals.co.uk
DOI:10.1348/000711006X126600
BJMSP 139—29/1/2008—ROBINSON—217240
(1969), which contain the most popular results in use today. Fleiss et al. provide large
sample approximations of the variances of the kand weighted kstatistics suggested by
Cohen (1960, 1968), respectively in the case of two raters, while Fleiss extends the p-
statistic to the case of multiple raters. Landis and Koch (1977) also give an instructive
discussion of inter-rater agreement among multiple observers. Agresti (2002) presents
several modelling techniques for analysing rating data in addition to presenting a short
account of the state of the art. Light (1971) introduces measures of agreement
conditionally upon a speciﬁc classiﬁcation category, and proposes a generalization of
Cohen’s k-coefﬁcient to the case of multiple raters. Conger (1980) suggests an
alternative multiple-rater agreement statistic obtained by averaging all pairwise overall
and chance-corrected probabilities proposed by Cohen (1960). Conger (1980) also
extends the notion of pairwise agreement to that of g-wise agreement where agreement
occurs if graters rather than two raters classify an object into the same category.
In section 2, I introduce the most commonly used pairwise indexes. Section 3
discusses a theoretical framework for analysing the origins of the kappa’s paradox. An
alternative and more stable agreement coefﬁcient referred to as the AC
1
statistic is
introduced in section 4. Section 5 is devoted to the analysis of the bias associated with
the various pairwise agreement coefﬁcients under investigation. In section 6, a variance
estimator for the generalized p-statistic is proposed, which is valid even under the
assumption of dependence of ratings. Section 7 presents a variance estimator of the AC
1
statistic, which is always valid. The important special case of two raters is discussed in
section 8, while section 9 describes a small simulation study aimed at verifying the
validity of the variance estimators as well as the magnitude of the biases associated with
the various indexes under investigation.
2. Cohen’s k, Scott’s p,G-index and Fleiss’s generalized p
In a two-rater reliability study involving raters Aand B, the data will be reported in a two-
way contingency table such as Table 1. Table 1 shows the distribution of nstudy
participants by rater and response category, where n
kl
indicates the number of
participants that raters Aand Bclassiﬁed into categories kand l, respectively.
All inter-rater reliability coefﬁcients discussed in this paper have two components:
the overall agreement probability p
a
, which is common to all coefﬁcients, and the
chance-agreement probability p
e
, which is speciﬁc to each index. For the two-rater
Table 1. Distribution of nparticipants by rater and response category
Rater B
Rater A12···qTotal
1
2
.
.
.
q
n11 n12 ··· n1q
n21 n22 ··· n2q
···
nq1nq2··· nqq
nA1
nA2
.
.
.
nAq
Total nB1nB2··· nBq n
2Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
reliability data of Table 1, the overall agreement probability is given by:
pa¼X
q
k¼1
pkk;where pkk ¼nkk =n:
Let p
Ak
¼n
Ak
/n,p
Bk
¼n
Bk
/n, and ^
pk¼ðpAk þpBkÞ=2:Cohen’s k-statistic is given by:
^
gk¼ðpa2pejkÞ=ð12pejkÞ;where pejk¼X
q
k¼1
pAkpBk :
Scott (1955) proposed the p-statistic given by:
^
gp¼ðpa2pejpÞ=ð12pejpÞ;where pejp¼X
q
k¼1
^
p2
k:
The G-index of Holley and Guilford (1964) is given by:
^
gG¼ðpa2pejGÞ=ð12pejGÞ;
where p
ejG
¼1/q, and qrepresents the number of response categories. Note that the
expression used for ^
gGhere is more general than the original Holley–Guilford formula,
which was presented for the simpler situation of two raters and two response categories
only.
If a reliability study involves an arbitrarily large number rof raters, rating data are
often reported in a frequency table showing the distribution of raters by participant and
response category, as described in Table 2. For a given participant iand category k,r
ik
represents the number of raters who classiﬁed participant iinto category k.
Fleiss (1971) extended Scott’s p-statistic to the case of multiple raters (r) and
proposed the following equation:
^
gp¼pa2pejp
12pejp
;where
pa¼1
nX
n
i¼1X
q
k¼1
rikðrik 21Þ
rðr21Þ;
pejp¼X
q
k¼1
^
p2
k;and ^
pk¼1
nX
n
i¼1
rik
r:
8
>
>
>
>
>
<
>
>
>
>
>
:
ð1Þ
Table 2. Distribution of rraters by participant and response category
Category
Participant 12···qTotal
1
2
.
.
.
n
r11 r12 ··· r1q
r21 r22 ··· r2q
···
rn1rn2··· rnq
r
r
.
.
.
r
Total rþ1rþ2··· rþqnr
Computing inter-rater reliability and its variance 3
BJMSP 139—29/1/2008—ROBINSON—217240
The terms p
a
and p
ejp
are, respectively, the overall agreement probability and the
probability of agreement due to chance. Conger (1980) suggested a generalized version
of the k-statistic that is obtained by averaging all r(r21)/2 pairwise k-statistics as
deﬁned by Cohen (1960). The k-statistic can also be generalized as follows:
^
gk¼pa2pejk
12pejk
;
where p
a
is deﬁned as above and chance-agreement probability p
ejk
given by:
pejk¼X
q
k¼1X
r
a¼2
ð21ÞaX
i1,···,iaY
a
j¼1
pkij
!
:ð2Þ
The term pkijðj¼1; :::; aÞrepresents the proportion of participants that rater i
j
classiﬁed into category k. It follows from equation (2) that if r¼2, then p
ejk
reduces to
the usual formula of chance-agreement probability for the k-statistic. For r¼3 and r¼4
the chance-agreement probabilities are, respectively, given by:
pejkð3Þ¼X
q
k¼1
ðpk1pk2þpk1pk3þpk2pk32pk1pk2pk3Þ;
pejkð4Þ¼
X
q
k¼1
½ð pk1pk2þpk1pk3þpk1pk4þpk2pk3þpk2pk4þpk3pk4Þ
2ðpk1pk2pk3þpk1pk2pk4þpk1pk3pk4þpk2pk3pk4Þ
þpk1pk2pk3pk4:
This general version of the k-statistic has not been studied yet and no expression for its
variance is available. There is no indication, however, that it has better statistical
properties than Fleiss’s generalized statistic. Nevertheless, a practitioner interested in
using this estimator, may still estimate its variance using the jackknife method described
by equation (36) for the p-statistic. Hubert (1977) discusses other possible extensions of
the k-statistic to the case of multiple raters.
Table 3 contains an example of rating data. These illustrate the limitations of equation
(1) as a measure of the extent of agreement between raters. For those data,
Table 3. Distribution of 125 participants by rater and response category
Rater B
Rater Aþ2Total
þ118 5 123
220 2
Total 120 5 125
4Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
^
gp¼ð0:9440 20:9456Þ=ð120:9456Þ¼20:0288, which is even a negative value. This
result is the opposite of what our intuition would suggest and illustrates one of the
paradoxes noted by Cicchetti and Feinstein (1990) where high agreement is coupled
with low k. In this example, raters Aand Bare expected to have a high inter-rater
reliability.
To understand the nature and the causes of the paradoxical behaviour of the p- and
k-statistics, I will conﬁne myself to the case of two raters, Aand B, who must identify the
presence or absence of a trait on individuals of a given population of interest. These
individuals will eventually be selected to participate in a study, and are therefore
potential study participants. The two raters will classify participants into the ‘ þ’or
2’ categories according to whether the trait is found or not. I will study how
agreement indexes are affected by raters’ sensitivity, speciﬁcity and the trait prevalence
in the population. The rater’s sensitivity is deﬁned as the conditional probability of
classifying a participant into the ‘ þ’ category given that the trait is indeed present. The
rater’s speciﬁcity is the conditional probability of classifying a participant into the ‘ 2
category given that the trait is actually absent.
Let a
A
and a
B
denote, respectively, raters Aand Bsensitivity values. Similarly, b
A
and
b
B
will denote raters Aand Bspeciﬁcity values. It follows that the probabilities P
Aþ
and
P
Bþ
for raters Aand Bto classify a participant into the þ’ category are given by
PAþ¼PraAþð12PrÞð12bAÞ;ð3Þ
PBþ¼PraBþð12PrÞð12bBÞ;ð4Þ
where P
r
represents the population trait prevalence. Our objective is to study how trait
prevalence, sensitivity and speciﬁcity affect inter-rater reliability. For the sake of
simplicity I will make the following two assumptions:
(A1) Sensitivity and speciﬁcity are identical for both raters. That is aA¼bAand
aB¼bB.
(A2) Correct classiﬁcations are independent. That is, if a
AB
denotes the probability
that raters Aand Bcorrectly classify an individual into the þ’ category, then
aAB ¼aAaB.
The probability P
a
that both raters agree is given by Pa¼pþþ þp22, where
pþþ and p22 are obtained as follows
pþþ ¼aAaBPrþð12PrÞð12bAÞð12bBÞ¼aAaBPrþð12PrÞð12aAÞð12aBÞ;
and p22 ¼12ðPAþþPBþ2pþþÞ. The following important equation can be
established:
Pa¼ð12aAð12aBÞþaAaB:ð5Þ
This relation shows that the overall agreement probability between two raters Aand B
does not depend upon the trait prevalence. Rather, it depends upon the rater’s
sensitivity and speciﬁcity values.
Computing inter-rater reliability and its variance 5
BJMSP 139—29/1/2008—ROBINSON—217240
The partial derivative with respect to P
r
of an inter-rater coefﬁcient of the form
g¼ðPa2PeÞ=ð12PeÞis given by
g=
Pr¼212Pa
ð12PeÞ2
Pe=
Pr;ð6Þ
since, from equation (5), one can conclude that
Pa=
Pr¼0. For Scott’s pand Cohen’s
k-statistics,
gp=
Pr¼2ð122lÞ2ð12PaÞð122PrÞ
ð12PejpÞ2;ð7Þ
gk=
Pr¼2ð122aAÞð122aBÞð12PaÞð122PrÞ
ð12PejkÞ2;ð8Þ
where l¼ðaAþaBÞ=2. Let p
þ
be the probability that a randomly chosen rater classiﬁes
a randomly chosen participant into the þ’ category. Then,
pþ¼ðPAþþPBþÞ=2¼lPrþð12lÞð12PrÞ:ð9Þ
The two equations (7) and (8) are derived from the fact that Pejp¼p2
þþð12pþÞ2and
Pejk¼12ðPAþþPBþÞþ2PAþPBþ. It follows from assumption A1 that
gG=
Pr¼0,
since ^
gGis solely a function of P
a
. Under this assumption, the G-index takes a constant
value of 2P
a
21 that depends on the raters’ sensitivity. Equation (6) shows that chance-
agreement probability plays a pivotal role on how inter-rater reliability relates to trait
prevalence. Equation (7) indicates that Scott’s p-statistic is an increasing function of P
r
for the values of trait prevalence between 0 and 0.50, and becomes decreasing for
P
r
.0.50, reaching its maximum value when Pr¼0:50. Because 0 #Pr#1, ^
gptakes
its smallest value at Pr¼0 and Pr¼1. Using equation (5) and the expression of P
ejp
,
one can show that:
gp¼ð2l21Þ2Prð12PrÞ2ðaA2aBÞ2=4
ð2l21Þ2Prð12PrÞþlð12lÞ:ð10Þ
It follows that,
if Pr¼0orPr¼1 then gp¼2ðaA2aBÞ2
4pð12lÞ;ð11Þ
if Pr¼0:50 then ^
gp¼2Pa21¼124lþ4aAaB;ð12Þ
Equations (10), (11) and (12) show very well how paradoxes often occur in practice.
From equation (11) it appears that whenever a trait is very rare or omnipresent, Scott’s
p-statistic yields a negative inter-rater reliability regardless of the raters’ sensitivity
values. In other words, if prevalence is low or high, any large extent of agreement
between raters will not be reﬂected in the p-statistic.
Equation (8), on the other hand, indicates that when trait prevalence is smaller than
0.5, Cohen’s k-statistic may be an increasing or a decreasing function of trait prevalence
depending on raters Aand Bsensitivity values. That is, if one rater has a sensitivity
smaller than 0.5 and the other a sensitivity greater than 0.5 then k-statistic is a
decreasing function of P
r
, otherwise it is increasing. The situation is similar when trait
prevalence is greater than 0.50. The maximum or minimum value of kis reached at
Pr¼0:50. If one rater has a sensitivity of 0.50, then k¼0 regardless of the trait
6Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
prevalence. The general equation of Cohen’s k-statistic is given by:
gk¼ð2aA21Þð2aB21ÞPrð12PrÞ
ð2aA21Þð2aB21ÞPrð12PrÞþð12PaÞ=2:ð13Þ
It follows that:
if Pr¼0orPr¼1 then ^
gk¼0;ð14Þ
if Pr¼0:50 then ^
gk¼2Pa21¼124lþ4aAaB:ð15Þ
Similar to Scott’s p-statistic, kseems to yield reasonable values only when trait
prevalence is close to 0.5. A value of trait prevalence that is either close to 0 or close to 1
will considerably reduce the ability of kto reﬂect any extent of agreement between
raters.
Many inter-rater agreement coefﬁcients proposed in the literature have been
criticized on the grounds that they are dependent upon trait prevalence. Such a
dependence is inevitable if raters’ sensitivity levels are different from their speciﬁcity
levels. In fact, without assumption A1, even the overall agreement probability P
a
is
dependent upon trait prevalence P
r
due to the fact that P
a
can be expressed as follows:
Pa¼2ððaAaB2bAbBÞ2½ðaAþaBÞ2ðbAþbBÞÞPrþð1þ2bAbB2ðbAþbBÞÞ:
However, the impact of prevalence on the overall agreement probability is small if
sensitivity and speciﬁcity are reasonably close.
The previous analysis indicates that the G-index, p-statistic and k-statistic all have the
same reasonable behaviour when trait prevalence P
r
takes a value in the neighbourhood
of 0.5. However, their behaviour becomes very erratic (with the exception of G-index)
as soon as trait prevalence goes to the extremes. I argue that the chance-agreement
probability used in these statistics is ill-estimated when trait prevalence is in the
neighbourhood of 0 or 1. I will now propose a new agreement coefﬁcient that will share
the common reasonable behaviour of its competitors in the neighbourhood of 0.5, but
will outperform them when trait prevalence goes to the extremes.
4. An alternative agreement statistic
Before I introduce an improved alternative inter-rater reliability coefﬁcient, it is
necessary to develop a clear picture of the goal one normally attempts to achieve by
correcting inter-rater reliability for chance agreement. My premises are the following:
(a) Chance agreement occurs when at least one rater rates an individual randomly.
(b) Only an unknown portion of the observed ratings is subject to randomness.
I will consider that a rater Aclassiﬁes an individual into one of two categories either
randomly, when he or she does not know where it belongs, or with certainty, when he
or she is certain about its ‘true’ membership. Rater Aperforms a random rating not all
the time, but with a probability u
A
. That is, u
A
is the propensity for rater Ato perform a
random rating. The participants not classiﬁed randomly are supposed to have been
classiﬁed into the correct category. If the random portion of the study was identiﬁable,
rating data of two raters Aand Bclassifying Nindividuals into categories ‘ þ’ and 2
could be reported as shown in Table 4.
Computing inter-rater reliability and its variance 7
BJMSP 139—29/1/2008—ROBINSON—217240
Note that N
þ2·RC
for example, represents the number of individuals that rater A
classiﬁed randomly into the ‘ þ’ category and that rater Bclassiﬁed with certainty into
the 2’ category. In general, for (k,l)[{þ,2}, and (X,Y)[{R,C}, N
kl·XY
represents the number of individuals that rater Aclassiﬁed into category kusing a
classiﬁcation method X(random or certainty), and that rater Bclassiﬁed into category l
using a classiﬁcation method Y(random or certainty).
To evaluate the extent of agreement between raters Aand Bfrom Table 4, what is
needed is the ability to remove from consideration all agreements that occurred by
chance; that is Nþþ·RR þNþþ·CR þNþþ·RC þN22·RR þN22·CR þN22·RC. This yields the
following ‘true’ inter-rater reliability:
g¼NþþCC þN22CC
X
k[{þ;2}
NkkRR þNkkCR þNkkRC

ð16Þ
Equation (16) can also be written as:
g¼Pa2Pe
12Pe
;where
Pa¼X
k[{þ;2}X
X;Y[{C;R}
NkkXY
N;and Pe¼X
k[{þ;2}X
X;Y2{C;R}
ðX;YÞðC;CÞ
NkkXY
N
ð17Þ
In a typical reliability study the two raters Aand Bwould rate nstudy participants, and
rating data reported as shown in Table 1, with q¼2. The problem is to ﬁnd a good
statistic ^
gfor estimating g. A widely accepted statistic for estimating the overall
agreement probability P
a
is given by:
pa¼ðnþþ þn22Þ=n:ð18Þ
The estimation of P
e
represents a more difﬁcult problem, since it requires one to be able
to isolate ratings performed with certainty from random ratings. To get around this
difﬁculty, I decided to approximate P
e
by a parameter that can be quantiﬁed more easily,
and to evaluate the quality of the approximation in section 5.
Table 4. Distribution of Nparticipants by rater, randomness of classiﬁcation and response category
Rater B
Random (R) Certain (C)
þ2þ2
Rater ARandom (R) þN
þþ·RR
N
þ2·RR
N
þþ·RC
N
þ2·RC
2N
2þ·RR
N
22·RR
N
2þ·RC
N
22·RC
Certain (C) þN
þþ·CR
N
þ2·CR
N
þþ·CC
0
2N
2þ·CR
N
22·CR
0N
22·CC
8Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
Suppose an individual is selected randomly from a pool of individuals and rated by
raters Aand B. Let Gand Rbe two events deﬁned as follows:
G¼{The two raters Aand Bagree};ð19Þ
R¼{A rater ðA;or B;or bothÞperforms a random rating}:ð20Þ
It follows that Pe¼PðG>RÞ¼PðG=RÞPðRÞ,whereP(G/R) is the conditional
probability that Aand Bagree given that one of them (or both) has performed a
random rating.
A random rating would normally lead to the classiﬁcation of an individual into either
category with the same probability 1/2, although this may not always be case. Since
agreement may occur on either category, it follows that PðG=RÞ¼2£1=22¼1=2. As for
the estimation of the probability of random rating P(R), one should note that when the
trait prevalence P
r
is high or low (i.e. if P
r
(1 2P
r
) is small), a uniform distribution of
participants among categories is an indication of high proportion of random ratings,
hence of high probability P(R).
Let the random variable X
þ
be deﬁned as follows:
Xþ¼1 if a rater classifies the participan into category þ;
0 otherwise:
(
I suggest approximating P(R) with a normalized measure of randomness Cdeﬁned by
the ratio of the variance V(X
þ
)ofX
þ
to the maximum possible variance V
MAX
for X
þ
,
which is reached only when the rating is totally random. It follows that
C¼VðXþÞ=VMAX ¼pþð12pþÞ
1=2ð121=2Þ¼4pþð12pþÞ;
where p
þ
represents the probability that a randomly chosen rater classiﬁes a randomly
chosen individual into the þ’ category. This leads to the following formulation of
chance agreement:
P
e¼PðG=RÞC¼2pþð12pþÞ:ð21Þ
This approximation leads to the following approximated ‘true’ inter-rater reliability:
g¼Pa2P
e
12P
e
;ð22Þ
The probability p
þ
can be estimated from sample data by ^
pþ¼ðpAþþpBþÞ=2, where
pAþ¼nAþ=nand pBþ¼nBþ=n. This leads to a chance-agreement probability
estimator p
e¼PðG=RÞ
^
C;where
^
C¼4^
pþð12^
pþÞ:That is,
p
e¼2^
pþð12^
pþÞ:ð23Þ
Note that ^
pþð12^
pþÞ¼ ^
p2ð12^
p2Þ:Therefore, p
ecan be rewritten as
p
e¼^
pþð12^
pþÞþ ^
p2ð12^
p2Þ:
The resulting agreement statistic is given by,
^
g1¼ðpa2p
eÞ=ð12p
eÞ;
with p
a
given by equation (18), and is shown in section 5 mathematically to have a
smaller bias with respect to the ‘true’ agreement coefﬁcient than all its competitors.
Computing inter-rater reliability and its variance 9
BJMSP 139—29/1/2008—ROBINSON—217240
Unlike the k- and p-statistics, this agreement coefﬁcient uses a chance-agreement
probability that is calibrated to be consistent with the propensity of random rating that
is suggested by the observed ratings. I will refer to the calibrated statistic ^
g1as the AC
1
estimator, where AC stands for agreement coefﬁcient and digit 1 indicates the ﬁrst-order
chance correction, which accounts for full agreement only as opposed to full and partial
agreement (second-order chance correction); the latter problem, which will be
investigated elsewhere, will lead to the AC
2
statistic.
A legitimate question to be asked is whether the inter-rater reliability statistic ^
g1;
estimates the ‘true’ inter-rater reliability of equation (16) at all, and under what
circumstances. I will show in the next section that if trait prevalence is high or low, then
^
g1does estimate the ‘true’ inter-rater reliability very well. However, with trait
prevalence at the extremes, p,kand G-index are all biased for estimating the ‘true’ inter-
rater reliability under any circumstances.
5. Biases of inter-rater reliability statistics
Let us consider two raters, Aand B, who would perform a random rating with
probabilities u
A
and u
B
, respectively. Each classiﬁcation of a study participant by a
random mechanism will either lead to a disagreement or to an agreement by chance.
The rater’s sensitivity values (which are assumed to be identical to their speciﬁcity
values) are given by:
aA¼12uA=2 and aB¼12uB=2:
These equations are obtained under the assumption that any rating that is not random
will automatically lead to a correct classiﬁcation, while a random rating leads to a
correct classiﬁcation with probability 1/2. In fact, aA¼ð12uAÞþuA=2¼12uA=2.
Under this simple rating model, and following equation (5), the overall agreement
probability is given by Pa¼aAaBþð12aAÞð12aBÞ¼12ðuAþuBÞ=2þuAuB=2. A s
for chance-agreement probability P
e
let R
A
and R
B
be two events deﬁned as follows:
.R
A
: Rater Aperforms a random rating.
.R
B
: Rater Bperforms a random rating.
Then,
Pe¼PðG>RÞ¼PðG>RA>
RBÞþPðG>RA>RBÞþPðG>
RA>RBÞ
¼uAð12uBÞ=2þuAuB=2þuBð12uAÞ=2¼ðuAþuB2uAuBÞ=2:
The ‘true’ inter-rater reliability is then given by:
g¼2ð12uAÞð12uBÞ
1þð12uAÞð12uBÞ:ð24Þ
The theoretical agreement coefﬁcients will now be derived for the AC
1
,G-index, k, and
pstatistics. Let l¼12ðuAþuBÞ=4.
10 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
For AC
1
coefﬁcient, it follows from equations (5) and (21) that chance-agreement
probability P
eis obtained as follows:
P
e¼2pþð12pþÞ¼2½lPrþð12lÞð12PrÞ½ð12lPrÞ2ð12lÞð12PrÞ
¼2lð12lÞþ2ð122lÞ2Prð12PrÞ¼Pe2ðuA2uBÞ2=8þD;
where D¼2ð122lÞ2Prð12PrÞ. The theoretical AC
1
coefﬁcient is given by:
g1¼g2ð12gÞðuA2uBÞ22D
ð12PeÞþ½ðuA2uBÞ2=82D:ð25Þ
For Scott’s p-coefﬁcient, one can establish that the chance-agreement probability P
ejp
is given by Pejp¼Peþð12uAÞð12uBÞþðuA2uBÞ2=82D. This leads to Scott’s
p-coefﬁcient of
gp¼g2ð12gÞð12uAÞð12uBÞþðuA2uBÞ2=82D
ð12PeÞ2½ð12uAÞð12uBÞþðuA2uBÞ2=8þD:ð26Þ
For the G-index, PejG¼1=2¼Peþð12uAÞð12uBÞ=2:
gG¼g2ð12gÞð12uAÞð12uBÞ=2
ð12PeÞ2ð12uAÞð12uBÞ=2:ð27Þ
For Cohen’s k-coefﬁcient, P
ejk
¼P
e
þ(1 2u
A
)(1 2u
B
)2D
k
, where D
k
¼2(1 2
u
A
)(1 2u
B
)P
r
(1 2P
r
):
gk¼g2ð12gÞð12uAÞð12uBÞ2Dk
ð12PeÞ2½ð12uAÞð12uBÞ2Dk:ð28Þ
To gain further insight into the magnitude of the biases of these different inter-rater
reliability statistics, let us consider the simpler case where raters Aand Bhave the same
propensity for random rating; that is, u
A
¼u
B
¼u. The ‘true’ inter-rater reliability is
given by:
g¼2ð12uÞ2
1þð12uÞ2:ð29Þ
I deﬁne the bias of an agreement coefﬁcient g
X
as BXðuÞ¼gX2g, the difference
between the agreement coefﬁcient and the ‘true’ coefﬁcient. The biases of AC
1
,p,kand
G-index statistics, respectively denoted by B
1
(u), B
p
(u), B
k
(u) and B
G
(u), satisfy the
following relations:
BGðuÞ¼2uð12uÞ2ð22uÞ
1þð12uÞ2;2uð12uÞ2ð22uÞ
1þð12uÞ2#B1ðuÞ#0;
22ð12uÞ2
1þð12uÞ2#BpðuÞ#2
uð12uÞ2ð22uÞ
1þð12uÞ2;
22ð12uÞ2
1þð12uÞ2#BkðuÞ#2
uð12uÞ2ð22uÞ
1þð12uÞ2:
Which way the bias will go depends on the magnitude of trait prevalence. It follows
from these equations that the G-index consistently exhibits a negative bias, which may
Computing inter-rater reliability and its variance 11
BJMSP 139—29/1/2008—ROBINSON—217240
take a maximum absolute value around 17%, when the rater’s propensity for random
rating is around 35%, and will gradually decrease as ugoes to 1. The AC
1
statistic, on the
other hand, has a negative bias that ranges from 2uð12uÞ2ð22uÞ=ð1þð12uÞ2Þto 0,
reaching its largest absolute value of uð12uÞ2ð22uÞ=ð1þð12uÞ2Þonly when the trait
prevalence is around 50%. The remaining two statistics, on the other hand, have some
serious bias problems on the negative side. The pand kstatistics each have a bias whose
lowest value is 22ð12uÞ2=½1þð12uÞ2, which varies from 0 to 21. That means pand
kmay underestimate the ‘true’ inter-rater reliability by 100%.
The next two sections, 6 and 7, are devoted to variance estimation of the
generalized p-statistic and the AC
1
statistic, respectively, in the context of multiple
raters. For the two sections, I will assume that the nparticipants in the reliability
study were randomly selected from a bigger population of Npotential participants.
Likewise, the rraters can be assumed to belong to a bigger universe of Rpotential
raters. This ﬁnite-population framework has not yet been considered in the study of
inter-rater agreement assessment. For this paper, however, I will conﬁne myself to the
case where r¼R, that is the estimators are not subject to any variability due to the
sampling of raters. Methods needed to extrapolate to a bigger universe of raters will
be discussed in a different paper.
6. Variance of the generalized p-statistic
The p-statistic denoted by ^
gpis deﬁned as follows:
^
gp¼pa2pejp
12pejp
;ð30Þ
where p
a
and p
ejp
are deﬁned as follows:
pa¼1
nX
n
i¼1X
q
k¼1
rikðrik 21Þ
rðr21Þ;and pejp¼X
q
k¼1
^
p2
k;with ^
pk¼1
nX
n
i¼1
rik
r:ð31Þ
Concerning the estimation of the variance of ^
gp;Fleiss (1971) suggested the following
variance estimator under the hypothesis of no agreement between raters beyond
chance:
vð^
gpjNo agreementÞ¼ 2ð12fÞ
nrðr21Þ£
pejp2ð2r23Þp2
ejpþ2ðr22ÞX
q
k¼1
^
p3
k
ð12pejpÞ2;ð32Þ
where f¼n=Nis the sampling fraction, which could be neglected if the population of
potential participants is deemed very large. It should be noted that this variance
estimator is invalid for conﬁdence interval construction. The original expression
proposed by Fleiss does not include the ﬁnite-population correction factor 1 2f.
Cochran (1977) is a good reference for readers interested in statistical methods in ﬁnite-
population sampling.
I propose here a non-paramateric variance estimator for ^
gpthat is valid for
conﬁdence interval construction using the linearization technique. Unlike
12 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
vð^
gpjNo agreementÞ, the validity of the non-parametric variance estimator does not
depend on the extent of agreement between the raters. This variance estimator is
given by
vð^
gpÞ¼12f
n
1
n21X
n
i¼1
ð^
g
pi2^
gpÞ2;ð33Þ
where ^
g
piis given by
^
g
pi¼^
gpi22ð12^
gpÞpepji2pejp
12pejp
;ð34Þ
where ^
gpi¼ðpaji2pejpÞ=ð12pejpÞ, and p
aji
, and p
epji
are given by:
paji¼X
q
i¼1
rikðrik 21Þ
rðr21Þ;and pepji¼X
q
k¼1
rik
r
^
pk:ð35Þ
To see how equation (33) is derived, one should consider the standard approach
that consists of deriving an approximation of the actual variance of the estimator
and using a consistent estimator of that approximate variance as the variance
estimator. Let us assume that as the sample size nincreases, the estimated chance-
agreement probability p
ejp
converges to a value P
ejp
and that each ^
pkconverges to
p
k
.If ^
pand pdenote the vectors of the ^
pk’s and p
k
’s, respectively, it can be
shown that,
pejp2Pejp¼2
nX
n
i¼1
ðpepji2PejpÞþOpðk ^
p2pk2Þ;
and that if G
p
¼(p
a
2P
ejp
)/(1 2P
ejp
), then ^
gpcan be expressed as,
^
gp¼ðpa2PejpÞ2ð12GpÞð pejp2PejpÞ
12Pejp
þOpðð pejp2PejpÞ2Þ:
The combination of these two equations gives us an approximation of ^
gpthat is a
linear function of r
ik
and that captures all terms except those with a stochastic
order of magnitude of 1/n, which can be neglected. Bishop, Fienberg, and Holland
(1975, chapter 14) provide a detailed discussion of the concept of stochastic order
of magnitude.
The variance estimator of equation (33) can be used for conﬁdence interval
construction as well as for hypothesis testing. Its validity is conﬁrmed by the simulation
study presented in section 9.
Alternatively, a jackknife variance estimator can be used to estimate the variance of the
p-statistic. The jackknife technique introduced by Quenouille (1949) and developed by
Tukey (1958), is a general purpose technique for estimating variances. It has wide
applicability although it is computation intensive. The jackknife variance of ^
gpis given by:
vJð^
gpÞ¼ð12fÞðn21Þ
nX
n
i¼1
ð^
gðiÞ
p2^
gðÞ
pÞ2;ð36Þ
where ^
gðiÞ
pis the p-statistic obtained after removing participant ifrom the sample, while
Computing inter-rater reliability and its variance 13
BJMSP 139—29/1/2008—ROBINSON—217240
^
gðÞ
prepresents the average of all ^
gðiÞ
p’s. Simulation results not reported in this paper show
that this jackknife variance works well for estimating the variance of ^
gp:The idea of using
the jackknife methodology for estimating the variance of an agreement coefﬁcient was
previously evoked by Kraemer (1980).
7. Variance of the generalized AC
1
estimator
The AC
1
statistic ^
g1introduced in section 4 can be extended to the case of rraters
(r.2) and qresponse categories (q.2) as follows:
^
g1¼pa2pejg
12pejg
;ð37Þ
where p
a
is deﬁned in equation (1), and chance-agreement probability p
ejg
deﬁned as
follows:
pejg¼1
q21X
q
k¼1
^
pkð12^
pkÞ;ð38Þ
the ^
pk’s being deﬁned in equation (1).
The estimator ^
g1is a non-linear statistic of the r
ik
’s. To derive its variance, I have used
a linear approximation that includes all terms with a stochastic order of magnitude up to
n
21/2
. This will yield a correct asymptotic variance that includes all terms with an order
of magnitude up to 1/n. Although a rigorous treatment of the asymptotics is not
presented here, it is possible to establish that for large values of n, a consistent estimator
for estimating the variance of ^
g1is given by:
vð^
g1Þ¼12f
n
1
n21X
n
i¼1
ð^
g
1ji2^
g1Þ2;ð39Þ
where f¼n=Nis the sampling fraction,
^
g
1ji¼^
g1ji22ð12^
g1Þpegji2pejg
12pejg
;
^
g1ji¼ðpaji2pejgÞ=ð12pejgÞis the agreement coefﬁcient with respect to participant i,
p
aji
is given by,
paji¼X
q
k¼1
rikðrik 21Þ
rðr21Þ;
and chance-agreement probability with respect to unit i,p
egji
is given by:
pegji¼1
q21X
q
k¼1
rik
rð12^
pkÞ;
To obtain equation (39), one should ﬁrst derive a large-sample approximation of the
14 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
actual variance of ^
g1:This is achieved by considering that as the size nof the
participant sample increases, chance-agreement probability p
ejg
converges to a ﬁxed
probability P
ejg
and each classiﬁcation probability ^
pkconverges to a constant p
k
. Let
us deﬁne the following two vectors: ^
p¼ð^
p1; :::; ^
pqÞ0and p¼ðp1; :::; pqÞ0:One can
establish that:
pejg2Pejg¼2
nX
n
i¼1
ðpegji2PejgÞþOpðk ^
p2p;
^
g1¼ðpa2PejgÞ2ð12GgÞð pejg2PejgÞ
12Pejg
þOpðð pejg2PejgÞ2Þ;
where Gg¼ðpa2PejgÞ=ð12PejgÞ. Combining these two expressions leads to a
linear approximation of ^
g1;which can be used to approximate the asymptotic
variance of ^
g1.
An alternative approach for estimating the variance of ^
g1is the jackknife method.
The jackknife variance estimator is given by:
vJð^
g1Þ¼ð12fÞn21
nX
n
i¼1
ð^
gðiÞ
12^
gðÞ
1Þ2;ð40Þ
where ^
gðiÞ
1represents the estimator ^
g1computed after removing participant ifrom the
participant sample, and ^
gðÞ
1the average of all the ^
gðiÞ
1’s:
8. Special case of two raters
Two-rater reliability studies are of special interest. Rating data in this case are often
conveniently reported using the distribution of participants by rater and response
category as shown in Table 1. Therefore, the inter-rater reliability coefﬁcient and its
associated variance must be expressed as functions of the n
kl
’s.
For two raters classifying nparticipants into qresponse categories, Fleiss et al.
(1969) proposed an estimator vð^
gkjNo agreementÞfor estimating the variance of
Cohen’s k-statistic under the hypothesis of no agreement between the raters. If there
exists an agreement between the two raters, Fleiss et al. recommended another variance
estimator vð^
gkjAgreementÞ:These estimators are given by:
vð^
gkjNo agreementÞ¼ 12f
nð12pejkÞ2X
q
k¼1
pBk pAk½12ðpBk þpAk Þ2
(
þX
q
k¼1X
q
1¼1
k1
pBk pAlðpBk þpAlÞ22p2
ejk)ð41Þ
Computing inter-rater reliability and its variance 15
BJMSP 139—29/1/2008—ROBINSON—217240
and
vð^
gkjAgreementÞ¼ 12f
nð12pejkÞ2X
q
k¼1
pkk½12ðpAk þpBk Þð12^
gkÞ2
(
þð12^
gkÞ2X
q
k¼1X
q
1¼1
k1
pklðpBk þpAlÞ22½^
gk2pejkð12^
gkÞ2):
ð42Þ
It can be shown that vð^
gkjAgreementÞcaptures all terms of magnitude order up to
n
21
, is consistent for estimating the true population variance and provides valid
normality-based conﬁdence intervals when the number of participants is reasonably
large.
When r¼2, the variance of the AC
1
statistic given in equation (39) reduces to the
following estimator:
vð^
g1Þ¼ 12f
nð12pejgÞ2pað12paÞ24ð12^
g1Þ1
q21X
q
k¼1
pkkð12^
pkÞ2papejg
!(
þ4ð12^
g1Þ21
ðq21Þ2X
q
k¼1X
q
l¼1
pkl½12ð^
pkþ^
plÞ=222p2
ejg
!):
ð43Þ
As for Scott’s p-estimator, its correct variance is given by:
vð^
gpÞ¼ 12f
nð12pejpÞ2pað12paÞ24ð12^
gpÞX
q
k¼1
pkk
^
pk2papejp
!(
þ4ð12^
gpÞ2X
q
k¼1X
q
l¼1
pkl ½ð ^
pkþ^
plÞ=222p2
ejp
!)
ð44Þ
For the sake of comparability, one should note that the correct variance of kappa can be
rewritten as follows:
vð^
gkÞ¼ 12f
nð12pejkÞ2pað12paÞ24ð12^
gkÞX
q
k¼1
pkk
^
pk2papek
!(
þ4ð12^
gkÞ2X
q
k¼1X
q
l¼1
pkl½ðpAk þpBl Þ=222p2
ejk
!):
ð45Þ
The variance of the G-index is given by:
vð^
gGÞ¼412f
npað12paÞ:ð46Þ
Using the rating data of Table 3, I obtained the following inter-rater reliability estimates
16 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
and variance estimates:
Statistic Estimate (%) Standard error (%)
AC1
Kappa
Pi
G-Index
94.08
–2.34
–2.88
88.80
2.30
1.23
1.09
4.11
Because the percentage agreement p
a
equals 94.4%, it appears that AC
1
and G-index
are more consistent with the observed extent of agreement. The kand pstatistics have
low values that are very inconsistent with the data conﬁguration and would be difﬁcult
to justify. If the standard error is compared with the inter-rater reliability estimate, the
AC
1
appears to be the most accurate of all agreement coefﬁcients.
9. Monte-Carlo simulation
In order to compare the biases of the various inter-rater reliability coefﬁcients under
investigation and to verify the validity of the different variance estimators discussed in
the previous sections, I have conducted a small Monte-Carlo experiment. This
experiment involves two raters, Aand B, who must classify n(for n¼20, 60, 80, 100)
participants into one of two possible categories ‘ þ and 2’.
All the Monte-Carlo experiments are based upon the assumption of a prevalence rate
of P
r
¼95%. A propensity of random rating u
A
is set for rater Aand another one u
B
for
rater Bat the beginning of each experiment. These parameters allow us to use equation
(19) to determine the ‘true’ inter-rater reliability to be estimated. Each Monte-Carlo
experiment is conducted as follows:
.The nparticipants are ﬁrst randomly classiﬁed into the two categories ‘ þ’ and ‘ 2
in such a way that a participant falls into category ‘ þ’ with probability P
r
.
.If a rater performs a random rating (with probabilities u
A
for rater Aand u
B
for rater
B), then the participant to be rated is randomly classiﬁed into one of the two
categories with the same probability 1/2. A non-random rating is supposed to lead to
a correct classiﬁcation.
.The number of replicate samples drawn in this simulation is 500.
Each Monte-Carlo experiment has two speciﬁc objectives, which are to evaluate the
magnitude of the biases associated with the agreement coefﬁcients and to verify the
validity of their variance estimators.
The bias of an estimator is measured by the difference of its Monte-Carlo expectation
to the ‘true’ inter-rater reliability. The bias of a variance estimator, on the other hand, is
obtained by comparing its Monte-Carlo expectation with the Monte-Carlo variance of
the agreement coefﬁcient. A small bias is desirable as it indicates that a given estimator
or variance estimator has neither a tendency to overestimate the true population
parameter nor a tendency to underestimate it.
In the simulation programmes, the calculation of the p-statistic and that of the
k-statistic were modiﬁed slightly in order to avoid the difﬁculty posed by undeﬁned
estimates. When pejp¼1orpejk¼1, these chance-agreement probabilities were
replaced with 0.99999 so that the agreement coefﬁcient can be deﬁned.
Computing inter-rater reliability and its variance 17
BJMSP 139—29/1/2008—ROBINSON—217240
Table 5 contains the relative bias of the agreement coefﬁcients ^
gp;^
gk;^
gG, and ^
g1.
A total of 500 replicate samples were selected and for each sample san estimate ^
gswas
calculated. The relative bias is obtained as follows:
RelBiasð^
gÞ¼ 1
500 X
500
s¼1
^
gs2g
!
=g;
where gis the ‘true’ inter-rater reliability obtained with equation (19). It follows from
Table 5 that the relative bias of the AC
1
estimator, which varies from 20.8 to 0.0% when
uA¼uB¼5%, and from 22.1 to 21.3% when uA¼20%and uB¼5%, is consistently
smaller than the relative bias of the other inter-rater reliability statistics. The pand k
statistics generally exhibit a very large negative bias under current conditions, ranging
from 232.8 to 262.5%. The main advantage of the AC
1
statistic over the G-index stems
from the fact that when the rater’s propensity for random rating is large (i.e. around
35%), the bias of the G-index is at its highest, while that of the AC
1
will decrease as the
trait prevalence increases.
Table 6 shows the Monte-Carlo variances of the four agreement statistics under
investigation, as well as the Monte-Carlo expectations of the associated variance
estimators. The Monte-Carlo expectation of a variance estimator vis obtained by
averaging all 500 variance estimates v
s
obtained from each replicate sample s. The
Monte-Carlo variance of an agreement coefﬁcient ^
g, on the other hand, is obtained by
averaging all 500 squared differences between the estimates ^
gsand their average. More
formally, the Monte-Carlo expectation E(v) of a variance estimator vis deﬁned as
Ta b l e 6 . Monte-Carlo variances and Monte-Carlo expectations of variance estimates for
Pr¼0:95 uðÞ
A¼uðÞ
B¼0:05
nVð^
gpÞ%E½vð^
gpÞ %Vð^
gkÞ%E½vð^
gkÞ %Vð^
gGÞ%E½vð^
gGÞ %Vð^
g1Þ%E½vð^
g1Þ %
20 15.8 3.3 15.0 3.13 0.79 0.78 0.32 0.33
60 6.0 3.9 5.9 3.83 0.28 0.31 0.10 0.12
80 3.9 3.0 3.8 3.00 0.24 0.23 0.09 0.09
100 2.5 2.4 2.5 2.39 0.17 0.19 0.07 0.07
(*)u
A
and u
B
represent the propensity for random rating of raters Aand B, respectively.
Table 5. Relative bias of agreement coefﬁcients for Pr¼0:95 based on 500 replicate samples
u
A
,u
B
nBð^
gpÞ%Bð^
gkÞ%Bð^
gGÞ%Bð^
g1Þ%
uð*Þ
A¼uð*Þ
B¼5%20 232.8 232.0 23.6 0.0
60 239.5 239.3 25.1 20.7
80 236.5 236.4 24.9 20.6
100 235.1 235.0 25.2 20.8
u
A
¼20% u
B
¼5% 20 262.5 259.9 211.9 22.1
60 258.4 257.0 211.7 21.4
80 258.2 256.9 212.1 21.6
100 257.4 256.3 211.6 21.3
(*)u
A
and u
B
represent the propensity for random rating of raters Aand B, respectively.
18 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
follows:
EðvÞ¼ 1
500 X
500
s¼1
vs;
while the Monte-Carlo variance Vð^
gÞof an agreement statistic ^
gis given by:
Vð^
gÞ¼ 1
500 X
500
s¼1
½^
gs2averageð^
gÞ2:
It follows from Table 6 that the variance of the AC
1
statistic is smaller than that of the
other statistics. In fact, Vð^
g1Þvaries from 0.07% when the sample size is 100 to 0.33%
when the sample size is 20. The second smallest variance is that of the G-index, which
varies from 0.17 to 0.79%. The kand pstatistics generally have larger variances, which
range from 2% to about 15%. An examination of the Monte-Carlo expectation of the
various variance estimators indicates that the proposed variance estimators for AC
1
and
G-index work very well. Even for a small sample size, these expectations are very close
to the Monte-Carlo approximations. The variance estimators of the kand pstatistics also
work well except for small sample sizes, for which they underestimate the ‘true’
variance.
10. Concluding remarks
In this paper, I have explored the problem of inter-rater reliability estimation when the
extent of agreement between raters is high. The paradox of the kand pstatistics has
been investigated and an alternative agreement coefﬁcient proposed. I have proposed
new variance estimators for the k,pand the AC
1
statistics using the linearization and
jackknife methods. The validity of these variance estimators does not depend upon the
assumption of independence. The absence of such variance estimators has prevented
practitioners from constructing conﬁdence intervals of multiple-rater agreement
coefﬁcients.
I have introduced the AC
1
statistic which is shown to have better statistical
properties than its k,pand G-index competitors. The kand pestimators became well-
known for their supposed ability to correct the percentage agreement for chance
agreement. However, this paper argues that not all observed ratings would lead to
agreement by chance. This will particularly be the case if the extent of agreement is high
in a situation of high trait prevalence. Kappa and pi evaluate the chance-agreement
probability as if all observed ratings may yield an agreement by chance. This may lead to
unpredictable results with rating data that suggest a rather small propensity for chance
agreement. The AC
1
statistic was developed in such a way that the propensity for chance
agreement is proportional to the portion of ratings that may lead to an agreement by
chance, reducing the overall agreement by chance to the right magnitude.
The simulation results tend to indicate that the AC
1
and G-index statistics have
reasonably small biases for estimating the ‘true’ inter-rater reliability, while the kand p
statistics tend to underestimate it. The AC
1
outperforms the G-index when the trait
prevalence is high or low. If the trait prevalence is around 50%, all agreement statistics
perform alike. The absolute bias in this case increases with the raters’ propensity for
random rating, which can be reduced by giving extra training to the raters. The proposed
Computing inter-rater reliability and its variance 19
BJMSP 139—29/1/2008—ROBINSON—217240
variance estimators work well according to our simulations. For small sample sizes, the
variance estimators proposed for kand pstatistics tend to underestimate the true
variances.
References
Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley.
Q1
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of
interrater agreement measures. Canadian Journal of Statistics,27, 3–23.
Bishop, Y. V. V., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis.
Cambridge, MA: MIT Press.
Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the
paradoxes. Journal of Clinical Epidemiology,43, 551–558.
Cochran, W. C. (1977). Sampling techniques (3rd ed.). New York: Wiley.
Cohen, J. (1960). A coefﬁcient of agreement for nominal scales. Educational and Psychological
Measurement,20, 37–46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit. Psychological Bulletin,70, 213–220.
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological
Bulletin,88, 322–328.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological
Bulletin,76, 378–382.
Fleiss, J. L., Cohen, J., & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted
kappa. Psychological Bulletin,72, 323–327.
Holley, J. W., & Guilford, J. P. (1964). A note on the Gindex of agreement. Educational and
Psychological Measurement,24, 749–753.
Hubert, L. (1977). Kappa revisited. Psychological Bulletin,84, 289–297.
Kraemer, H. C. (1980). Ramiﬁcations of a population model for kas a coefﬁcient of reliability.
Psychometrika,44, 461–472.
Landis, R. J., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the
assessment of majority agreement among multiple observers. Biometrics,33, 363–374.
Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and
alternatives. Psychological Bulletin,76, 365–377.
Quenouille, M. H. (1949). Approximate tests of correlation in times series. Journal of the Royal
Statistical Society, B,11, 68–84.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public
Opinion Quarterly,XIX, 321–325.
Tukey, J. W. (1958). Bias and conﬁdence in not quite large samples. Annals of Mathematical
Statistics,29, 614.
Received 6 January 2006; revised version received 14 June 2006
20 Kilem Li Gwet
BJMSP 139—29/1/2008—ROBINSON—217240
Author Queries
JOB NUMBER: 139
JOURNAL: BJMSP
Q1 Please provide publisher location details for reference Agresti (2002).
Computing inter-rater reliability and its variance 21
BJMSP 139—29/1/2008—ROBINSON—217240
... Significance was set up at p-value < 0.05. The agreement between the two radiologists was estimated using Gwet's AC2 with its 95% confidence interval [25,26]. An estimate of <0.4 was considered as a poor agreement, from 0.4 to 0.6 as fair, from 0.6 to 0.8 as good, and >0.8 as an excellent agreement [25]. ...
... The agreement between the two radiologists was estimated using Gwet's AC2 with its 95% confidence interval [25,26]. An estimate of <0.4 was considered as a poor agreement, from 0.4 to 0.6 as fair, from 0.6 to 0.8 as good, and >0.8 as an excellent agreement [25]. When both readers rated an identical score for all images, the lack of variance did not allow for a concordance coefficient to be calculated, which is indicated as "-" in Table 2. ...
Article
Full-text available
The study’s aim was to assess the impact of a deep learning image reconstruction algorithm (Precise Image; DLR) on image quality and liver metastasis conspicuity compared with an iterative reconstruction algorithm (IR). This retrospective study included all consecutive patients with at least one liver metastasis having been diagnosed between December 2021 and February 2022. Images were reconstructed using level 4 of the IR algorithm (i4) and the Standard/Smooth/Smoother levels of the DLR algorithm. Mean attenuation and standard deviation were measured by placing the ROIs in the fat, muscle, healthy liver, and liver tumor. Two radiologists assessed the image noise and image smoothing, overall image quality, and lesion conspicuity using Likert scales. The study included 30 patients (mean age 70.4 ± 9.8 years, 17 men). The mean CTDIvol was 6.3 ± 2.1 mGy, and the mean dose-length product 314.7 ± 105.7 mGy.cm. Compared with i4, the HU values were similar in the DLR algorithm at all levels for all tissues studied. For each tissue, the image noise significantly decreased with DLR compared with i4 (p < 0.01) and significantly decreased from Standard to Smooth (−26 ± 10%; p < 0.01) and from Smooth to Smoother (−37 ± 8%; p < 0.01). The subjective image assessment confirmed that the image noise significantly decreased between i4 and DLR (p < 0.01) and from the Standard to Smoother levels (p < 0.01), but the opposite occurred for the image smoothing. The highest scores for overall image quality and conspicuity were found for the Smooth and Smoother levels.
... Although both Cohen's kappa and Fleiss' kappa are widely popular measures for inter-rater reliability, some scholars have pointed out that these kappa coefficients are not free from paradoxes and can occasionally yield unexpected results (Warrens, 2010;Gwet, 2008;Feinstein & Cicchetti, 1990). One paradox arises when both the observed agreement P o and the expected chance agreement P e are high: the correction process embodied in kappa's formula (see (1)) can return a relatively low or even negative value of κ, whilst the observed agreement P o is high. ...
Preprint
Full-text available
Cohen's and Fleiss' kappa are well-known measures for inter-rater reliability. However, they only allow a rater to select exactly one category for each subject. This is a severe limitation in some research contexts: for example, measuring the inter-rater reliability of a group of psychiatrists diagnosing patients into multiple disorders is impossible with these measures. This paper proposes a generalisation of the Fleiss' kappa coefficient that lifts this limitation. Specifically, the proposed kappa statistic measures inter-rater reliability between multiple raters classifying subjects into one-or-more nominal categories. These categories can be weighted according to their importance, and the measure can take into account the category hierarchy (e.g., categories consisting of subcategories that are only available when choosing the main category like a primary psychiatric disorder and sub-disorders; but much more complex dependencies between categories are possible as well). The proposed $\kappa$ statistic can handle missing data and a varying number of raters for subjects or categories. The paper briefly overviews existing methods allowing raters to classify subjects into multiple categories. Next, we derive our proposed measure step-by-step and prove that the proposed measure equals Fleiss' kappa when a fixed number of raters chose one category for each subject. The measure was developed to investigate the reliability of a new mathematics assessment method, of which an example is elaborated. The paper concludes with the worked-out example of psychiatrists diagnosing patients into multiple disorders.
... Gwet's AC2 is also reported as an agreement index for categories amongst multiple raters [13]. It can be interpreted in a similar way to Cohen's kappa but is less prone to kappa paradoxes such as low values when agreement is high and is less affected by event distribution [14]. ...
Article
Full-text available
Background and aimsHigh quality clinical research that addresses important questions requires significant resources. In resource-constrained environments, projects will therefore need to be prioritized. The Australia and New Zealand Musculoskeletal (ANZMUSC) Clinical Trials Network aimed to develop a stakeholder-based, transparent, easily implementable tool that provides a score for the 'importance' of a research question which could be used to rank research projects in order of importance.Methods Using a mixed-methods, multi-stage approach that included a Delphi survey, consensus workshop, inter-rater reliability testing, validity testing and calibration using a discrete-choice methodology, the Research Question Importance Tool (ANZMUSC-RQIT) was developed. The tool incorporated broad stakeholder opinion, including consumers, at each stage and is designed for scoring by committee consensus.ResultsThe ANZMUSC-RQIT tool consists of 5 dimensions (compared to 6 dimensions for an earlier version of RQIT): (1) extent of stakeholder consensus, (2) social burden of health condition, (3) patient burden of health condition, (4) anticipated effectiveness of proposed intervention, and (5) extent to which health equity is addressed by the research. Each dimension is assessed by defining ordered levels of a relevant attribute and by assigning a score to each level. The scores for the dimensions are then summed to obtain an overall ANZMUSC-RQIT score, which represents the importance of the research question. The result is a score on an interval scale with an arbitrary unit, ranging from 0 (minimal importance) to 1000. The ANZMUSC-RQIT dimensions can be reliably ordered by committee consensus (ICC 0.73-0.93) and the overall score is positively associated with citation count (standardised regression coefficient 0.33, p
Article
Numerous studies provide critical analyses of TV minority representations but only few examine successful portrayals. Moreover, there is no consensus on what makes a given depiction successful and how to measure it. Bringing insights from representation studies and media psychology, we propose that successful representations showcase minorities in a way which may foster audience engagement with characters and improve their diversity attitudes. In the current project, we developed a quantitative content analysis codebook with the following representation strategies: portrayal of minority experiences, recognizable representation, attractive representation, psychological depth, stereotypical representation, and friendly interactions. We demonstrate our approach by analyzing the portrayal of non-heterosexual and Black characters in Sex Education. We coded all the scenes with Eric, Adam, and Jackson from the first season of the TV show. Results indicate that these characters are predominantly shown as recognizable to viewers and in friendly interactions with other people. Moreover, they are depicted with attractive personality traits, and indicators of psychological depth. They also undergo various minority experiences. Some stereotypes about gay men are shown but anti-Black stereotypes are rare. In the results’ discussion, we suggest different ways of using our codebook in future studies.
Article
Full-text available
Gwet’s first-order agreement coefficient (AC1) is widely used to assess the agreement between raters. This paper proposes several asymptotic statistics for a homogeneity test of stratified AC1 in large sample sizes. These statistics may have unsatisfactory performance, especially for small samples and a high value of AC1. Furthermore, we propose three exact methods for small pieces. A likelihood ratio statistic is recommended in large sample sizes based on the numerical results. The exact E approaches under likelihood ratio and score statistics are more robust in the case of small sample scenarios. Moreover, the exact E method is effective to a high value of AC1. We apply two real examples to illustrate the proposed methods.
Article
Full-text available
Recently, a social movement for greater diversity in society, including cinema, has been taking place. Thus, members of minorities claim their space as content creators. Various organizations have indicated that if more women are directing films, there will be more films starring women. However, there is no scientific evidence. This study wants to check if there is a direct relation between the director’s gender and the characters. The purpose of this paper is to study the representation of gender in Spanish films through the relationship between the representation of female/male characters and the gender of directors. A content analysis was conducted on 30 Spanish films from 2018 and 2019, including 468 characters (n = 468). It is concluded that there is no significant statistical relationship between underrepresentation and the gender of directors; there is still an underrepresentation of female characters. In addition, no significant statistical association was found between the gender of the director and the character’s occupation or personal vs. work objectives. However, in the works of male directors, the frequency of dialogues between female and male characters is significantly higher than in films of female directors.
Article
Full-text available
Desde hace unos años se viene gestando un movimiento social que reclama una mayor diversidad en la sociedad, incluido el cine. Así, miembros de minorías reclaman su espacio como creadores de contenidos. Diversas organizaciones han señalado que si hay más mujeres en la dirección de películas, habrá más películas protagonizadas por mujeres. Empero, no hay evidencia científica de ello, y este estudio quiere comprobar si existe una relación directa entre el género del director y los personajes. El objetivo fue estudiar la representación de género en el cine español a través de la relación entre la representación de personajes femeninos/masculinos y el género de los directores. Se analizó el contenido de 30 películas españolas de 2018 y 2019 con un total de 468 caracteres (n = 468). Se concluyó que, aun cuando sigue existiendo una infrarrepresentación de personajes femeninos, no existe una relación estadística significativa entre esta infrarrepresentación y el género de los consejeros. Además, no hubo una relación estadística significativa entre el género del director y la ocupación del personaje u objetivos personales vs. laborales. En comparación con las películas de directoras, en el trabajo de los directores masculinos, la frecuencia de diálogo entre personajes femeninos y masculinos es significativamente mayor.
Preprint
Full-text available