Content uploaded by Andriy Olenko
Author content
All content in this area was uploaded by Andriy Olenko on Oct 12, 2017
Content may be subject to copyright.
Double Entropy Inter-rater Agreement
Indices
Andriy Olenko1and Vitaliy Tsyganok2
Abstract
The proper application of the most frequently used inter-rater agreement indices
can be problematic for the case of a single target, for example, a psychotherapy
patient, a student’s thesis, a grant proposal, the lifestyle in a country. The major-
ity of indices that can handle this case either assess the deviation of ranks from
some central/average value or the pattern of ranks’ distribution. Contrary to other
approaches, this article defines disagreement rating results using the unpredictabil-
ity/complexity of scores. The article discusses alternative entropy methods for
measuring inter-rater agreement or consensus in survey responses for the case of a
single target. A new inter-rater agreement index is proposed. Comparisons between
this index and the known inter-rater agreement measures show some limitations of
the most frequently used indices. Various important methodological issues such
as disagreement assumptions, average sensitivity, adjustments to deal with outliers
and missing or incorrectly recorded data are discussed. Examples of applications
to actual data are presented.
Keywords
psychological statistics, agreement index, similarity measure, average sensitivity
1La Trobe University, Australia
2Institute for Information Recording, National Academy of Sciences of Ukraine, Ukraine
Corresponding Author:
A. Olenko, Department of Mathematics and Statistics, La Trobe University, Victoria, 3086, Australia
Email: a.olenko@latrobe.edu.au
The authors are grateful to Drs. C.L. Ambrey, J. Dawes, and C.M. Fleming for providing their data.
1
1 INTRODUCTION
Assessments of inter-rater agreement appear in numerous applications in laboratory and
field studies, various types of data analysis problems, organizational and applied psy-
chology, etc. The publications (Lin et al.,2002;Shoukri,2004;von Eye & Mun,2005;
LeBreton & Senter,2008;Wagner et al.,2010) give excellent surveys of the field. Some
most recent results and bibliography can be found in (LeBreton et al.,2005;Shah,2011;
Gwet,2012;Lin et al.,2013;Lohse-Bossenz et al.,2014).
Inter-rater agreement indices were used and studied extensively in a number of con-
texts. For example, the article by Smith-Crowe et al. (2013) reports that almost half
of the articles published in various applied psychology journals used inter-rater agree-
ment indices. Recently such indices have been frequently applied to justify aggregating
lower-level data in multilevel and hierarchical models. Some statistical characteristics of
inter-rater agreement indices were obtained and discussed for specific scenarios, consult,
for example, (Fleiss,1971;James et al.,1984;Burke et al.,2002;Brown & Hauenstein,
2005;Klemens,2012).
Despite recent progress and results for the case of multiple targets there has been
remarkably little study on agreement estimates which can be also applied to a single
target, see (Lindell & Brandt,1997;Cicchetti et al.,1997,2009;Lindell et al.,1999;
Baca-Garc´ıa et al.,2001). The most of the known indices do not allow for the estimation
of chance agreement on a single target. However, there has been a significant interest in
single target problems in applications where a group of raters is formed only for a single
evaluation.
Firstly, various on-line ratings that are based on situational random responses require
agreement estimates. For example, the Faculty of 1000 (http://f1000.com) evaluates
the quality of scientific articles based on the opinion of scientific experts. Now the Faculty
of 1000 has more than 5000 evaluators spread across 44 subject-specific faculties. A few
other examples: Big on-line movie databases, for instance, Rotten Tomatoes, IMDb,
Netflix, collect and provide information about user rankings of movies; Amazon’s users
rank books and music; Google asks customers to rate businesses on a 30-point scale. The
number of responses in these ratings varies from few dozens to hundreds of thousands.
2
The second example situation is when users select raters of interest to perform an
evaluation. For instance, patients can be examined by independent psychiatric practi-
tioners. Knowing about a degree of agreement might help in taking decisions in the most
difficult or unprecedented clinical cases.
Inter-rater agreement indices can also be used to examine consensus in survey re-
sponses. They can measure whether several respondents, who evaluate the same target,
produce similar scores. Especially interesting cases are multilevel and hierarchical mod-
els, where each final score is a result of various aggregations of responses, see Example 1 in
Section 6. In such situations generating methods of final scores can have rather complex
structure to investigate them directly.
Another potentially interesting area is when human experts are replaced by some
automatic labelling processes, for example, data mining classifiers, which are selected
from a large number of potential methods. The reader can find more details about
classical and new applications of agreement indices in (von Eye & Mun,2005;Shah,
2011;Smith-Crowe et al.,2013).
Agreement indices are also widely applied in constructed-response assessments, espe-
cially when different raters can use partially different criteria. In such cases data about
scores of constructed response items may have a hierarchical structure, see DeCarlo et
al. (2011). Another important essay grading application is measuring the increase of
consensus using scores of raters before and after an initial training.
Contrary to other approaches, this article defines disagreement rating results for the
case of a single target using the unpredictability/complexity of scores. Our focus in this
paper is on a new entropy-based agreement index and its properties. The most of the
known agreement indices either assess the deviation of scores from some central(average)
value or the pattern of scores’ distribution. To overcome this limitation we assess both the
spread and the variability of distributions. In terms of the underlying rationale for this
approach we measure the unpredictability/uncertainty of scores by applying the Shannon
entropy twice. First, we assess the degree of uncertainty for the frequencies of selected
scores, then we compute the degree of uncertainty in their concentration. Each time
the Shannon entropy is computed and then normalized to the range [0,1].The double
entropy index will be introduced using weighted averages of these normalized entropies.
3
The new index is aimed to overcome some limitations of other well-known similar-
ity/dissimilarity measures. For example:
•Many of classical measures are not normalized and, therefore, their results are not
directly comparable;
•Indices similar to χ2use goodness of fit principles and depend only on the frequen-
cies of selected scores, but not on scores’ actual values;
•Measures, that refer to a uniform distribution as the most dissimilar rating, often
cannot distinguish between distributions with equidistant equal frequencies;
•Many similarity measures do not change gradually for discrete cases and can pro-
duce quite different values for similar rating results.
The article establishes properties of the new index and compares it with various known
indices. Two modifications of the index designed to be insensitive to outliers and missing
data are introduced and discussed. The article also gives unified representations of the
known indices, which might be useful for other research problems. Potential practical
applications are illustrated by examples. The study shows that the new indices perform
overall well comparing to other well-known indices.
The rest of the paper is organized as follows. Section 2introduces the main nota-
tions and assumptions. A novel measure of inter-rater agreement is defined and dis-
cussed in Sections 3. Section 4gives unified representations of some well-known agree-
ment/disagreement measures. Then it provides an insight into empirical behavior of the
proposed index and its advantages over the related metrics. Section 5proposes two ad-
justments to deal with outliers and missing or incorrectly recorded data. Examples of
applications to real data are presented in Section 6. Finally, conclusions and discussions
are given in Section 7.
2 FRAMEWORK AND NOTATION
In this section we introduce underlying assumptions, some basic notations, and a conve-
nient way to represent rating results.
4
All information about rating results can be presented by the four characteristics:
the number of categories in the response scale, the number of raters, the selected score
values, and the number of raters assigned each selected score. Below we formalise this
information for the following computations.
Let n≥2 be the number of possible scores (the number of categories into which
assignments are made) and I:= {i1, ..., ik}be a set of score values selected by raters.
Without loss of generality we assume that i1< i2< ... < ik.
Let m≥2 be the total number of raters and ribe the number of those who gave
the score i. Denote R:= (r1, ..., rn).If the score iwas not given, then ri= 0.Thus,
m=Pi∈Iri.
To visualize these notations one can use the bar plot as shown in Figure 1. The
horizontal axis of the plot shows the scores and the vertical axis gives the frequencies ri.
scores
ri
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5
Figure 1: Case of n= 10 and R= (4,5,1,1,4,0,1,2).
Agreement indices can be derived from various statistical and epistemological assump-
tions and yield differing results. The problem with the most of inter-rater agreement
indices is that they either assess the deviation of the scores in Ifrom some central value,
e.g., the mean or the median, or the deviation of the score distribution Rfrom a specific
pattern, e.g., a uniform distribution. We will define a new inter-rater agreement index
by taking into account both the deviations.
Firstly, it is important to formally specify the most disagreement rating result. In
this article we adopt the approach which defines disagreement rating results using the
unpredictability of scores. Thus, it is impossible to conclude that some raters have more
agreement about a target and prefer a particular score to the other scores. Note, that
5
some popular indices use a different approach. Namely, they define the most disagreement
rating if a half of raters select the lowest score and the other half select the highest
one. This article considers this scenario only as a partial disagreement because the both
marginal scores are high, but diametrical (directly opposite preferences are usually highly
correlated and can be treated as a partial agreement). In addition, in this situation there
is a high level of agreement inside of the each diametrical group. For example, all raters
consider a target as un-ordinary, but rank it very low or high because of their training
or ”like” or ”dislike” preferences.
For given k, m, and nthe most disagreement rating result is when
•kelements of Iare equidistantly located in the set {1, ..., n}with i1= 1 and ik=n
(or the distances between consecutive selected scores vary at most by 1 if k < n
and equidistant placement is impossible, i.e. k−1 is not a divisor of n−1);
•the values {ri1, ..., rik}are the same (vary at most by 1 if kis not a divisor of m).
It means that the same number of raters selects each of kscores and these kscores are
swept in equidistant steps from 1 to n.
Figure 2shows some examples of the most disagreement rating results for n= 10,
m= 20, k = 10,4,and 2 respectively.
Secondly, the situation of complete disagreement of mraters is referred to as an
occurrence of the two above conditions and k= min(m, n).For example, for n= 10 and
m= 20 it is shown in the first plot of Figure 2.
scores
ri
12345678910
0 2 4 6 8 10
scores
ri
12345678910
0 2 4 6 8 10
scores
ri
12345678910
0 2 4 6 8 10
Figure 2: Examples of most disagreement rating results
Note, that the above definition describes the probability mass function of the uniform
distribution on the set {1, ..., n}if nis a divisor of m. However, it differs from a uniform
distribution for those cases where mraters can’t be equally split into ngroups.
6
3 DOUBLE ENTROPY INDEX
In this section we use the entropy measure to introduce a new index of inter-rater agree-
ment/disagreement. We discuss the motivation for index’s definition, some marginal
cases, and estimate the sensitivity of the index.
The new index describes the complexity of the scores. A natural candidate for that
would be the entropy, a classical measure of complexity. However, it is not enough to
use the entropy once as the problem involves two levels of complexity/uncertainty. The
first one is pertinent to relative positions of the selected scores across the response scale.
An additional level appears because some selected scores are more and some are less
preferable by experts.
To assess agreement among multiple scores for a subject we use
1. the distribution of the selected scores i1, ..., ikin the set {1, ..., n};
2. the distribution of the frequencies ri, i ∈I, of the selected scores.
For each of the above distributions the Shannon entropy will be computed and then
normalized to the range [0,1].The double entropy index will be defined as an average
of the normalized entropies. In Section 5the adjusted index will be introduced using
weighted averages of these entropies.
Step 1. To describe the spread of the scores {i1, ..., ik}we use the distances
dj:= ij+1 −ij, j = 1, ..., k −1,(1)
between consecutive selected scores. We also introduce an additional distance between
ikand i1by
dk:=
(n−ik)+(i1−1) + n−1
k−1,if k > 1,
n, if k= 1.
(2)
where [x] is the largest integer less than or equal to x.
Note, that all distances are positive integers and their sum is the same for each k
regardless of actual values of {i1, ..., ik}.This sum equals
7
d:=
n−1 + n−1
k−1,if k > 1,
n, if k= 1.
(3)
Using the set of distances {d1, ..., dk}one can normalize them to sum to one and
introduce the corresponding fractions P:= {p1, ..., pk}by the formula pj:= dj/d,
j= 1, ..., k.
To characterise disagreement among the kselected scores we use the entropy
H(P) = −
k
X
j=1
pjln pj.(4)
Step 2. To describe the distribution of the frequencies ri, i ∈I, we normalise them and
introduce the corresponding fractions Q:= {qi, i ∈I}by the formula
qi=ri
Pi∈Iri
, i ∈I . (5)
If the score iwas not given by raters, then ri= 0 and we obtain qi= 0, i 6∈ I.
To characterise the degree of unpredictability for Qwe use the entropy
H(Q) = −
k
X
i∈I
qiln qi.(6)
Note, that all qi>0 if i∈I, because the corresponding ri≥1.
Step 3. To get a normalized index taking values between 0 and 1 , first, we normalize
each of the entropies H(P) and H(Q) to the range [0,1] :
H∗(P) :=
H(P)−minPH(P)
maxPH(P)−minPH(P)=H(P)−Bn,k
An,k−Bn,k ,if k < n
1,if k=n,
(7)
H∗(Q) := H(Q)
maxk,Q H(Q)=H(Q)
Cn,m
,(8)
where maxPH(P) = An,k,minPH(P) = Bn,k ,and maxk,Q H(Q) = Cn,m.The details of
computations of the values An,k, Bn,k,and Cn,m and implementations in R code can be
found in the Supplementary Materials.
Secondly, we define the normalized index of disagreement κ0(P, Q) as the average of
the normalized entropies
κ0(P, Q) := H∗(P) + H∗(Q)
2.(9)
8
The normalized index κ0(P, Q) takes values in the interval [0,1].There exist score
distributions for which κ0(P, Q) reaches the both endpoints 0 and 1. Namely, if all
raters give the same score, which indicates perfect agreement, then p1=q1= 1 and
κ0(P, Q) = 0; if each score is selected by m
nor m
n+ 1 raters, then κ0(P, Q) = 1.
Finally, the corresponding index of inter-rater agreement κ(P, Q) is defined by
κ(P, Q) := 1 −κ0(P, Q).(10)
It follows from (9) and (10) that κ(P, Q) increases when H∗(P) or H∗(Q) decreases.
Both H∗(P) and H∗(Q) make equal contributions to κ(P, Q).
Similar to rwg and awg ,see the discussion and references in Brown & Hauenstein
(2005), there are no sampling distributions that completely specify agreement or dis-
agreement and can be used to test hypotheses about statistical significance of κ. Simula-
tion studies analogous to Section 4suggest values in the range 0.6−0.7 as a reasonable
cut-off for κ, while values from 0 to 0.6 should be considered as unacceptable levels of
agreement. The article by Cicchetti et al. (2009) presents a more detailed discussion of
the rationale underlying the use of 0.7 as the expected level of chance agreement. If κ
is frequently used in the same circumstances, then the expected level of chance agree-
ment can be adjusted using results of an initial trial period. A new level can be easily
implemented in the developed R code.
A sensitivity of the index to changes in scores can be used to investigate its per-
formance and reliability (it also characterises its sensitivity to outliers). Namely, the
following two-step procedure is used to compute an average sensitivity of the index:
•calculate the maximum absolute deviation of κfor each admissible list of scores of
mraters when one of scores has been changed;
•calculate the mean of the maximum absolute deviations from step 1 over all admis-
sible scores of mraters.
Figure 3demonstrates the average sensitivity of the index as a function of the number
of raters mand the number of categories n. To obtain this figure the index κwas computed
for mMonte Carlo simulated scores and recalculated for all possible changes in one score.
Then the maximum absolute differences of κwere averaged over all replications. Monte
9
Carlo simulations with 10000 replications for each group size m= 2, ..., 30 and the
numbers of categories 5, 10 and 15 were performed.
Figure 3demonstrates that the average sensitivity decreases rather quickly as the
number of raters mincreases. On the average, for a large number of raters the distribution
of selected scores is closer to a uniform distribution, when there are fewer categories.
Therefore, the rightmost part of Figure 3exhibits a slightly smaller average sensitivity if
the number of categories is smaller.
5 10 15 20 25 30
0.0 0.1 0.2 0.3 0.4
m
average sensitivity
5 10 15 20 25 30
0.0 0.1 0.2 0.3 0.4
5 10 15 20 25 30
0.0 0.1 0.2 0.3 0.4
n=5
n=10
n=15
Figure 3: Average sensitivity for 5, 10 and 15-level response scales
Using these plots one can compute an average level of sensitivity for a given number
of raters, and vice versa. For example, based on the Monte Carlo simulation results,
to get the average level of index’s sensitivity at most 0.05 the recommended number of
raters is 15 or more. However, 15 raters might not be realistic and very high sensitivity
is not required in many applications. At the same time, for typical applications with
6 raters, Figure 3 indicates levels of sensitivity in the range 0.08-0.15 depending on a
response scale. These levels are quite acceptable in practice.
Figure 3demonstrates that more raters are needed to obtain a higher sensitiveness
of the index. This is true not only for κ, but for most of inter-rater agreement indices.
There are various situations, when a large number of raters is appropriate, for instance,
in examining raters’ consensus during a training period. DeCarlo et al. (2011) gave
an example of a large language assessment, where 34 raters scored the first item and 20
raters scored the second item. More examples can be found in (Kim,2009;Xi & Mollaun,
10
2009). However, in many performance assessments (e.g., essay exams), the large number
of raters is not realistic. Thus, some adjustments of the index that are robust to outliers
and missing data are required. They are introduced in Section 5.
4 COMPARISON OF INTER-RATER AGREEMENT
MEASURES
In this section we compare the double entropy index with the most frequently used indices
of inter-rater agreement. The most of these indices involve determining the extent to
which either each rater’s score differs from the mean or the median item rating, or the
score distribution differs from the uniform distribution.
The more sound theoretical basis introduced in Section 3improves the validity of
inter-rater comparisons. In this section, we give a set of simple examples, backed by
our intuition, about what sorts of raters’ results should correspond to higher or lower
inter-rater agreements.
We start by unifying definitions of these indices using the notations in this paper.
1. Sample standard deviation S. The sample standard deviation of ratings was
proposed and studied as an agreement index by Schmidt & Hunter (1989). It was
introduced to measure the dispersion from the mean rating. Its low values indicate
that the ranks tend to be close to the mean.
In our notations it can be computed by the formula
S=sPi∈Iri(i−M)2
m−1,(11)
where M=Pi∈Ii·ri/m is the mean rating. The standard error of the mean
SEM =S/√mwas employed to construct confidence intervals around Mto assess
significant levels of agreement among the raters.
2. Coefficient of variation CV. The coefficient of variation is a normalized measure
of dispersion obtained from the sample standard deviation Sby CV =S/M.
11
3. Adjusted average deviation index ADMadj .Burke et al. (1999) studied the
average deviation around the mean or the median as an index of inter-rater agree-
ment. It is based on the same principle as S, but uses the absolute differences which
usually produce more robust results.
Using the notations of this paper ADµfor a single subject is computed as follows:
ADµ:= Pi∈Iri|i−µ|
m,(12)
where µis the mean Mor the median Mdrating.
Compared to the original ADMindex, the adjusted ADMadj index introduced by
Lohse-Bossenz et al. (2014) gives less biased estimates within smaller groups. It
can be computed as follows:
ADMadj := 2m−1
2m(m−1) X
i∈I
ri|i−M|=2m−1
2(m−1)ADM.(13)
As ADMadj is not a normalized index, a basic rule for interpreting significant levels
of agreement is whether the ADµis smaller than n/6.The further discussion of
properties of this index and additional references can be found in Smith-Crowe et
al. (2013).
4. rwg and r∗
wg indices. The index rwg was introduced by James et al. (1984). The
main rationale for it was to define no agreement as random responses over all
categories.
Using our notations rwg may be represented as rwg := 1−S2/σ2
EU ,where σEU is the
expected variance. The value σ2
EU = (n2−1)/12 is mostly used, which corresponds
to the discrete uniform distribution. If S2> σ2
EU then the truncation of S2to σ2
EU
is applied which results in rwg = 0.
The modified index r∗
wg := 1 −S2/σ2
MV ,where σ2
MV := 0.5·(n2+ 1) −0.25(n+ 1)2
is the maximum variance, was recommended and discussed by Lindell et al. (1999).
5. Index awg .This index was introduced by Brown & Hauenstein (2005) to address
some limitations of rwg and r∗
wg indices. It has a similar interpretation to rwg.The
12
index estimates agreement by the proportion of observed agreement between raters
to the maximum possible disagreement assuming that the mean rating is given.
Using our notations awg is defined by
awg := 1 −2(m−1)S2
m((i1+ik)M−M2−i1·ik).(14)
The further discussion of awg can be found in Wagner et al. (2010).
6. Spectral consistency factor K.The weighted spectrum of the set Iwas used
to assess inter-rater agreement by Zgurovsky et al. (2004). The index defined the
no-agreement case as a distribution of ranks with the minimal sum of the variance
and the deviation from a uniform distribution.
In our notations the index Kcan be expressed as
K:= 1−ADM+H(Q)
GPn
i=1 |i−(n+ 1)/2|+ ln(n)·(1 −z),(15)
where G=m/(nln(m) ln(n)) is a scaling factor,
z=I(i1= 1) ·I(ik=n)·
k−1
Y
i=1
I(ri=ri+1)·
k−1
Y
j=2
I(ij+1 −ij=i2−i1),(16)
and the indicator function I(·) is defined by
I(x) :=
1,if xis true,
0,if xis false.
(17)
The multiplier 1 −zwas introduced to truncate Kto 0 when the equal numbers of
raters select equidistant scores (the case of the uniform discrete distribution).
Unfortunately, the normalization above gives negative values of Kfor some ratings.
Therefore, we propose to use the following correction
K0:= 1−ADM+H(Q)
(n−1)/2 + ln(n)·(1 −z) (18)
which insures that K0∈[0,1].
7. Weighted pairing index MR.This index was introduced by Cicchetti et al.
(1997) and can be computed as
MR := Pi,l∈Irirl(1 − |i−l|/n)−Pi∈Iri
m(m−1) .(19)
13
The main rationale for the index is to determine whether the ratings of experts are
too far from the group average.
8. Pearson’s statistic χ2.χ2is used to test goodness of fit and establishes whether
or not an observed rating distribution differs from a theoretical distribution. It
can be computed by χ2=n
mPn
i=1(ri−m/n)2,where m/n is an expected rating
asserting complete disagreement (a uniform distribution).
Case 1
scores
ri
12345678910
0 2 4 6 8 10
Case 2
scores
ri
12345678910
0 2 4 6 8 10
Case 3
scores
ri
12345678910
0 2 4 6 8 10
Case 4
scores
ri
12345678910
0 2 4 6 8 10
Case 5
marks
ri
1 2 3 4 5 6 7 8 9 10
0 2 4 6 8 10
Case 6
scores
ri
12345678910
0 2 4 6 8 10
Case 7
scores
ri
12345678910
0 5 10 15 20
Case 8
scores
ri
12345678910
0 5 10 15 20
Case 9
scores
ri
12345678910
0 5 10 15 20
Figure 4: Example patterns
For the number of raters m= 20 Figure 4shows some example patterns for which
the above indices were compared. The corresponding values of the indices are shown in
Table 1. Similar results were also obtained for ADM, ADMd,and Kbut are not shown
in the paper. The cases in Figure 4are numbered from left to right and continuing on
in the next row. It follows from the assumptions that the degree of agreement should
increase from case 1 to case 9.
Note that the indices S, ADMadj, awg, MR, and χ2are not normalized to [0,1]. The
14
Table 1: Values of indices for example patterns
Case S CV ADMadj rwg r∗
wg awg K0M R χ2κ
1 2.95 0.54 2.57 0.00 0.57 0.19 0.00 0.65 0 0.00
2 2.91 0.49 2.52 0.00 0.58 0.20 0.3 0.66 1 0.01
3 3.44 0.63 3.08 0.00 0.42 -0.11 0.00 0.61 30 0.20
4 4.62 0.84 4.62 0.00 0.00 -1.00 0.00 0.53 80 0.35
5 2.56 0.47 2.57 0.20 0.68 0.38 0.53 0.74 80 0.46
6 0.97 0.19 0.75 0.89 0.95 0.91 0.69 0.89 39 0.71
7 0.51 0.21 0.51 0.97 0.99 0.96 0.82 0.95 80 0.85
8 0.31 0.05 0.18 0.99 1.00 0.99 0.93 0.98 144 0.93
9 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 180 1.00
indices S, CV, and ADMadj are disagreement measures, i.e. their values should decrease
from case 1 to case 9.
The test examples provide important marginal cases that help to compare various
indices. Respective contributions of H∗(P) and H∗(Q) to κ(P, Q) can be easily seen
from (7) and (8). The normalised entropy H∗(P) equals 1 for cases 1-4 with equidistant
selected scores, then it decreases to 0 for the cases when scores form a group. At the
same time the normalised entropy H∗(Q) gradually decreases from 1 to 0.
Table 1demonstrates that the index κperforms well on the test examples. Its val-
ues gradually increase with increasing agreement between raters’ scores. In addition, it
follows from its definition that κhas various properties required for inter-rater agree-
ment indices. Namely, it is invariant under score shifts, i.e. κis the same for Iand
I0:= {i0
1, ..., i0
k}if i0
j−ij=const for all j= 1, ..., k. The index is monotonically decreas-
ing when the distance ij+1 −ijincreases. κdecreases if one of mraters changes his score
to a new one in such a way that the new set I0consists of more elements than the original
I, i.e. I0=I∪ {j},where j∈1, ..., n, and j6∈ I.
All other indices have various limitations. In the discussion below, we refer to Figure 4
and the corresponding results in Table 1to list some of these limitations:
– the indices rwg and K0cannot distinguish between distributions with equidistant
15
equal frequencies, e.g., cases 1, 3, and 4. Thus, these indices might give rather
similar results for two different cases: when all raters chose ranks randomly and
when the raters formed several groups(schools) that shared quite different opinions;
– the indices S, CV, ADMadj , rwg, r∗
wg , awg ,K0,and M R specify complete disagree-
ment for case 4, i.e. these indices indicate when two equal groups of raters have
opposite opinions (a half of raters assign the score 1 and the other half the score
n) instead of unpredictability of scores;
– the indices S, CV, ADMadj , r∗
wg , awg ,and M R do not suggest complete disagreement
for case 1 of complete unpredictability (the uniform distribution on all scores). They
will demonstrate various levels of agreement even if raters assign ranks at random;
– the index K0is ”discontinuous”, i.e. substantially vary between similar patterns
(cases 1 and 2). It means K0is not robust to outliers. It may produce misleading
results when a single rank impacts the overall agreement;
– the indices rwg , r∗
wg ,K0,and awg do not change gradually having big gaps between
their values. It may cause problems in selecting an appropriate critical region/level
to test an agreement hypothesis;
– Pearson’s statistic χ2does not work well for nonuniform distributions which demon-
strate high degrees of agreement (case 6 versus cases 4 and 5). Also, it depends
only on the frequencies of selected scores, but not on their actual values. Thus,
quite different cases 4, 5, and 7 have the same χ2value.
5 ADJUSTED DOUBLE ENTROPY INDICES
It is well-known that values of inter-rater agreement indices may significantly change if a
new expert assigns a score, which is quite different from the existing scores. Particularly,
it is a serious problem in applications with a small number of raters and a relatively large
response scale.
In this section, we introduce two adjusted indices that can handle outliers and missing
or incorrectly recorded data.
16
In situations where the identification and removal of outliers are controversial issues,
reporting both the unadjusted and adjusted values is methodologically sound.
Adjusted index κ∗.Firstly, we consider situations when the developed double entropy
index κmight be sensitive to outliers. Namely, values of κcan significantly change when a
new rater gives a score too far from the other scores, e.g., Figure 5and the corresponding
Table 2. Some other indices in Section 4demonstrate similar sensitivity too. For κthis
sensitivity is caused by significant changes in the component H∗(P) for small k.
marks
ri
1 2 3 4 5 6 7 8 9 10
0 2 4 6 8 10
marks
ri
1 2 3 4 5 6 7 8 9 10
0 2 4 6 8 10
Figure 5: Case of outliers
Table 2: Values of indices for the case of outliers
S CV ADMadj rwg r∗
wg awg K0M R χ2κ κ∗
0.51 0.21 0.51 0.97 0.99 0.96 0.82 0.95 76.26 0.85 0.85
1.75 0.62 0.87 0.63 0.85 0.56 0.75 0.88 71 0.49 0.81
To overcome this limitation, we propose the adjusted index κ∗.To define it we use the
normalized entropy H∗(Q) as before and the adjusted H∗(P) which is computed after
censoring small ri.For instance, if all riless than 0.2Mr(Mr=Pk
i=1 ri/k) are truncated,
then I0={il∈I:ril≥0.2Mr}is used instead of Ito compute H∗(P).
For example, for the scores in Figure 5the values of κ∗are given in Table 2. It
demonstrates that the adjusted κ∗is less sensitive to marginal scores than the index κ.
Note, that κand κ∗give same results for all cases in Table 1.
Adjusted index κ∗.Equation 9defines κ0as the weighted sum of H∗(P) and H∗(Q) with
equal weights of 0.5. It can be generalised to the weighted sum (1 −α)H∗(P) + αH∗(Q),
17
where α∈[0,1].For example, if α=k
nwe obtain
˜κ0(P, Q) := n−k
nH∗(P) + k
nH∗(Q).(20)
The index ˜κ0gives higher weights to H∗(Q) and smaller weights to H∗(P) for those k
which are closer to n. Therefore, ˜κ0is less sensitive to changes in the spread of a distri-
bution and more sensitive to changes of distribution’s shape than κ0when k≈n. Thus,
its application is preferable for large values of kwhich indicate a significant divergence
in raters’ assessments. The adjustment is useful if some of kscores were missing or
incorrectly recorded.
The adjusted index κ∗is defined by κ∗(P, Q) := 1 −˜κ0(P, Q).
For example, the second subplot of Figure 6shows that one of scores 8 was mistakenly
recorded as 2. The numerical values of the indices are given in Table 3. It is clear that
the adjusted κ∗is less sensitive to incorrect records.
scores
ri
12345678910
0 1 2 3 4 5
scores
ri
12345678910
0 1 2 3 4 5
Figure 6: Case of incorrect records
Table 3: Values of indices for the case of incorrect records
S CV ADMadj rwg r∗
wg awg K0M R χ2κ κ∗
2.11 0.28 1.77 0.46 0.78 0.46 0.47 0.76 14 0.59 0.34
2.45 0.34 2.09 0.27 0.7 0.33 0.4 0.72 10 0.06 0.11
Figure 7shows respective contributions of H∗(P) and H∗(Q) to the adjusted double
entropy index κ∗(P, Q).Monte Carlo simulations were used to generate random responses
of 6 raters over a 10-level scale and to compute κ∗.For each random response the figure
18
shows a value of κ∗versus the corresponding values of H∗(P) and H∗(Q).Similarly to κ
the adjusted double entropy index increases when H∗(P) and H∗(Q) decrease.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
H*(P)
κ*(P,Q)
H*(Q)
Figure 7: κ∗versus H∗(P) and H∗(Q).
6 ILLUSTRATIVE EXAMPLES
The examples in this section present typical applications of the indices for a situational
selection of raters from a larger population. The agreement of inter-rater judgements is
investigated. For the sake of space, tables with the detailed values of all of the other
agreement indices have not been included.
Example 1: Assessment of life satisfaction. This example illustrates possible ap-
plications of the double entropy inter-rater agreement indices to examine consensus in
survey responses. Life satisfaction is a complex index and individual scores can vary
substantially.
In the example we analyse life satisfaction data from the Household, Income and
Labour Dynamics in Australia survey and Statistics Canada’s General Social Survey. We
use the distributions of individuals’ responses in the 2010 Australian and 2011 Canadian
surveys presented in Ambrey & Fleming (2014), Figure 1, and Bonikowska et al. (2014),
Table 3, respectively. The 11-point response scale with 0 meaning ”very dissatisfied” and
with 10 meaning ”very satisfied” was used. Figure 8provides a graphical representation
of the percentage distribution of scores.
Note that in both surveys there are a number of categories with rather low percentage
frequencies which bias index estimates. In this case, the adjusted index κ∗provides
19
0 1 2 3 4 5 6 7 8 9 10
Canadian survey
Australian survey
scores
ri
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Figure 8: Distributions of life satisfaction scores
a better solution and effectively corrects for the low remote frequencies. Indeed, the
adjusted κ∗equals 0.65 and 0.64 for the Australian and Canadian surveys respectively,
when unadjusted indices are 0.15 and 0.14. The analysis demonstrates moderate levels
of agreement in individuals’ responses. The levels of respondent’s agreement are roughly
the same for the both countries.
Example 2: Prompt Difficulty. In this example the double entropy index is applied
to estimate degrees of agreement for constructed-response assessments.
We analyse data from Table 5.14 presented in Lim (2009). The University of Michi-
gan offers an advanced-level English proficiency test which is similar to the IELTS and
TOEFL. Examinees respond to prompts that are drawn from a larger pool of prompts.
Among other problems, Lim (2009) studied raters’ perceptions of prompts. For each
prompt ten raters answered to the following question: ”Compared to the average prompt
in the pool of writing prompts, is this prompt easier, about average, or more difficult to
get a high score on?” The following five-level Likert scale was used:
(1) Clearly Easier, (2) Somewhat Easier, (3) About Average,
(4) Somewhat More Difficult, (5) Clearly More Difficult.
Figure 9provides a graphical representation of the percentage distribution of scores
for two prompts.
As expected from the data the value κ= 0.04 for prompt 12 shows little agreement
among raters. On the contrary, we get κ= 0.71 for prompt 46. It demonstrates moderate
agreement in individuals’ responses. Notice, that the two cases are easily distinguished by
κdespite of a rather low level of sensitivity for n= 5 and m= 10 given by Figure 3. The
20
1 2 3 4 5
prompt 12
prompt 46
scores
ri
0.0 0.1 0.2 0.3 0.4 0.5
Figure 9: Distributions of scores for two prompts
example shows that the same group of raters can have very different levels of agreement
for distinct single targets. The values of κmatch well the observed distribution patterns
of prompts.
Overall, the observed values of the indices in all three examples appear to be consistent
with visual and other assessments of the degree of inter-rater agreement.
7 DIRECTIONS FOR FUTURE RESEARCH
This article develops novel agreement indices and shows some application of a new
assessment procedure of the chance agreement, assuming no prior standard rating. It
demonstrates that the new indices are superior and more closely capture consensus/dis-
agreement in comparison to other inter-rater agreement indices for a single target.
Although we focused the discussion on definitions and properties of the indices in
relation to a single target data, the derived results can be extended and applied to other
research questions. The future work includes extending of the proposed statistics to
various testing scenarios. Some of them are listed below.
– It is typically assumed that a lack of agreement is generated by a uniform distribu-
tion. For the illustrations in this article we followed this approach too. Replacing
the entropy by the relative entropy allows using other distributions, such as those
that would be caused by response biases (LeBreton & Senter,2008).
– The article mainly deals with introducing and discussing properties and applica-
tions of the new indices for single target cases. It would be interesting to extend
21
the results and statistical analysis to multiple target cases using various aggrega-
tion methods, for example, the unweighted group mean, the response data-based
weighted mean, and the confidence-based weighted mean considered by Wagner et
al. (2010).
– It would be interesting to generalize the results to the scenarios which take into
account relative rater’s competence in a group.
– The indices proposed in the paper were developed for discrete quantitative data
and can also be used for ordinal data. An important problem is the investigation
of the impact of aggregating/rounding real-valued measurements to a finite set of
equidistant scores on changes in inter-rater agreement indices (see, e.g., Zgurovsky
et al. 2004). Various applied problems may also require generalizations to a finite
set of non-equidistant categories.
– It would be interesting to extend the approach based on additive functions in (9)
and (20) to other transformations. For example, multiplicative functions of H∗(P)
and H∗(Q) can be studied. It would be also important to further clarify respective
contributions of H∗(P) and H∗(Q) to novel indices (see, e.g., Figure 7).
SUPPLEMENTARY MATERIALS
Additional examples and materials on testing hypotheses about inter-rater agreement are
available in the subsection Agreement Coefficient of Research Materials on the website
https://sites.google.com/site/olenkoandriy/. It also contains the mathematical
derivations, data and R code used in this paper.
Acknowledgements
The authors are grateful for the referees’ careful reading of the paper and many detailed
comments and suggestions, which helped to improve an earlier version of this article.
22
References
Ambrey, C. L., and Fleming, C. M. (2014). Life Satisfaction in Australia: Evidence from
Ten Years of the HILDA Survey, Social Indicators Research, 115(2), 691–714.
Baca-Garc´ıa, E., Blanco, C., S´aiz-Ruiz, J., Rico, F., Diaz-Sastre, C., and Cicchetti, D. V.
(2001). Assessment of Reliability in the Clinical Evaluation of Depressive Symptoms
Among Multiple Investigators in a Multicenter Clinical Trial, Psychiatry Research,
102(2), 163–173.
Bonikowska, A., Helliwell, J. F., Feng Hou, and Schellenberg, G. (2014). An Assessment
of Life Satisfaction on Recent Statistics Canada Surveys, Social Indicators Research,
118(2), 617–643.
Brown, R. D., and Hauenstein, N. M. A. (2005). Interrater Agreement Reconsidered: An
Alternative to the rwg Indices, Organizational Research Methods, 8, 165–184.
Burke, M. J., Finkelstein, L. M., and Dusig, M. S. (1999). On Average Deviation Indices
for Estimating Interrater Agreement, Organizational Research Methods, 2, 49–68.
Burke, M., and Dunlap, W. (2002). Estimating Interrater Agreement with the Average
Deviation Index: A User’s Guide, Organizational Research Methods, 5(2), 159–172.
Cicchetti, D. V., Showalter, D., and Rosenheck R. (1997). A New Method for Assessing
Interexaminer Agreement when Multiple Ratings are Made on a Single Subject: Ap-
plications to the Assessment of Neuropsychiatric Symtomatology, Psychiatry research,
72(1), 51–63.
Cicchetti, D., Fontana, A., and Showalter, D. (2009). Evaluating the Reliability of Mul-
tiple Assessments of PTSD Symptomatology: Multiple Examiners, One Patient, Psy-
chiatry Research, 166(2-3), 269–280.
DeCarlo, L. T., Kim, Y. K., and Johnson, M. S. (2011). A hierarchical rater model for
constructed responses, with a signal detection rater model, Journal of Educational
Measurement, 48, 333–356.
23
von Eye, A., and Mun, Y. E. (2005). Rater Agreement: Manifest Variable Methods,
Mahwah, NJ: Lawrence Erlbaum.
Fleiss, J. L. (1971). Measuring Nominal Scale Agreement Among Many Raters, Psycho-
logical Bulletin, 76(5), 378–382.
Gwet, K. L. (2012). Handbook of Inter-Rater Reliability, Gaithersburg: Advanced Ana-
lytics.
James, L. R., Demaree, R. G., and Wolf, G. (1984). Estimating Within-Group Interrater
Reliability With and Without Response Bias, Journal of Applied Psychology, 69, 85–98.
Kim, Y. H. (2009). An Investigation Into Native and Non-Native Teachers Judgments
of Oral English Performance: A Mixed Methods Approach, Language Testing, 26(2),
187–217.
Klemens, B. (2012). Mutual Information as a Measure of Intercoder Agreement, Journal
of Official Statistics, 28(3), 395–412.
LeBreton, J. M., James, L. R., and Lindell, M. K. (2005). Recent Issues Regarding rWG,
rWG, rWG (J), and rWG (J), Organizational Research Methods, 8(1), 128–138.
LeBreton, J. M., and Senter, J. L. (2008). Answers to 20 Questions About Interrater Re-
liability and Interrater Agreement, Organizational Research Methods, 11(4), 815–852.
Lim, G. S. (2009). Prompt and Rater Effects in Second Language Writ-
ing Performance Assessment, Doctoral dissertation, retrieved from
http://hdl.handle.net/2027.42/64665.
Lin, L., Hedayat, A. S., Sinha, B., and Min Yang. (2002). Statistical Methods in Assessing
Agreement, Journal of the American Statistical Association, 97(457), 257–270.
Lin, L., Hedayat, A. S., and Yuqing Tang. (2013). A Comparison Model for Measuring
Individual Agreement, Journal of Biopharmaceutical Statistics, 23(2), 322–345.
Lindell, M. K., and Brandt, C. J. (1997). Measuring Interrater Agreement for Ratings of
a Single Target, Applied Psychological Measurement, 21, 271–278.
24
Lindell, M. K., Brandt, C. J., and Whitney, D. J. (1999). A Revised Index of Agreement
for Multi-Item Ratings of a Single Target, Applied Psychological Measurement, 23,
127–135.
Lohse-Bossenz, H., Kunina-Habenicht, O., and Kunter, M. (2014). Estimating Within-
Group Agreement in Small Groups: A Proposed Adjustment for the Average Deviation
Index, European Journal for Work and Organizational Psychology, 23(3), 456–468.
Shah, M. (2011). Generalized Agreement Statistics over Fixed Group of Experts, in
Machine Learning and Knowledge Discovery in Databases, Volume 6913 of Springer
Lecture Notes in Computer Science, pp. 191–206.
Schmidt, F. L., and Hunter, J. E. (1989). Interrater Reliability Coefficients Cannot be
Computed When Only One Stimulus is Rated, Journal of Applied Psychology, 74,
368–370.
Smith-Crowe, K., Burke, M. J., Kouchaki, M., and Signal, S. M. (2013). Assessing Inter-
rater Agreement via the Average Deviation Index Given a Variety of Theoretical and
Methodological Problems, Organizational Research Methods, 16(1), 127–151.
Shoukri, M. M. (2004). Measures of Interobserver Agreement, Boca Raton, FL: Chapman
& Hall/CRC.
Zgurovsky, M. Z., Totsenko, V. G., and Tsyganok, V. V. (2004). Group Incomplete
Paired Comparisons with Account of Expert Competence, Mathematical and Computer
Modelling, 39(4-5), 349–361.
Wagner, S. M., Rau, C., and Lindemann, E. (2010). Multiple Informant Methodol-
ogy: A Critical Review and Recommendations, Sociological Methods and Research,
38, 582-618.
Xi, X. and Mollaun, P. (2009). How Do Raters From India Perform in Scoring
the TOEFL iBT Speaking Section and What Kind of Training Helps? TOEFL
iBT Research Report. No. TOEFLiBT-11. Princeton, NJ: ETS, retrieved from
http://files.eric.ed.gov/fulltext/ED507804.pdf
25