ArticlePDF Available

Double Entropy Inter-Rater Agreement Indices

Authors:
  • Institute for Information Recording of National Academy of Sciences of Ukraine

Abstract and Figures

The proper application of the most frequently used inter-rater agreement indices can be problematic for the case of a single target, for example, a psychotherapy patient, a student’s thesis, a grant proposal, and the lifestyle in a country. The majority of indices that can handle this case assess either the deviation of ranks from some central/average value or the pattern of ranks’ distribution. Contrary to other approaches, this article defines disagreement rating results using the unpredictability/complexity of scores. The article discusses alternative entropy methods for measuring inter-rater agreement or consensus in survey responses for the case of a single target. A new inter-rater agreement index is proposed. Comparisons between this index and the known inter-rater agreement measures show some limitations of the most frequently used indices. Various important methodological issues such as disagreement assumptions, average sensitivity, adjustments to deal with outliers, and missing or incorrectly recorded data are discussed. Examples of applications to actual data are presented.
Content may be subject to copyright.
Double Entropy Inter-rater Agreement
Indices
Andriy Olenko1and Vitaliy Tsyganok2
Abstract
The proper application of the most frequently used inter-rater agreement indices
can be problematic for the case of a single target, for example, a psychotherapy
patient, a student’s thesis, a grant proposal, the lifestyle in a country. The major-
ity of indices that can handle this case either assess the deviation of ranks from
some central/average value or the pattern of ranks’ distribution. Contrary to other
approaches, this article defines disagreement rating results using the unpredictabil-
ity/complexity of scores. The article discusses alternative entropy methods for
measuring inter-rater agreement or consensus in survey responses for the case of a
single target. A new inter-rater agreement index is proposed. Comparisons between
this index and the known inter-rater agreement measures show some limitations of
the most frequently used indices. Various important methodological issues such
as disagreement assumptions, average sensitivity, adjustments to deal with outliers
and missing or incorrectly recorded data are discussed. Examples of applications
to actual data are presented.
Keywords
psychological statistics, agreement index, similarity measure, average sensitivity
1La Trobe University, Australia
2Institute for Information Recording, National Academy of Sciences of Ukraine, Ukraine
Corresponding Author:
A. Olenko, Department of Mathematics and Statistics, La Trobe University, Victoria, 3086, Australia
Email: a.olenko@latrobe.edu.au
The authors are grateful to Drs. C.L. Ambrey, J. Dawes, and C.M. Fleming for providing their data.
1
1 INTRODUCTION
Assessments of inter-rater agreement appear in numerous applications in laboratory and
field studies, various types of data analysis problems, organizational and applied psy-
chology, etc. The publications (Lin et al.,2002;Shoukri,2004;von Eye & Mun,2005;
LeBreton & Senter,2008;Wagner et al.,2010) give excellent surveys of the field. Some
most recent results and bibliography can be found in (LeBreton et al.,2005;Shah,2011;
Gwet,2012;Lin et al.,2013;Lohse-Bossenz et al.,2014).
Inter-rater agreement indices were used and studied extensively in a number of con-
texts. For example, the article by Smith-Crowe et al. (2013) reports that almost half
of the articles published in various applied psychology journals used inter-rater agree-
ment indices. Recently such indices have been frequently applied to justify aggregating
lower-level data in multilevel and hierarchical models. Some statistical characteristics of
inter-rater agreement indices were obtained and discussed for specific scenarios, consult,
for example, (Fleiss,1971;James et al.,1984;Burke et al.,2002;Brown & Hauenstein,
2005;Klemens,2012).
Despite recent progress and results for the case of multiple targets there has been
remarkably little study on agreement estimates which can be also applied to a single
target, see (Lindell & Brandt,1997;Cicchetti et al.,1997,2009;Lindell et al.,1999;
Baca-Garc´ıa et al.,2001). The most of the known indices do not allow for the estimation
of chance agreement on a single target. However, there has been a significant interest in
single target problems in applications where a group of raters is formed only for a single
evaluation.
Firstly, various on-line ratings that are based on situational random responses require
agreement estimates. For example, the Faculty of 1000 (http://f1000.com) evaluates
the quality of scientific articles based on the opinion of scientific experts. Now the Faculty
of 1000 has more than 5000 evaluators spread across 44 subject-specific faculties. A few
other examples: Big on-line movie databases, for instance, Rotten Tomatoes, IMDb,
Netflix, collect and provide information about user rankings of movies; Amazon’s users
rank books and music; Google asks customers to rate businesses on a 30-point scale. The
number of responses in these ratings varies from few dozens to hundreds of thousands.
2
The second example situation is when users select raters of interest to perform an
evaluation. For instance, patients can be examined by independent psychiatric practi-
tioners. Knowing about a degree of agreement might help in taking decisions in the most
difficult or unprecedented clinical cases.
Inter-rater agreement indices can also be used to examine consensus in survey re-
sponses. They can measure whether several respondents, who evaluate the same target,
produce similar scores. Especially interesting cases are multilevel and hierarchical mod-
els, where each final score is a result of various aggregations of responses, see Example 1 in
Section 6. In such situations generating methods of final scores can have rather complex
structure to investigate them directly.
Another potentially interesting area is when human experts are replaced by some
automatic labelling processes, for example, data mining classifiers, which are selected
from a large number of potential methods. The reader can find more details about
classical and new applications of agreement indices in (von Eye & Mun,2005;Shah,
2011;Smith-Crowe et al.,2013).
Agreement indices are also widely applied in constructed-response assessments, espe-
cially when different raters can use partially different criteria. In such cases data about
scores of constructed response items may have a hierarchical structure, see DeCarlo et
al. (2011). Another important essay grading application is measuring the increase of
consensus using scores of raters before and after an initial training.
Contrary to other approaches, this article defines disagreement rating results for the
case of a single target using the unpredictability/complexity of scores. Our focus in this
paper is on a new entropy-based agreement index and its properties. The most of the
known agreement indices either assess the deviation of scores from some central(average)
value or the pattern of scores’ distribution. To overcome this limitation we assess both the
spread and the variability of distributions. In terms of the underlying rationale for this
approach we measure the unpredictability/uncertainty of scores by applying the Shannon
entropy twice. First, we assess the degree of uncertainty for the frequencies of selected
scores, then we compute the degree of uncertainty in their concentration. Each time
the Shannon entropy is computed and then normalized to the range [0,1].The double
entropy index will be introduced using weighted averages of these normalized entropies.
3
The new index is aimed to overcome some limitations of other well-known similar-
ity/dissimilarity measures. For example:
Many of classical measures are not normalized and, therefore, their results are not
directly comparable;
Indices similar to χ2use goodness of fit principles and depend only on the frequen-
cies of selected scores, but not on scores’ actual values;
Measures, that refer to a uniform distribution as the most dissimilar rating, often
cannot distinguish between distributions with equidistant equal frequencies;
Many similarity measures do not change gradually for discrete cases and can pro-
duce quite different values for similar rating results.
The article establishes properties of the new index and compares it with various known
indices. Two modifications of the index designed to be insensitive to outliers and missing
data are introduced and discussed. The article also gives unified representations of the
known indices, which might be useful for other research problems. Potential practical
applications are illustrated by examples. The study shows that the new indices perform
overall well comparing to other well-known indices.
The rest of the paper is organized as follows. Section 2introduces the main nota-
tions and assumptions. A novel measure of inter-rater agreement is defined and dis-
cussed in Sections 3. Section 4gives unified representations of some well-known agree-
ment/disagreement measures. Then it provides an insight into empirical behavior of the
proposed index and its advantages over the related metrics. Section 5proposes two ad-
justments to deal with outliers and missing or incorrectly recorded data. Examples of
applications to real data are presented in Section 6. Finally, conclusions and discussions
are given in Section 7.
2 FRAMEWORK AND NOTATION
In this section we introduce underlying assumptions, some basic notations, and a conve-
nient way to represent rating results.
4
All information about rating results can be presented by the four characteristics:
the number of categories in the response scale, the number of raters, the selected score
values, and the number of raters assigned each selected score. Below we formalise this
information for the following computations.
Let n2 be the number of possible scores (the number of categories into which
assignments are made) and I:= {i1, ..., ik}be a set of score values selected by raters.
Without loss of generality we assume that i1< i2< ... < ik.
Let m2 be the total number of raters and ribe the number of those who gave
the score i. Denote R:= (r1, ..., rn).If the score iwas not given, then ri= 0.Thus,
m=PiIri.
To visualize these notations one can use the bar plot as shown in Figure 1. The
horizontal axis of the plot shows the scores and the vertical axis gives the frequencies ri.
scores
ri
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5
Figure 1: Case of n= 10 and R= (4,5,1,1,4,0,1,2).
Agreement indices can be derived from various statistical and epistemological assump-
tions and yield differing results. The problem with the most of inter-rater agreement
indices is that they either assess the deviation of the scores in Ifrom some central value,
e.g., the mean or the median, or the deviation of the score distribution Rfrom a specific
pattern, e.g., a uniform distribution. We will define a new inter-rater agreement index
by taking into account both the deviations.
Firstly, it is important to formally specify the most disagreement rating result. In
this article we adopt the approach which defines disagreement rating results using the
unpredictability of scores. Thus, it is impossible to conclude that some raters have more
agreement about a target and prefer a particular score to the other scores. Note, that
5
some popular indices use a different approach. Namely, they define the most disagreement
rating if a half of raters select the lowest score and the other half select the highest
one. This article considers this scenario only as a partial disagreement because the both
marginal scores are high, but diametrical (directly opposite preferences are usually highly
correlated and can be treated as a partial agreement). In addition, in this situation there
is a high level of agreement inside of the each diametrical group. For example, all raters
consider a target as un-ordinary, but rank it very low or high because of their training
or ”like” or ”dislike” preferences.
For given k, m, and nthe most disagreement rating result is when
kelements of Iare equidistantly located in the set {1, ..., n}with i1= 1 and ik=n
(or the distances between consecutive selected scores vary at most by 1 if k < n
and equidistant placement is impossible, i.e. k1 is not a divisor of n1);
the values {ri1, ..., rik}are the same (vary at most by 1 if kis not a divisor of m).
It means that the same number of raters selects each of kscores and these kscores are
swept in equidistant steps from 1 to n.
Figure 2shows some examples of the most disagreement rating results for n= 10,
m= 20, k = 10,4,and 2 respectively.
Secondly, the situation of complete disagreement of mraters is referred to as an
occurrence of the two above conditions and k= min(m, n).For example, for n= 10 and
m= 20 it is shown in the first plot of Figure 2.
scores
ri
12345678910
0 2 4 6 8 10
scores
ri
12345678910
0 2 4 6 8 10
scores
ri
12345678910
0 2 4 6 8 10
Figure 2: Examples of most disagreement rating results
Note, that the above definition describes the probability mass function of the uniform
distribution on the set {1, ..., n}if nis a divisor of m. However, it differs from a uniform
distribution for those cases where mraters can’t be equally split into ngroups.
6
3 DOUBLE ENTROPY INDEX
In this section we use the entropy measure to introduce a new index of inter-rater agree-
ment/disagreement. We discuss the motivation for index’s definition, some marginal
cases, and estimate the sensitivity of the index.
The new index describes the complexity of the scores. A natural candidate for that
would be the entropy, a classical measure of complexity. However, it is not enough to
use the entropy once as the problem involves two levels of complexity/uncertainty. The
first one is pertinent to relative positions of the selected scores across the response scale.
An additional level appears because some selected scores are more and some are less
preferable by experts.
To assess agreement among multiple scores for a subject we use
1. the distribution of the selected scores i1, ..., ikin the set {1, ..., n};
2. the distribution of the frequencies ri, i I, of the selected scores.
For each of the above distributions the Shannon entropy will be computed and then
normalized to the range [0,1].The double entropy index will be defined as an average
of the normalized entropies. In Section 5the adjusted index will be introduced using
weighted averages of these entropies.
Step 1. To describe the spread of the scores {i1, ..., ik}we use the distances
dj:= ij+1 ij, j = 1, ..., k 1,(1)
between consecutive selected scores. We also introduce an additional distance between
ikand i1by
dk:=
(nik)+(i11) + n1
k1,if k > 1,
n, if k= 1.
(2)
where [x] is the largest integer less than or equal to x.
Note, that all distances are positive integers and their sum is the same for each k
regardless of actual values of {i1, ..., ik}.This sum equals
7
d:=
n1 + n1
k1,if k > 1,
n, if k= 1.
(3)
Using the set of distances {d1, ..., dk}one can normalize them to sum to one and
introduce the corresponding fractions P:= {p1, ..., pk}by the formula pj:= dj/d,
j= 1, ..., k.
To characterise disagreement among the kselected scores we use the entropy
H(P) =
k
X
j=1
pjln pj.(4)
Step 2. To describe the distribution of the frequencies ri, i I, we normalise them and
introduce the corresponding fractions Q:= {qi, i I}by the formula
qi=ri
PiIri
, i I . (5)
If the score iwas not given by raters, then ri= 0 and we obtain qi= 0, i 6∈ I.
To characterise the degree of unpredictability for Qwe use the entropy
H(Q) =
k
X
iI
qiln qi.(6)
Note, that all qi>0 if iI, because the corresponding ri1.
Step 3. To get a normalized index taking values between 0 and 1 , first, we normalize
each of the entropies H(P) and H(Q) to the range [0,1] :
H(P) :=
H(P)minPH(P)
maxPH(P)minPH(P)=H(P)Bn,k
An,kBn,k ,if k < n
1,if k=n,
(7)
H(Q) := H(Q)
maxk,Q H(Q)=H(Q)
Cn,m
,(8)
where maxPH(P) = An,k,minPH(P) = Bn,k ,and maxk,Q H(Q) = Cn,m.The details of
computations of the values An,k, Bn,k,and Cn,m and implementations in R code can be
found in the Supplementary Materials.
Secondly, we define the normalized index of disagreement κ0(P, Q) as the average of
the normalized entropies
κ0(P, Q) := H(P) + H(Q)
2.(9)
8
The normalized index κ0(P, Q) takes values in the interval [0,1].There exist score
distributions for which κ0(P, Q) reaches the both endpoints 0 and 1. Namely, if all
raters give the same score, which indicates perfect agreement, then p1=q1= 1 and
κ0(P, Q) = 0; if each score is selected by m
nor m
n+ 1 raters, then κ0(P, Q) = 1.
Finally, the corresponding index of inter-rater agreement κ(P, Q) is defined by
κ(P, Q) := 1 κ0(P, Q).(10)
It follows from (9) and (10) that κ(P, Q) increases when H(P) or H(Q) decreases.
Both H(P) and H(Q) make equal contributions to κ(P, Q).
Similar to rwg and awg ,see the discussion and references in Brown & Hauenstein
(2005), there are no sampling distributions that completely specify agreement or dis-
agreement and can be used to test hypotheses about statistical significance of κ. Simula-
tion studies analogous to Section 4suggest values in the range 0.60.7 as a reasonable
cut-off for κ, while values from 0 to 0.6 should be considered as unacceptable levels of
agreement. The article by Cicchetti et al. (2009) presents a more detailed discussion of
the rationale underlying the use of 0.7 as the expected level of chance agreement. If κ
is frequently used in the same circumstances, then the expected level of chance agree-
ment can be adjusted using results of an initial trial period. A new level can be easily
implemented in the developed R code.
A sensitivity of the index to changes in scores can be used to investigate its per-
formance and reliability (it also characterises its sensitivity to outliers). Namely, the
following two-step procedure is used to compute an average sensitivity of the index:
calculate the maximum absolute deviation of κfor each admissible list of scores of
mraters when one of scores has been changed;
calculate the mean of the maximum absolute deviations from step 1 over all admis-
sible scores of mraters.
Figure 3demonstrates the average sensitivity of the index as a function of the number
of raters mand the number of categories n. To obtain this figure the index κwas computed
for mMonte Carlo simulated scores and recalculated for all possible changes in one score.
Then the maximum absolute differences of κwere averaged over all replications. Monte
9
Carlo simulations with 10000 replications for each group size m= 2, ..., 30 and the
numbers of categories 5, 10 and 15 were performed.
Figure 3demonstrates that the average sensitivity decreases rather quickly as the
number of raters mincreases. On the average, for a large number of raters the distribution
of selected scores is closer to a uniform distribution, when there are fewer categories.
Therefore, the rightmost part of Figure 3exhibits a slightly smaller average sensitivity if
the number of categories is smaller.
n=5
n=10
n=15
Figure 3: Average sensitivity for 5, 10 and 15-level response scales
Using these plots one can compute an average level of sensitivity for a given number
of raters, and vice versa. For example, based on the Monte Carlo simulation results,
to get the average level of index’s sensitivity at most 0.05 the recommended number of
raters is 15 or more. However, 15 raters might not be realistic and very high sensitivity
is not required in many applications. At the same time, for typical applications with
6 raters, Figure 3 indicates levels of sensitivity in the range 0.08-0.15 depending on a
response scale. These levels are quite acceptable in practice.
Figure 3demonstrates that more raters are needed to obtain a higher sensitiveness
of the index. This is true not only for κ, but for most of inter-rater agreement indices.
There are various situations, when a large number of raters is appropriate, for instance,
in examining raters’ consensus during a training period. DeCarlo et al. (2011) gave
an example of a large language assessment, where 34 raters scored the first item and 20
raters scored the second item. More examples can be found in (Kim,2009;Xi & Mollaun,
10
2009). However, in many performance assessments (e.g., essay exams), the large number
of raters is not realistic. Thus, some adjustments of the index that are robust to outliers
and missing data are required. They are introduced in Section 5.
4 COMPARISON OF INTER-RATER AGREEMENT
MEASURES
In this section we compare the double entropy index with the most frequently used indices
of inter-rater agreement. The most of these indices involve determining the extent to
which either each rater’s score differs from the mean or the median item rating, or the
score distribution differs from the uniform distribution.
The more sound theoretical basis introduced in Section 3improves the validity of
inter-rater comparisons. In this section, we give a set of simple examples, backed by
our intuition, about what sorts of raters’ results should correspond to higher or lower
inter-rater agreements.
We start by unifying definitions of these indices using the notations in this paper.
1. Sample standard deviation S. The sample standard deviation of ratings was
proposed and studied as an agreement index by Schmidt & Hunter (1989). It was
introduced to measure the dispersion from the mean rating. Its low values indicate
that the ranks tend to be close to the mean.
In our notations it can be computed by the formula
S=sPiIri(iM)2
m1,(11)
where M=PiIi·ri/m is the mean rating. The standard error of the mean
SEM =S/mwas employed to construct confidence intervals around Mto assess
significant levels of agreement among the raters.
2. Coefficient of variation CV. The coefficient of variation is a normalized measure
of dispersion obtained from the sample standard deviation Sby CV =S/M.
11
3. Adjusted average deviation index ADMadj .Burke et al. (1999) studied the
average deviation around the mean or the median as an index of inter-rater agree-
ment. It is based on the same principle as S, but uses the absolute differences which
usually produce more robust results.
Using the notations of this paper ADµfor a single subject is computed as follows:
ADµ:= PiIri|iµ|
m,(12)
where µis the mean Mor the median Mdrating.
Compared to the original ADMindex, the adjusted ADMadj index introduced by
Lohse-Bossenz et al. (2014) gives less biased estimates within smaller groups. It
can be computed as follows:
ADMadj := 2m1
2m(m1) X
iI
ri|iM|=2m1
2(m1)ADM.(13)
As ADMadj is not a normalized index, a basic rule for interpreting significant levels
of agreement is whether the ADµis smaller than n/6.The further discussion of
properties of this index and additional references can be found in Smith-Crowe et
al. (2013).
4. rwg and r
wg indices. The index rwg was introduced by James et al. (1984). The
main rationale for it was to define no agreement as random responses over all
categories.
Using our notations rwg may be represented as rwg := 1S22
EU ,where σEU is the
expected variance. The value σ2
EU = (n21)/12 is mostly used, which corresponds
to the discrete uniform distribution. If S2> σ2
EU then the truncation of S2to σ2
EU
is applied which results in rwg = 0.
The modified index r
wg := 1 S22
MV ,where σ2
MV := 0.5·(n2+ 1) 0.25(n+ 1)2
is the maximum variance, was recommended and discussed by Lindell et al. (1999).
5. Index awg .This index was introduced by Brown & Hauenstein (2005) to address
some limitations of rwg and r
wg indices. It has a similar interpretation to rwg.The
12
index estimates agreement by the proportion of observed agreement between raters
to the maximum possible disagreement assuming that the mean rating is given.
Using our notations awg is defined by
awg := 1 2(m1)S2
m((i1+ik)MM2i1·ik).(14)
The further discussion of awg can be found in Wagner et al. (2010).
6. Spectral consistency factor K.The weighted spectrum of the set Iwas used
to assess inter-rater agreement by Zgurovsky et al. (2004). The index defined the
no-agreement case as a distribution of ranks with the minimal sum of the variance
and the deviation from a uniform distribution.
In our notations the index Kcan be expressed as
K:= 1ADM+H(Q)
GPn
i=1 |i(n+ 1)/2|+ ln(n)·(1 z),(15)
where G=m/(nln(m) ln(n)) is a scaling factor,
z=I(i1= 1) ·I(ik=n)·
k1
Y
i=1
I(ri=ri+1)·
k1
Y
j=2
I(ij+1 ij=i2i1),(16)
and the indicator function I(·) is defined by
I(x) :=
1,if xis true,
0,if xis false.
(17)
The multiplier 1 zwas introduced to truncate Kto 0 when the equal numbers of
raters select equidistant scores (the case of the uniform discrete distribution).
Unfortunately, the normalization above gives negative values of Kfor some ratings.
Therefore, we propose to use the following correction
K0:= 1ADM+H(Q)
(n1)/2 + ln(n)·(1 z) (18)
which insures that K0[0,1].
7. Weighted pairing index MR.This index was introduced by Cicchetti et al.
(1997) and can be computed as
MR := Pi,lIrirl(1 − |il|/n)PiIri
m(m1) .(19)
13
The main rationale for the index is to determine whether the ratings of experts are
too far from the group average.
8. Pearson’s statistic χ2.χ2is used to test goodness of fit and establishes whether
or not an observed rating distribution differs from a theoretical distribution. It
can be computed by χ2=n
mPn
i=1(rim/n)2,where m/n is an expected rating
asserting complete disagreement (a uniform distribution).
Case 1
scores
ri
12345678910
0 2 4 6 8 10
Case 2
scores
ri
12345678910
0 2 4 6 8 10
Case 3
scores
ri
12345678910
0 2 4 6 8 10
Case 4
scores
ri
12345678910
0 2 4 6 8 10
Case 5
marks
ri
1 2 3 4 5 6 7 8 9 10
0 2 4 6 8 10
Case 6
scores
ri
12345678910
0 2 4 6 8 10
Case 7
scores
ri
12345678910
0 5 10 15 20
Case 8
scores
ri
12345678910
0 5 10 15 20
Case 9
scores
ri
12345678910
0 5 10 15 20
Figure 4: Example patterns
For the number of raters m= 20 Figure 4shows some example patterns for which
the above indices were compared. The corresponding values of the indices are shown in
Table 1. Similar results were also obtained for ADM, ADMd,and Kbut are not shown
in the paper. The cases in Figure 4are numbered from left to right and continuing on
in the next row. It follows from the assumptions that the degree of agreement should
increase from case 1 to case 9.
Note that the indices S, ADMadj, awg, MR, and χ2are not normalized to [0,1]. The
14
Table 1: Values of indices for example patterns
Case S CV ADMadj rwg r
wg awg K0M R χ2κ
1 2.95 0.54 2.57 0.00 0.57 0.19 0.00 0.65 0 0.00
2 2.91 0.49 2.52 0.00 0.58 0.20 0.3 0.66 1 0.01
3 3.44 0.63 3.08 0.00 0.42 -0.11 0.00 0.61 30 0.20
4 4.62 0.84 4.62 0.00 0.00 -1.00 0.00 0.53 80 0.35
5 2.56 0.47 2.57 0.20 0.68 0.38 0.53 0.74 80 0.46
6 0.97 0.19 0.75 0.89 0.95 0.91 0.69 0.89 39 0.71
7 0.51 0.21 0.51 0.97 0.99 0.96 0.82 0.95 80 0.85
8 0.31 0.05 0.18 0.99 1.00 0.99 0.93 0.98 144 0.93
9 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 180 1.00
indices S, CV, and ADMadj are disagreement measures, i.e. their values should decrease
from case 1 to case 9.
The test examples provide important marginal cases that help to compare various
indices. Respective contributions of H(P) and H(Q) to κ(P, Q) can be easily seen
from (7) and (8). The normalised entropy H(P) equals 1 for cases 1-4 with equidistant
selected scores, then it decreases to 0 for the cases when scores form a group. At the
same time the normalised entropy H(Q) gradually decreases from 1 to 0.
Table 1demonstrates that the index κperforms well on the test examples. Its val-
ues gradually increase with increasing agreement between raters’ scores. In addition, it
follows from its definition that κhas various properties required for inter-rater agree-
ment indices. Namely, it is invariant under score shifts, i.e. κis the same for Iand
I0:= {i0
1, ..., i0
k}if i0
jij=const for all j= 1, ..., k. The index is monotonically decreas-
ing when the distance ij+1 ijincreases. κdecreases if one of mraters changes his score
to a new one in such a way that the new set I0consists of more elements than the original
I, i.e. I0=I∪ {j},where j1, ..., n, and j6∈ I.
All other indices have various limitations. In the discussion below, we refer to Figure 4
and the corresponding results in Table 1to list some of these limitations:
the indices rwg and K0cannot distinguish between distributions with equidistant
15
equal frequencies, e.g., cases 1, 3, and 4. Thus, these indices might give rather
similar results for two different cases: when all raters chose ranks randomly and
when the raters formed several groups(schools) that shared quite different opinions;
the indices S, CV, ADMadj , rwg, r
wg , awg ,K0,and M R specify complete disagree-
ment for case 4, i.e. these indices indicate when two equal groups of raters have
opposite opinions (a half of raters assign the score 1 and the other half the score
n) instead of unpredictability of scores;
the indices S, CV, ADMadj , r
wg , awg ,and M R do not suggest complete disagreement
for case 1 of complete unpredictability (the uniform distribution on all scores). They
will demonstrate various levels of agreement even if raters assign ranks at random;
the index K0is ”discontinuous”, i.e. substantially vary between similar patterns
(cases 1 and 2). It means K0is not robust to outliers. It may produce misleading
results when a single rank impacts the overall agreement;
the indices rwg , r
wg ,K0,and awg do not change gradually having big gaps between
their values. It may cause problems in selecting an appropriate critical region/level
to test an agreement hypothesis;
Pearson’s statistic χ2does not work well for nonuniform distributions which demon-
strate high degrees of agreement (case 6 versus cases 4 and 5). Also, it depends
only on the frequencies of selected scores, but not on their actual values. Thus,
quite different cases 4, 5, and 7 have the same χ2value.
5 ADJUSTED DOUBLE ENTROPY INDICES
It is well-known that values of inter-rater agreement indices may significantly change if a
new expert assigns a score, which is quite different from the existing scores. Particularly,
it is a serious problem in applications with a small number of raters and a relatively large
response scale.
In this section, we introduce two adjusted indices that can handle outliers and missing
or incorrectly recorded data.
16
In situations where the identification and removal of outliers are controversial issues,
reporting both the unadjusted and adjusted values is methodologically sound.
Adjusted index κ.Firstly, we consider situations when the developed double entropy
index κmight be sensitive to outliers. Namely, values of κcan significantly change when a
new rater gives a score too far from the other scores, e.g., Figure 5and the corresponding
Table 2. Some other indices in Section 4demonstrate similar sensitivity too. For κthis
sensitivity is caused by significant changes in the component H(P) for small k.
marks
ri
1 2 3 4 5 6 7 8 9 10
0 2 4 6 8 10
marks
ri
1 2 3 4 5 6 7 8 9 10
0 2 4 6 8 10
Figure 5: Case of outliers
Table 2: Values of indices for the case of outliers
S CV ADMadj rwg r
wg awg K0M R χ2κ κ
0.51 0.21 0.51 0.97 0.99 0.96 0.82 0.95 76.26 0.85 0.85
1.75 0.62 0.87 0.63 0.85 0.56 0.75 0.88 71 0.49 0.81
To overcome this limitation, we propose the adjusted index κ.To define it we use the
normalized entropy H(Q) as before and the adjusted H(P) which is computed after
censoring small ri.For instance, if all riless than 0.2Mr(Mr=Pk
i=1 ri/k) are truncated,
then I0={ilI:ril0.2Mr}is used instead of Ito compute H(P).
For example, for the scores in Figure 5the values of κare given in Table 2. It
demonstrates that the adjusted κis less sensitive to marginal scores than the index κ.
Note, that κand κgive same results for all cases in Table 1.
Adjusted index κ.Equation 9defines κ0as the weighted sum of H(P) and H(Q) with
equal weights of 0.5. It can be generalised to the weighted sum (1 α)H(P) + αH(Q),
17
where α[0,1].For example, if α=k
nwe obtain
˜κ0(P, Q) := nk
nH(P) + k
nH(Q).(20)
The index ˜κ0gives higher weights to H(Q) and smaller weights to H(P) for those k
which are closer to n. Therefore, ˜κ0is less sensitive to changes in the spread of a distri-
bution and more sensitive to changes of distribution’s shape than κ0when kn. Thus,
its application is preferable for large values of kwhich indicate a significant divergence
in raters’ assessments. The adjustment is useful if some of kscores were missing or
incorrectly recorded.
The adjusted index κis defined by κ(P, Q) := 1 ˜κ0(P, Q).
For example, the second subplot of Figure 6shows that one of scores 8 was mistakenly
recorded as 2. The numerical values of the indices are given in Table 3. It is clear that
the adjusted κis less sensitive to incorrect records.
scores
ri
12345678910
0 1 2 3 4 5
scores
ri
12345678910
0 1 2 3 4 5
Figure 6: Case of incorrect records
Table 3: Values of indices for the case of incorrect records
S CV ADMadj rwg r
wg awg K0M R χ2κ κ
2.11 0.28 1.77 0.46 0.78 0.46 0.47 0.76 14 0.59 0.34
2.45 0.34 2.09 0.27 0.7 0.33 0.4 0.72 10 0.06 0.11
Figure 7shows respective contributions of H(P) and H(Q) to the adjusted double
entropy index κ(P, Q).Monte Carlo simulations were used to generate random responses
of 6 raters over a 10-level scale and to compute κ.For each random response the figure
18
shows a value of κversus the corresponding values of H(P) and H(Q).Similarly to κ
the adjusted double entropy index increases when H(P) and H(Q) decrease.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
H*(P)
κ*(P,Q)
H*(Q)
Figure 7: κversus H(P) and H(Q).
6 ILLUSTRATIVE EXAMPLES
The examples in this section present typical applications of the indices for a situational
selection of raters from a larger population. The agreement of inter-rater judgements is
investigated. For the sake of space, tables with the detailed values of all of the other
agreement indices have not been included.
Example 1: Assessment of life satisfaction. This example illustrates possible ap-
plications of the double entropy inter-rater agreement indices to examine consensus in
survey responses. Life satisfaction is a complex index and individual scores can vary
substantially.
In the example we analyse life satisfaction data from the Household, Income and
Labour Dynamics in Australia survey and Statistics Canada’s General Social Survey. We
use the distributions of individuals’ responses in the 2010 Australian and 2011 Canadian
surveys presented in Ambrey & Fleming (2014), Figure 1, and Bonikowska et al. (2014),
Table 3, respectively. The 11-point response scale with 0 meaning ”very dissatisfied” and
with 10 meaning ”very satisfied” was used. Figure 8provides a graphical representation
of the percentage distribution of scores.
Note that in both surveys there are a number of categories with rather low percentage
frequencies which bias index estimates. In this case, the adjusted index κprovides
19
0 1 2 3 4 5 6 7 8 9 10
Canadian survey
Australian survey
scores
ri
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Figure 8: Distributions of life satisfaction scores
a better solution and effectively corrects for the low remote frequencies. Indeed, the
adjusted κequals 0.65 and 0.64 for the Australian and Canadian surveys respectively,
when unadjusted indices are 0.15 and 0.14. The analysis demonstrates moderate levels
of agreement in individuals’ responses. The levels of respondent’s agreement are roughly
the same for the both countries.
Example 2: Prompt Difficulty. In this example the double entropy index is applied
to estimate degrees of agreement for constructed-response assessments.
We analyse data from Table 5.14 presented in Lim (2009). The University of Michi-
gan offers an advanced-level English proficiency test which is similar to the IELTS and
TOEFL. Examinees respond to prompts that are drawn from a larger pool of prompts.
Among other problems, Lim (2009) studied raters’ perceptions of prompts. For each
prompt ten raters answered to the following question: ”Compared to the average prompt
in the pool of writing prompts, is this prompt easier, about average, or more difficult to
get a high score on?” The following five-level Likert scale was used:
(1) Clearly Easier, (2) Somewhat Easier, (3) About Average,
(4) Somewhat More Difficult, (5) Clearly More Difficult.
Figure 9provides a graphical representation of the percentage distribution of scores
for two prompts.
As expected from the data the value κ= 0.04 for prompt 12 shows little agreement
among raters. On the contrary, we get κ= 0.71 for prompt 46. It demonstrates moderate
agreement in individuals’ responses. Notice, that the two cases are easily distinguished by
κdespite of a rather low level of sensitivity for n= 5 and m= 10 given by Figure 3. The
20
1 2 3 4 5
prompt 12
prompt 46
scores
ri
0.0 0.1 0.2 0.3 0.4 0.5
Figure 9: Distributions of scores for two prompts
example shows that the same group of raters can have very different levels of agreement
for distinct single targets. The values of κmatch well the observed distribution patterns
of prompts.
Overall, the observed values of the indices in all three examples appear to be consistent
with visual and other assessments of the degree of inter-rater agreement.
7 DIRECTIONS FOR FUTURE RESEARCH
This article develops novel agreement indices and shows some application of a new
assessment procedure of the chance agreement, assuming no prior standard rating. It
demonstrates that the new indices are superior and more closely capture consensus/dis-
agreement in comparison to other inter-rater agreement indices for a single target.
Although we focused the discussion on definitions and properties of the indices in
relation to a single target data, the derived results can be extended and applied to other
research questions. The future work includes extending of the proposed statistics to
various testing scenarios. Some of them are listed below.
It is typically assumed that a lack of agreement is generated by a uniform distribu-
tion. For the illustrations in this article we followed this approach too. Replacing
the entropy by the relative entropy allows using other distributions, such as those
that would be caused by response biases (LeBreton & Senter,2008).
The article mainly deals with introducing and discussing properties and applica-
tions of the new indices for single target cases. It would be interesting to extend
21
the results and statistical analysis to multiple target cases using various aggrega-
tion methods, for example, the unweighted group mean, the response data-based
weighted mean, and the confidence-based weighted mean considered by Wagner et
al. (2010).
It would be interesting to generalize the results to the scenarios which take into
account relative rater’s competence in a group.
The indices proposed in the paper were developed for discrete quantitative data
and can also be used for ordinal data. An important problem is the investigation
of the impact of aggregating/rounding real-valued measurements to a finite set of
equidistant scores on changes in inter-rater agreement indices (see, e.g., Zgurovsky
et al. 2004). Various applied problems may also require generalizations to a finite
set of non-equidistant categories.
It would be interesting to extend the approach based on additive functions in (9)
and (20) to other transformations. For example, multiplicative functions of H(P)
and H(Q) can be studied. It would be also important to further clarify respective
contributions of H(P) and H(Q) to novel indices (see, e.g., Figure 7).
SUPPLEMENTARY MATERIALS
Additional examples and materials on testing hypotheses about inter-rater agreement are
available in the subsection Agreement Coefficient of Research Materials on the website
https://sites.google.com/site/olenkoandriy/. It also contains the mathematical
derivations, data and R code used in this paper.
Acknowledgements
The authors are grateful for the referees’ careful reading of the paper and many detailed
comments and suggestions, which helped to improve an earlier version of this article.
22
References
Ambrey, C. L., and Fleming, C. M. (2014). Life Satisfaction in Australia: Evidence from
Ten Years of the HILDA Survey, Social Indicators Research, 115(2), 691–714.
Baca-Garc´ıa, E., Blanco, C., S´aiz-Ruiz, J., Rico, F., Diaz-Sastre, C., and Cicchetti, D. V.
(2001). Assessment of Reliability in the Clinical Evaluation of Depressive Symptoms
Among Multiple Investigators in a Multicenter Clinical Trial, Psychiatry Research,
102(2), 163–173.
Bonikowska, A., Helliwell, J. F., Feng Hou, and Schellenberg, G. (2014). An Assessment
of Life Satisfaction on Recent Statistics Canada Surveys, Social Indicators Research,
118(2), 617–643.
Brown, R. D., and Hauenstein, N. M. A. (2005). Interrater Agreement Reconsidered: An
Alternative to the rwg Indices, Organizational Research Methods, 8, 165–184.
Burke, M. J., Finkelstein, L. M., and Dusig, M. S. (1999). On Average Deviation Indices
for Estimating Interrater Agreement, Organizational Research Methods, 2, 49–68.
Burke, M., and Dunlap, W. (2002). Estimating Interrater Agreement with the Average
Deviation Index: A User’s Guide, Organizational Research Methods, 5(2), 159–172.
Cicchetti, D. V., Showalter, D., and Rosenheck R. (1997). A New Method for Assessing
Interexaminer Agreement when Multiple Ratings are Made on a Single Subject: Ap-
plications to the Assessment of Neuropsychiatric Symtomatology, Psychiatry research,
72(1), 51–63.
Cicchetti, D., Fontana, A., and Showalter, D. (2009). Evaluating the Reliability of Mul-
tiple Assessments of PTSD Symptomatology: Multiple Examiners, One Patient, Psy-
chiatry Research, 166(2-3), 269–280.
DeCarlo, L. T., Kim, Y. K., and Johnson, M. S. (2011). A hierarchical rater model for
constructed responses, with a signal detection rater model, Journal of Educational
Measurement, 48, 333–356.
23
von Eye, A., and Mun, Y. E. (2005). Rater Agreement: Manifest Variable Methods,
Mahwah, NJ: Lawrence Erlbaum.
Fleiss, J. L. (1971). Measuring Nominal Scale Agreement Among Many Raters, Psycho-
logical Bulletin, 76(5), 378–382.
Gwet, K. L. (2012). Handbook of Inter-Rater Reliability, Gaithersburg: Advanced Ana-
lytics.
James, L. R., Demaree, R. G., and Wolf, G. (1984). Estimating Within-Group Interrater
Reliability With and Without Response Bias, Journal of Applied Psychology, 69, 85–98.
Kim, Y. H. (2009). An Investigation Into Native and Non-Native Teachers Judgments
of Oral English Performance: A Mixed Methods Approach, Language Testing, 26(2),
187–217.
Klemens, B. (2012). Mutual Information as a Measure of Intercoder Agreement, Journal
of Official Statistics, 28(3), 395–412.
LeBreton, J. M., James, L. R., and Lindell, M. K. (2005). Recent Issues Regarding rWG,
rWG, rWG (J), and rWG (J), Organizational Research Methods, 8(1), 128–138.
LeBreton, J. M., and Senter, J. L. (2008). Answers to 20 Questions About Interrater Re-
liability and Interrater Agreement, Organizational Research Methods, 11(4), 815–852.
Lim, G. S. (2009). Prompt and Rater Effects in Second Language Writ-
ing Performance Assessment, Doctoral dissertation, retrieved from
http://hdl.handle.net/2027.42/64665.
Lin, L., Hedayat, A. S., Sinha, B., and Min Yang. (2002). Statistical Methods in Assessing
Agreement, Journal of the American Statistical Association, 97(457), 257–270.
Lin, L., Hedayat, A. S., and Yuqing Tang. (2013). A Comparison Model for Measuring
Individual Agreement, Journal of Biopharmaceutical Statistics, 23(2), 322–345.
Lindell, M. K., and Brandt, C. J. (1997). Measuring Interrater Agreement for Ratings of
a Single Target, Applied Psychological Measurement, 21, 271–278.
24
Lindell, M. K., Brandt, C. J., and Whitney, D. J. (1999). A Revised Index of Agreement
for Multi-Item Ratings of a Single Target, Applied Psychological Measurement, 23,
127–135.
Lohse-Bossenz, H., Kunina-Habenicht, O., and Kunter, M. (2014). Estimating Within-
Group Agreement in Small Groups: A Proposed Adjustment for the Average Deviation
Index, European Journal for Work and Organizational Psychology, 23(3), 456–468.
Shah, M. (2011). Generalized Agreement Statistics over Fixed Group of Experts, in
Machine Learning and Knowledge Discovery in Databases, Volume 6913 of Springer
Lecture Notes in Computer Science, pp. 191–206.
Schmidt, F. L., and Hunter, J. E. (1989). Interrater Reliability Coefficients Cannot be
Computed When Only One Stimulus is Rated, Journal of Applied Psychology, 74,
368–370.
Smith-Crowe, K., Burke, M. J., Kouchaki, M., and Signal, S. M. (2013). Assessing Inter-
rater Agreement via the Average Deviation Index Given a Variety of Theoretical and
Methodological Problems, Organizational Research Methods, 16(1), 127–151.
Shoukri, M. M. (2004). Measures of Interobserver Agreement, Boca Raton, FL: Chapman
& Hall/CRC.
Zgurovsky, M. Z., Totsenko, V. G., and Tsyganok, V. V. (2004). Group Incomplete
Paired Comparisons with Account of Expert Competence, Mathematical and Computer
Modelling, 39(4-5), 349–361.
Wagner, S. M., Rau, C., and Lindemann, E. (2010). Multiple Informant Methodol-
ogy: A Critical Review and Recommendations, Sociological Methods and Research,
38, 582-618.
Xi, X. and Mollaun, P. (2009). How Do Raters From India Perform in Scoring
the TOEFL iBT Speaking Section and What Kind of Training Helps? TOEFL
iBT Research Report. No. TOEFLiBT-11. Princeton, NJ: ETS, retrieved from
http://files.eric.ed.gov/fulltext/ED507804.pdf
25
... As an agreement measure, we suggest using the spectral double entropy index (Olenko & Tsyganok, 2016). It is, in a way, the extension of spectral consistency coefficient suggested by Totsenko (1996). ...
... In Tsyganok et al. (2015) it was recommended to use spectral consistency coefficient , suggested in Totsenko, 1996 as agreement measure. Olenko & Tsyganok (2016) demonstrated this coefficient's drawbacks. For example, originally was intended to lie within the range between 0 and 1, while in fact it could assume negative values (Olenko & Tsyganok, 2016). ...
... Olenko & Tsyganok (2016) demonstrated this coefficient's drawbacks. For example, originally was intended to lie within the range between 0 and 1, while in fact it could assume negative values (Olenko & Tsyganok, 2016). Additionally, introduction of was an attempt to unite two independent agreement measures: entropy and dispersion. ...
Article
Full-text available
Abstract Paper aims the paper aims to demonstrate the advantages of several modifications of combinatorial method of expert judgment aggregation in AHP. Modifications are based on 1) weighting of spanning trees; 2) sorting of spanning trees by graph diameter during aggregation. Originality Both the method and its modifications are developed and improved by the authors. We propose to 1) weight spanning trees, based on quality of respective expert estimates, and 2) sort them by diameter in order to reduce the impact of expert errors and the method’s computational complexity. Research method we focus on theoretical and empirical studies of several modifications of combinatorial method of aggregation of expert judgments within AHP. Main findings modified combinatorial method has several conceptual advantages over ordinary method. It is also less sensitive to perturbations of initial data. Additionally, selection of spanning trees with smaller diameters allows us to reduce computational complexity of the method and minimize the impact of expert errors. Implications for theory and practice Combinatorial method is a universal instrument of expert judgment aggregation, applicable to additive/multiplicative, complete/incomplete, individual/group pair-wise comparisons, provided in different scales. It is used in the original strategic planning technology, which has recently found several important applications.
... Застосування зазначеного підходу для визначення узгодженості з подвійним використанням формули ентропії запропоновано, як і деякі інші підходи, у праці [3], у якій подано також коригування індексу, наведеного у праці [1], з метою обмеження його області значень діапазоном [0,1] і неможливості потрапляння в область від'ємних значень. ...
... 9. Індекс подвійної ентропії (double entropy іndex) [3]. Цей індекс враховує міру інформації оцінок і їх частоти. ...
... Це може бути підтвердженням більшої адаптованості функції до умов реальних статисти-Рис. 3. Приклад зображення закону розподілу оцінок для функції 2 x f  чних досліджень, а тому свідчить про кращу придатність функції для використання в межах поставлених задач. ...
Article
Full-text available
Розглянуто проблему визначення рівня узгодженості оцінок під час групової експертизи. Завданням дослідження є розроблення методу визначення узгодженості експертних оцінок, позбавленого ряду ключових недоліків, притаманних наявним методам. Запропоновано індекс узгодженості визначати з використанням спектрального підходу, відповідно до якого оцінки експертів відображуються у вигляді спектра на обмеженій, безперервній або дискретній шкалі. Індекс обчислено як нормоване значення суми відстаней між оцінками експертів для всіх можливих пар оцінок. Індекс узгодженості досліджено також для функції квадрата попарних різниць у парах оцінок. Проведений аналіз засвідчив, що функція відстані більш придатна для ґрунтовного практичного визначення узгодженості експертних оцінок. Проведено імітаційне моделювання та запропоновано визначення порогового значення узгодженості, вище якого стає допустимою агрегація експертних оцінок. Для підвищення рівня узгодженості запропоновано процедуру зворотного зв’язку з експертом за умови неспричинення будь-якого тиску на нього.
... Almost every known method of determining consistency [1,2] is based on a specific methodology of determining the threshold. Most of them, apart from original approaches such as [3], are based on simulation modeling of expert estimates, for example, the consistency determination by the Analytical hierarchy process method by Thomas Saaty [4]. ...
... Also, it should be noted that using this approach, it is appropriate to determine the dependence of the consistency (inconsistency) indices on the requirements for the reliability of the obtained expert estimate results for various expert assessment methods for the corresponding consistency indices are used. For example, the consistency ratio for the eigenvector method used in the classical method of Analytical Hierarchy Process [12], Double Entropy Agreement Indices [2], and others. ...
Preprint
Full-text available
To obtain reliable results of expertise, which usually use individual and group expert pairwise comparisons, it is important to summarize (aggregate) expert estimates provided that they are sufficiently consistent. There are several ways to determine the threshold level of consistency sufficient for aggregation of estimates. They can be used for different consistency indices, but none of them relates the threshold value to the requirements for the reliability of the expertise's results. Therefore, a new approach to determining this consistency threshold is required. The proposed approach is based on simulation modeling of expert pairwise comparisons and a targeted search for the most inconsistent among the modeled pairwise comparison matrices. Thus, the search for the least consistent matrix is carried out for a given perturbation of the perfectly consistent matrix. This allows for determining the consistency threshold corresponding to a given permissible relative deviation of the resulting weight of an alternative from its hypothetical reference value.
... , w n ) ⊤ ∈ R n such that the pairwise ratios of the weights, w i /w j , are as close as possible to the matrix elements a ij . Several methods have been suggested for this weighting problem, e.g., the eigenvector method [32], the least squares method [5,9,21,23], the logarithmic least squares method [11,12,13], the spanning tree approach [7,26,30,33,34,36,37] besides many other proposals discussed and compared by Golany and Kress [22], Choo and Wedley [8], Lin [25], Fedrizzi and Brunelli [18]. Bajwa, Choo and Wedley [3] not only compare seven weighting methods with respect to four criteria, but provide a detailed list of nine earlier comparative studies, too. ...
Preprint
Pairwise comparison matrices are frequently applied in multi-criteria decision making. A weight vector is called efficient if no other weight vector is at least as good in approximating the elements of the pairwise comparison matrix, and strictly better in at least one position. A weight vector is weakly efficient if the pairwise ratios cannot be improved in all non-diagonal positions. We show that the principal eigenvector is always weakly efficient, but numerical examples show that it can be inefficient. The linear programs proposed test whether a given weight vector is (weakly) efficient, and in case of (strong) inefficiency, an efficient (strongly) dominating weight vector is calculated. The proposed algorithms are implemented in Pairwise Comparison Matrix Calculator, available at pcmc.online.
... Certain applications apply the spanning trees enumeration, but not necessarily together with the aggregation by the geometric mean. The approach of spanning trees enumeration is used in determining the consistency to build the distribution of expert estimates based on the matrix [41]. Such problems offer further research possibilities. ...
Preprint
Complete and incomplete additive/multiplicative pairwise comparison matrices are applied in preference modelling, multi-attribute decision making and ranking. The equivalence of two well known methods is proved in this paper. The arithmetic (geometric) mean of weight vectors, calculated from all spanning trees, is proved to be optimal to the (logarithmic) least squares problem, not only for complete, as it was recently shown in Lundy, M., Siraj, S., Greco, S. (2017): The mathematical equivalence of the "spanning tree" and row geometric mean preference vectors and its implications for preference analysis, European Journal of Operational Research 257(1) 197-208, but for incomplete matrices as well. Unlike the complete case, where an explicit formula, namely the row arithmetic/geometric mean of matrix elements, exists for the (logarithmic) least squares problem, the incomplete case requires a completely different and new proof. Finally, Kirchhoff's laws for the calculation of potentials in electric circuits is connected to our results.
... Studies with a large number of explanatory parameters can also result in a large set of candidate models. In such studies it would be interesting to quantitatively investigate agreement levels between top selected models by using various agreement coefficients, see (Olenko & Tsyganok , 2016). ...
Preprint
Determining the relationship between the electrical resistivity of soil and its geotechnical properties is an important engineering problem. This study aims to develop methodology for finding the best model that can be used to predict the electrical resistivity of soil, based on knowing its geotechnical properties. The research develops several linear models, three non-linear models, and three artificial neural network models (ANN). These models are applied to the experimental data set comprises 864 observations and five variables. The results show that there are significant exponential negative relationships between the electrical resistivity of soil and its geotechnical properties. The most accurate prediction values are obtained using the ANN model. The cross-validation analysis confirms the high precision of the selected predictive model. This research is the first rigorous systematic analysis and comparison of difference methodologies in ground electrical resistivity studies. It provides practical guidelines and examples of design, development and testing non-linear relationships in engineering intelligent systems and applications.
... Studies with a large number of explanatory parameters can also result in a large set of candidate models. In such studies it would be interesting to quantitatively investigate agreement levels between top selected models by using various agreement coefficients, see (Olenko & Tsyganok , 2016). ...
Article
Full-text available
Determining the relationship between the electrical resistivity of soil and its geotechnical properties is an important engineering problem. This study aims to develop methodology for finding the best model that can be used to predict the electrical resistivity of soil, based on knowing its geotechnical properties. The research develops several linear models, three non-linear models, and three artificial neural network models (ANN). These models are applied to the experimental data set comprises 864 observations and five variables. The results show that there are significant exponential negative relationships between the electrical resistivity of soil and its geotechnical properties. The most accurate prediction values are obtained using the ANN model. The cross-validation analysis confirms the high precision of the selected predictive model. This research is the first rigorous systematic analysis and comparison of difference methodologies in ground electrical resistivity studies. It provides practical guidelines and examples of design, development and testing non-linear relationships in engineering intelligent systems and applications.
Article
Full-text available
The article is devoted to the problem of democratic development of Ukraine. The reasons for the need for a radical transformation of the electoral process in Ukraine have been considered from a theoretical standpoint. The main goal and sub-goals of the research have been formulated. The classical mathematical models of electoral technologies, selected for comparison with modern approaches have been described. The basic principles of selection of methods for measuring the results of approval voting have been analyzed. The issues of constructing a verbal-numerical scale, assessing the consistency of voter decisions and applying statistical criteria to obtain a consolidated result have been considered. The models selected for calculating the final election rating are analyzed. Mathematical algorithms of multicriteria selection based on the qualimetric approach and pairwise comparison on four variants of scales are given. Protocols for determining consensus alternatives using the Topsis method, the Kemeni – Young median, the Schulze heuristic procedure, and the fuzzy set approach are described. The results of approbation of the selected protocols of approval of the voting system for the election model of 4 candidates on 7 questions of the ballot paper are given. The algorithm and the results of generating by the Monte Carlo method arrays of initial data with a size of 10,000 records, having a uniform and normal distribution with three variants of the bias parameter, are presented. To identify the sensitivity of the studied protocols to violations of the transitivity of individual preference profiles, the primary data arrays were transformed by replacing the nontransitive profiles with an equivalent number of transitive ones without presenting a preference to any alternative. Based on the assessment of the correlation of the final ratings, their sensitivity to the type of distribution and to violations of the transitivity of individual judgments, it was concluded that it is advisable to use the Kemeny median to determine the voting results. The use of the proposed method for transforming primary data also makes it possible to use the Condorcet, Dodgson, Saati and Schulze protocols. The results of this study indicate that there is a fundamental possibility of transition to a new digital paradigm of the electoral process based on the approving principle of voting.
Article
Full-text available
In a situation where two raters are classifying a series of observations, it is useful to have an index of agreement among the raters that takes into account both the simple rate of agreement and the complexity of the rating task. Information theory provides a measure of the quantity of information in a list of classifications which can be used to produce an appropriate index of agreement. A normalized weighted mutual information index improves upon the traditional intercoder agreement index in a number of ways, key being that there is no need to develop a model of error generation before use; comparison across experiments is easier; and that ratings are based on the distribution of agreement across categories, not just an overall agreement level.
Article
Full-text available
In many research contexts where a multilevel data structure is present, researchers have to aggregate individual responses to the group level (e.g., team climate) to test specific hypotheses. An established way of justifying this aggregation is to show sufficient within-group agreement. An increasingly common measure of within-group agreement is the Average Deviation Index (AD(M)). The study elaborates on the properties of the AD(M) within small groups. This is a crucial topic, as many multilevel studies incorporate teams of fewer than ten members. Comparing practical and critical values for interpreting AD(M) magnitudes shows that in small groups the AD(M) is more likely to overestimate the agreement by receiving a smaller AD(M) value by chance. We assume the calculation procedure of the AD(M) to be one reason for the overestimation and therefore propose an adjusted calculation. After exploring the properties of the adjustment with simulated data, the original and adjusted calculations are applied to an empirical example with multiple ratings from 48 experts. Compared to the original AD(M), using the adjusted AD(M) leads to a less biased agreement estimate within smaller groups. However, researchers are encouraged to use practical as well as critical values to interpret the level of agreement when using the AD(M).
Article
Full-text available
A recent article by Lindell and Brandt raised two concerns regarding the use of James, Demaree, and Wolf’s interrater agreement indices rWG and rWG(J). First, they noted that the multi-item rWG(J) equation is mathematically equivalent to inserting rWG into the Spearman-Brown prophecy formula and questioned whether applying a formula developed for a reliability index was also appropriate for an agreement index. Second, they questioned the appropriateness of James et al.’s suggestion of replacing obtained negative values of rWG with zeros. This article addresses these concerns by demonstrating that rWG(J) can be derived independently from the Spearman-Brown prophecy formula and that negative values of rWG can be avoided by reparameterizing the structural equation underlying the data when there is systematic disagreement between subgroups of raters.
Article
Full-text available
Currently, guidelines do not exist for applying interrater agreement indices to the vast majority of methodological and theoretical problems that organizational and applied psychology researchers encounter. For a variety of methodological problems, we present critical values for interpreting the practical significance of observed average deviation (AD) values relative to either single items or scales. For a variety of theoretical problems, we present null ranges for AD values, relative to either single items or scales, to be used for determining whether an observed distribution of responses within a group is consistent with a theoretically specified distribution of responses. Our discussion focuses on important ways to extend the usage of interrater agreement indices beyond problems relating to the aggregation of individual level data.
Article
Full-text available
Statistics Canada’s General Social Survey (GSS) and Canadian Community Health Survey (CCHS) offer a valuable opportunity to examine the stability of life satisfaction responses and their correlates from year to year within a consistent analytical framework. Capitalizing on the strengths of these surveys, this paper addresses two questions. First, how much variability is observed from year to year and across surveys in the distribution of life satisfaction responses and what accounts for it? Second, how much variability is observed in the direction and magnitude of the correlation between life satisfaction and a consistent set of socioeconomic characteristics? The study shows that the mean level of life satisfaction reported varies from year to year in the GSS but remains stable in the CCHS. This pattern in variability is associated with survey content preceding the life satisfaction question. In contrast, the direction and magnitude of the relationships between life satisfaction and common socioeconomic characteristics is generally consistent between the two surveys and over time.
Book
Agreement among at least two evaluators is an issue of prime importance to statisticians, clinicians, epidemiologists, psychologists, and many other scientists. Measuring interobserver agreement is a method used to evaluate inconsistencies in findings from different evaluators who collect the same or similar information. Highlighting applications.
Article
Performance assessments have become the norm for evaluating language learners??? writing abilities in international examinations of English proficiency. Two aspects of these assessments are usually systematically varied: test takers respond to different prompts, and their responses are read by different raters. This raises the possibility of undue prompt and rater effects on test-takers??? scores, which can affect the validity, reliability, and fairness of these tests. This study uses data from the Michigan English Language Assessment Battery (MELAB), including all official ratings given over a period of over four years (n=29,831), to examine these issues related to scoring validity. It uses the multi-facet extension of Rasch methodology to model this data, producing measures on a common, interval scale. First, the study investigates the comparability of prompts that differ on topic domain, rhetorical task, prompt length, task constraint, expected grammatical person of response, and number of tasks. It also considers whether prompts are differentially difficult for test takers of different genders, language backgrounds, and proficiency levels. Second, the study investigates the quality of raters??? ratings, whether these are affected by time and by raters??? experience and language background. It also considers whether raters alter their rating behavior depending on their perceptions of prompt difficulty and of test-takers??? prompt selection behavior. The results show that test-takers??? scores reflect actual ability in the construct being measured as operationalized in the rating scale, and are generally not affected by a range of prompt dimensions, rater variables, or test taker characteristics. It can be concluded that scores on this test and others whose particulars are like it have score validity, and assuming that other inferences in the validity argument are similarly warranted, can be used as a basis for making appropriate decisions. Further studies to develop a framework of task difficulty and a model of rater development are proposed.
Article
Measurements of agreement are needed to assess the acceptability of a new or generic process, methodology, and formulation in areas of laboratory performance, instrument or assay validation, method comparisons, statistical process control, goodness of fit, and individual bioequivalence. In all of these areas, one needs measurements that capture a large proportion of data that are within a meaningful boundary from target values. Target values can be considered random (measured with error) or fixed (known), depending on the situation. Various meaningful measures to cope with such diverse and complex situations have become available only in the last decade. These measures often assume that the target values are random. This article reviews the literature and presents methodologies in terms of “coverage probability.” In addition, analytical expressions are introduced for all of the aforementioned measurements when the target values are fixed and when the error structure is homogenous or heterogeneous (proportional to target values). This article compares the asymptotic power of accepting the agreement across all competing methods and discusses the pros and cons of each. Data when the target values are random or fixed are used for illustration. A SAS macro program to compute all of the proposed methods is available for download at http://www.uic.edu/~hedayat/.
Article
The hierarchical rater model (HRM) re-cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters' scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM-SDT model is applied to data from a large-scale assessment and is shown to provide a useful summary of various aspects of the raters' performance.
Article
For continuous constructs, the most frequently used index of interrater agreement (r wg(1))can be problematic. Typically, rwg(1) is estimated with the assumption that a uniform distribution represents no agreement. The authors review the limitations of this uniform nullr wg(1) index and discuss alternative methods for measuring interrater agreement. A new interrater agreement statistic,a wg(1),is proposed. The authors derive thea wg(1)statistic and demonstrate thatawg(1) is an analogue to Cohen’s kappa, an interrater agreement index for nominal data. A comparison is made between agreement estimates based on the uniformr wg(1)and a wg(1), and issues such as minimum sample size and practical significance levels are discussed. The authors close with recommendations regarding the use ofr wg(1)/rwg(J) when a uniform null is assumed,r wg(1)/rwg(J) indices that do not assume a uniform null,awg(1) / a wg(J)indices, and generalizability estimates of interrater agreement.