ArticlePDF Available

Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation


Abstract and Figures

Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.
Content may be subject to copyright.
Measuring Agreement on Set-valued Items (MASI)
for Semantic and Pragmatic Annotation
Rebecca Passonneau
Columbia University
New York, New York, USA
Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that
oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the
dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for
comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs)
generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper
presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC).
The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous
application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of
inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other
phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.
1. Introduction
To capture gradations in meaning or function,
semantic and pragmatic annotation projects have taken
various approaches. The project on Interlingual
Annotation of Multilingual Text Corpora (IAMTC;
Farwell et al., 2004) explicitly directed annotators to
make multiple selections when no single selection seems
sufficient. A related approach was taken in an email
domain in which annotators were allowed to make
multiple selections, but were asked to designate one as
primary (Rosenberg & Binkowski, 2004). A contrasting
implicit method relies on frequency of a category across
multiple annotators to represent stronger or weaker
presence of pragmatic units (Passonneau & Litman, 1997)
or semantic ones (Nenkova & Passonneau, 2004).
Annotations with multiple choices or graded
categories require new approaches to measuring
agreement. Rosenberg & Binkowski (2004), for example,
developed an augmented version of Cohen’s kappa (1960)
to assess inter-annotator agreement for the email domain.
It yields different kappa scores, depending on the weight
assigned to the primary selection. If the weight is 1, the
secondary selection is ignored; if it is .5, both are
considered equally. (Passonneau, 2004) presented a
weighted metric for measuring agreement on set-valued
items (referred to here as MASI) and compared it with
other measures of agreement on co-reference annotation.
MASI has also been applied to the IAMTC data
(Passonneau et al., 2006). It will be demonstrated here for
a semantic annotation task pertaining to the evaluation of
automatic summarization: the creation of pyramids from
human model summaries. A pyramid is a weighted model
of the semantic content in a set of human model
summaries (Nenkova & Passonneau, 2004), and can be
used to score machine-generated summaries. It was used
in the 2005 Document Understanding Conference (DUC)
(Passonneau et al., 2005) and will be used in DUC 2006.
This paper will address the inter-annotator reliability of
pyramid construction for a DUC 2003 dataset.
Section two describes the annotation task, and gives
an example of a representative pair of SCUs from
pyramids created by different annotators. Section three
gives an overview of a standard framework for assessing
reliability, a definition and simple illustration of MASI,
and a brief discussion of related work. Section four
compares inter-annotator agreement results for the five
pairs of pyramids using three metrics, including MASI.
Together, the three metrics indicate a very high degree of
overlap in pyramid annotations.
2. Pyramid Annotation Task
Summaries written by different humans will share
information, but will also have information that does not
appear in any other summary. This long observed fact
was dramatically quantified by (van Halteren & Teufel,
2003) for a set of fifty summaries of a single source text.
Pyramids represent shared content in summaries by
having annotators select spans of words, or contributors,
from different summaries such that each expresses more
or less the same information. We refer to a set of
contributors as a Summary Content Unit (SCU). An SCU
will have at most the same number of contributors as
there are model summaries. The cardinality of an SCU,
its weight, indicates how many of the model summaries
expresses the given content. The set of all SCUs found in
the models constitutes a pyramid. Annotators assign a
label that serves as a mnemonic for the meaning.
A1’s SCU: Weight=4
[Label: Americans asked Saudi officials for help]
Sum1 <Saudi Arabian officials, under American
Sum2 <sought help from Saudi officials>2
Sum3 <Through the Saudis, the United States
Sum4 <U.S. and Saudi Arabian requests>4
A2’s SCU: Weight=5
[Label: Through the Saudis, the U.S. tried to get
cooperation from the Taliban]
Sum1 <Saudi Arabian officials, under American
pressure,>1 <asked Afghan leaders>5
Sum2 <U.S. and Saudi officials then attempted>6
Sum3 <sought help from Saudi officials>2,
<who tried to convince Taliban leaders>7
Sum4 <Through the Saudis, the United States
Sum5 <U.S. and Saudi Arabian requests>4
Equivalence classes:
A1 { 1, 2, 3, 4 } {5, 7} {6}
A2 { 1, 2, 3, 4, 5, 6, 7}
Figure 1. Semantically similar SCUs from two
annotators, A1 and A2.
In the original annotation method (Passonneau &
Nenkova, 2003), the contributors constituted equivalence
classes over the words in the model summaries, and SCUs
were equivalence classes over the contributors. This is the
annotation style whose results are reported on here. For
the annotations in DUC 2005 (Passonneau et al., 2005),
the annotation constraints were relaxed to allow a word or
phrase to be part of multiple contributors, and a
contributor could be part of multiple SCUs.
For the 2003 Document Understanding Conference,
NIST assembled thirty clusters of documents to use in the
evaluation of automatic summarizers. In addition, four
100-word human summaries per document cluster
(Docset) were collected. We recruited journalism majors,
English majors and others in the Columbia University
community who demonstrated high verbal skills (such as
high verbal GREs scores) to write additional summaries.
A contributor can have discontinuities in the word string, e.g.,
for discontinuous constituents.
For five of the document clusters from DUC 2003, we
had two annotators work independently to create
pyramids, each using seven model summaries per Docset.
Figure 1 shows a pair of SCUs from the two
independently annotated pyramids for one of the five
Docsets (31038). It is typical of what we see from
different annotators across the five pairs of pyramids
investigated here. The two SCUs are very similar, but not
identical. They differ in the weight (four versus five), and
in the constituency of the contributors. By giving each
span of words a unique identifier, we can see that there
are two contributors that are the same for both annotators
(spans 3 and 4), two that partly overlap (spans 1 and 2 for
A1, versus A2’s combination of spans 1 with 5, and of
spans 2 with 7), and one that is unique to annotator A2
(span 6). Annotator A1 placed spans 5 and 7 in a distinct
SCU, and span 6 was a singleton SCU.
The two annotators’ equivalence classes (of spans
rather than words) are shown at the bottom of Figure 1.
3. Inter-annotator Agreement
3.1. Standard Approach
Different types of tables have been used as a basis for
computing inter-annotator agreement, including
contingency tables, and simple agreement tables having
rows for each unit and columns for each category and
where cells record how often each unit was assigned each
category. For two coders, Di Eugenio & Glass (2004)
prefer contingency tables. They note that in comparison to
contingency tables, simple agreement tables lose
information about what choices individual coders make.
Contingency tables also lose information: they don’t
represent the coding units, and are inconvenient for more
than two coders. An agreement matrix in Krippendorff’s
(1986) canonical form has one row per coder, and one
column per coding unit. Cell values indicate the category
k assigned by the ith coder to the jth unit. Such matrices
lose no information, and can be used to tabulate counts of
the number of coders who assigned the kth category to
the jth unit, the number of categories assigned by the ith
coder to the jth unit, and so on.
Apart from the assumptions used for computing the
probability of the kth, most reliability metrics use the
same or equivalent general formula to factor out chance
agreement (Passonneau, 1997) (Arstein & Poesio, 2005).
Where p(A
) and p(A
) are the probabilities of observed
and expected agreement, the general formula is:
() ()
The metrics all have the same range: one for perfect
agreement, to zero for no difference from chance, to
values that approach minus one for ever greater than
chance disagreement. The devil is in the details, namely
how to estimate p(A
The family of metrics including Scott’s pi (1955) and
Siegel & Castellan’s K (1988) use a single probability
distribution for all coders, based on the observed rate of
each category k across all coders. Cohen’s kappa (1960)
uses a distinct probability distribution for each coder
based on the rate at which the kth category appears in the
ith coder’s annotation. Cohen’s (1960) kappa makes
fewer assumptions, so in principle it provides stronger
support for inferences about reliability. In practice, kappa
may not always be the best choice.
Di Eugenio & Glass (2004) argue that kappa suffers
from coder bias. The size of kappa will be relatively
higher than Siegel & Castellan’s K if two coders assign
the categories k at different rates. Whether one views bias
as an obstacle depends on one’s goals. If the probability
distributions over the values k are very different for two
coders, then the probability that they will agree will
necessarily be lower, and kappa accounts for this.
Whether the difference in distribution arises from the
inherent subjectivity of the task, insufficient specification
in the annotation guidelines of when to use each category,
or differences in the skill and attention of the annotators,
cannot be answered by one metric in one comparison.
Artstein and Poesio (2005) review several families of
reliability metrics, the associated assumptions, and
differences in the resulting values that arise given the
same data. The quantitative differences tend to be small.
In order to illustrate the impact of different distance
metrics, results are reported here using a single method of
computing p(A
), Krippendorff’s Alpha (1980).
The formula for Alpha, given m coders and r units, is:
bc bc
rm n n
The numerator is a summation over the product of counts
of all pairs of values b and c, times the distance metric δ,
across rows. The denominator is a summation of
agreements and disagreements within columns. For
categorical scales, because Alpha measures
disagreements, δ is 0 when b=c, and 1 when b
c. For
very large samples, Alpha is equivalent to Scott’s (1955)
pi; it corrects for small sample sizes, applies to multiple
coders, and generalizes to many scales of annotation data.
Interpreting inter-annotator reliability raises two
questions: what value of reliability is good enough, and
how does one decide. Krippendorff (1980) is often cited
as recommending a threshold of 0.67 to support cautious
conclusions. The comment he made that introduced his
discussion should be quoted more often. For the question
of how reliable is reliable enough, he said: “there is no set
answer” (p. 146). He offered the 0.67 threshold in the
context of reliability studies in which the same variables
also played a role in independent significance tests. In his
data, variables below the 0.67 threshold happened never
to be significant. He noted that in contrast, “some content
analyses are very robust in the sense that unreliabilities
become hardly noticeable in the result” (p. 147).
I will refer to the simultaneous investigation of
reliability values of annotated data, and significance tests
of the annotated variables with respect to independent
measures, as a paradigmatic reliability study.
(Passonneau et al., 2005) includes an analysis of the
reliability of peer annotations for pyramid evaluation, and
of the significance of correlations of pyramid scores using
peer annotations from different annotators. It is a
paradigmatic reliability study of peer annotation. The
average Kappa across six document sets was .57, the
average Alpha with Dice (1945) as a distance metric was
0.62, and Pearson’s correlations were highly significant.
A distance metric was used to count partial agreement for
annotators who agreed that a given SCU occurred in a
peer summary, but disagreed as to how often. MASI was
not relevant here, because the counts of SCUs per
summary did not constitute a unit of representation.
In concurrent work (Passonneau, 2005), we present
results of a study in which the five pyramids discussed
here were used to score summaries. Thus the present
paper in combination with (Passonneau, 2005) constitutes
a paradigmatic reliability study of pyramid annotation.
3.3. MASI
MASI is a distance metric for comparing two sets,
much like an association measure such as Jaccard (1908)
or Dice (1945). In fact, it incorporates Jaccard, as
explained below. When used to weight the computation of
inter-annotator agreement, it is independent of the method
in which probability is computed, thus of the expected
agreement. It can be used in any weighted agreement
metric, such as Krippendorff's Alpha (Passonneau, 2004)
or Artstein & Poesio's (2005) Beta
In (Passonneau, 2004), MASI was used for measuring
agreement on co-reference annotations. Earlier work on
assessing co-reference annotations did not use reliability
measures of canonical agreement matrices, in part
because of the data representation problem of determining
what the coding values should be. The annotation task in
co-reference does not involve selecting categories from a
predefined set, but instead requires annotators to group
expressions together into sets of those that co-refer.
(Passonneau, 2004) proposed a means for casting co-
reference annotation into a conventional agreement matrix
by treating the equivalence classes that annotators
grouped NPs into as the coding values. Application of
MASI for comparing the equivalence classes that
annotators assign an NP to made it possible to quantify
the degree of similarity across annotations. Since it is
typically the case that annotators assign the same NP to
very similar, but rarely identical, equivalence classes,
applying an unweighted metric to the agreement matrices
yields misleadingly low values.
The annotation task in creating pyramids has similar
properties to the NP co-coreference annotation task.
Neither the number of distinct referents, nor the number
of distinct SCUs, is given in advance: both are the
outcome of the annotation. The annotations both yield
equivalence classes in which every NP token, or every
word token, belongs to exactly one class (corresponding
to a referent, or an SCU). NPs that are not grouped with
other NPs (e.g., NPs annotated as non-referential), and
words that are not grouped with other words (e.g., closed-
class lexical items like “and” that contribute little or
nothing to the semantics of an SCU, form singleton sets.
Figure 2 and Figure 3 schematically represent
agreement matrices using set-based annotations. A3 and
A4 stand for two annotators; x, y and z are the units from
which to create sets, and the coding values are the sets
shown in the cells of the matrices.
Annotator x y z
A3 {x, y} {x, y} {x}
A4 {x, y, z} {x, y, z} {x, y, z}
Figure 2. Annotation with set subsumption.
Annotator x y z
A3 {x, y } {x, y} {z}
A4 {x } { y, z} { y, z}
Figure 3. Annotation with symmetric difference in
column “y”.
Figure 2 is like the SCU example in Figure 1 in that
there is a monotonic relationship among all the sets in the
matrix. Within columns, A3’s sets always share properties
with A4’s sets, and there are no conflicting properties.
This is not the case in Figure 3, where A3 has a set {x,y},
and A4 has a set {y,z}. The two sets have a non-null
intersection ({y}), and non-null set-differences ({x},{z}).
Figure 3 represents a case where A3 thinks x and y have
the same set of properties, not shared by z; A3 thinks y
and z have the same set of properties, not shared by x,
thus the semantic or pragmatic elements being
represented are in conflict.
MASI ranges from 1, when two sets are identical, to 0,
when they are disjoint. It has two terms which weight
different aspects of set comparison: MASI = J*M. The
Jaccard (1908) metric (the J term) is used to weight the
differences in size of two sets, independent of whether
sets are monotonic. The M term is for monotonicity, and
penalizes a case like Figure 3 more heavily than Figure 2.
Their role in computing MASI will now be illustrated.
Spans A1 A2
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
{5, 7} { 1, 2, 3, 4, 5, 6, 7}
{6} { 1, 2, 3, 4, 5, 6, 7}
{5, 7} { 1, 2, 3, 4, 5, 6, 7}
Figure 4. Agreement matrix for Figure 1, using spans
(instead of words) as the coding units.
Taking the two MASI terms in turn, J is the ratio of
the cardinality of the intersection to the cardinality of the
union of the two sets. For two sets P and Q, it is one if
P=Q, and grows closer to one the more members P and Q
have in common. J is zero if P and Q are disjoint, and is
closer to zero the larger P and Q are, and the fewer
members they have in common.
The value of Jaccard is 2/3 for the x and y columns of
Figure 2, and 1/3 for the z column. Similarly, it is 2/3 for
the y column of Figure 3, and 1/3 for the x and z columns.
The mean Jaccard for Figure 2 is 5/9, and for Figure 3 it is
4/9. Thus Figure 2 is appropriately closer to one than
Figure 3, but the quantitative difference is small.
The second term of MASI (M, for monotonicity)
penalizes the case in Figure 3 more heavily than that in
Figure 2. If two sets Q and P are identical, M is 1. If one
set is a subset of the other, M is 2/3. If the intersection
and the two set differences are all non-null, then M is 1/3.
If the sets are disjoint, M is 0.
Before comparing the sets assigned by A3 and A4 to a
coding unit y, the coding unit itself must be removed.
Otherwise, the coding values will necessarily intersect.
For column y in Figure 2, {x} would be compared with
{x, z}. For column y in Figure 3, {x} would be compared
with {z}. The mean MASI for Figure 2 is 10/27 (.37) and
for Figure 3 it is 6/27 (.22).
A1’s SCU-101: Weight=2
[Label: Worker’s agree to Estrada’s terms]
Sum1 <61% voted yes>1
Sum2 <Unions agreed to some employee cuts with
separation benefits>2
A1’s SCU-102: Weight=1
[Label: Ground crew accepts 2 weeks after initial
Sum3 <which it accepted two weeks later>3
A2’s SCU-201: Weight=2
[Label: The settlement was finally accepted]
Sum1 <61% voted yes>1
Sum2 <which it accepted two weeks later>3
A2’s SCU-202: Weight=1
[Label: Unions agree to some employee cuts]
Sum3 < Unions agreed to some employee cuts with
separation benefits >2
Figure 5. Pairs of SCUs from two annotators
illustrating non-monotonicity.
Figure 4 shows a canonical agreement matrix for the
example from Figure 1; to save space it is presented with
the coding units in rows instead of columns. For the sake
of illustration, the coding units are spans instead of words.
The set of coding categories consists of the equivalence
classes from both annotations. Annotator A2 placed all
the spans shown in a single SCU labeled [Through the
Saudis, the U.S. tried to get cooperation from the
Taliban]. Annotator A1 created a similar SCU labeled,
[Americans asked Saudi officials for help], but did not
include spans 5 (“asked Afghan leaders”) and 7 (“who
tried to convince Taliban leaders”). A1 placed 5 and 7 in
a distinct SCU (with other contributing spans, omitted
from discussion), labeled [Saudi officials asked Afghan
leaders to release Bin Laden].
The labels assigned by A1 and A2 in Figure 1 reflect
the difference in content. A2 chose a more
comprehensive label expressing a 3-way relation in which
the Saudis would mediate between the U.S. and the
Taliban. In comparison, A1’s labels describe two binary
relations, one relating the U.S. and the Saudis, and one
relating the Saudis and the Taliban. The labels would
suggest that A2’s annotation subsumes A1’s, and the SCU
representation confirms this.
In contrast to the SCU example illustrated in Figure 1,
we occasionally find groups of SCUs across annotators
that are semantically more distinct, corresponding to cases
like Figure 3. Figure 5 gives an example from a pyramid
whose reliability was reported on in (Nenkova &
Passonneau, 2004).
Table 1 shows the reliability values for the data from
Figure 1 using Krippendorff’s Alpha with three different
distance metrics. Because Krippendorff’s Alpha measures
disagreements, one minus Jaccard, and one minus MASI,
are used in computing Alpha. The “Nominal” column
shows the results treating all non-identical sets as
categorically distinct (see section 3.1). For illustrative
purposes, the top portion of the table uses spans as the
coding units, i.e., computing Alpha from the agreement
matrix given in Figure 4. Since spans were not given in
advance, but were decided on by coders, this
underestimates the number of decisions that annotators
were required to make. The very low value in the Jaccard
column is due to the disparity in size between the two
annotations for rows five through seven of Figure 4.
The lower portion of Table 1 shows the results using
words as the coding units. The values across the three
columns are similar to those for the full dataset as we will
see in the discussion of Table 2.
Coding units Nominal Jaccard MASI
spans 0 -.44 0.14
words 0 .64 .81
Table 1. Reliability values for data from Figure 1,
using spans versus words as coding units.
3.4. Related Work
As noted above, Teufel and van Halteren (2004)
perform an annotation addressing a goal similar to the
pyramid method. They create lists of factoids, atomic
units of information. To compare sets of factoids that
were independently created by two annotators, they first
create a list of subsumption relations between factoids
across annotations. Then they construct a table that lists
all (subsumption-relation, summary) pairs, with counts of
how often each subsumption relation occurs in each
summary. Figure 6 reproduces their Figure 2. Every
factoid is given an index, and in Figure 6, P30 represents
a factoid created by one annotator that subsumes two
created by the other annotator. Symbols a through e
represent five summaries. They compute kappa from this
type of agreement table.
SCU-201 has been simplified for illustrative purposes; in the
actual data, it had a third contributor.
A1 A2 A1
P30 F9.21 -a 1 1 P30 F9.22 -a 1 0
P30 F9.21 -b 0 0 P30 F9.22 -b 0 0
P30 F9.21 -c 1 0 P30 F9.22 -c 1 1
P30 F9.21 -d 0 0 P30 F9.22 -d 0 0
P30 F9.21 -e 1 0 P30 F9.22 -e 1 1
Figure 6. Agreement table representation used in
Teufel and van Halteren (2004).
While this representation does not suffer from the loss
of information De Eugenio & Glass (2004) fault Siegel &
Castellan (1988) for, note that it differs from an
agreement matrix or a contingency table in that it is not
the case that each count represents an individual decision
made by an annotator. We can see from the table that A1
is the annotator who created P30 and A2 is the one who
created F9.21 and F9.22. Although there are two cells in
A1’s column for the two subsumption relations P30
F9.21 and P30F9.22, it is unlikely that A1’s original
annotation involved decisions about F9.21 and F9.22. If
the number of decisions is overestimated, p(A
) will be
underestimated, leading to higher kappa values.
Another issue in using such an agreement table from
two independently created factoid lists is that it requires
the creation of a new level of representation that would
itself be subject to reliability issues.
4. Results and Discussion
Canonical greement matrices of the form shown in
Figure 4, but with words as the coding units, were
computed for the five pairs of independently created
pyramids for the Docsets listed in Table 2. The mean
number of words per pyramid was 725; the mean number
of distinct SCUs was 92. Results are shown for Alpha
with the same three distance metrics used in Table 1.
Docset Nominal Jaccard MASI
30016 .19 .55 0.79
30040 .24 .58 0.80
31001 .01 .40 0.68
31010 .03 .39 0.69
31038 .09 .40 0.71
Table 2. Inter-annotator agreement on 5 pyramids
using unweighted Krippendorff’s Alpha (nominal),
and Alpha with Jaccard and MASI as δ.
The low values for the nominal distance metric are
expected, given that there are few cases of word-for-word
identity of SCUs across annotations. With Jaccard as the
distance metric, the values increase manyfold, indicating
that over all the comparisons of pairs of SCUs across
annotators for a given pyramid, the size of the set
intersection is closer to the size of the set union than not.
With MASI, values increase by approximately half of
the difference between the Jaccard value and the
maximum value of one. Since MASI rewards
overlapping sets twice as much if one is a subset of the
other than if they are not, this degree of increase indicates
that most of the differences between SCUs are monotonic.
By including several metrics whose relationship to
each other is known, Table 2 indicates that the pyramid
annotations do not have many cases of exact agreement
(nominal), that the sets being compared have more
members in common than not (Jaccard), and that the
commonality is more often monotonic than not (MASI).
Whether these results are sufficiently reliable depends
on the uses of the data. In a separate investigation
(Passonneau, 2005), the pairs of pyramids for Docsets
30016 and 30014 have been used to produce parallel sets
of scores for summaries from sixteen summarization
systems that participated in DUC 2003. Pearson’s
correlations of two types of scores (original pyramid and
modified) range from 0.84 to 0.91with p values always
zero. This constitutes evidence that the pyramid
annotations are more than reliable enough.
5. Conclusion
Measuring inter-annotator reliability involves more
than a single number or a single study. Di Eugenio &
Glass (2004) argue that using multiple reliability metrics
with different methods for computing p(A
) can be more
revealing of than a single metric. Passonneau et al. (2005)
present a similar argument for the case of comparing
different distance metrics. Here, inter-annotator reliability
results have been presented using three metrics in order to
more fully characterize the dataset.
This paper argues that full interpretation of a
reliability measure is best carried out in a paradigmatic
reliability study: a series of studies that link one or more
measures of the reliability of a dataset to an independent
assessment, such as a significance test. If the same
dataset is used in different tasks, what is reliable for one
task may not be for another.
Investigators faced with complex annotation data have
shown ingenuity in proposing new data representations
(Teufel & van Halteren, 2004), new reliability measures
(Rosenberg & Binkowski, 2004), and techniques new to
computational linguistics, as discussed in (Artstein &
Poesio). While this paper argues for placing a greater
burden on the interpretation of inter-annotator agreement,
proposals such as these provide an expanding suite of
tools for accomplishing this task.
This work was supported by DARPA NBCH105003
and NUU01-00-1-8919. The author thanks many
annotators, especially Ani Nenkova and David Elson.
Artstein, R. and M. Poesio. 2005. Kappa
= Alpha (or
Beta). University of Essex NLE Technote 2005-01.
Dice, L. R. (1945). Measures of the amount of ecologic
association between species. Ecology 26:297-302.
Farwell, D.; Helmreich, S.; Dorr, B. J.; Habash, N.;
Reeder, F.; Miller, K.; Levin, L.; Mitamura, T.; Hovy,
E.; Rambow, O.; Siddharthan, A. (2004). Interlingual
annotation of multilingual text corpora. In Proceedings
of the North American Chapter of the Association for
Computational Linguistics Workshop on Frontiers in
Corpus Annotation, Boston, MA, pp. 55-62, 2004.
van Halteren, H. and S. Teufel. 2003. Examining the
consensus between human summaries. In Proceedings
of the Document Understanding Workshop.
Jaccard, P. (1908). Nouvelles recherches sur la
distribution florale. Bulletin de la Societe Vaudoise des
Sciences Naturelles 44:223-270.
Krippendorff, K. 1980. Content Analysis. Newbury Park,
CA: Sage Publications.
Nenkova, A. and R. Passonneau. 2004. Evaluating content
selection in summarization: The pyramid method.
Proceedings of the Joint Annual Meeting of Human
Language Technology (HLT) and the North American
chapter of the Association for Computational
Linguistics (NACL). Boston, MA.
Passonneau, R.; Nenkova, A.; McKeown, K.; Sigelman,
S. 2005. Applying the pyramid method in DUC 2005.
Document Understanding Conference Workshop.
Passonneau, R.; Habash, N.; Rambow, O. 2006. Inter-
annotator agreement on a multilingual semantic
annotation task. Proceedings of the International
Conference on Language Resources and Evaluation
(LREC). Genoa, Italy.
Passonneau, R. 2005. Evaluating an evaluation method:
The pyramid method applied to 2003 Document
Understanding Conference (DUC) Data. Technical
Report CUCS-010-06, Department of Computer
Science, Columbia University.
Passonneau, R. 2004. Computing reliability for
coreference annotation. Proceedings of the
International Conference on Language Resources and
Evaluation (LREC). Portugal.
Passonneau, R. 1997. Applying reliability metrics to co-
reference annotation. Technical Report CUCS-017-97,
Department of Computer Science, Columbia
Passonneau, R. and D. Litman. 1997. Discourse
segmentation by human and automated means.
Computational Linguistics 23.1: 103-139.
Rosenberg, A. and Binkowski, E. (2004). Augmenting the
kappa statistic to determine inter-annotator reliability
for multiply labeled data points. In Proceedings of the
Human Language Technology Conference and Meeting
of the North American Chapter of the Association for
Computational Linguistics (HLT/NAACL).
Siegel, S. and N. John Castellan, Jr. (1988) Non-
parametric Statistics for the Behavioral Sciences, 2
edition. McGraw-Hill, New York.
Teufel, S. and H. van Halteren, 2004: Evaluating
information content by factoid analysis: human
annotation and stability. In Proceedings of Empirical
Methods in Natural Language Processing. Barcelona.
... To check the Inter Annotator Agreement (IAA), we calculated the Krippendorff's alpha agreement score (using MASI distance [26]). The agreement score came out as 0.9557 on the final dataset (considering only the labels assigned by at least two annotators) which shows very high agreement. ...
Full-text available
Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by some state-of-the-art models. Our dataset and codes are available at:
... We calculate the inter-annotator agreement using Krippendorf's alpha [49] coefficient which is one minus the ratio of observed and expected disagreement. Since our topic dataset is multilabel (i.e. a sample can be assigned any number of the topic labels), we use MASI distance [50] to measure the distance between the sets of labels When considering all crowdsourced samples without filtering, the Krippendorff's alpha is 0.35 which corresponds to low agreement. Possible reasons for this result are the large number of labels, ambiguous samples or workers who did not understand the task or cheated. ...
Full-text available
Toxicity on the Internet, such as hate speech, offenses towards particular users or groups of people, or the use of obscene words, is an acknowledged problem. However, there also exist other types of inappropriate messages which are usually not viewed as toxic, e.g. as they do not contain explicit offences. Such messages can contain covered toxicity or generalizations, incite harmful actions (crime, suicide, drug use), provoke "heated" discussions. Such messages are often related to particular sensitive topics, e.g. on politics, sexual minorities, social injustice which more often than other topics, e.g. cars or computing, yield toxic emotional reactions. At the same time, clearly not all messages within such flammable topics are inappropriate. Towards this end, in this work, we present two text collections labelled according to binary notion of inapropriateness and a multinomial notion of sensitive topic. Assuming that the notion of inappropriateness is common among people of the same culture, we base our approach on human intuitive understanding of what is not acceptable and harmful. To objectivise the notion of inappropriateness, we define it in a data-driven way though crowdsourcing. Namely we run a large-scale annotation study asking workers if a given chatbot textual statement could harm reputation of a company created it. Acceptably high values of inter-annotator agreement suggest that the notion of inappropriateness exists and can be uniformly understood by different people. To define the notion of sensitive topics in an objective way we use on guidelines suggested commonly by specialists of legal and PR department of a large public company as potentially harmful.
... Finally, the two annotators will discuss to resolve any disagreement in the annotation, thus producing a final version of our MACRONYM dataset. To study the challenges of AE in each language, following (Veyseh et al., 2020), we compute the inter-annotator agreement (IAA) scores using Krippendorff's alpha (Krippendorff, 2011) with the MASI distance metric (Passonneau, 2006) for the initial independent annotations of the two annotators, i.e., before resolving the conflicts. Table 1 shows the IAA scores for each language. ...
Full-text available
Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). As such, challenges of AE in other languages and domains is mainly unexplored. Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area. To address this limitation, we propose a new dataset for multilingual multi-domain AE. Specifically, 27,200 sentences in 6 typologically different languages and 2 domains, i.e., Legal and Scientific, is manually annotated for AE. Our extensive experiments on the proposed dataset show that AE in different languages and different learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.
... The similarity of two annotators can be measured with Krippendorff 's inter-annotator-agreement coefficient α (Krippendorff, 2018, pp. 221-250), using the MASI distance (Passonneau, 2006) for multi-label annotations. Table 10 shows inter-annotator agreements for all pairs of annotators. ...
Full-text available
Literary narratives regularly contain passages that different readers attribute to different speakers: a character, the narrator, or the author. Since literary narratives are highly ambiguous constructs, it is often impossible to decide between diverging attributions of a specific passage by hermeneutic means. Instead, we hypothesise that attribution decisions are often influenced by annotator bias, in particular an annotator's literary preferences and beliefs. We present first results on the correlation between the literary attitudes of an annotator and their attribution choices. In a second set of experiments, we present a neural classifier that is capable of imitating individual annotators as well as a common-sense annotator, and reaches accuracies of up to 88% (which improves the majority baseline by 23%).
... The rendition of emotion clue in the text is highly subjective, which directs to a discrepancy in the annotations by distinct annotators. The difference in experiences, the annotation task itself, and focus on the annotators contribute to a disagreement between the annotators [66]. Thus, we investigate to find how much the annotators agree in assigning emotion classes by using Cohen's kappa [67]. ...
Full-text available
Emotion classification in text has growing interest among NLP experts due to the enormous availability of people’s emotions and its emergence on various Web 2.0 applications/services. Emotion classification in the Bengali texts is also gradually being considered as an important task for sports, e-commerce, entertainments, and security applications. However, It is a very critical task to develop an automatic emotion classification system for low-resource languages such as, Bengali. Scarcity of resources and deficiency of benchmark corpora make the task more complicated. Thus, the development of a benchmark corpus is the prerequisite to develop an emotion classifier for Bengali texts. This paper describes the development of an emotional corpus (hereafter called ‘BEmoC’) for classifying six emotions in Bengali texts. The corpus development process consists of four key steps: data crawling, pre-processing, labelling, and verification. A total of 7000 texts are labelled into six basic emotion categories such as anger, fear, surprise, sadness, joy, and disgust, respectively. Dataset evaluation with 0.969 Cohen’s κ score indicates the close agreement between the corpus annotators and the expert. The analysis of evaluation also represents that the distribution of emotion words obeys Zipf’s law. Moreover, the results of BEmoC analysis shown in terms of coding reliability, emotion density, and most frequent emotion words, respectively.
... We calculated inter-annotator agreement for 30 double-annotated articles. Since each sentence could be labeled with multiple items, we chose MASI measure (Measuring Agreement on Set-Valued Items) [44] to calculate agreement. MASI is a distance metric for comparing two sets and incorporates the Jaccard index. ...
Objective To annotate a corpus of randomized controlled trial (RCT) publications with the checklist items of CONSORT reporting guidelines and using the corpus to develop text mining methods for RCT appraisal. Methods We annotated a corpus of 50 RCT articles at the sentence level using 37 fine-grained CONSORT checklist items. A subset (31 articles) was double-annotated and adjudicated, while 19 were annotated by a single annotator and reconciled by another. We calculated inter-annotator agreement at the article and section level using MASI (Measuring Agreement on Set-Valued Items) and at the CONSORT item level using Krippendorff’s α. We experimented with two rule-based methods (phrase-based and section header-based) and two supervised learning approaches (support vector machine and BioBERT-based neural network classifiers), for recognizing 17 methodology-related items in the RCT Methods sections. Results We created CONSORT-TM consisting of 10,709 sentences, 4,845 (45%) of which were annotated with 5,246 labels. A median of 28 CONSORT items (out of possible 37) were annotated per article. Agreement was moderate at the article and section levels (average MASI: 0.60 and 0.64, respectively). Agreement varied considerably among individual checklist items (Krippendorff’s α= 0.06-0.96). The model based on BioBERT performed best overall for recognizing methodology-related items (micro-precision: 0.82, micro-recall: 0.63, micro-F1: 0.71). Combining models using majority vote and label aggregation further improved precision and recall, respectively. Conclusion Our annotated corpus, CONSORT-TM, contains more fine-grained information than earlier RCT corpora. Low frequency of some CONSORT items made it difficult to train effective text mining models to recognize them. For the items commonly reported, CONSORT-TM can serve as a testbed for text mining methods that assess RCT transparency, rigor, and reliability, and support methods for peer review and authoring assistance. Minor modifications to the annotation scheme and a larger corpus could facilitate improved text mining models. CONSORT-TM is publicly available at
... Otherwise, a fourth annotator is hired to resolve the conflict. The inter-annotator agreement (IAA) using Krippendorff's alpha (Krippendorff, 2011) with the MASI distance metric (Passonneau, 2006) for short-forms (i.e., acronyms) is 0.80 and for long-forms (i.e., phrases) is 0.86. Out of the 17,506 annotated sentences, 1% and 24% of the sentences do not contain any short form or long form, respectively. ...
Full-text available
Lack of large quantities of annotated data is a major barrier in developing effective text mining models of biomedical literature. In this study, we explored weak supervision strategies to improve the accuracy of text classification models developed for assessing methodological transparency of randomized controlled trial (RCT) publications. Specifically,we used Snorkel, a framework to programmatically build training sets, and UMLS-EDA, a data augmentation method that leverages a small number of existing examples to generate new training instances, for weak supervision and assessed their effect on a BioBERT-based text classification model proposed for the task in previous work. Performance improvements due to weak supervision were limited and were surpassed by gains from hyperparameter tuning. Our analysis suggests that refinements to the weak supervision strategies to better deal with multi-label case could be beneficial.
In recent years, the amount of digital text contents or documents in the Bengali language has increased enormously on online platforms due to the effortless access of the Internet via electronic gadgets. As a result, an enormous amount of unstructured data is created that demands much time and effort to organize, search or manipulate. To manage such a massive number of documents effectively, an intelligent text document classification system is proposed in this paper. Intelligent classification of text document in a resource-constrained language (like Bengali) is challenging due to unavailability of linguistic resources, intelligent NLP tools, and larger text corpora. Moreover, Bengali texts are available in two morphological variants (i.e., Sadhu-bhasha and Cholito-bhasha) making the classification task more complicated. The proposed intelligent text classification model comprises GloVe embedding and Very Deep Convolution Neural Network (VDCNN) classifier. Due to the unavailability of standard corpus, this work develops a large Embedding Corpus (EC) containing 969,000 unlabelled texts and Bengali Text Classification Corpus (BDTC) containing 156,207 labelled documents arranged into 13 categories. Moreover, this work proposes the Embedding Parameters Identification (EPI) Algorithm, which selects the best embedding parameters for low-resource languages (including Bengali). Evaluation of 165 embedding models with intrinsic evaluators (semantic & syntactic similarity measures) shows that the GloVe model is more suitable (regarding Spearman & Pearson correlation) than other embeddings (Word2Vec, FastText, m-BERT) in Bengali text. Experimental results on the test dataset confirm that the proposed GloVe + VDCNN model outperformed (achieving the highest 96.96% accuracy) the other classification models and existing methods to perform the Bengali text classification task.
Full-text available
A pyramid evaluation dataset was created for DUC 2003 in order to compare re- sults with DUC 2005, and to provide an independent test of the evaluation metric. The main differences between DUC 2003 and 2005 datasets pertain to the docu- ment length, cluster sizes, and model sum- mary length. For five of the DUC 2003 document sets, two pyramids each were constructed by annotators working inde- pendently. Scores of the same peer us- ing different pyramids were highly corre- lated. Sixteen systems were evaluated on eight document sets. Analysis of variance using Tukey's Honest Significant Differ- ence method showed significant differ- ences among all eight document sets, and more significant differences among the sixteen systems than for DUC 2005.
Full-text available
Co-reference annotation is annotation of language corpora to indicate which expressions have been used to co-specify the same discourse entity. When annotations of the same data are collected from two or more coders, the reliability of the data may need to be quantified. Two obstacles have stood in the way of applying reliability metrics: incommensurate units across annotations, and lack of a convenient representation of the coding values. Given N coders and M coding units, reliability is computed from an N-by-M matrix that records the value assigned to unit Mj by coder Nk. The solution I present accommodates a wide range of coding choices for the annotator, while preserving the same units across codings. As a consequence, it permits a straightforward application of reliability measurement. In addition, in coreference annotation, disagreements can be complete or partial. The representation I propose has the advantage of incorporating a distance metric that can scale disagreements accordingly. It also allows the investigator to experiment with alternative distance metrics. Finally, the coreference representation proposed here can be useful for other tasks, such as multivariate distributional analysis. The same reliability methodology has already been applied to another coding task, namely semantic annotation of summaries.
Full-text available
Six sites participated in the Interlingual Annotation of Multilingual Text Corpora (IAMTC) project (Dorr et al., 2004; Farwell et al., 2004; Mitamura et al., 2004). Parsed versions of English translations of news articles in Arabic, French, Hindi, Japanese, Korean and Spanish were annotated by up to ten annotators. Their task was to match open-class lexical items (nouns, verbs, adjectives, adverbs) to one or more concepts taken from the Omega ontology (Philpot et al., 2003), and to identify theta roles for verb arguments. The annotated corpus is intended to be a resource for meaning-based approaches to machine translation. Here we discuss inter-annotator agreement for the corpus. The annotation task is characterized by annotators' freedom to select multiple concepts or roles per lexical item. As a result, the annotation categories are sets, the number of which is bounded only by the number of distinct annotator-lexical item pairs. We use a reliability metric designed to handle partial agreement between sets. The best results pertain to the part of the ontology derived from WordNet. We examine change over the course of the project, differences among annotators, and differences across parts of speech. Our results suggest a strong learning effect early in the project.
Full-text available
The need to model the relation between discourse structure and linguistic features of utterances is almost universally acknowledged in the literature on discourse. However, there is only weak consensus on what the units of discourse structure are, or the criteria for recognizing and generating them. We present quantitative results of a two-part study using a corpus of spontaneous, narrative monologues. The first part of our paper presents a method for empirically validating multitutterance units referred to as discourse segments. We report highly significant results of segmentations performed by naive subjects, where a commonsense notion of speaker intention is the segmentation criterion. In the second part of our study, data abstracted from the subjects' segmentations serve as a target for evaluating two sets of algorithms that use utterance features to perform segmentation. On the first algorithm set, we evaluate and compare the correlation of discourse segmentation with three types of linguistic cues (referential noun phrases, cue words, and pauses). We then develop a second set using two methods: error analysis and machine learning. Testing the new algorithms on a new data set shows that when multiple sources of linguistic knowledge are used concurrently, algorithm performance improves.
Full-text available
This paper describes a method for evaluating interannotator reliability in an email corpus annotated for type (e.g., question, answer, so-cial chat) when annotators are allowed to as-sign multiple labels to a message. An augmentation is proposed to Cohen's kappa statistic which permits all data to be included in the reliability measure and which further permits the identification of more or less re-liably annotated data points.
Conference Paper
Full-text available
We present an empirically grounded method for evaluating content selection in summariza- tion. It incorporates the idea that no single best model summary for a collection of documents exists. Our method quantifies the relative im- portance of facts to be conveyed. We argue that it is reliable, predictive and diagnostic, thus im- proves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference.
Full-text available
Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important to evaluate the reliability of the annotation, but methods for doing so have not been readily available. In this report, I present a method for computing reliability of coreference annotation. First I review a method for applying the information retrieval metrics of recall and precision to coreference annotation proposed by Marc Vilain and his collaborators. I show how this method makes it possible to construct contingency tables for computing Cohen's Kappa, a familiar reliability metric. By comparing recall and precision to reliability on the same data sets, I also show that recall and precision can be misleadingly high. Because Kappa factors out chance agreement among coders, it is a preferable measure for developing annotated corpora where no pre-existing target annotation exists.
Conference Paper
We present a new approach to intrinsic sum- mary evaluation, based on initial experiments in van Halteren and Teufel (2003), which com- bines two novel aspects: comparison of infor- mation content (rather than string similarity) in gold standard and system summary, mea- sured in shared atomic information units which we call factoids, and comparison to more than one gold standard summary (in our data: 20 and 50 summaries respectively). In this paper, we show that factoid annotation is highly re- producible, introduce a weighted factoid score, estimate how many summaries are required for stable system rankings, and show that the fac- toid scores cannot be su-ciently approximated by unigrams and the DUC information overlap measure.