ArticlePDF Available

Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation

Authors:

Abstract and Figures

Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.
Content may be subject to copyright.
Measuring Agreement on Set-valued Items (MASI)
for Semantic and Pragmatic Annotation
Rebecca Passonneau
Columbia University
New York, New York, USA
becky@cs.columbia.edu
Abstract
Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that
oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the
dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for
comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs)
generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper
presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC).
The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous
application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of
inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other
phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.
1. Introduction
To capture gradations in meaning or function,
semantic and pragmatic annotation projects have taken
various approaches. The project on Interlingual
Annotation of Multilingual Text Corpora (IAMTC;
Farwell et al., 2004) explicitly directed annotators to
make multiple selections when no single selection seems
sufficient. A related approach was taken in an email
domain in which annotators were allowed to make
multiple selections, but were asked to designate one as
primary (Rosenberg & Binkowski, 2004). A contrasting
implicit method relies on frequency of a category across
multiple annotators to represent stronger or weaker
presence of pragmatic units (Passonneau & Litman, 1997)
or semantic ones (Nenkova & Passonneau, 2004).
Annotations with multiple choices or graded
categories require new approaches to measuring
agreement. Rosenberg & Binkowski (2004), for example,
developed an augmented version of Cohen’s kappa (1960)
to assess inter-annotator agreement for the email domain.
It yields different kappa scores, depending on the weight
assigned to the primary selection. If the weight is 1, the
secondary selection is ignored; if it is .5, both are
considered equally. (Passonneau, 2004) presented a
weighted metric for measuring agreement on set-valued
items (referred to here as MASI) and compared it with
other measures of agreement on co-reference annotation.
MASI has also been applied to the IAMTC data
(Passonneau et al., 2006). It will be demonstrated here for
a semantic annotation task pertaining to the evaluation of
automatic summarization: the creation of pyramids from
human model summaries. A pyramid is a weighted model
of the semantic content in a set of human model
summaries (Nenkova & Passonneau, 2004), and can be
used to score machine-generated summaries. It was used
in the 2005 Document Understanding Conference (DUC)
(Passonneau et al., 2005) and will be used in DUC 2006.
This paper will address the inter-annotator reliability of
pyramid construction for a DUC 2003 dataset.
Section two describes the annotation task, and gives
an example of a representative pair of SCUs from
pyramids created by different annotators. Section three
gives an overview of a standard framework for assessing
reliability, a definition and simple illustration of MASI,
and a brief discussion of related work. Section four
compares inter-annotator agreement results for the five
pairs of pyramids using three metrics, including MASI.
Together, the three metrics indicate a very high degree of
overlap in pyramid annotations.
2. Pyramid Annotation Task
Summaries written by different humans will share
information, but will also have information that does not
appear in any other summary. This long observed fact
was dramatically quantified by (van Halteren & Teufel,
2003) for a set of fifty summaries of a single source text.
Pyramids represent shared content in summaries by
having annotators select spans of words, or contributors,
1
from different summaries such that each expresses more
or less the same information. We refer to a set of
contributors as a Summary Content Unit (SCU). An SCU
will have at most the same number of contributors as
there are model summaries. The cardinality of an SCU,
its weight, indicates how many of the model summaries
expresses the given content. The set of all SCUs found in
the models constitutes a pyramid. Annotators assign a
label that serves as a mnemonic for the meaning.
A1’s SCU: Weight=4
[Label: Americans asked Saudi officials for help]
Sum1 <Saudi Arabian officials, under American
pressure>1
Sum2 <sought help from Saudi officials>2
Sum3 <Through the Saudis, the United States
asked>3
Sum4 <U.S. and Saudi Arabian requests>4
A2’s SCU: Weight=5
[Label: Through the Saudis, the U.S. tried to get
cooperation from the Taliban]
Sum1 <Saudi Arabian officials, under American
pressure,>1 <asked Afghan leaders>5
Sum2 <U.S. and Saudi officials then attempted>6
Sum3 <sought help from Saudi officials>2,
<who tried to convince Taliban leaders>7
Sum4 <Through the Saudis, the United States
asked>3
Sum5 <U.S. and Saudi Arabian requests>4
Equivalence classes:
A1 { 1, 2, 3, 4 } {5, 7} {6}
A2 { 1, 2, 3, 4, 5, 6, 7}
Figure 1. Semantically similar SCUs from two
annotators, A1 and A2.
In the original annotation method (Passonneau &
Nenkova, 2003), the contributors constituted equivalence
classes over the words in the model summaries, and SCUs
were equivalence classes over the contributors. This is the
annotation style whose results are reported on here. For
the annotations in DUC 2005 (Passonneau et al., 2005),
the annotation constraints were relaxed to allow a word or
phrase to be part of multiple contributors, and a
contributor could be part of multiple SCUs.
For the 2003 Document Understanding Conference,
NIST assembled thirty clusters of documents to use in the
evaluation of automatic summarizers. In addition, four
100-word human summaries per document cluster
(Docset) were collected. We recruited journalism majors,
English majors and others in the Columbia University
community who demonstrated high verbal skills (such as
high verbal GREs scores) to write additional summaries.
1
A contributor can have discontinuities in the word string, e.g.,
for discontinuous constituents.
For five of the document clusters from DUC 2003, we
had two annotators work independently to create
pyramids, each using seven model summaries per Docset.
Figure 1 shows a pair of SCUs from the two
independently annotated pyramids for one of the five
Docsets (31038). It is typical of what we see from
different annotators across the five pairs of pyramids
investigated here. The two SCUs are very similar, but not
identical. They differ in the weight (four versus five), and
in the constituency of the contributors. By giving each
span of words a unique identifier, we can see that there
are two contributors that are the same for both annotators
(spans 3 and 4), two that partly overlap (spans 1 and 2 for
A1, versus A2’s combination of spans 1 with 5, and of
spans 2 with 7), and one that is unique to annotator A2
(span 6). Annotator A1 placed spans 5 and 7 in a distinct
SCU, and span 6 was a singleton SCU.
The two annotators’ equivalence classes (of spans
rather than words) are shown at the bottom of Figure 1.
3. Inter-annotator Agreement
3.1. Standard Approach
Different types of tables have been used as a basis for
computing inter-annotator agreement, including
contingency tables, and simple agreement tables having
rows for each unit and columns for each category and
where cells record how often each unit was assigned each
category. For two coders, Di Eugenio & Glass (2004)
prefer contingency tables. They note that in comparison to
contingency tables, simple agreement tables lose
information about what choices individual coders make.
Contingency tables also lose information: they don’t
represent the coding units, and are inconvenient for more
than two coders. An agreement matrix in Krippendorff’s
(1986) canonical form has one row per coder, and one
column per coding unit. Cell values indicate the category
k assigned by the ith coder to the jth unit. Such matrices
lose no information, and can be used to tabulate counts of
the number of coders who assigned the kth category to
the jth unit, the number of categories assigned by the ith
coder to the jth unit, and so on.
Apart from the assumptions used for computing the
probability of the kth, most reliability metrics use the
same or equivalent general formula to factor out chance
agreement (Passonneau, 1997) (Arstein & Poesio, 2005).
Where p(A
O
) and p(A
E
) are the probabilities of observed
and expected agreement, the general formula is:
() ()
1()
OE
E
p
ApA
pA
The metrics all have the same range: one for perfect
agreement, to zero for no difference from chance, to
values that approach minus one for ever greater than
chance disagreement. The devil is in the details, namely
how to estimate p(A
E
).
The family of metrics including Scott’s pi (1955) and
Siegel & Castellan’s K (1988) use a single probability
distribution for all coders, based on the observed rate of
each category k across all coders. Cohen’s kappa (1960)
uses a distinct probability distribution for each coder
based on the rate at which the kth category appears in the
ith coder’s annotation. Cohen’s (1960) kappa makes
fewer assumptions, so in principle it provides stronger
support for inferences about reliability. In practice, kappa
may not always be the best choice.
Di Eugenio & Glass (2004) argue that kappa suffers
from coder bias. The size of kappa will be relatively
higher than Siegel & Castellan’s K if two coders assign
the categories k at different rates. Whether one views bias
as an obstacle depends on one’s goals. If the probability
distributions over the values k are very different for two
coders, then the probability that they will agree will
necessarily be lower, and kappa accounts for this.
Whether the difference in distribution arises from the
inherent subjectivity of the task, insufficient specification
in the annotation guidelines of when to use each category,
or differences in the skill and attention of the annotators,
cannot be answered by one metric in one comparison.
Artstein and Poesio (2005) review several families of
reliability metrics, the associated assumptions, and
differences in the resulting values that arise given the
same data. The quantitative differences tend to be small.
In order to illustrate the impact of different distance
metrics, results are reported here using a single method of
computing p(A
E
), Krippendorff’s Alpha (1980).
The formula for Alpha, given m coders and r units, is:
1
1
ibci
bc bc
bc
cb
bc
ib
rm n n
nn
m
δ
δ
α
>
=−
∑∑
∑∑
The numerator is a summation over the product of counts
of all pairs of values b and c, times the distance metric δ,
across rows. The denominator is a summation of
agreements and disagreements within columns. For
categorical scales, because Alpha measures
disagreements, δ is 0 when b=c, and 1 when b
c. For
very large samples, Alpha is equivalent to Scott’s (1955)
pi; it corrects for small sample sizes, applies to multiple
coders, and generalizes to many scales of annotation data.
Interpreting inter-annotator reliability raises two
questions: what value of reliability is good enough, and
how does one decide. Krippendorff (1980) is often cited
as recommending a threshold of 0.67 to support cautious
conclusions. The comment he made that introduced his
discussion should be quoted more often. For the question
of how reliable is reliable enough, he said: “there is no set
answer” (p. 146). He offered the 0.67 threshold in the
context of reliability studies in which the same variables
also played a role in independent significance tests. In his
data, variables below the 0.67 threshold happened never
to be significant. He noted that in contrast, “some content
analyses are very robust in the sense that unreliabilities
become hardly noticeable in the result” (p. 147).
I will refer to the simultaneous investigation of
reliability values of annotated data, and significance tests
of the annotated variables with respect to independent
measures, as a paradigmatic reliability study.
(Passonneau et al., 2005) includes an analysis of the
reliability of peer annotations for pyramid evaluation, and
of the significance of correlations of pyramid scores using
peer annotations from different annotators. It is a
paradigmatic reliability study of peer annotation. The
average Kappa across six document sets was .57, the
average Alpha with Dice (1945) as a distance metric was
0.62, and Pearson’s correlations were highly significant.
A distance metric was used to count partial agreement for
annotators who agreed that a given SCU occurred in a
peer summary, but disagreed as to how often. MASI was
not relevant here, because the counts of SCUs per
summary did not constitute a unit of representation.
In concurrent work (Passonneau, 2005), we present
results of a study in which the five pyramids discussed
here were used to score summaries. Thus the present
paper in combination with (Passonneau, 2005) constitutes
a paradigmatic reliability study of pyramid annotation.
3.3. MASI
MASI is a distance metric for comparing two sets,
much like an association measure such as Jaccard (1908)
or Dice (1945). In fact, it incorporates Jaccard, as
explained below. When used to weight the computation of
inter-annotator agreement, it is independent of the method
in which probability is computed, thus of the expected
agreement. It can be used in any weighted agreement
metric, such as Krippendorff's Alpha (Passonneau, 2004)
or Artstein & Poesio's (2005) Beta
3
.
In (Passonneau, 2004), MASI was used for measuring
agreement on co-reference annotations. Earlier work on
assessing co-reference annotations did not use reliability
measures of canonical agreement matrices, in part
because of the data representation problem of determining
what the coding values should be. The annotation task in
co-reference does not involve selecting categories from a
predefined set, but instead requires annotators to group
expressions together into sets of those that co-refer.
(Passonneau, 2004) proposed a means for casting co-
reference annotation into a conventional agreement matrix
by treating the equivalence classes that annotators
grouped NPs into as the coding values. Application of
MASI for comparing the equivalence classes that
annotators assign an NP to made it possible to quantify
the degree of similarity across annotations. Since it is
typically the case that annotators assign the same NP to
very similar, but rarely identical, equivalence classes,
applying an unweighted metric to the agreement matrices
yields misleadingly low values.
The annotation task in creating pyramids has similar
properties to the NP co-coreference annotation task.
Neither the number of distinct referents, nor the number
of distinct SCUs, is given in advance: both are the
outcome of the annotation. The annotations both yield
equivalence classes in which every NP token, or every
word token, belongs to exactly one class (corresponding
to a referent, or an SCU). NPs that are not grouped with
other NPs (e.g., NPs annotated as non-referential), and
words that are not grouped with other words (e.g., closed-
class lexical items like “and” that contribute little or
nothing to the semantics of an SCU, form singleton sets.
Figure 2 and Figure 3 schematically represent
agreement matrices using set-based annotations. A3 and
A4 stand for two annotators; x, y and z are the units from
which to create sets, and the coding values are the sets
shown in the cells of the matrices.
Units
Annotator x y z
A3 {x, y} {x, y} {x}
A4 {x, y, z} {x, y, z} {x, y, z}
Figure 2. Annotation with set subsumption.
Units
Annotator x y z
A3 {x, y } {x, y} {z}
A4 {x } { y, z} { y, z}
Figure 3. Annotation with symmetric difference in
column “y”.
Figure 2 is like the SCU example in Figure 1 in that
there is a monotonic relationship among all the sets in the
matrix. Within columns, A3’s sets always share properties
with A4’s sets, and there are no conflicting properties.
This is not the case in Figure 3, where A3 has a set {x,y},
and A4 has a set {y,z}. The two sets have a non-null
intersection ({y}), and non-null set-differences ({x},{z}).
Figure 3 represents a case where A3 thinks x and y have
the same set of properties, not shared by z; A3 thinks y
and z have the same set of properties, not shared by x,
thus the semantic or pragmatic elements being
represented are in conflict.
MASI ranges from 1, when two sets are identical, to 0,
when they are disjoint. It has two terms which weight
different aspects of set comparison: MASI = J*M. The
Jaccard (1908) metric (the J term) is used to weight the
differences in size of two sets, independent of whether
sets are monotonic. The M term is for monotonicity, and
penalizes a case like Figure 3 more heavily than Figure 2.
Their role in computing MASI will now be illustrated.
Spans A1 A2
1
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
2
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
3
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
4
{ 1, 2, 3, 4 } { 1, 2, 3, 4, 5, 6, 7}
5
{5, 7} { 1, 2, 3, 4, 5, 6, 7}
6
{6} { 1, 2, 3, 4, 5, 6, 7}
7
{5, 7} { 1, 2, 3, 4, 5, 6, 7}
Figure 4. Agreement matrix for Figure 1, using spans
(instead of words) as the coding units.
Taking the two MASI terms in turn, J is the ratio of
the cardinality of the intersection to the cardinality of the
union of the two sets. For two sets P and Q, it is one if
P=Q, and grows closer to one the more members P and Q
have in common. J is zero if P and Q are disjoint, and is
closer to zero the larger P and Q are, and the fewer
members they have in common.
The value of Jaccard is 2/3 for the x and y columns of
Figure 2, and 1/3 for the z column. Similarly, it is 2/3 for
the y column of Figure 3, and 1/3 for the x and z columns.
The mean Jaccard for Figure 2 is 5/9, and for Figure 3 it is
4/9. Thus Figure 2 is appropriately closer to one than
Figure 3, but the quantitative difference is small.
The second term of MASI (M, for monotonicity)
penalizes the case in Figure 3 more heavily than that in
Figure 2. If two sets Q and P are identical, M is 1. If one
set is a subset of the other, M is 2/3. If the intersection
and the two set differences are all non-null, then M is 1/3.
If the sets are disjoint, M is 0.
Before comparing the sets assigned by A3 and A4 to a
coding unit y, the coding unit itself must be removed.
Otherwise, the coding values will necessarily intersect.
For column y in Figure 2, {x} would be compared with
{x, z}. For column y in Figure 3, {x} would be compared
with {z}. The mean MASI for Figure 2 is 10/27 (.37) and
for Figure 3 it is 6/27 (.22).
A1’s SCU-101: Weight=2
[Label: Worker’s agree to Estrada’s terms]
Sum1 <61% voted yes>1
Sum2 <Unions agreed to some employee cuts with
separation benefits>2
A1’s SCU-102: Weight=1
[Label: Ground crew accepts 2 weeks after initial
rejection]
Sum3 <which it accepted two weeks later>3
A2’s SCU-201: Weight=2
[Label: The settlement was finally accepted]
Sum1 <61% voted yes>1
Sum2 <which it accepted two weeks later>3
A2’s SCU-202: Weight=1
[Label: Unions agree to some employee cuts]
Sum3 < Unions agreed to some employee cuts with
separation benefits >2
Figure 5. Pairs of SCUs from two annotators
illustrating non-monotonicity.
Figure 4 shows a canonical agreement matrix for the
example from Figure 1; to save space it is presented with
the coding units in rows instead of columns. For the sake
of illustration, the coding units are spans instead of words.
The set of coding categories consists of the equivalence
classes from both annotations. Annotator A2 placed all
the spans shown in a single SCU labeled [Through the
Saudis, the U.S. tried to get cooperation from the
Taliban]. Annotator A1 created a similar SCU labeled,
[Americans asked Saudi officials for help], but did not
include spans 5 (“asked Afghan leaders”) and 7 (“who
tried to convince Taliban leaders”). A1 placed 5 and 7 in
a distinct SCU (with other contributing spans, omitted
from discussion), labeled [Saudi officials asked Afghan
leaders to release Bin Laden].
The labels assigned by A1 and A2 in Figure 1 reflect
the difference in content. A2 chose a more
comprehensive label expressing a 3-way relation in which
the Saudis would mediate between the U.S. and the
Taliban. In comparison, A1’s labels describe two binary
relations, one relating the U.S. and the Saudis, and one
relating the Saudis and the Taliban. The labels would
suggest that A2’s annotation subsumes A1’s, and the SCU
representation confirms this.
In contrast to the SCU example illustrated in Figure 1,
we occasionally find groups of SCUs across annotators
that are semantically more distinct, corresponding to cases
like Figure 3. Figure 5 gives an example from a pyramid
whose reliability was reported on in (Nenkova &
Passonneau, 2004).
2
Table 1 shows the reliability values for the data from
Figure 1 using Krippendorff’s Alpha with three different
distance metrics. Because Krippendorff’s Alpha measures
disagreements, one minus Jaccard, and one minus MASI,
are used in computing Alpha. The “Nominal” column
shows the results treating all non-identical sets as
categorically distinct (see section 3.1). For illustrative
purposes, the top portion of the table uses spans as the
coding units, i.e., computing Alpha from the agreement
matrix given in Figure 4. Since spans were not given in
advance, but were decided on by coders, this
underestimates the number of decisions that annotators
were required to make. The very low value in the Jaccard
column is due to the disparity in size between the two
annotations for rows five through seven of Figure 4.
The lower portion of Table 1 shows the results using
words as the coding units. The values across the three
columns are similar to those for the full dataset as we will
see in the discussion of Table 2.
Alpha
Coding units Nominal Jaccard MASI
spans 0 -.44 0.14
words 0 .64 .81
Table 1. Reliability values for data from Figure 1,
using spans versus words as coding units.
3.4. Related Work
As noted above, Teufel and van Halteren (2004)
perform an annotation addressing a goal similar to the
pyramid method. They create lists of factoids, atomic
units of information. To compare sets of factoids that
were independently created by two annotators, they first
create a list of subsumption relations between factoids
across annotations. Then they construct a table that lists
all (subsumption-relation, summary) pairs, with counts of
how often each subsumption relation occurs in each
summary. Figure 6 reproduces their Figure 2. Every
factoid is given an index, and in Figure 6, P30 represents
a factoid created by one annotator that subsumes two
created by the other annotator. Symbols a through e
represent five summaries. They compute kappa from this
type of agreement table.
2
SCU-201 has been simplified for illustrative purposes; in the
actual data, it had a third contributor.
A1 A2 A1
A2
P30 F9.21 -a 1 1 P30 F9.22 -a 1 0
P30 F9.21 -b 0 0 P30 F9.22 -b 0 0
P30 F9.21 -c 1 0 P30 F9.22 -c 1 1
P30 F9.21 -d 0 0 P30 F9.22 -d 0 0
P30 F9.21 -e 1 0 P30 F9.22 -e 1 1
Figure 6. Agreement table representation used in
Teufel and van Halteren (2004).
While this representation does not suffer from the loss
of information De Eugenio & Glass (2004) fault Siegel &
Castellan (1988) for, note that it differs from an
agreement matrix or a contingency table in that it is not
the case that each count represents an individual decision
made by an annotator. We can see from the table that A1
is the annotator who created P30 and A2 is the one who
created F9.21 and F9.22. Although there are two cells in
A1’s column for the two subsumption relations P30
F9.21 and P30F9.22, it is unlikely that A1’s original
annotation involved decisions about F9.21 and F9.22. If
the number of decisions is overestimated, p(A
E
) will be
underestimated, leading to higher kappa values.
Another issue in using such an agreement table from
two independently created factoid lists is that it requires
the creation of a new level of representation that would
itself be subject to reliability issues.
4. Results and Discussion
Canonical greement matrices of the form shown in
Figure 4, but with words as the coding units, were
computed for the five pairs of independently created
pyramids for the Docsets listed in Table 2. The mean
number of words per pyramid was 725; the mean number
of distinct SCUs was 92. Results are shown for Alpha
with the same three distance metrics used in Table 1.
Alpha
Docset Nominal Jaccard MASI
30016 .19 .55 0.79
30040 .24 .58 0.80
31001 .01 .40 0.68
31010 .03 .39 0.69
31038 .09 .40 0.71
Table 2. Inter-annotator agreement on 5 pyramids
using unweighted Krippendorff’s Alpha (nominal),
and Alpha with Jaccard and MASI as δ.
The low values for the nominal distance metric are
expected, given that there are few cases of word-for-word
identity of SCUs across annotations. With Jaccard as the
distance metric, the values increase manyfold, indicating
that over all the comparisons of pairs of SCUs across
annotators for a given pyramid, the size of the set
intersection is closer to the size of the set union than not.
With MASI, values increase by approximately half of
the difference between the Jaccard value and the
maximum value of one. Since MASI rewards
overlapping sets twice as much if one is a subset of the
other than if they are not, this degree of increase indicates
that most of the differences between SCUs are monotonic.
By including several metrics whose relationship to
each other is known, Table 2 indicates that the pyramid
annotations do not have many cases of exact agreement
(nominal), that the sets being compared have more
members in common than not (Jaccard), and that the
commonality is more often monotonic than not (MASI).
Whether these results are sufficiently reliable depends
on the uses of the data. In a separate investigation
(Passonneau, 2005), the pairs of pyramids for Docsets
30016 and 30014 have been used to produce parallel sets
of scores for summaries from sixteen summarization
systems that participated in DUC 2003. Pearson’s
correlations of two types of scores (original pyramid and
modified) range from 0.84 to 0.91with p values always
zero. This constitutes evidence that the pyramid
annotations are more than reliable enough.
5. Conclusion
Measuring inter-annotator reliability involves more
than a single number or a single study. Di Eugenio &
Glass (2004) argue that using multiple reliability metrics
with different methods for computing p(A
E
) can be more
revealing of than a single metric. Passonneau et al. (2005)
present a similar argument for the case of comparing
different distance metrics. Here, inter-annotator reliability
results have been presented using three metrics in order to
more fully characterize the dataset.
This paper argues that full interpretation of a
reliability measure is best carried out in a paradigmatic
reliability study: a series of studies that link one or more
measures of the reliability of a dataset to an independent
assessment, such as a significance test. If the same
dataset is used in different tasks, what is reliable for one
task may not be for another.
Investigators faced with complex annotation data have
shown ingenuity in proposing new data representations
(Teufel & van Halteren, 2004), new reliability measures
(Rosenberg & Binkowski, 2004), and techniques new to
computational linguistics, as discussed in (Artstein &
Poesio). While this paper argues for placing a greater
burden on the interpretation of inter-annotator agreement,
proposals such as these provide an expanding suite of
tools for accomplishing this task.
Acknowledgments
This work was supported by DARPA NBCH105003
and NUU01-00-1-8919. The author thanks many
annotators, especially Ani Nenkova and David Elson.
References
Artstein, R. and M. Poesio. 2005. Kappa
3
= Alpha (or
Beta). University of Essex NLE Technote 2005-01.
Dice, L. R. (1945). Measures of the amount of ecologic
association between species. Ecology 26:297-302.
Farwell, D.; Helmreich, S.; Dorr, B. J.; Habash, N.;
Reeder, F.; Miller, K.; Levin, L.; Mitamura, T.; Hovy,
E.; Rambow, O.; Siddharthan, A. (2004). Interlingual
annotation of multilingual text corpora. In Proceedings
of the North American Chapter of the Association for
Computational Linguistics Workshop on Frontiers in
Corpus Annotation, Boston, MA, pp. 55-62, 2004.
van Halteren, H. and S. Teufel. 2003. Examining the
consensus between human summaries. In Proceedings
of the Document Understanding Workshop.
Jaccard, P. (1908). Nouvelles recherches sur la
distribution florale. Bulletin de la Societe Vaudoise des
Sciences Naturelles 44:223-270.
Krippendorff, K. 1980. Content Analysis. Newbury Park,
CA: Sage Publications.
Nenkova, A. and R. Passonneau. 2004. Evaluating content
selection in summarization: The pyramid method.
Proceedings of the Joint Annual Meeting of Human
Language Technology (HLT) and the North American
chapter of the Association for Computational
Linguistics (NACL). Boston, MA.
Passonneau, R.; Nenkova, A.; McKeown, K.; Sigelman,
S. 2005. Applying the pyramid method in DUC 2005.
Document Understanding Conference Workshop.
Passonneau, R.; Habash, N.; Rambow, O. 2006. Inter-
annotator agreement on a multilingual semantic
annotation task. Proceedings of the International
Conference on Language Resources and Evaluation
(LREC). Genoa, Italy.
Passonneau, R. 2005. Evaluating an evaluation method:
The pyramid method applied to 2003 Document
Understanding Conference (DUC) Data. Technical
Report CUCS-010-06, Department of Computer
Science, Columbia University.
Passonneau, R. 2004. Computing reliability for
coreference annotation. Proceedings of the
International Conference on Language Resources and
Evaluation (LREC). Portugal.
Passonneau, R. 1997. Applying reliability metrics to co-
reference annotation. Technical Report CUCS-017-97,
Department of Computer Science, Columbia
University.
Passonneau, R. and D. Litman. 1997. Discourse
segmentation by human and automated means.
Computational Linguistics 23.1: 103-139.
Rosenberg, A. and Binkowski, E. (2004). Augmenting the
kappa statistic to determine inter-annotator reliability
for multiply labeled data points. In Proceedings of the
Human Language Technology Conference and Meeting
of the North American Chapter of the Association for
Computational Linguistics (HLT/NAACL).
Siegel, S. and N. John Castellan, Jr. (1988) Non-
parametric Statistics for the Behavioral Sciences, 2
nd
edition. McGraw-Hill, New York.
Teufel, S. and H. van Halteren, 2004: Evaluating
information content by factoid analysis: human
annotation and stability. In Proceedings of Empirical
Methods in Natural Language Processing. Barcelona.
... Two annotators were engaged to annotate the first ten dialogues, ensuring the annotation agreement. The MASI (Measuring Agreement on Set-valued Items) distance (Passonneau, 2006) was employed as it is particularly useful for handling multiple labels for a single item, ranging from 1 to indicate identical sets, to 0 to indicate completely disjointed sets. Additionally, The Krippendorff's Alpha (Passonneau, 2004) was applied to assess the annotation quality, calculating the metric of weighted agreement. ...
Conference Paper
Commonsense Knowledge (CSK) is defined as a complex and multifaceted structure, encompassing a wide range of knowledge and reasoning generally acquired through everyday experiences. As CSK is often implicit in communication, it poses a challenge for AI systems to simulate human-like interaction. This work aims to deepen the CSK information structure from a linguistic perspective, starting from its organisation in conversations. To achieve this goal, we developed a three-level analysis model to extract more insights about this knowledge, focusing our attention on the second level. In particular, we aimed to extract the distribution of explicit actions and their execution order in the communicative flow. We built an annotation scheme based on FrameNet and applied it to a dialogical corpus on the culinary domain. Preliminary results indicate that certain frames occur earlier in the dialogues, while others occur towards the process’s end. These findings contribute to the systematic nature of actions by establishing clear patterns and relationships between frames.
... We compare this range of IAA from the four trials of the unmodified schema, to the range of IAA for the three trials annotated with the new schema. (Passonneau, 2006) Our modified schema yields comparable or higher IAA than the original schema for antecedents (maximum 0.94) and relation types (maximum 0.93). Our TU IAA (maximum 0.85) is higher than the range of TU IAA reported for the first round of annotations with the unmodified schema (0.48-0.70), but the final round of TU annotation from in the unmodified schema achieves the highest agreement rate of 0.93. ...
Preprint
In this paper, we describe the development of symbolic representations annotated on human-robot dialogue data to make dimensions of meaning accessible to autonomous systems participating in collaborative, natural language dialogue, and to enable common ground with human partners. A particular challenge for establishing common ground arises in remote dialogue (occurring in disaster relief or search-and-rescue tasks), where a human and robot are engaged in a joint navigation and exploration task of an unfamiliar environment, but where the robot cannot immediately share high quality visual information due to limited communication constraints. Engaging in a dialogue provides an effective way to communicate, while on-demand or lower-quality visual information can be supplemented for establishing common ground. Within this paradigm, we capture propositional semantics and the illocutionary force of a single utterance within the dialogue through our Dialogue-AMR annotation, an augmentation of Abstract Meaning Representation. We then capture patterns in how different utterances within and across speaker floors relate to one another in our development of a multi-floor Dialogue Structure annotation schema. Finally, we begin to annotate and analyze the ways in which the visual modalities provide contextual information to the dialogue for overcoming disparities in the collaborators' understanding of the environment. We conclude by discussing the use-cases, architectures, and systems we have implemented from our annotations that enable physical robots to autonomously engage with humans in bi-directional dialogue and navigation.
... 0.65-0.85 MASI (Passonneau, 2006) Content courtesy of Springer Nature, terms of use apply. Rights reserved. ...
Article
Full-text available
In this paper, we describe the development of symbolic representations annotated on human–robot dialogue data to make dimensions of meaning accessible to autonomous systems participating in collaborative, natural language dialogue, and to enable common ground with human partners. A particular challenge for establishing common ground arises in remote dialogue (occurring in disaster relief or search-and-rescue tasks), where a human and robot are engaged in a joint navigation and exploration task of an unfamiliar environment, but where the robot cannot immediately share high quality visual information due to limited communication constraints. Engaging in a dialogue provides an effective way to communicate, while on-demand or lower-quality visual information can be supplemented for establishing common ground. Within this paradigm, we capture propositional semantics and the illocutionary force of a single utterance within the dialogue through our Dialogue-AMR annotation, an augmentation of Abstract Meaning Representation. We then capture patterns in how different utterances within and across speaker floors relate to one another in our development of a multi-floor Dialogue Structure annotation schema. Finally, we begin to annotate and analyze the ways in which the visual modalities provide contextual information to the dialogue for overcoming disparities in the collaborators’ understanding of the environment. We conclude by discussing the use-cases, architectures, and systems we have implemented from our annotations that enable physical robots to autonomously engage with humans in bi-directional dialogue and navigation.
... To further investigate laypeople's preferences, we adopted the analysis technique outlined in (Goyal et al., 2022). We calculated the interannotator agreement, applying Krippendorff's alpha with MASI distance (Passonneau, 2006), to account for the possibility of multiple selections for the best or worst simplifications in our research design. The alpha scores for the most and least preferred options were 0.177 and 0.132 respectively. ...
Conference Paper
O processo de anotação de um corpus utilizando a Teoria da estrutura retórica (RST) possui etapas bem claras e definidas, dentre as quais destaca-se a análise da concordância entre os anotadores. Neste trabalho apresentamos duas estratégias de análise da concordância (gold e silver) com base na medida de Krippendorff Alpha. Os resultados apontam significativos avanços para esse tipo de análise e a possibilidade de replicação por outros trabalhos nesse segmento.
Conference Paper
To produce explanations that are more likely to be accepted by humans, Explainable Artificial Intelligence (XAI) systems need to incorporate explanation models grounded in human communication patterns. So far, little is known about how an explainee, who lacks understanding of an issue, and an explainer, who has knowledge to fill the explainee’s knowledge gap, actively shape an explanation process, and how their involvement relates to explanatory success in terms of maximizing the explainee’s level of understanding. In this paper, we characterize explanations as dialogues in which explainee and explainer take turns to advance the explanation process. We build on an existing annotation scheme of ‘explanatory moves’ to characterize such turns, and manually annotate 362 dialogical explanations from the “Explain Like I’m Five” subreddit. Building on the annotated data, we compute correlations between explanatory moves and explanatory success, measured on a five-point Likert scale, in order to identify factors that are significantly correlated with explanatory success. Based on a qualitative analysis of these factors, we develop a conceptual model of the main factors that contribute to the success of explanatory dialogues.
Article
Full-text available
A pyramid evaluation dataset was created for DUC 2003 in order to compare re- sults with DUC 2005, and to provide an independent test of the evaluation metric. The main differences between DUC 2003 and 2005 datasets pertain to the docu- ment length, cluster sizes, and model sum- mary length. For five of the DUC 2003 document sets, two pyramids each were constructed by annotators working inde- pendently. Scores of the same peer us- ing different pyramids were highly corre- lated. Sixteen systems were evaluated on eight document sets. Analysis of variance using Tukey's Honest Significant Differ- ence method showed significant differ- ences among all eight document sets, and more significant differences among the sixteen systems than for DUC 2005.
Article
Full-text available
Co-reference annotation is annotation of language corpora to indicate which expressions have been used to co-specify the same discourse entity. When annotations of the same data are collected from two or more coders, the reliability of the data may need to be quantified. Two obstacles have stood in the way of applying reliability metrics: incommensurate units across annotations, and lack of a convenient representation of the coding values. Given N coders and M coding units, reliability is computed from an N-by-M matrix that records the value assigned to unit Mj by coder Nk. The solution I present accommodates a wide range of coding choices for the annotator, while preserving the same units across codings. As a consequence, it permits a straightforward application of reliability measurement. In addition, in coreference annotation, disagreements can be complete or partial. The representation I propose has the advantage of incorporating a distance metric that can scale disagreements accordingly. It also allows the investigator to experiment with alternative distance metrics. Finally, the coreference representation proposed here can be useful for other tasks, such as multivariate distributional analysis. The same reliability methodology has already been applied to another coding task, namely semantic annotation of summaries.
Article
Full-text available
Six sites participated in the Interlingual Annotation of Multilingual Text Corpora (IAMTC) project (Dorr et al., 2004; Farwell et al., 2004; Mitamura et al., 2004). Parsed versions of English translations of news articles in Arabic, French, Hindi, Japanese, Korean and Spanish were annotated by up to ten annotators. Their task was to match open-class lexical items (nouns, verbs, adjectives, adverbs) to one or more concepts taken from the Omega ontology (Philpot et al., 2003), and to identify theta roles for verb arguments. The annotated corpus is intended to be a resource for meaning-based approaches to machine translation. Here we discuss inter-annotator agreement for the corpus. The annotation task is characterized by annotators' freedom to select multiple concepts or roles per lexical item. As a result, the annotation categories are sets, the number of which is bounded only by the number of distinct annotator-lexical item pairs. We use a reliability metric designed to handle partial agreement between sets. The best results pertain to the part of the ontology derived from WordNet. We examine change over the course of the project, differences among annotators, and differences across parts of speech. Our results suggest a strong learning effect early in the project.
Article
Full-text available
The need to model the relation between discourse structure and linguistic features of utterances is almost universally acknowledged in the literature on discourse. However, there is only weak consensus on what the units of discourse structure are, or the criteria for recognizing and generating them. We present quantitative results of a two-part study using a corpus of spontaneous, narrative monologues. The first part of our paper presents a method for empirically validating multitutterance units referred to as discourse segments. We report highly significant results of segmentations performed by naive subjects, where a commonsense notion of speaker intention is the segmentation criterion. In the second part of our study, data abstracted from the subjects' segmentations serve as a target for evaluating two sets of algorithms that use utterance features to perform segmentation. On the first algorithm set, we evaluate and compare the correlation of discourse segmentation with three types of linguistic cues (referential noun phrases, cue words, and pauses). We then develop a second set using two methods: error analysis and machine learning. Testing the new algorithms on a new data set shows that when multiple sources of linguistic knowledge are used concurrently, algorithm performance improves.
Article
Full-text available
This paper describes a method for evaluating interannotator reliability in an email corpus annotated for type (e.g., question, answer, so-cial chat) when annotators are allowed to as-sign multiple labels to a message. An augmentation is proposed to Cohen's kappa statistic which permits all data to be included in the reliability measure and which further permits the identification of more or less re-liably annotated data points.
Conference Paper
Full-text available
We present an empirically grounded method for evaluating content selection in summariza- tion. It incorporates the idea that no single best model summary for a collection of documents exists. Our method quantifies the relative im- portance of facts to be conveyed. We argue that it is reliable, predictive and diagnostic, thus im- proves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference.
Article
Full-text available
Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important to evaluate the reliability of the annotation, but methods for doing so have not been readily available. In this report, I present a method for computing reliability of coreference annotation. First I review a method for applying the information retrieval metrics of recall and precision to coreference annotation proposed by Marc Vilain and his collaborators. I show how this method makes it possible to construct contingency tables for computing Cohen's Kappa, a familiar reliability metric. By comparing recall and precision to reliability on the same data sets, I also show that recall and precision can be misleadingly high. Because Kappa factors out chance agreement among coders, it is a preferable measure for developing annotated corpora where no pre-existing target annotation exists.
Conference Paper
We present a new approach to intrinsic sum- mary evaluation, based on initial experiments in van Halteren and Teufel (2003), which com- bines two novel aspects: comparison of infor- mation content (rather than string similarity) in gold standard and system summary, mea- sured in shared atomic information units which we call factoids, and comparison to more than one gold standard summary (in our data: 20 and 50 summaries respectively). In this paper, we show that factoid annotation is highly re- producible, introduce a weighted factoid score, estimate how many summaries are required for stable system rankings, and show that the fac- toid scores cannot be su-ciently approximated by unigrams and the DUC information overlap measure.