Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution.
ABSTRACT We present a sequence of unsupervised, nonparametric Bayesian models for clus tering complex linguistic objects. In this approach, we consider a potentially infi nite number of features and categorical outcomes. We evaluated these models for the task of within and crossdocument event coreference on two corpora. All the models we investigated show significant improvements when c ompared against an existing baseline for this task.
 Citations (28)
 Cited In (0)

Conference Paper: Infinite latent feature models and the Indian buffet process.
[Show abstract] [Hide abstract]
ABSTRACT: We define a probability distribution over equivalence classes of binary ma trices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We derive the distribution by taking the limit of a distribution over N × K binary matrices as K ! 1, a strategy inspired by the derivation of the Chinese restaurant process (Aldous, 1985; Pitman, 2002) as the limit of a Dirichletmultinomial model. This strategy preserves the exchangeability of the rows of matrices. We define several simple generative processes that result in the same distri bution over equivalence classes of binary matrices, one of which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algo rithm for inference in this model and applying this algorithm to an artificial dataset.Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS 2005, December 58, 2005, Vancouver, British Columbia, Canada]; 01/2005  SourceAvailable from: arxiv.org[Show abstract] [Hide abstract]
ABSTRACT: The author describes a class of slice sampling methods that can be applied to a wide variety of distributions. Section 2 summarizes generalpurpose Markov chain sampling methods as the Gibbs sampling, the adaptive rejection sampling, the adaptive rejection Metropolis sampling etc. Section 3 presents the basic ideas of a slice sampling and thoroughly discusses different predecessors more or less connected to it. The principal message of the paper is concentrated in chapters 4–7. At first, simple variable slice samplings methods are described. Then the author concentrates on multivariate slice sampling methods and reflective slice sampling. An example forms the final section. I liked the paper and I must say that despite it is a paper for Annals of Statistics, the author really concentrates on the ideas and not on the formal proofs as is typical for this journal. I am sure that everybody who want to get an idea of what is slice sampling will be satisfied. The paper is complemented by an interesting discussion prepared by MingHui Chen, B. W. Schmeiser, O. B. Downs, A. Mira, G. O. Roberts, J. Skilling, D. J. C. MacKey and G. S. Walker.The Annals of Statistics 01/2003; 31(3). · 2.53 Impact Factor  SourceAvailable from: Aria Haghighi
Conference Paper: Unsupervised Coreference Resolution in a Nonparametric Bayesian Model
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics; 01/2007
Page 1
Nonparametric Bayesian Models for Unsupervised
Event Coreference Resolution
Cosmin Adrian Bejan1, Matthew Titsworth2, Andrew Hickl2, & Sanda Harabagiu1
1Human Language Technology Research Institute, University of Texas at Dallas
2Language Computer Corporation, Richardson, Texas
ady@hlt.utdallas.edu
Abstract
We present a sequence of unsupervised, nonparametric Bayesian models for clus
tering complex linguistic objects. In this approach, we consider a potentially infi
nite number of features and categorical outcomes. We evaluated these models for
the task of within and crossdocument event coreference on two corpora. All the
modelswe investigatedshow significantimprovementswhen comparedagainst an
existing baseline for this task.
1
In Natural Language Processing (NLP), the task of event coreference has numerous applications,
including question answering, multidocument summarization, and information extraction. Two
event mentions are coreferential if they share the same participants and spatiotemporalgroundings.
Moreover, two event mentions are identical if they have the same causes and effects. For example,
the threedocumentslisted in Table1 containsfourmentionsofidenticaleventsbutonlythe arrested,
apprehended, and arrest mentions from the documents 1 and 2 are coreferential. These definitions
were used in the tasks of Topic Detection and Tracking (TDT), as reported in [24].
Introduction
Previous approaches to event coreference resolution [3] used the same lexeme or synonymy of the
verb describing the event to decide coreference. Event coreference was also tried by using the
semantictypesofanontology[17]. However,thefeaturesusedbytheseapproachesarehardtoselect
and require the design of domain specific constraints. To address this problems, we have explored
a sequence of unsupervised, nonparametric Bayesian models that are used to probabilistically infer
coreference clusters of event mentions from a collection of unlabeled documents. Our approach
is motivated by the recent success of unsupervised approaches for entity coreference resolution
[16, 22, 25] and by the advantages of using a large amount of data at no cost.
One model was inspired by the fully generative Bayesian model proposed by Haghighi and Klein
[16] (henceforth, H&K). However, to employ the H&K’s model for tasks that require clustering
objects with rich linguistic features (such as event coreferenceresolution), or to extend this model in
order to enclose additional observable properties is a challenging task [22, 25]. In order to counter
this limitation, we make a conditional independence assumption between the observable features
and propose a generalized framework (Section 3) that is able to easily incorporate new features.
During the process of learning the model described in Section 3, it was observedthat a large amount
of time was requiredto incorporateand tune new features. This lead us to the challengeof creating a
framework which considers an unbounded number of features where the most relevant are selected
automatically. To accomplish this new goal, we propose two novel approaches (Section 4). The
first incorporates a Markov Indian Buffet Process (mIBP) [30] into a Hierarchical Dirichlet Process
(HDP) [28]. The second uses an Infinite Hidden Markov Model (iHMM) [5] coupled to an Infinite
Factorial Hidden Markov Model (iFHMM) [30].
In this paper, we focus on event coreference resolution, thoughadaptations for event identity resolu
tion can be easily made. We evaluated the models on the ACE 2005 event corpus [18] and on a new
annotated corpus encoding within and crossdocument event coreference information (Section 5).
1
Page 2
Document 1: San Diego Chargers receiver Vincent Jackson was arrested on suspicion of drunk driving on
Tuesday morning, five days before a key NFL playoff game.
...
Police apprehended Jackson in San Diego at 2:30 a.m. and booked him for the misdemeanour before his
release.
Document 2: Despite his arrest on suspicion of driving under the influence yesterday, Chargers receiver
Vincent Jackson will play in Sunday’s AFC divisional playoff game at Pittsburgh.
Document 3: In another antipiracy operation, Navy warship on Saturday repulsed an attack on a merchant
vessel in the Gulf of Aden and nabbed 23 Somali and Yemeni sea brigands.
Table 1: Examples of coreferential and identical events.
2
Models for solving event coreference and event identity can lead to the generation of adhoc event
hierarchies from text. A sample of a hierarchy capturing corefering and identical events, including
those from the example presented in Section 1, is illustrated in Figure 1.
Event Coreference Resolution
arrest
arrest
Event properties:
Suspect:
Authorities:
Time:
Location:
sea brigands
Navy warship
Saturday
Gulf of Aden
... nabbed ...
Document 3
... arrested ... apprehended ... arrest ...
Document 2
mentions
event
generic
events
arrest
Document 1
Suspect:
Authorities:
Time:
Location:
Vincent Jackson
police
Tuesday
San Diego
Event properties:
events
Figure 1: A portion of the event hierarchy.
First, we introduce some basic notation.1Next, to cluster the mentions that share common event
properties (as shown in Figure 1), we briefly describe the linguistic features of event mentions.
2.1Notation
As input for our models, we consider a collection of I documents, each document i having Jievent
mentions. Each event mention is characterized by L feature types, FT, and each feature type is
represented by a finite number of feature values, fv. Therefore, we can represent the observable
properties of an event mention, em, as a vector of pairs ?(FT1:fv1i),...,(FTL:fvLi)?, where each
feature value index i ranges in the feature value space associated with a feature type.
2.2 Linguistic Features
We consider the following set of features associated to an event mention:2
Lexical Features (LF) To capture the lexical context of an event mention, we extract the following
features: the head word of the mention (HW), the lemma of the HW (HL), lemmas of left and right
words of the mention (LHL,RHL), and lemmas of left and right mentions (LHE,RHE).
Class Features (CF) These features aim to classify mentions into several types of classes: the
mention HW’s partofspeech (POS), the word class of the HW (HWC), which can take one of the
following values ?verb, noun, adjective, other?, and the event class of the mention (EC). To extract
the event class associated to every event mention, we employed the event identifier described in [6].
WordNet Features (WF) We build three types of clusters over all the words from WordNet [9]
and use them as features for the mention HW. First cluster type associates an unique id to each
(word:HWC) pair (WNW). The second cluster type uses the transitive closure of the synonymous
relations to group words from WordNet (WNS). Finally, the third cluster type considers as grouping
criteria the category from WordNet lexicographer’s files that is associated to each word (WNL). For
cases when a new word does not belong to any of these WordNet clusters, we create a new cluster
with a new id for each of the three cluster types.
Semantic Features (SF) To extract features that characterize participants and properties of event
mentions, we use s semantic parser [8] trained on PropBank(PB) [23] and FrameNet(FN) [4] cor
pora. (For instance, for the apprehended mention from our example, Jackson is the feature value
1For consistency, we try to preserve the notation of the original models.
2In this subsection and the following section, the feature term is used in context of a feature type.
2
Page 3
for A0 PB argument3and the SUSPECT frame element (FEA0) of the ARREST frame.) Another se
mantic feature is the semantic frame (FR) that is evoked by an event mention. (For instance, all the
emphasized mentions from our example evoke the ARREST frame from FN.)
Feature Combinations (FC) We also explore various combinations of features presented above.
Examples include HW+POS, HL+FR, FE+A1, etc.
3 Finite Feature Models
Inthis section, we presenta sequenceof HDP mixturemodelsforsolvingeventcoreference. Forthis
type of approach, a Dirichlet Process (DP) [10] is associated with each document, and each mixture
component, which in our case corresponds to an event, is shared across documents. To describe
these models, we consider Z the set of indicator random variables for indices of events, φzthe set
of parameters associated to an event z, φ a notation for all model parameters, and X a notation for
all random variables that represent observable features.
Given a document collection annotated with event mentions, the goal is to find the best assignment
of event indices, Z∗, which maximize the posterior probability P(ZX). In a Bayesian approach,
this probability is computed by integrating out all model parameters:
P(ZX) =
?
P(Z,φX)dφ =
?
P(ZX,φ)P(φX)dφ
In orderto describe ourmodifications,we first revisit a basic modelfromthe set of models described
in H&K’s paper.
3.1The One Feature Model
The one feature model, HDP1f, constitutes the simplest representation of an HDP model. In this
model, which is depicted graphically in Figure 2(a), the observable components are characterized
by only one feature. The distribution over events associated to each document β is generated by a
Dirichlet process with a concentration parameter α > 0. Since this setting enables a clustering of
event mentions at the document level, it is desirable that events are shared across documents and
the number of events K is inferred from data. To ensure this flexibility, a global nonparametric
DP prior with a hyperparameter γ and a global base measure H can be considered for β [28]. The
global distribution drawnfrom this DP prior, denotedas β0in Figure 2(a), encodes the event mixing
weights. Thus, same global events are used for each document, but each event has a document
specific distribution βithat is drawn from a DP prior centered on β0.
To infer the true posterior probability of P(ZX), we follow [28] in using a Gibbs sampling algo
rithm [12] based on the direct assignment sampling scheme. In this sampling scheme, the β and φ
parameters are integrated out analytically. The formula for sampling an event index for mention j
from document i, Zi,j, is given by:4
P(Zi,j Z−i,j,HL) ∝ P(Zi,j Z−i,j)P(HLi,j Z,HL−i,j)
where HLi,jis the head lemma of the event mention j from the document i.
First, in the generative process of an event mention, an event index z is sampled by using a mecha
nism that facilitates sampling from a prior for infinite mixture models called the Chinese Restaurant
Franchise (CRF) representation [28]:
?
Here, nzis the number of event mentions with the event index z, znewis a new event index not used
already in Z−i,j, βz
weight for the unknown mixture component.
P(Zi,j= z  Z−i,j,β0) ∝
αβu
nz+ αβz
0,
if z = znew
otherwise
0,
0are the global mixing proportions associated to the K events, and βu
0is the
Then,to generatethementionheadlemma(inthis model,X = ?HL?), theeventz is associatedwith
a multinomial emission distribution over the HL feature values having the parameters φ = ?φhl
We assume that this emission distribution is drawn from a symmetric Dirichlet distribution with
concentration λHL:
Z?.
3A0 annotates in PB a specific type of semantic role which represents the AGENT, the DOER, or the ACTOR
of a specific event. Another PB argument is A1, which plays the role of the PATIENT, the THEME, or the
EXPERIENCER of an event.
4Z−i,jrepresents a notation for Z − {Zi,j}.
3
Page 4
H
Zi
∞
β
α
γ
φ
∞
Xi
β0∞
JiI
L
H
φ
∞
HLi
FRi
POSi
α
γ
∞
β
β0∞
I
θ
Ji
Zi
H
Zi
HLi
FRi
φ
∞
γ
α
∞
β
β0∞
I
Ji
H
Zi
HLi
φ
∞
α
γ
∞
β
Ji
β0∞
I
(b)(c)(d)(a)
Figure 2: Graphical representation of four HDP models. Each node corresponds to a random variable. In
particular, shaded nodes denotes observable variables. Each rectangle captures the replication of the structure
it contains. The number of replications is indicated in the bottomright corner of the rectangle. The model
depicted in (a) is an HDP model using one feature; the model in (b) employs HL and FR features; (c) illustrates
a flat representation of a limited number of features in a generalized framework (henceforth, HDPflat); and (d)
captures a simple example of structured network topology of three feature variables (henceforth, HDPstruct).
The dependencies involving parameters φ and θ in models (b), (c), and (d) are omitted for clarity.
P(HLi,j= hl  Z,HL−i,j) ∝ nhl,z+ λHL
where HLi,jis the head lemma of mention j from document i, and nhl,zis the number of times
the feature value hl has been associated with the event index z in (Z,HL−i,j). We also apply the
Lidstone’s smoothing method to this distribution.
3.2Adding More Features
A model in which observable components are represented only by one feature has the tendency to
cluster these components based on their feature value. To address this limitation, H&K proposed
a more complex model that is strictly customized for entity coreference resolution. On the other
hand, event coreference involves clustering complex objects characterized by richer features than
entity coreference (or topic detection), and therefore it is desirable to extend the HDP1fmodel with
a generalized model where additional features can be easily incorporated.
To facilitate this extension, we assume that feature variables are conditionally independent given Z.
This assumption considerably reduces the complexity of computing P(ZX). For example, if we
want to incorporate another feature (e.g., FR) in the previous model, the formula becomes:
P(Zi,jHL,FR) ∝ P(Zi,j)P(HLi,j,FRi,jZ) = P(Zi,j)P(HLi,jZ)P(FRi,jZ)
In this formula, we omit the conditioning components of Z, HL, and FR for clarity. The graphical
representation corresponding to this model is illustrated in Figure 2(b). In general, if X consists of
L feature variables, the inference formula for the Gibbs sampler is defined as:
P(Zi,jX) ∝ P(Zi,j)
?
FT∈X
P(FTi,jZ)
The graphical model for this general setting is depicted in Figure 2(c). Drawing an analogy, the
graphical representation involving feature variables and Z variables resembles the graphical repre
sentation of a Naive Bayes classifier.
When dependencies between feature variables exist (e.g., in our case, frame elements are dependent
of the semantic frames that define them, and frames are dependent of the words that evoke them),
various global distributions are involved for computing P(Z  X). For instance, for the model
depicted in Figure 2(d) the posterior probability is given by:
P(Zi,j)P(FRi,jHLi,j,θ)
?
FT∈X
P(FTi,jZ)
In this model, P(FRi,j  HLi,j,θ) is a global distribution parameterized by θ, and the feature
variables considered are X=?HL,POS,FR?.
4
Page 5
For all these extended models, we compute the prior and likelihood factors as described in the one
feature model. Also, following H&K, in the inference mechanism we assign soft counts for missing
features (e.g., unspecified PB argument).
4Unbounded Feature Models
First, we presenta generativemodelcalled the MarkovIndianBuffet Process (mIBP) that providesa
mechanisminwhicheachobjectcanberepresentedbya sparsesubsetofapotentiallyunboundedset
of latent features [15, 14, 30].5Then, to overcome the limitations regarding the number of mixture
components and the number of features associated with objects, we combine this mechanism with
an HDP model to form an mIBPHDP hybrid. Finally, to account for temporal dependencies, we
employ an mIBP extension, called the Infinite Factorial Hidden Markov Model (iFHMM) [30], in
combination with an Infinite Hidden Markov Model (iHMM) to form the iFHMMiHMM model.
4.1 The Markov Indian Buffet Process
Asdescribedin[30],themIBPdefinesadistributionoveranunboundedsetofbinaryMarkovchains,
where each chain can be associated to a binary latent feature that evolves over time according to
Markov dynamics. Specifically, if we denote by M the total number of feature chains and by T
the number of observable components (event mentions), the mIBP defines a probability distribution
over a binary matrix F with T rows, which correspond to observations, and an unbounded number
of columns (M → ∞), which correspond to features. An observation ytcontains a subset from
the unbounded set of features {f1,f2,...,fM} that is represented in the matrix by a binary vector
Ft=?F1
t,F2
t,...,FM
t?, where Fi
t=1 indicates that fiis associated to yt.
Therefore, F decomposes the observations and represents them as feature factors, which can then
be associated to hidden variables in an iFHMM as depicted in Figure 3(a). The transition matrix of
a binary Markov chain associated to a feature fmis defined as
W(m)=
?1 − am
1 − bm
am
bm
?
where W(m)
and the initial state Fm
object yt, Fm
t∼Bernoulli(a
To compute the probability of the feature matrix F6, in which the parameters a and b are integrated
out analytically, we use the counting variables c00
1→0, and 1→1 transitions fmhas made in the binary chain m. The stochastic process that derives
the probability distribution in terms of these variables is defined as follows. The first component
samples a number of Poisson(α′) features. In general, depending on the value that was sampled in
the previous step (t − 1), a feature fmis sampled for the tthcomponent according to the following
probabilities:
ij
=P(Fm
t+1=j Fm
0 = 0. In the generative process, the hidden variable of feature fmfor an
1−Fm
t−1
m
b
m
).
t =i), the parameters am∼Beta(α′/M,1) and bm∼Beta(γ′,δ′),
Fm
t−1
m, c01
m, c10
m, and c11
mto record the 0 → 0, 0 → 1,
P(Fm
t= 1Fm
t−1=1) =
c11
m+ δ′
γ′+ δ′+ c10
c00
m
c00
m+ c11
m
P(Fm
t= 1Fm
t−1=0) =
m+ c01
m
The tthcomponent then repeats the same mechanism for sampling the next features until it finishes
the current number of sampled features M. After all features are sampled for the tthcomponent,
a number of Poisson(α′/t) new features are assigned for this component and M gets incremented
accordingly.
4.2 The mIBPHDP Model
One direct application of the mIBP is to integrate it into the HDP models proposed in Section 3. In
this way, the new nonparametric extension will have the benefits of capturing uncertainty regarding
thenumberofmixturecomponentsthatarecharacterizedbyapotentiallyinfinitenumberoffeatures.
Since one observablecomponentis associated with an unboundedcountableset of features, we have
to provide a mechanism in which only a finite set of features will represent the component in the
HDP inference process.
5In this section, a feature is represented by a (feature type:feature value) pair.
6Technical details for computing this probability are described in [30].
5
Page 6
Y1
F2
1
F1
1
FM
1
FM
2
Y2
F2
2
F1
2
FM
T
YT
F2
T
F1
T
(a)
F2
0
F1
0
FM
0
(b)
F2
0
F2
1
F2
2
F2
T
F1
0
Y1
F1
1
Y2
F1
2
YT
F1
T
S0
FM
0
FM
1
FM
2
FM
T
S1
S2
ST
Figure 3: (a) The Infinite Factorial Hidden Markov Model. (b) The iFHMMiHMM model. (M→∞)
The idea behind this mechanism is to use slice sampling7[21] in order to derive a finite set of
features for yt. Letting qmbe the number of times feature fmwas sampled in the mIBP, and vtan
auxiliary variable for ytsuch that vt∼Uniform(1,max{qm Fm
set Btfor the observation ytas:
t=1}), we define the finite feature
Bt= {fm Fm
t
= 1 ∧ qm≥ vt}
The finiteness of this feature set is based on the observation that, in the generative process of the
mIBP, only a finite set of features are sampled for a component. Another observation worth men
tioning regarding the way this set is constructed is that only the most representative features of yt
get selected in Bt.
4.3The iFHMMiHMM Model
The iFHMM is a nonparametric Bayesian factor model that extends the Factorial Hidden Markov
Model (FHMM) [13] by letting the number of parallel Markov chains M be learned from data.
Although the iFHMM allows a more flexible representation of the latent structure, it can not be
used as a framework where the number of clustering components K is infinite. On the other hand,
the iHMM represents a nonparametric extension of the Hidden Markov Model (HMM) [27] that
allows performing inference on an infinite number of states K. In order to further increase the
representationalpowerformodelingdiscretetime series data, we proposea nonparametricextension
that combines the best of the two models, and lets the parameters M and K be learned from data.
Each step in the new generative process, whose graphical representation is depicted in Figure 3(b),
is performedin two phases: (i) the latent feature variables from the iFHMM framework are sampled
using the mIBP mechanism; and (ii) the features sampled so far, which become observable during
this second phase, are used in an adapted beam sampling algorithm [29] to infer the clustering
components (or, in our case, latent events).
To describe the beam sampler for event coreference resolution, we introduce additional notation.
We denote by (s1,...,sT) the sequence of hidden states corresponding to the sequence of event
mentions (y1,...,yT), where each state stbelong to one of the K events, st∈ {1,...,K}, and
each mention ytis represented by a sequence of latent features ?F1
the transition probability π is defined as πij= P(st= j  st−1= i) and a mention ytis generated
according to a likelihood model F that is parameterized by a statedependent parameter φst(yt
st∼F(φst)). The observation parameters φ are iid drawn from a prior base distribution H.
The beam sampling algorithm combines the ideas of slice sampling and dynamic programming for
an efficient sampling of state trajectories. Since in time series models the transition probabilities
have independent priors [5], Van Gael and colleagues [29] also used the HDP mechanism to al
low couplings across transitions. For sampling the whole hidden state trajectory s, this algorithm
employs a forward filteringbackward sampling technique.
t,F2
t,...,FM
t ?. One element of
In the forward step of our implementation, we sample the feature variables using the mIBP as de
scribed in Section 4.1, and the auxiliary variable ut∼ Uniform(0,πst−1st) for each mention yt.
As explained in [29], the auxiliary variables u are used to filter only those trajectories s for which
7The idea of using this procedure is inspired from [29] where a slice variable was used to sample a finite
number of state trajectories in the iHMM.
6
Page 7
πst−1st≥ utfor all t. Also, in this step, we compute the probabilities P(st y1:t,u1:t) for all t as
described in [29]:
P(st y1:t,u1:t) ∝ P(yt st)
?
st−1:ut<πst−1st
P(st−1 y1:t−1,u1:t−1)
Here, the dependencies involving parameters π and φ are omitted for clarity.
In the backward step, we first sample the event for the last state sTdirectly from P(sTy1:T,u1:T)
and then, for all t : T − 1,1, we sample each state st given st+1by using the formula P(st
st+1,y1:T,u1:T)∝P(sty1:t,u1:t)P(st+1st,ut+1).
To sample the emission distribution φ efficiently, and to ensure that each mention is characterized
by a finite set of representative features, we set the base distribution H to be conjugate with the
data distribution F in a Dirichletmultinomial model with the sufficient statistics of the multinomial
distribution (o1,...,oK) defined as:
ok=
T
?
t=1
?
fm∈Bt
nmk
where nmkcounts how many times feature fmwas sampled for event k, and Btstores a finite set
of features for ytas it is defined in Section 4.2.
5 Evaluation
Event Coreference Data One corpus used for evaluation is ACE 2005 [18]. This corpus annotates
withindocument coreference information of specific types of events (such as Conflict, Justice, and
Life). After an initial processing phase, we extracted from ACE 6553 event mentions and 4946
events. To increase the diversity of events and to evaluate the models for both within and cross
document event coreference, we created the EventCorefBank corpus (ECB).8This new corpus con
tains 43 topics, 1744eventmentions,1302withindocumentevents,and339crossdocumentevents.
For a more realistic approach, we trained the models on all the event mentions from the two corpora
and not only on the mentions manually annotated for event coreference(the true event mentions). In
this regard, we ran the event identifier described in [6] on the ACE and ECB corpora, and extracted
45289 and 21175 system mentions respectively.
The Experimental Setup Table 2 lists the recall (R), precision (P), and Fscore (F) of our exper
iments averaged over 5 runs of the generative models. Since there is no agreement on the best
coreference resolution metric, we employed four metrics for our evaluation: the linkbased MUC
metric [31], the mentionbased B3metric [2], the entitybased CEAF metric [19], and the pairwise
F1 (PW) metric. In the evaluation process, we considered only the true mentions of the ACE test
dataset and of the test sets of a 5fold cross validation scheme on the ECB dataset. For evaluating
the crossdocument coreference annotations, we adopted the same approach as described in [3] by
merging all the documents from the same topic into a metadocument and then scoring this docu
ment as performed for withindocument evaluation. Also, for both corpora, we considered a set of
132 feature types, where each feature type consists on average of 3900 distinct feature values.
The Baseline A simple baseline for event coreference consists in grouping events by their event
classes [1]. To extract event classes, we employed the event identifier described in [6]. Therefore,
this baseline will categorize events into a small number of clusters, since the event identifier is
trained to predict the five event classes annotated in TimeBank [26]. As it was already observed
[20, 11], consideringveryfew categoriesfor coreferenceresolutiontasks will result in overestimates
of the MUC scorer. For instance, a baseline that groups all entity mentions into the same entity
achievesthehighestMUC scorethananypublishedsystemforthetaskofentitycoreference. Similar
behaviour of the MUC metric is observed for event coreference resolution. For example, for cross
document evaluation on ECB, a baseline that clusters all mentions into one event achieves 73.2%
MUC Fscore, while the baseline listed in Table 2 achieves 72.9% MUC Fscore.
HDP Extensions Due to memory limitations, we evaluated the HDPflatand HDPstructmodels
only on a restricted subset of manually selected feature types. In general, as shown in Table 2,
the HDPflat model achieved the best performance results on the ACE test dataset, whereas the
8This resource is available at http://www.hlt.utdallas.edu/∼ady. The annotation process is described in [7].
7
Page 8
ModelMUC
P
B3
P
CEAF
P
PW
PRFRFRFRF
ACE (withindocument event coreference)
49.097.925.0
50.9 86.070.6
53.9 83.4 84.2
54.7
86.276.9
45.1 81.776.4
48.7 81.9 82.2
ECB (withindocument event coreference)
55.697.7 55.8
50.4 84.389.0
53.4 82.199.2
60.1
84.397.1
48.982.195.3
53.9 82.598.1
ECB (crossdocument event coreference)
72.9
93.849.6
56.8 67.086.2
60.565.098.7
65.7 69.3 95.8
53.263.194.1
62.767.096.4
Baseline
HDP1f (HL)
HDPflat
HDPstruct
mIBPHDP
iFHMMiHMM
94.3
62.2
53.5
61.9
48.7
48.7
33.1
43.1
54.2
49.0
41.9
48.8
39.9
77.5
83.8
81.3
79.0
82.1
14.7
62.3
76.9
69.0
68.8
74.6
64.4
76.4
76.5
77.5
73.8
74.5
24.0
68.6
76.7
73.0
71.2
74.5
93.5
50.5
43.3
53.2
37.4
37.2
8.2
27.7
47.1
38.1
28.9
39.0
15.2
35.8
45.1
44.4
32.6
38.1
Baseline
HDP1f (HL)
HDPflat
HDPstruct
mIBPHDP
iFHMMiHMM
92.2
46.9
37.8
47.4
38.2
39.5
39.8
54.8
92.9
82.7
68.8
85.2
71.0
86.5
89.8
90.2
88.2
89.6
44.5
83.4
93.9
92.7
90.3
93.1
80.1
79.6
78.2
81.1
78.5
78.8
57.2
81.4
85.3
86.5
84.0
85.3
93.7
36.6
27.0
34.4
26.5
29.4
25.4
53.4
92.4
83.0
67.9
86.6
39.8
42.6
41.3
48.6
37.7
43.7
Baseline
HDP1f (HL)
HDPflat
HDPstruct
mIBPHDP
iFHMMiHMM
Table 2: Evaluation results for within and crossdocument event coreference resolution.
90.5
47.7
44.4
51.9
40.0
48.4
61.1
70.5
95.3
89.5
79.8
89.0
64.9
75.3
78.3
80.4
75.5
79.0
36.6
76.2
86.9
86.2
82.7
85.5
72.7
57.1
56.0
60.1
54.6
58.0
48.7
65.2
68.0
70.8
65.7
69.1
90.7
34.9
29.2
37.5
26.1
33.3
28.6
58.9
95.1
85.6
77.0
88.3
43.3
43.5
44.4
52.1
38.9
48.2
HDPstructmodel, which also considers dependencies between feature types, proved to be more
effective on the ECB dataset for both within and crossdocumentevent coreferenceevaluation. The
set of feature types used to achieve these results consists of combinations of types from all feature
categories described in Section 2.2. For the results of the HDPstructmodel listed in Table 2, we also
explored the conditional dependencies between the HL, FR, and FEA types.
As can be observed fromTable 2, the results of the HDPflatand HDPstructmodels show an Fscore
increase by 410% over the HDP1fmodel, and therefore prove that the HDP extensions provide a
more flexible representation for clustering objects characterized by rich properties.
mIBPHDP In spite of its advantage of working with a potentially infinite number of features in an
HDP framework, the mIBPHDP model did not achieve a satisfactory performance in comparison
with the other proposed models. However, the results were obtained by automatically selecting
only 2% of distinct feature values from the entire set of values extracted from both corpora. When
compared with the restricted set of features considered by the HDPflatand HDPstructmodels, the
percentage of values selected by mIBPHDP is only 6%. A future research area for improving this
model is to consider other distributions for automatic selection of salient feature values.
iFHMMiHMMInspiteoftheautomaticfeatureselectionemployedfortheiFHMMiHMMmodel,
its results remain competitive against the results of the HDP extensions (where the feature types
were hand tuned). As shown in Table 2, most of the iFHMMiHMM results fall in between the
HDPflatand HDPstructmodels. Also, these results indicate that the iFHMMiHMM model is a
better framework than HDP in capturing the event mention dependencies simulated by the mIBP
feature sampling scheme. Similar to the mIBPHDP model, to achieve these results, the iFHMM
iHMM model uses only 2% values from the entire set of distinct feature values. For the experiments
of the iFHMMiHMM results reported in Table 2, we set α′=50, γ′=0.5, and δ′=0.5.
6Conclusion
In this paper, we have described how a sequence of unsupervised, nonparametric Bayesian models
can be employed to cluster complexlinguistic objects that are characterized by a rich set of features.
The experimental results proved that these models are able to solve real data applications in which
the feature and cluster numbers are treated as free parameters, and the selection of features is per
formed automatically. While the results of event coreference resolution are promising, we believe
that the classes of models proposed in this paper have a real utility for a wide range of applications.
8
Page 9
References
[1] David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and
Reasoning about Time and Events, pages 1–8.
[2] Amit Bagga and Breck Baldwin. 1998. Algorithms for Scoring Coreference Chains. In Proc. of LREC.
[3] Amit Bagga and Breck Baldwin. 1999. CrossDocument Event Coreference: Annotations, Experiments,
and Observations. In Proceedings of the ACL99 Workshop on Coreference and its Applications.
[4] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In
Proceedings of COLINGACL.
[5] Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. 2002. The Infinite Hidden Markov
Model. In Proceedings of NIPS.
[6] Cosmin Adrian Bejan. 2007. Deriving Chronological Information from Texts through a Graphbased
Algorithm. In Proceedings of FLAIRS2007.
[7] CosminAdrian Bejan and Sanda Harabagiu. 2008. A Linguistic Resource for Discovering Event Structures
and Resolving Event Coreference. In Proceedings of LREC2008.
[8] CosminAdrian Bejan and ChrisHathaway. 2007. UTDSRL:A Pipeline Architecturefor ExtractingFrame
Semantic Structures. In Proceedings of SemEval2007.
[9] Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
[10] Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Nonparametric Problems. The Annals of
Statistics, 1(2):209–230.
[11] Jenny Rose Finkel and Christopher D. Manning. 2008. Enforcing Transitivity in Coreference Resolution.
In Proceedings of ACL/HLT2008, pages 45–48.
[12] Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. . IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.
[13] Z. Ghahramani and M. Jordan. 1997. Factorial Hidden Markov Models. Machine Learning, 29:245–273.
[14] Zoubin Ghahramani, T. L. Griffiths, and Peter Sollich, 2007. Bayesian Statistics 8, chapter Bayesian
nonparametric latent feature models, pages 201–225. Oxford University Press.
[15] Tom Griffiths and Zoubin Ghahramani. 2006. Infinite Latent Feature Models and the Indian Buffet
Process. In Proceedings of NIPS, pages 475–482.
[16] Aria Haghighi and Dan Klein. 2007. Unsupervised Coreference Resolution in a Nonparametric Bayesian
Model. In Proceedings of the ACL.
[17] Kevin Humphreys, Robert Gaizauskas, Saliha Azzam. 1997. Event Coreference for Information Extrac
tion. In Proceedings of the Workshop on Operational Factors in Practical, Robust Anaphora Resolution
for Unrestricted Texts, 35th Meeting of ACL, pages 75–81.
[18] LDCACE05. 2005. ACE (Automatic Content Extraction) English Annotation Guidelines for Events.
[19] X. Luo. 2005. On Coreference Resolution Performance Metrics. In Proceedings of EMNLP.
[20] X. Luo, A. Ittycheriah, H. Jing, N. Kambhatla, and S.Roukos 2004. A MentionSynchronous Coreference
Resolution Algorithm Based On the Bell Tree. In Proceedings of ACL2004.
[21] Radford M. Neal. 2003. Slice Sampling. The Annals of Statistics, 31:705–741.
[22] Vincent Ng. 2008. Unsupervised Models for Coreference Resolution. In Proceedings of EMNLP.
[23] Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus
of Semantic Roles. Computational Linguistics, 31(1):71–105.
[24] Ron Papka. 1999. Online New Event Detection, Clustering and Tracking. Ph.D. thesis, Department of
Computer Science, University of Massachusetts.
[25] Hoifung Poon and Pedro Domingos. 2008. Joint Unsupervised Coreference Resolution with Markov
Logic. In Proceedings of EMNLP.
[26] J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L.
Ferro, and M. Lazo. 2003. The TimeBank Corpus. In Corpus Linguistics, pages 647–656.
[27] Lawrence R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition. In Proceedings of the IEEE, pages 257–286.
[28] Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2006. Hierarchical Dirichlet Processes.
Journal of the American Statistical Association, 101(476):1566–1581.
[29] Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. 2008. Beam Sampling for the
Infinite Hidden Markov Model. In Proceedings of ICML, pages 1088–1095.
[30] Jurgen Van Gael, Yee Whye Teh, and Zoubin Ghahramani. 2008. The Infinite Factorial Hidden Markov
Model. In Proceedings of NIPS.
[31] Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A Model
Theoretic Coreference Scoring Scheme. In Proceedings of MUC6, pages 45–52.
9