# Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution.

**ABSTRACT** We present a sequence of unsupervised, nonparametric Bayesian models for clus- tering complex linguistic objects. In this approach, we consider a potentially infi- nite number of features and categorical outcomes. We evaluated these models for the task of within- and cross-document event coreference on two corpora. All the models we investigated show significant improvements when c ompared against an existing baseline for this task.

**0**Bookmarks

**·**

**77**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Coreference resolution tries to identify all expressions (called mentions) in observed text that refer to the same entity. Beside entity extraction and relation extraction, it represents one of the three complementary tasks in Information Extraction. In this paper we describe a novel coreference resolution system SkipCor that reformulates the problem as a sequence labeling task. None of the existing supervised, unsupervised, pairwise or sequence-based models are similar to our approach, which only uses linear-chain conditional random fields and supports high scalability with fast model training and inference, and a straightforward parallelization. We evaluate the proposed system against the ACE 2004, CoNLL 2012 and SemEval 2010 benchmark datasets. SkipCor clearly outperforms two baseline systems that detect coreferentiality using the same features as SkipCor. The obtained results are at least comparable to the current state-of-the-art in coreference resolution.PLoS ONE 06/2014; 9(6):e100101. · 3.53 Impact Factor

Page 1

Nonparametric Bayesian Models for Unsupervised

Event Coreference Resolution

Cosmin Adrian Bejan1, Matthew Titsworth2, Andrew Hickl2, & Sanda Harabagiu1

1Human Language Technology Research Institute, University of Texas at Dallas

2Language Computer Corporation, Richardson, Texas

ady@hlt.utdallas.edu

Abstract

We present a sequence of unsupervised, nonparametric Bayesian models for clus-

tering complex linguistic objects. In this approach, we consider a potentially infi-

nite number of features and categorical outcomes. We evaluated these models for

the task of within- and cross-document event coreference on two corpora. All the

modelswe investigatedshow significantimprovementswhen comparedagainst an

existing baseline for this task.

1

In Natural Language Processing (NLP), the task of event coreference has numerous applications,

including question answering, multi-document summarization, and information extraction. Two

event mentions are coreferential if they share the same participants and spatio-temporalgroundings.

Moreover, two event mentions are identical if they have the same causes and effects. For example,

the threedocumentslisted in Table1 containsfourmentionsofidenticaleventsbutonlythe arrested,

apprehended, and arrest mentions from the documents 1 and 2 are coreferential. These definitions

were used in the tasks of Topic Detection and Tracking (TDT), as reported in [24].

Introduction

Previous approaches to event coreference resolution [3] used the same lexeme or synonymy of the

verb describing the event to decide coreference. Event coreference was also tried by using the

semantictypesofanontology[17]. However,thefeaturesusedbytheseapproachesarehardtoselect

and require the design of domain specific constraints. To address this problems, we have explored

a sequence of unsupervised, nonparametric Bayesian models that are used to probabilistically infer

coreference clusters of event mentions from a collection of unlabeled documents. Our approach

is motivated by the recent success of unsupervised approaches for entity coreference resolution

[16, 22, 25] and by the advantages of using a large amount of data at no cost.

One model was inspired by the fully generative Bayesian model proposed by Haghighi and Klein

[16] (henceforth, H&K). However, to employ the H&K’s model for tasks that require clustering

objects with rich linguistic features (such as event coreferenceresolution), or to extend this model in

order to enclose additional observable properties is a challenging task [22, 25]. In order to counter

this limitation, we make a conditional independence assumption between the observable features

and propose a generalized framework (Section 3) that is able to easily incorporate new features.

During the process of learning the model described in Section 3, it was observedthat a large amount

of time was requiredto incorporateand tune new features. This lead us to the challengeof creating a

framework which considers an unbounded number of features where the most relevant are selected

automatically. To accomplish this new goal, we propose two novel approaches (Section 4). The

first incorporates a Markov Indian Buffet Process (mIBP) [30] into a Hierarchical Dirichlet Process

(HDP) [28]. The second uses an Infinite Hidden Markov Model (iHMM) [5] coupled to an Infinite

Factorial Hidden Markov Model (iFHMM) [30].

In this paper, we focus on event coreference resolution, thoughadaptations for event identity resolu-

tion can be easily made. We evaluated the models on the ACE 2005 event corpus [18] and on a new

annotated corpus encoding within- and cross-document event coreference information (Section 5).

1

Page 2

Document 1: San Diego Chargers receiver Vincent Jackson was arrested on suspicion of drunk driving on

Tuesday morning, five days before a key NFL playoff game.

...

Police apprehended Jackson in San Diego at 2:30 a.m. and booked him for the misdemeanour before his

release.

Document 2: Despite his arrest on suspicion of driving under the influence yesterday, Chargers receiver

Vincent Jackson will play in Sunday’s AFC divisional playoff game at Pittsburgh.

Document 3: In another anti-piracy operation, Navy warship on Saturday repulsed an attack on a merchant

vessel in the Gulf of Aden and nabbed 23 Somali and Yemeni sea brigands.

Table 1: Examples of coreferential and identical events.

2

Models for solving event coreference and event identity can lead to the generation of ad-hoc event

hierarchies from text. A sample of a hierarchy capturing corefering and identical events, including

those from the example presented in Section 1, is illustrated in Figure 1.

Event Coreference Resolution

arrest

arrest

Event properties:

Suspect:

Authorities:

Time:

Location:

sea brigands

Navy warship

Saturday

Gulf of Aden

... nabbed ...

Document 3

... arrested ... apprehended ... arrest ...

Document 2

mentions

event

generic

events

arrest

Document 1

Suspect:

Authorities:

Time:

Location:

Vincent Jackson

police

Tuesday

San Diego

Event properties:

events

Figure 1: A portion of the event hierarchy.

First, we introduce some basic notation.1Next, to cluster the mentions that share common event

properties (as shown in Figure 1), we briefly describe the linguistic features of event mentions.

2.1Notation

As input for our models, we consider a collection of I documents, each document i having Jievent

mentions. Each event mention is characterized by L feature types, FT, and each feature type is

represented by a finite number of feature values, fv. Therefore, we can represent the observable

properties of an event mention, em, as a vector of pairs ?(FT1:fv1i),...,(FTL:fvLi)?, where each

feature value index i ranges in the feature value space associated with a feature type.

2.2 Linguistic Features

We consider the following set of features associated to an event mention:2

Lexical Features (LF) To capture the lexical context of an event mention, we extract the following

features: the head word of the mention (HW), the lemma of the HW (HL), lemmas of left and right

words of the mention (LHL,RHL), and lemmas of left and right mentions (LHE,RHE).

Class Features (CF) These features aim to classify mentions into several types of classes: the

mention HW’s part-of-speech (POS), the word class of the HW (HWC), which can take one of the

following values ?verb, noun, adjective, other?, and the event class of the mention (EC). To extract

the event class associated to every event mention, we employed the event identifier described in [6].

WordNet Features (WF) We build three types of clusters over all the words from WordNet [9]

and use them as features for the mention HW. First cluster type associates an unique id to each

(word:HWC) pair (WNW). The second cluster type uses the transitive closure of the synonymous

relations to group words from WordNet (WNS). Finally, the third cluster type considers as grouping

criteria the category from WordNet lexicographer’s files that is associated to each word (WNL). For

cases when a new word does not belong to any of these WordNet clusters, we create a new cluster

with a new id for each of the three cluster types.

Semantic Features (SF) To extract features that characterize participants and properties of event

mentions, we use s semantic parser [8] trained on PropBank(PB) [23] and FrameNet(FN) [4] cor-

pora. (For instance, for the apprehended mention from our example, Jackson is the feature value

1For consistency, we try to preserve the notation of the original models.

2In this subsection and the following section, the feature term is used in context of a feature type.

2

Page 3

for A0 PB argument3and the SUSPECT frame element (FEA0) of the ARREST frame.) Another se-

mantic feature is the semantic frame (FR) that is evoked by an event mention. (For instance, all the

emphasized mentions from our example evoke the ARREST frame from FN.)

Feature Combinations (FC) We also explore various combinations of features presented above.

Examples include HW+POS, HL+FR, FE+A1, etc.

3 Finite Feature Models

Inthis section, we presenta sequenceof HDP mixturemodelsforsolvingeventcoreference. Forthis

type of approach, a Dirichlet Process (DP) [10] is associated with each document, and each mixture

component, which in our case corresponds to an event, is shared across documents. To describe

these models, we consider Z the set of indicator random variables for indices of events, φzthe set

of parameters associated to an event z, φ a notation for all model parameters, and X a notation for

all random variables that represent observable features.

Given a document collection annotated with event mentions, the goal is to find the best assignment

of event indices, Z∗, which maximize the posterior probability P(Z|X). In a Bayesian approach,

this probability is computed by integrating out all model parameters:

P(Z|X) =

?

P(Z,φ|X)dφ =

?

P(Z|X,φ)P(φ|X)dφ

In orderto describe ourmodifications,we first revisit a basic modelfromthe set of models described

in H&K’s paper.

3.1The One Feature Model

The one feature model, HDP1f, constitutes the simplest representation of an HDP model. In this

model, which is depicted graphically in Figure 2(a), the observable components are characterized

by only one feature. The distribution over events associated to each document β is generated by a

Dirichlet process with a concentration parameter α > 0. Since this setting enables a clustering of

event mentions at the document level, it is desirable that events are shared across documents and

the number of events K is inferred from data. To ensure this flexibility, a global nonparametric

DP prior with a hyperparameter γ and a global base measure H can be considered for β [28]. The

global distribution drawnfrom this DP prior, denotedas β0in Figure 2(a), encodes the event mixing

weights. Thus, same global events are used for each document, but each event has a document

specific distribution βithat is drawn from a DP prior centered on β0.

To infer the true posterior probability of P(Z|X), we follow [28] in using a Gibbs sampling algo-

rithm [12] based on the direct assignment sampling scheme. In this sampling scheme, the β and φ

parameters are integrated out analytically. The formula for sampling an event index for mention j

from document i, Zi,j, is given by:4

P(Zi,j| Z−i,j,HL) ∝ P(Zi,j| Z−i,j)P(HLi,j| Z,HL−i,j)

where HLi,jis the head lemma of the event mention j from the document i.

First, in the generative process of an event mention, an event index z is sampled by using a mecha-

nism that facilitates sampling from a prior for infinite mixture models called the Chinese Restaurant

Franchise (CRF) representation [28]:

?

Here, nzis the number of event mentions with the event index z, znewis a new event index not used

already in Z−i,j, βz

weight for the unknown mixture component.

P(Zi,j= z | Z−i,j,β0) ∝

αβu

nz+ αβz

0,

if z = znew

otherwise

0,

0are the global mixing proportions associated to the K events, and βu

0is the

Then,to generatethementionheadlemma(inthis model,X = ?HL?), theeventz is associatedwith

a multinomial emission distribution over the HL feature values having the parameters φ = ?φhl

We assume that this emission distribution is drawn from a symmetric Dirichlet distribution with

concentration λHL:

Z?.

3A0 annotates in PB a specific type of semantic role which represents the AGENT, the DOER, or the ACTOR

of a specific event. Another PB argument is A1, which plays the role of the PATIENT, the THEME, or the

EXPERIENCER of an event.

4Z−i,jrepresents a notation for Z − {Zi,j}.

3

Page 4

H

Zi

∞

β

α

γ

φ

∞

Xi

β0∞

JiI

L

H

φ

∞

HLi

FRi

POSi

α

γ

∞

β

β0∞

I

θ

Ji

Zi

H

Zi

HLi

FRi

φ

∞

γ

α

∞

β

β0∞

I

Ji

H

Zi

HLi

φ

∞

α

γ

∞

β

Ji

β0∞

I

(b)(c)(d)(a)

Figure 2: Graphical representation of four HDP models. Each node corresponds to a random variable. In

particular, shaded nodes denotes observable variables. Each rectangle captures the replication of the structure

it contains. The number of replications is indicated in the bottom-right corner of the rectangle. The model

depicted in (a) is an HDP model using one feature; the model in (b) employs HL and FR features; (c) illustrates

a flat representation of a limited number of features in a generalized framework (henceforth, HDPflat); and (d)

captures a simple example of structured network topology of three feature variables (henceforth, HDPstruct).

The dependencies involving parameters φ and θ in models (b), (c), and (d) are omitted for clarity.

P(HLi,j= hl | Z,HL−i,j) ∝ nhl,z+ λHL

where HLi,jis the head lemma of mention j from document i, and nhl,zis the number of times

the feature value hl has been associated with the event index z in (Z,HL−i,j). We also apply the

Lidstone’s smoothing method to this distribution.

3.2Adding More Features

A model in which observable components are represented only by one feature has the tendency to

cluster these components based on their feature value. To address this limitation, H&K proposed

a more complex model that is strictly customized for entity coreference resolution. On the other

hand, event coreference involves clustering complex objects characterized by richer features than

entity coreference (or topic detection), and therefore it is desirable to extend the HDP1fmodel with

a generalized model where additional features can be easily incorporated.

To facilitate this extension, we assume that feature variables are conditionally independent given Z.

This assumption considerably reduces the complexity of computing P(Z|X). For example, if we

want to incorporate another feature (e.g., FR) in the previous model, the formula becomes:

P(Zi,j|HL,FR) ∝ P(Zi,j)P(HLi,j,FRi,j|Z) = P(Zi,j)P(HLi,j|Z)P(FRi,j|Z)

In this formula, we omit the conditioning components of Z, HL, and FR for clarity. The graphical

representation corresponding to this model is illustrated in Figure 2(b). In general, if X consists of

L feature variables, the inference formula for the Gibbs sampler is defined as:

P(Zi,j|X) ∝ P(Zi,j)

?

FT∈X

P(FTi,j|Z)

The graphical model for this general setting is depicted in Figure 2(c). Drawing an analogy, the

graphical representation involving feature variables and Z variables resembles the graphical repre-

sentation of a Naive Bayes classifier.

When dependencies between feature variables exist (e.g., in our case, frame elements are dependent

of the semantic frames that define them, and frames are dependent of the words that evoke them),

various global distributions are involved for computing P(Z | X). For instance, for the model

depicted in Figure 2(d) the posterior probability is given by:

P(Zi,j)P(FRi,j|HLi,j,θ)

?

FT∈X

P(FTi,j|Z)

In this model, P(FRi,j | HLi,j,θ) is a global distribution parameterized by θ, and the feature

variables considered are X=?HL,POS,FR?.

4

Page 5

For all these extended models, we compute the prior and likelihood factors as described in the one

feature model. Also, following H&K, in the inference mechanism we assign soft counts for missing

features (e.g., unspecified PB argument).

4Unbounded Feature Models

First, we presenta generativemodelcalled the MarkovIndianBuffet Process (mIBP) that providesa

mechanisminwhicheachobjectcanberepresentedbya sparsesubsetofapotentiallyunboundedset

of latent features [15, 14, 30].5Then, to overcome the limitations regarding the number of mixture

components and the number of features associated with objects, we combine this mechanism with

an HDP model to form an mIBP-HDP hybrid. Finally, to account for temporal dependencies, we

employ an mIBP extension, called the Infinite Factorial Hidden Markov Model (iFHMM) [30], in

combination with an Infinite Hidden Markov Model (iHMM) to form the iFHMM-iHMM model.

4.1 The Markov Indian Buffet Process

Asdescribedin[30],themIBPdefinesadistributionoveranunboundedsetofbinaryMarkovchains,

where each chain can be associated to a binary latent feature that evolves over time according to

Markov dynamics. Specifically, if we denote by M the total number of feature chains and by T

the number of observable components (event mentions), the mIBP defines a probability distribution

over a binary matrix F with T rows, which correspond to observations, and an unbounded number

of columns (M → ∞), which correspond to features. An observation ytcontains a subset from

the unbounded set of features {f1,f2,...,fM} that is represented in the matrix by a binary vector

Ft=?F1

t,F2

t,...,FM

t?, where Fi

t=1 indicates that fiis associated to yt.

Therefore, F decomposes the observations and represents them as feature factors, which can then

be associated to hidden variables in an iFHMM as depicted in Figure 3(a). The transition matrix of

a binary Markov chain associated to a feature fmis defined as

W(m)=

?1 − am

1 − bm

am

bm

?

where W(m)

and the initial state Fm

object yt, Fm

t∼Bernoulli(a

To compute the probability of the feature matrix F6, in which the parameters a and b are integrated

out analytically, we use the counting variables c00

1→0, and 1→1 transitions fmhas made in the binary chain m. The stochastic process that derives

the probability distribution in terms of these variables is defined as follows. The first component

samples a number of Poisson(α′) features. In general, depending on the value that was sampled in

the previous step (t − 1), a feature fmis sampled for the tthcomponent according to the following

probabilities:

ij

=P(Fm

t+1=j |Fm

0 = 0. In the generative process, the hidden variable of feature fmfor an

1−Fm

t−1

m

b

m

).

t =i), the parameters am∼Beta(α′/M,1) and bm∼Beta(γ′,δ′),

Fm

t−1

m, c01

m, c10

m, and c11

mto record the 0 → 0, 0 → 1,

P(Fm

t= 1|Fm

t−1=1) =

c11

m+ δ′

γ′+ δ′+ c10

c00

m

c00

m+ c11

m

P(Fm

t= 1|Fm

t−1=0) =

m+ c01

m

The tthcomponent then repeats the same mechanism for sampling the next features until it finishes

the current number of sampled features M. After all features are sampled for the tthcomponent,

a number of Poisson(α′/t) new features are assigned for this component and M gets incremented

accordingly.

4.2 The mIBP-HDP Model

One direct application of the mIBP is to integrate it into the HDP models proposed in Section 3. In

this way, the new nonparametric extension will have the benefits of capturing uncertainty regarding

thenumberofmixturecomponentsthatarecharacterizedbyapotentiallyinfinitenumberoffeatures.

Since one observablecomponentis associated with an unboundedcountableset of features, we have

to provide a mechanism in which only a finite set of features will represent the component in the

HDP inference process.

5In this section, a feature is represented by a (feature type:feature value) pair.

6Technical details for computing this probability are described in [30].

5

Page 6

Y1

F2

1

F1

1

FM

1

FM

2

Y2

F2

2

F1

2

FM

T

YT

F2

T

F1

T

(a)

F2

0

F1

0

FM

0

(b)

F2

0

F2

1

F2

2

F2

T

F1

0

Y1

F1

1

Y2

F1

2

YT

F1

T

S0

FM

0

FM

1

FM

2

FM

T

S1

S2

ST

Figure 3: (a) The Infinite Factorial Hidden Markov Model. (b) The iFHMM-iHMM model. (M→∞)

The idea behind this mechanism is to use slice sampling7[21] in order to derive a finite set of

features for yt. Letting qmbe the number of times feature fmwas sampled in the mIBP, and vtan

auxiliary variable for ytsuch that vt∼Uniform(1,max{qm| Fm

set Btfor the observation ytas:

t=1}), we define the finite feature

Bt= {fm| Fm

t

= 1 ∧ qm≥ vt}

The finiteness of this feature set is based on the observation that, in the generative process of the

mIBP, only a finite set of features are sampled for a component. Another observation worth men-

tioning regarding the way this set is constructed is that only the most representative features of yt

get selected in Bt.

4.3The iFHMM-iHMM Model

The iFHMM is a nonparametric Bayesian factor model that extends the Factorial Hidden Markov

Model (FHMM) [13] by letting the number of parallel Markov chains M be learned from data.

Although the iFHMM allows a more flexible representation of the latent structure, it can not be

used as a framework where the number of clustering components K is infinite. On the other hand,

the iHMM represents a nonparametric extension of the Hidden Markov Model (HMM) [27] that

allows performing inference on an infinite number of states K. In order to further increase the

representationalpowerformodelingdiscretetime series data, we proposea nonparametricextension

that combines the best of the two models, and lets the parameters M and K be learned from data.

Each step in the new generative process, whose graphical representation is depicted in Figure 3(b),

is performedin two phases: (i) the latent feature variables from the iFHMM framework are sampled

using the mIBP mechanism; and (ii) the features sampled so far, which become observable during

this second phase, are used in an adapted beam sampling algorithm [29] to infer the clustering

components (or, in our case, latent events).

To describe the beam sampler for event coreference resolution, we introduce additional notation.

We denote by (s1,...,sT) the sequence of hidden states corresponding to the sequence of event

mentions (y1,...,yT), where each state stbelong to one of the K events, st∈ {1,...,K}, and

each mention ytis represented by a sequence of latent features ?F1

the transition probability π is defined as πij= P(st= j | st−1= i) and a mention ytis generated

according to a likelihood model F that is parameterized by a state-dependent parameter φst(yt|

st∼F(φst)). The observation parameters φ are iid drawn from a prior base distribution H.

The beam sampling algorithm combines the ideas of slice sampling and dynamic programming for

an efficient sampling of state trajectories. Since in time series models the transition probabilities

have independent priors [5], Van Gael and colleagues [29] also used the HDP mechanism to al-

low couplings across transitions. For sampling the whole hidden state trajectory s, this algorithm

employs a forward filtering-backward sampling technique.

t,F2

t,...,FM

t ?. One element of

In the forward step of our implementation, we sample the feature variables using the mIBP as de-

scribed in Section 4.1, and the auxiliary variable ut∼ Uniform(0,πst−1st) for each mention yt.

As explained in [29], the auxiliary variables u are used to filter only those trajectories s for which

7The idea of using this procedure is inspired from [29] where a slice variable was used to sample a finite

number of state trajectories in the iHMM.

6

Page 7

πst−1st≥ utfor all t. Also, in this step, we compute the probabilities P(st| y1:t,u1:t) for all t as

described in [29]:

P(st| y1:t,u1:t) ∝ P(yt| st)

?

st−1:ut<πst−1st

P(st−1| y1:t−1,u1:t−1)

Here, the dependencies involving parameters π and φ are omitted for clarity.

In the backward step, we first sample the event for the last state sTdirectly from P(sT|y1:T,u1:T)

and then, for all t : T − 1,1, we sample each state st given st+1by using the formula P(st|

st+1,y1:T,u1:T)∝P(st|y1:t,u1:t)P(st+1|st,ut+1).

To sample the emission distribution φ efficiently, and to ensure that each mention is characterized

by a finite set of representative features, we set the base distribution H to be conjugate with the

data distribution F in a Dirichlet-multinomial model with the sufficient statistics of the multinomial

distribution (o1,...,oK) defined as:

ok=

T

?

t=1

?

fm∈Bt

nmk

where nmkcounts how many times feature fmwas sampled for event k, and Btstores a finite set

of features for ytas it is defined in Section 4.2.

5 Evaluation

Event Coreference Data One corpus used for evaluation is ACE 2005 [18]. This corpus annotates

within-document coreference information of specific types of events (such as Conflict, Justice, and

Life). After an initial processing phase, we extracted from ACE 6553 event mentions and 4946

events. To increase the diversity of events and to evaluate the models for both within- and cross-

document event coreference, we created the EventCorefBank corpus (ECB).8This new corpus con-

tains 43 topics, 1744eventmentions,1302within-documentevents,and339cross-documentevents.

For a more realistic approach, we trained the models on all the event mentions from the two corpora

and not only on the mentions manually annotated for event coreference(the true event mentions). In

this regard, we ran the event identifier described in [6] on the ACE and ECB corpora, and extracted

45289 and 21175 system mentions respectively.

The Experimental Setup Table 2 lists the recall (R), precision (P), and F-score (F) of our exper-

iments averaged over 5 runs of the generative models. Since there is no agreement on the best

coreference resolution metric, we employed four metrics for our evaluation: the link-based MUC

metric [31], the mention-based B3metric [2], the entity-based CEAF metric [19], and the pairwise

F1 (PW) metric. In the evaluation process, we considered only the true mentions of the ACE test

dataset and of the test sets of a 5-fold cross validation scheme on the ECB dataset. For evaluating

the cross-document coreference annotations, we adopted the same approach as described in [3] by

merging all the documents from the same topic into a meta-document and then scoring this docu-

ment as performed for within-document evaluation. Also, for both corpora, we considered a set of

132 feature types, where each feature type consists on average of 3900 distinct feature values.

The Baseline A simple baseline for event coreference consists in grouping events by their event

classes [1]. To extract event classes, we employed the event identifier described in [6]. Therefore,

this baseline will categorize events into a small number of clusters, since the event identifier is

trained to predict the five event classes annotated in TimeBank [26]. As it was already observed

[20, 11], consideringveryfew categoriesfor coreferenceresolutiontasks will result in overestimates

of the MUC scorer. For instance, a baseline that groups all entity mentions into the same entity

achievesthehighestMUC scorethananypublishedsystemforthetaskofentitycoreference. Similar

behaviour of the MUC metric is observed for event coreference resolution. For example, for cross-

document evaluation on ECB, a baseline that clusters all mentions into one event achieves 73.2%

MUC F-score, while the baseline listed in Table 2 achieves 72.9% MUC F-score.

HDP Extensions Due to memory limitations, we evaluated the HDPflatand HDPstructmodels

only on a restricted subset of manually selected feature types. In general, as shown in Table 2,

the HDPflat model achieved the best performance results on the ACE test dataset, whereas the

8This resource is available at http://www.hlt.utdallas.edu/∼ady. The annotation process is described in [7].

7

Page 8

ModelMUC

P

B3

P

CEAF

P

PW

PRFRFRFRF

ACE (within-document event coreference)

49.097.925.0

50.9 86.070.6

53.9 83.4 84.2

54.7

86.276.9

45.1 81.776.4

48.7 81.9 82.2

ECB (within-document event coreference)

55.697.7 55.8

50.4 84.389.0

53.4 82.199.2

60.1

84.397.1

48.982.195.3

53.9 82.598.1

ECB (cross-document event coreference)

72.9

93.849.6

56.8 67.086.2

60.565.098.7

65.7 69.3 95.8

53.263.194.1

62.767.096.4

Baseline

HDP1f (HL)

HDPflat

HDPstruct

mIBP-HDP

iFHMM-iHMM

94.3

62.2

53.5

61.9

48.7

48.7

33.1

43.1

54.2

49.0

41.9

48.8

39.9

77.5

83.8

81.3

79.0

82.1

14.7

62.3

76.9

69.0

68.8

74.6

64.4

76.4

76.5

77.5

73.8

74.5

24.0

68.6

76.7

73.0

71.2

74.5

93.5

50.5

43.3

53.2

37.4

37.2

8.2

27.7

47.1

38.1

28.9

39.0

15.2

35.8

45.1

44.4

32.6

38.1

Baseline

HDP1f (HL)

HDPflat

HDPstruct

mIBP-HDP

iFHMM-iHMM

92.2

46.9

37.8

47.4

38.2

39.5

39.8

54.8

92.9

82.7

68.8

85.2

71.0

86.5

89.8

90.2

88.2

89.6

44.5

83.4

93.9

92.7

90.3

93.1

80.1

79.6

78.2

81.1

78.5

78.8

57.2

81.4

85.3

86.5

84.0

85.3

93.7

36.6

27.0

34.4

26.5

29.4

25.4

53.4

92.4

83.0

67.9

86.6

39.8

42.6

41.3

48.6

37.7

43.7

Baseline

HDP1f (HL)

HDPflat

HDPstruct

mIBP-HDP

iFHMM-iHMM

Table 2: Evaluation results for within- and cross-document event coreference resolution.

90.5

47.7

44.4

51.9

40.0

48.4

61.1

70.5

95.3

89.5

79.8

89.0

64.9

75.3

78.3

80.4

75.5

79.0

36.6

76.2

86.9

86.2

82.7

85.5

72.7

57.1

56.0

60.1

54.6

58.0

48.7

65.2

68.0

70.8

65.7

69.1

90.7

34.9

29.2

37.5

26.1

33.3

28.6

58.9

95.1

85.6

77.0

88.3

43.3

43.5

44.4

52.1

38.9

48.2

HDPstructmodel, which also considers dependencies between feature types, proved to be more

effective on the ECB dataset for both within- and cross-documentevent coreferenceevaluation. The

set of feature types used to achieve these results consists of combinations of types from all feature

categories described in Section 2.2. For the results of the HDPstructmodel listed in Table 2, we also

explored the conditional dependencies between the HL, FR, and FEA types.

As can be observed fromTable 2, the results of the HDPflatand HDPstructmodels show an F-score

increase by 4-10% over the HDP1fmodel, and therefore prove that the HDP extensions provide a

more flexible representation for clustering objects characterized by rich properties.

mIBP-HDP In spite of its advantage of working with a potentially infinite number of features in an

HDP framework, the mIBP-HDP model did not achieve a satisfactory performance in comparison

with the other proposed models. However, the results were obtained by automatically selecting

only 2% of distinct feature values from the entire set of values extracted from both corpora. When

compared with the restricted set of features considered by the HDPflatand HDPstructmodels, the

percentage of values selected by mIBP-HDP is only 6%. A future research area for improving this

model is to consider other distributions for automatic selection of salient feature values.

iFHMM-iHMMInspiteoftheautomaticfeatureselectionemployedfortheiFHMM-iHMMmodel,

its results remain competitive against the results of the HDP extensions (where the feature types

were hand tuned). As shown in Table 2, most of the iFHMM-iHMM results fall in between the

HDPflatand HDPstructmodels. Also, these results indicate that the iFHMM-iHMM model is a

better framework than HDP in capturing the event mention dependencies simulated by the mIBP

feature sampling scheme. Similar to the mIBP-HDP model, to achieve these results, the iFHMM-

iHMM model uses only 2% values from the entire set of distinct feature values. For the experiments

of the iFHMM-iHMM results reported in Table 2, we set α′=50, γ′=0.5, and δ′=0.5.

6Conclusion

In this paper, we have described how a sequence of unsupervised, nonparametric Bayesian models

can be employed to cluster complexlinguistic objects that are characterized by a rich set of features.

The experimental results proved that these models are able to solve real data applications in which

the feature and cluster numbers are treated as free parameters, and the selection of features is per-

formed automatically. While the results of event coreference resolution are promising, we believe

that the classes of models proposed in this paper have a real utility for a wide range of applications.

8

Page 9

References

[1] David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and

Reasoning about Time and Events, pages 1–8.

[2] Amit Bagga and Breck Baldwin. 1998. Algorithms for Scoring Coreference Chains. In Proc. of LREC.

[3] Amit Bagga and Breck Baldwin. 1999. Cross-Document Event Coreference: Annotations, Experiments,

and Observations. In Proceedings of the ACL-99 Workshop on Coreference and its Applications.

[4] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In

Proceedings of COLING-ACL.

[5] Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. 2002. The Infinite Hidden Markov

Model. In Proceedings of NIPS.

[6] Cosmin Adrian Bejan. 2007. Deriving Chronological Information from Texts through a Graph-based

Algorithm. In Proceedings of FLAIRS-2007.

[7] CosminAdrian Bejan and Sanda Harabagiu. 2008. A Linguistic Resource for Discovering Event Structures

and Resolving Event Coreference. In Proceedings of LREC-2008.

[8] CosminAdrian Bejan and ChrisHathaway. 2007. UTD-SRL:A Pipeline Architecturefor ExtractingFrame

Semantic Structures. In Proceedings of SemEval-2007.

[9] Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.

[10] Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Nonparametric Problems. The Annals of

Statistics, 1(2):209–230.

[11] Jenny Rose Finkel and Christopher D. Manning. 2008. Enforcing Transitivity in Coreference Resolution.

In Proceedings of ACL/HLT-2008, pages 45–48.

[12] Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian

restoration of images. . IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.

[13] Z. Ghahramani and M. Jordan. 1997. Factorial Hidden Markov Models. Machine Learning, 29:245–273.

[14] Zoubin Ghahramani, T. L. Griffiths, and Peter Sollich, 2007. Bayesian Statistics 8, chapter Bayesian

nonparametric latent feature models, pages 201–225. Oxford University Press.

[15] Tom Griffiths and Zoubin Ghahramani. 2006. Infinite Latent Feature Models and the Indian Buffet

Process. In Proceedings of NIPS, pages 475–482.

[16] Aria Haghighi and Dan Klein. 2007. Unsupervised Coreference Resolution in a Nonparametric Bayesian

Model. In Proceedings of the ACL.

[17] Kevin Humphreys, Robert Gaizauskas, Saliha Azzam. 1997. Event Coreference for Information Extrac-

tion. In Proceedings of the Workshop on Operational Factors in Practical, Robust Anaphora Resolution

for Unrestricted Texts, 35th Meeting of ACL, pages 75–81.

[18] LDC-ACE05. 2005. ACE (Automatic Content Extraction) English Annotation Guidelines for Events.

[19] X. Luo. 2005. On Coreference Resolution Performance Metrics. In Proceedings of EMNLP.

[20] X. Luo, A. Ittycheriah, H. Jing, N. Kambhatla, and S.Roukos 2004. A Mention-Synchronous Coreference

Resolution Algorithm Based On the Bell Tree. In Proceedings of ACL-2004.

[21] Radford M. Neal. 2003. Slice Sampling. The Annals of Statistics, 31:705–741.

[22] Vincent Ng. 2008. Unsupervised Models for Coreference Resolution. In Proceedings of EMNLP.

[23] Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus

of Semantic Roles. Computational Linguistics, 31(1):71–105.

[24] Ron Papka. 1999. On-line New Event Detection, Clustering and Tracking. Ph.D. thesis, Department of

Computer Science, University of Massachusetts.

[25] Hoifung Poon and Pedro Domingos. 2008. Joint Unsupervised Coreference Resolution with Markov

Logic. In Proceedings of EMNLP.

[26] J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L.

Ferro, and M. Lazo. 2003. The TimeBank Corpus. In Corpus Linguistics, pages 647–656.

[27] Lawrence R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech

Recognition. In Proceedings of the IEEE, pages 257–286.

[28] Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2006. Hierarchical Dirichlet Processes.

Journal of the American Statistical Association, 101(476):1566–1581.

[29] Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. 2008. Beam Sampling for the

Infinite Hidden Markov Model. In Proceedings of ICML, pages 1088–1095.

[30] Jurgen Van Gael, Yee Whye Teh, and Zoubin Ghahramani. 2008. The Infinite Factorial Hidden Markov

Model. In Proceedings of NIPS.

[31] Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A Model-

Theoretic Coreference Scoring Scheme. In Proceedings of MUC-6, pages 45–52.

9