Available via license: CC BY-NC-SA 4.0
Content may be subject to copyright.
EXPLORING MULTIPLE STRATEGIES TO IMPROVE
MULTILINGUAL COREFERENCE RESOLUTION IN COREFUD
Ondˇ
rej Pražák
New Technologies for the Information Society,
Faculty of Applied Sciences,
University of West Bohemia
Pilsen
ondfa@ntis.zcu.cz
Miloslav Konopík
Department of Computer Science and Engineering,
Faculty of Applied Sciences,
University of West Bohemia
Pilsen
konopik@kiv.zcu.cz
ABS TRACT
Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a
critical component in various natural language processing (NLP) applications. This paper presents our
end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17
datasets across 12 languages. Our model is based on end-to-end neural coreference resolution system
by [
1
]. We first establish strong baseline models, including monolingual and cross-lingual variations,
and then propose several extensions to enhance performance across diverse linguistic contexts. These
extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model
for optimized headword prediction, and advanced singleton modeling. We also experiment with
headword span representation and long-documents modeling through overlapping segments. The
proposed extensions, particularly the heads-only approach, singleton modeling, and long document
prediction significantly improve performance across most datasets. We also perform zero-shot cross-
lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference
resolution. Our findings contribute to the development of robust and scalable coreference systems
for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and
surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our
nodel is available on GitHub: https://github.com/ondfa/coref-multiling
Keywords coreference resolution ·cross-lingual model ·Transformers ·end-to-end model
1 Introduction
Coreference resolution is the task of identifying language expressions that refer to the same real-world entity (antecedent)
within a text. These coreferential expressions can sometimes appear within a single sentence, but often, they are spread
across multiple sentences. In some challenging cases, it is necessary to consider the entire document to determine
whether two expressions refer to the same entity. The task can be divided into two main subtasks: identifying entity
mentions and grouping these mentions based on the real-world entities they refer to. Coreference resolution is closely
related to anaphora resolution, as discussed in [2]
Historically, coreference resolution was a standard preprocessing step in various natural language processing (NLP)
tasks, such as machine translation, summarization, and information extraction. Although recent large language models
have achieved state-of-the-art results in coreference resolution, they are expensive to train and deploy, and traditional
(discriminative) approaches remain competitive. Expressing this task in natural language is challenging, and to the best
of our knowledge, there have been no successful attempts to utilize large chatbots (like ChatGPT-4) to achieve superior
results.
Coreference resolution becomes particularly challenging in low-resource languages. One strategy to address this
challenge is to train a multilingual model on datasets from multiple languages, thereby transferring knowledge from
resource-rich languages to those with fewer resources. However, a significant challenge with this approach lies in the
arXiv:2408.16893v1 [cs.CL] 29 Aug 2024
Pražák et al. – E2E Multilingual Coreference Resolution
differences in annotations across available corpora. The CorefUD initiative [
3
] tries to harmonize the datasets and create
one annotation scheme for coreference in multiple languages similarly to Universal Dependencies [
4
] for syntactic
annotations.
This paper describes our approach to multilingual coreference resolution. The task is based on the CorefUD dataset [
3
].
The CorefUD 1.1 corpus comprises 17 different datasets spanning 12 languages, all within a harmonized annotation
scheme. As CorefUD is intended to extend Universal Dependencies with coreference annotations, all datasets within
CorefUD are treebanks. For some languages, dependency annotations were provided by human annotators, while for
others, these annotations were generated automatically using a parser. Coreference annotations are built upon these
dependencies, meaning that mentions are represented as subtrees in the dependency tree and can be captured by their
heads. In some datasets, there are non-treelet mentions—mentions that do not form a single subtree—but even for these,
a single headword is selected. Notable differences exist between the datasets, with one of the most prominent being the
presence of singletons. Singletons are clusters that contain only one mention and do not participate in any coreference
relation, yet they are still annotated as mentions. For further details on the dataset, refer to [
3
] or [
5
]. The task was
simplified to involve only the prediction of non-singleton mentions and their grouping into entity clusters. Building
upon the baseline model proposed by [
1
], we introduce several novel extensions aimed at enhancing the performance of
coreference resolution across multiple languages and datasets within the CorefUD collection. Our primary goal is to
develop a universal model capable of handling the diverse and complex nature of these datasets effectively.
First, recognizing the challenge posed by small dataset sizes in the CorefUD collection, we propose a cross-lingual
training approach. By pretraining our model on a concatenated dataset that includes all available training data across
languages, we aim to improve the model’s ability to generalize across languages. This approach is particularly beneficial
for low-resource languages, where training large models from scratch is impractical due to the limited data available.
Next, we incorporate syntactic information into our model to leverage the dependency structures inherent in the
CorefUD datasets. This extension adds depth to the token representations by encoding their paths to the ROOT in the
dependency tree, thereby enriching the model’s understanding of syntactic relationships, which are crucial for accurate
mention detection and coreference resolution.
In response to the evaluation metrics used in the CRAC 2022 Shared Task, we also propose the Span2head model, which
shifts the focus from span-based mention representation to headword prediction. This adjustment aligns the model’s
output more closely with the evaluation criteria, optimizing performance by accurately predicting the headwords that
serve as minimal spans in coreference chains.
Additionally, we explore the potential of head representations as a simplified approach to coreference resolution. By
modeling mentions solely based on their syntactic heads, we reduce the computational complexity from quadratic to
linear, making the model more efficient and less prone to errors, particularly in the case of long and complex mentions.
We further address the issue of singletons—mentions that do not participate in coreference chains—by introducing
mechanisms to incorporate these into the training process. By modeling singletons explicitly, we ensure that valuable
training data from singleton-rich datasets is not discarded, thus improving the model’s robustness and accuracy across
different datasets.
Finally, to overcome the limitations of short sequence lengths in models like XLM-R, we propose an innovative
approach that utilizes overlapping segments with a cluster merging algorithm. This method ensures that coreference
chains spanning multiple segments are correctly identified and merged, even when document segmentation is necessary
due to memory constraints.
Through these extensions, our model aims to advance the state of multilingual coreference resolution by addressing the
specific challenges posed by the diverse datasets in the CorefUD collection, ultimately contributing to more accurate
and generalizable coreference systems.
2 Related Work
2.1 End-to-end Neural Coreference Resolution Models
The first end-to-end neural coreference resolution system was introduced by [
6
] and many subsequent neural coreference
resolution systems are based on their model.
In the model, we start by modeling the probability
P(yi|D)
of a mention
i
corefering with the antecedent
yi
in a
document
D
. Since the model adopts the end-to-end approach, the mentions are identified together with the coreference
links. We consider every continuous sequence of words as a mention
i
. Therefore, we work with
N=T(T+1)
2
possible
mentions, where Tis the number of words in a document D.
2
Pražák et al. – E2E Multilingual Coreference Resolution
General Electric said the Postal Service contacted the company
General Electric
+
Electric said the
+
the Postal Service
+
Service contacted the
+
the company
+
Ment ion score (sm)
Span r epresent ation (g)
Span h ead (ˆ
x)
Bidir ectional LST M (x∗
)
Word & char acter
embed ding (x)
(a) Mention Ranking
General Electric the Postal Service the company
s(the company,
General Electric)s(the company,
the Postal Service)
s(the company, ) = 0
Softma x (P(yi| D ))
Coreferen ce
score (s)
Anteced ent score (sa)
Ment ion score (sm)
Span
represen tat ion (g)
(b) Antecedent Ranking
Figure 1: Architecture of the model from [6]
We model the score of a mention
i
corefering with an antecedent
yi
as a combination of two types of scores
sm(i)
and
sa(i, yi)
. The
sm
is a score of a sequence of words (spans)
i
being a mention. The
sa(i, yi)
score is the score of a span
yibeing an antecedent of span i. The scores are combined as a sum of sm(i),sm(yi)and sa(i, yi)as follows:
s(i, yi) = 0yi=ϵ
sm(i) + sm(yi) + sa(i, yi)yi=ϵ,(1)
where
ϵ
is an empty (dummy) antecedent. Both scores
sm(i)
and
sa(i, yi)
are estimated with a feed-forward neural
network over the BERT-based encoder.
The probability of an antecedent
yi
can be expressed as the
softmax
normalization over all possible antecedents
y′∈Y(i)for a mention i:
P(yi|D) = exp(s(i, yi)
Py′∈Y(i)exp(s(i, y′)(2)
The formula for all antecedents uses a product of multinomials of all individual antecedents:
P(y1, ..., yN|D) =
N
Y
i=1
P(yi|D)(3)
In the training phase, we maximize the marginal log-likelihood of all correct antecedents:
J(D) = log
N
Y
i=1 X
ˆy∈Y(i)∩GOLD(i)
P(ˆy)(4)
where GOLD(i) is the set of spans (mentions) in the training data that are antecedents of mention i.
The schema of the model is shown in Figure 1.
This end-to-end model has proven to be highly successful, especially on datasets where singletons are not annotated
(e.g., CoNLL 2012/OntoNotes), a scenario where traditional two-stage models tend to struggle. The model was later
extended by [
7
] to the higher-order coreference resolution (CR) through iterative refining of span representations with a
gated-attention mechanism. They further optimize pro model speed by pruning the mentions with simple scorer at the
first step and rescoring only top
k
spans with the precise scorer and the second step. [
8
] uses the model from [
7
] but
with BERT as an encoder.
While many higher-order CR approaches have been proposed, [
1
] demonstrated that their impact is marginal when a
strong encoder is used, as seen in their experiments with SpanBERT [9].
2.2 Word-level Coreference Resolution
[
10
] proposed a word-level coreference resolution approach based on the model by [
6
]. Their method reduces the
computational complexity by reducing the mention space. Instead of iterating over all possible spans they first map
3
Pražák et al. – E2E Multilingual Coreference Resolution
Figure 2: Schema of CorPipe model (from [16])
each gold mention to a single word—specifically, the headword, which they select based on the syntax tree. Antecedent
prediction is then performed on the word level. In the next step, they use a span extraction model to predict the original
spans from headwords. This model is trained to classify each word in the sentence together with a headword to decide,
whether the word is the start token or end token of the span corresponding to the headword. During training the span
extraction model is trained simultaneously with with the antecedent prediction as another classification head.
Later [
11
] proposed a modification in headword selection. They suggested making a coordinate conjunction the
head when it appears within an entity to resolve span ambiguity. For example, in the sentence "Tom and Mary are
playing," the entities "Tom" and "Tom, and Mary" would typically share the same head in the original approach. Their
modification eliminates this ambiguity.
2.3 CRAC Shared Task on Multilingual Coreference Resolution
CRAC Shared Task on Multilingual Coreference Resolution (CRAC-coref) [
12
] is an annual shared task that began in
2022 and is built upon CorefUD collection. In fact, this paper is an extension of CRAC22-coref [
13
] participant system
[14], which is based on [15]
Best-performing system in CRAC 2022 was submitted by [
16
]. It is a two-stage model with joined BERT-like encoder.
In mention detection stage, they use extended BIO tagging via stack-manipulation instructions to solve overlapping
mentions. Coreference resolution is then done by antecedent prediction head which is very similar to the one use in
end-to-end neural CR [
6
]. They use a concatenation of start and end token as mention representation and classify
mention pairs to find the best antecedent of each mention. The schema of CorPipe model is shown in Figure 2.
Other cross-lingual experiments include Portuguese by learning from Spanish [
17
]; Spanish and Chinese relying on an
English corpus [
18
]; and Basque based on an English corpus as well [
19
]. All these approaches employ neural networks,
and they transfer the model via cross-lingual word embeddings.
4
Pražák et al. – E2E Multilingual Coreference Resolution
Table 1: Dataset Statistics (taken from [3])
total size division [%]
CorefUD dataset docs sents words empty train dev test
Catalan-AnCora 1298 13,613 429,313 6,377 77.5 11.4 11.1
Czech-PCEDT 2312 49,208 1,155,755 35,844 80.9 14.2 4.9
Czech-PDT 3165 49,428 834,720 22,389 78.3 10.6 11.1
English-GUM 195 10,761 187,416 99 78.9 10.5 10.6
English-ParCorFull 19 543 10,798 0 81.2 10.7 8.1
French-Democrat 126 13,057 284,883 0 80.1 9.9 10.0
German-ParCorFull 19 543 10,602 0 81.6 10.4 8.1
German-PotsdamCC 176 2,238 33,222 0 80.3 10.2 9.5
Hungarian-KorKor 94 1,351 24,568 1,988 79.2 10.3 10.5
Hungarian-SzegedKoref 400 8,820 123,968 4,857 81.1 9.6 9.3
Lithuanian-LCC 100 1,714 37,014 0 81.3 9.1 9.6
Norwegian-BokmaalNARC 346 15,742 245,515 0 82.8 8.8 8.4
Norwegian-NynorskNARC 394 12,481 206,660 0 83.6 8.7 7.7
Polish-PCC 1828 35,874 538,885 470 80.1 10.0 9.9
Russian-RuCor 181 9,035 156,636 0 78.9 13.5 7.6
Spanish-AnCora 1356 14,159 458,418 8,112 80.0 10.0 10.0
Turkish-ITCC 24 4,733 55,341 0 81.5 8.8 9.7
3 Datasets & Metrics
For our experiments, we use the CorefUD 1.1 dataset. The collection consists of 17 datasets in 12 different languages.
Table 1 provides a summary of these datasets and their respective sizes.
3.1 Dataset Differences
As highlighted by [
3
] significant differences remain among the individual datasets within CorefUD, not only in data
distribution but also in annotation practices. In this section, we discuss these differences that are particularly relevant to
our model and experiments.
3.1.1 Empty Nodes
Some datasets contain empty nodes (also called zeros). These are virtual nodes in the syntactic tree which refers to
unexpressed entities. The entities that are not explicitly mentioned in the text, it is implicitly referenced. Empty nodes
can make it hard for a general-purpose pre-trained model to understand the language. We cannot discard them from the
datasets because most of them are coreferential.
3.1.2 Singletons
Another key difference between the datasets lies in the annotation of singletons. Singletons are entities with only a
single mention throughout the document. In some datasets, singletons are annotated, while in others they are not. That
means that when there is an entity with only a single mention, it is not considered an entity during annotation. Table 2
provides the percentage of singleton entities across the datasets (entities with a length of 1).
3.2 Dataset Lengths
Tables 1, 2, and 3 show basic statistics of all the datasets in CorefUD collection. All these tables are taken from the
CorefUD paper [
5
]. Table 1 shows the basic sizes of the datasets and portion of the train/dev/test split. Column empty
shows the number of empty nodes. Table 2 shows statistics about entities. The length of an entity means the number
of its mentions thus entities with length 1 are singletons. Table 3 shows statistics of mentions. Here the length of a
mention means the numbers of its words. The last three columns shows the percentage of specific mentions. w/empty is
mentions with at least one empty node, w/gap is mentions with at least one gap e.g. discontinuous mentions. non-tree
are mentions which do not form a single subtree of the syntax tree.
5
Pražák et al. – E2E Multilingual Coreference Resolution
Table 2: Entity Statistics (taken from [3])
Entities distribution of lengths
CorefUD dataset total per 1k length 1 2 3 4 5+
count words max avg. [%] [%] [%] [%] [%]
Catalan-AnCora 18,030 42 101 3.5 2.6 54.0 18.1 8.6 16.8
Czech-PCEDT 52,721 46 236 3.3 6.6 59.6 14.7 6.3 12.7
Czech-PDT 78,747 94 175 2.4 40.8 35.1 10.4 4.9 8.9
English-GUM 27,757 148 131 1.9 73.8 14.2 4.9 2.2 4.9
English-ParCorFull 202 19 38 4.2 6.9 54.0 13.9 5.9 19.3
French-Democrat 39,023 137 895 2.0 81.6 10.7 3.0 1.4 3.3
German-ParCorFull 259 24 43 3.5 6.2 64.9 11.6 5.0 12.4
German-PotsdamCC 3,752 113 15 1.4 76.5 13.9 5.0 1.8 2.7
Hungarian-KorKor 1,134 46 41 3.6 0.9 55.1 17.1 9.0 17.9
Hungarian-SzegedKoref 5,182 42 36 3.0 8.0 51.1 19.0 9.1 12.9
Lithuanian-LCC 1,224 33 23 3.7 11.2 45.3 11.8 8.2 23.5
Norwegian-BokmaalNARC 53,357 217 298 1.4 89.4 5.4 1.9 1.0 2.4
Norwegian-NynorskNARC 44,847 217 84 1.4 88.7 5.5 2.1 1.1 2.7
Polish-PCC 127,688 237 135 1.5 82.6 9.8 2.9 1.4 3.2
Russian-RuCor 3,636 23 141 4.5 3.3 53.7 15.6 6.9 20.5
Spanish-AnCora 20,115 44 110 3.5 3.3 53.7 17.0 8.8 17.1
Turkish-ITCC 690 12 66 5.3 0.0 39.0 19.6 10.7 30.7
Table 3: Mention Statistics (taken from [3])
mentions distribution of lengths mention type
CorefUD dataset total per 1k length 0 1 2 3 4 5+ w/empty w/gap non-tree
count words max avg. [%] [%] [%] [%] [%] [%] [%] [%] [%]
Catalan-AnCora 62,417 145 141 4.8 10.2 28.2 21.7 7.9 5.3 26.8 12.4 0.0 3.7
Czech-PCEDT 168,138 145 79 3.6 19.5 29.9 16.8 8.6 4.1 20.9 25.7 0.8 8.4
Czech-PDT 154,983 186 99 3.1 10.7 39.6 20.3 9.1 4.2 16.1 13.7 1.3 2.2
English-GUM 32,323 172 95 2.6 0.0 56.5 19.5 8.0 3.8 12.1 0.0 0.0 1.2
English-ParCorFull 835 77 37 2.1 0.0 59.6 23.4 6.3 3.4 7.3 0.0 0.6 0.6
French-Democrat 46,487 163 71 1.7 0.0 64.2 21.8 6.3 2.4 5.3 0.0 0.0 2.0
German-ParCorFull 896 85 30 2.0 0.0 64.8 17.5 6.2 4.0 7.4 0.0 0.3 1.5
German-PotsdamCC 2,519 76 34 2.6 0.0 34.8 32.4 15.6 6.4 10.9 0.0 6.3 3.8
Hungarian-KorKor 4,103 167 42 2.2 30.8 20.6 23.2 9.8 4.8 10.8 35.3 0.6 5.5
Hungarian-SzegedKoref 15,165 122 36 1.6 15.1 37.4 32.5 10.2 2.6 2.2 15.2 0.4 1.1
Lithuanian-LCC 4,337 117 19 1.5 0.0 69.1 16.6 11.1 1.2 2.0 0.0 0.0 4.3
Norwegian-BokmaalNARC 26,611 108 51 1.9 0.0 74.5 9.7 6.1 2.1 7.6 0.0 0.6 1.5
Norwegian-NynorskNARC 21,847 106 57 2.1 0.0 70.2 10.1 7.7 2.8 9.2 0.0 0.4 1.4
Polish-PCC 82,804 154 108 2.1 0.3 68.7 14.9 5.2 2.7 8.2 0.5 1.0 4.7
Russian-RuCor 16,193 103 18 1.7 0.0 69.1 16.3 6.6 3.5 4.6 0.0 0.5 1.4
Spanish-AnCora 70,663 154 101 4.8 11.4 31.6 18.8 7.2 4.5 26.3 14.0 0.0 0.3
Turkish-ITCC 3,668 66 25 1.9 0.0 67.3 17.5 5.8 2.9 6.6 0.0 0.0 1.9
6
Pražák et al. – E2E Multilingual Coreference Resolution
3.3 Evaluation Metrics
For evaluation, we employ the official metric for CRAC 2023 Multilingual Coreference Resolution shared task
implemented via the CoreUD scorer
1
. The metric is a modification of the standard F1 metric for coreference resolution,
averaging three metrics: CLEAF
b3
and MUC. The main modification of the CorefUD scorer lies in using the head
matching. In head matching the gold and system mentions are considered identical if and only if they have the same
syntactic head. Another important aspect is that the primary metric ignores singletons. The CorefUD scorer also
provides additional metrics, such as BLANC [
20
] and LEA [
21
]. To assess the quality of mention matching while
ignoring the assignment of mentions to coreferential entities, it uses the MOR score (mention overlap ratio) [13].
3.3.1 A Link-Based Metric: The muc Score
The muc official scorer [
22
] introduced a link-based metric. A link-based metric measures the extent to which the links
in the response match the links in the key. For example, recall is computed by summing up the correctly recalled links
for each coreference chain in the key and then dividing by the total number of correct links in the key. The number of
missing links—the links found in the key entities but not in the response entities—is computed by counting the number
of partitions of key K induced by response R, as follows:
RecallMUC =Pi|Ki| − |P(Ki, R)|
Pi|Ki| − 1(5)
where
P(Ki, R)
is the partition function, which returns all the partitions of key entity
Ki
with respect to a system’s
response
R
. 1 Precision is computed by summing up the correct links in each coreference chain in the response and
dividing that by the total number of links in the response—that is, by swapping key and response in the formula above.
3.3.2 A Mention-Based Metric: B3
One problem with the muc score is that, by definition, it only scores a system’s ability to identify links between
mentions; its ability to recognize that a mention does not belong to any coreference chain—that is, its ability to classify
a mention as a singleton—does not get any reward. The
B3
metric [
23
] was proposed to correct this problem. It does
this by computing recall and precision for each mention m, even if m is a singleton.
B3
computes the intersection
|Ki∩Rj|
between every coreference chain
Ki
in the key and every coreference chain
Rj
in the response and then sums
up recall and precision for each pair
i
,
j
and normalizes. In turn, recall and precision for
i
,
j
are computed by summing
up recall and precision for each mention m in
|Ki∩Rj|
. For instance, recall for
m
is the proportion of mentions in
|Ki∩Rj|and the number of mentions in Ki:
RecallB3(m) = |Ki∩Rj|
|Ki|(m∈Ki∩Rj)(6)
Precision for mis the proportion between |Ki∩Rj|and |Rj|.
3.3.3 An Entity-Based Metric: ceaf
B3
also suffers from a problem—namely, that a single chain in the key or response can be credited several times. This
leads to anomalies; for instance, if all coreference chains in the key are merged into one in the response, the
b3
recall is
one. The ceaf metric was proposed by [
24
] to correct this problem. The key idea of ceaf is to align chains (entities) in
the key and response using a map
g
in such a way that each chain
Ki
in the key is aligned with only one chain
g(Ki)
in
the response and to then use the similarity
ϕ(Ki, g(Ki))
to compute recall and precision. Because different maps are
possible, the one that achieves optimal similarity is used. [25]
4 Baseline Model & Extensions
We adopted a model from [
1
] and we use it as a baseline model (without higher-order inference). We use UDAPI
2
to
load the data in CorefUD format. We use XLM-Roberta-large as an encoder.
Building on this baseline, we propose several extensions to the standard end-to-end model presented earlier in this
paper. Our objective is to create a universal model suitable for all datasets in the CorefUD collection.
1https://github.com/ufal/corefud-scorer
2https://github.com/udapi/udapi-python
7
Pražák et al. – E2E Multilingual Coreference Resolution
Table 4: Number of trainable parameters of the models
Model Pretrained params New params
mBERT 180M 40M
XLM-R 550M 50M
4.1 Cross-lingual Training
Some of the datasets are relatively small (see Table 1). To enhance performance, we propose pretraining the model on a
concatenation of all training datasets within the CorefUD collection.
As you can see from Table 4, there are approximately 50 million parameters trained from scratch for XLM-R. For
smaller datasets, it is practically impossible to train so many random parameters. To address this issue, we first pre-train
the model on the joined dataset and then fine-tune the model for a specific language.
4.2 Syntactic Information
We believe that incorporating dependency information can significantly enhance the model, particularly when manually
annotated dependencies are available, such as in the Czech PDT dataset. Moreover, dependency information is essential
for identifying mention heads, as some datasets in the CorefUD collection annotate mentions as nodes in the dependency
tree rather than spans in the linear text flow.
To encode syntactic information, we add to each token representation its path to ROOT in the dependency tree. In more
detail, we first set the maximum tree depth parameter and then concatenate Bert representations of all parents up to max
depth with the embedding of the corresponding dependency relation. Thus the resulting tree structure representation
has the size of
max_tree_depth ×(bert_emb_size +deprel_emb_size)
. This representation is then concatenated
with BERT embedding of each token.
4.3 Span2head Model
Given that the official scorer uses min-span evaluation with headwords as the min spans, we decided to train the model
to predict heads rather than entire spans to optimize the evaluation metric. Having all the useful information (even
dependency trees), the model should learn the original rules for selecting the head.
A straightforward approach to get mention heads is to represent mention with its headword on the input. We describe
this approach in detail later. However, this method is not ideal because multiple mentions can share the same head,
which could result in merging distinct mentions and their clusters. To prevent this, we represent mentions with their full
spans, predicting the head of each mention at the top of our model, and then outputting only the headword(s). This way,
when building clusters, mentions are represented by their spans, ensuring that clusters of different mentions with the
same head are not erroneously merged.
We implemented two versions of the head prediction model, both as separate classification heads on top of our
coreference resolution model. The first model predicts the relative position of the headword(s) within a span using the
hidden representation of the span from the coreference model. The output probabilities of head positions are obtained
using sigmoid activation, allowing the model to predict multiple heads, even though only a single headword is present in
the gold data. This serves as an optimization of the evaluation metric: if multiple words are likely headword candidates,
it is statistically advantageous to output all of them.
The second model uses a binary classification of each span and head candidate pair, so again, there can be more
headwords of a single span predicted.
4.4 Head Representations
Later, inspired by the word-level coreference resolution model, we decided to try to model mentions only with their
syntactic head. As mentioned above, For the CRAC official evaluation metric, reconstructing the original span is
unnecessary, as predicting the correct head suffices. Therefore, we explored a simplified word-level coreference
resolution approach that bypasses the span prediction step. Given that syntactic information is available in all CorefUD
datasets, this is not so unrealistic scenario because in most cases the spans can be reconstructed by simply taking the
whole subtree of the predicted node (head). Heads are heuristically selected only in cases where a mention does not
form a single subtree. The proportion of such mentions is shown in Table 3(column non-tree). Using a word-level
8
Pražák et al. – E2E Multilingual Coreference Resolution
PilsenBohemiaWest inofUniversity
BERT encoder
Span Representation
MLP
0 [.95]
Uni.
1 [.01]
of
2 [.01]
West
3 [.01]
Bohemia
4 [.01]
in
5 [.01]
Pilsen
6 [.00] 19 [.00]
...
Figure 3: First Span2Head Model
PilsenBohemiaWest inofUniversity
BERT encoder
Span Representation
MLP
(inference 1)
.95
is head
.01
NOT
head
MLP
(inference 2)
.01
NOT
head
MLP
(inference 3)
.01
NOT
head
MLP
(inference 4)
.01
NOT
head
MLP
(inference 5)
.01
NOT
head
MLP
(inference 6)
Figure 4: Second Span2Head Model
9
Pražák et al. – E2E Multilingual Coreference Resolution
model reduces the mention space from quadratic to linear, making the model more efficient and minimizing potential
false-positive mentions. Moreover, we believe that for very long mentions, the standard representation (sum of the start
token, end token, and attended sum of all tokens) becomes insufficient.
4.5 Singletons
Some datasets in CorefUD collection have singletons annotated and others do not. Specifically, in CorefUD 1.1, 8
out of 17 datasets have more than 10% singletons and 6 of these have more than 70% singletons which is probably a
sign of consistent entity annotation independent of the coreference annotation. The baseline model completely ignores
singletons during training
3
. As a result for these 6 singleton-including datasets, we discard more than 70% of training
data for mention identification task. To leverage this data, we incorporated singleton modeling into our model and
propose several approaches to address this.
4.5.1 Another Dummy Antecedent
In the first proposed approach, we introduce another virtual antecedent (learned) representing that the mention has
no real antecedent, but it is a valid mention, so it is a singleton. For these mentions, we avoid using the binary score
because it does not make sense to do it. It would model the similarity between all the singletons (because all the
singleton representations would be trained to be similar to the dummy antecedent). Additionally, included an option to
use a separate feed-forward network for predicting singleton scores (different from the one for other mentions scores)
4.5.2 Mention Modeling
The second approach modifies the loss function to model mentions independently of coreference relations. In this
approach, we simply add a binary cross-entropy of each span being a mention to the loss function. In other words, we
add another classification head for the mention classification:
J(D) = log
N
Y
i=1 X
ˆy∈Y(i)∩GOLD(i)
P(ˆy) + y(i)
m·σ(sm(i)) + (1 −y(i)
m)·σ(−sm(i))
| {z }
singletons binary cross-entropy
(7)
where y(i)
mis 1if span icorresponds to gold mention, 0otherwise.
In the prediction step, the mention score is evaluated only for potential singletons. If a mention has no real antecedent
we look at the mention score. If it is likely to be a mention we make it a singleton, otherwise it is not a mention at all.
4.6 Overlapping Segments
One limitation of the XLM-R model is the short sequence length. In the employed model, the input document is split
into segments which are processed individually with a BERT-like encoder and merged in the antecedent-prediction head.
To propagate errors correctly across segments, we need to process all segments in a single gradient update. However,
due to GPU memory constraints, we set a maximum number of segments, and if a document exceeds this limit, we split
it into multiple individual documents without any mutual coreferences.
We propose using segment overlapping with a cluster merging algorithm to address this issue. The cluster merging
algorithm is straightforward: We iterate over all mentions in the new part and if it is present in a cluster in the previous
part we simply make a union between its clusters from both parts. We recommend using maximal segment overlap. For
instance, if a document has 6 segments and the maximum number of segments for a single example is 4, then we split it
into 3 examples overlapping with 3 segments. The example is shown in Figure 5. We further extend this method by
only using clusters in the continuing examples where at least one mention is present in new segments because if there is
no such mention then all the mentions were seen already in the previous part with a longer left context.
5 Experiments
5.1 Training
We trained all models on NVIDIA A40 GPUs using online learning (batch size 1 document). We limit the maximum
sequence length to 6 non-overlapping segments of 512 tokens. During training, if the document is longer than
6×512
3The loss is the sum over all correct antecedents and since singletons have no gold antecedents they do not affect the loss
10
Pražák et al. – E2E Multilingual Coreference Resolution
Figure 5: Example of overlapping segments splitting
tokens, a random segment offset is sampled to take a random continuous block of 6 segments, and the rest of them
are discarded. During prediction, longer documents are split into independent sub-documents (for simplicity, non-
overlapping again). For the head-only model, we double the maximum number of segments since the model is much
more memory-effective therefore we can afford it. We train a model for each dataset for approximately 80k updates
in our monolingual experiments. For joined-pre-trained models, we use 80k steps for model pre-training on all the
datasets and approximately 30k for fine-tuning on each dataset. Each training took from 8 to 14 hours.
5.2 Baseline and Extensions
We began by training a monolingual model for each language. For these experiments, we use XLM-Roberta-large and
specific monolingual models for each language. Monolingual models for individual languages are listed in Table 5.
Then we trained a joined cross-lingual model for all the datasets which we fine-tune for each dataset afterwards.
To evaluate the effect of each extension, we first run all its variants on a basic XLM-R-large model and measure the
performance for each language (Sections 6.1.1 and 6.1.2). Then we compare each extension in its best configuration to
the baseline joined model in Section 6.1.3 (since joined models achieve the best results from our three baseline models).
In the last step, we evaluate the effects of Overlapping segments only on the best model configuration for each language
(because it is orthogonal to all other extensions).
5.2.1 Monolingual Models
We try a specific monolingual model for each language. We use a large variant where it is available. For Czech we use
Czert [
26
], for English Roberta-large [
27
], gbert-large [
28
]. Other languages these models [
29
,
30
,
31
,
32
,
33
,
34
,
35
,
36
].
The complete list can be found in Table 5.
5.3 Zero-shot Cross-lingual Experiments
We also conducted zero-shot cross-lingual experiments in two variants; dataset zero shot, and language zero shot. In
dataset zero shot we train the model on all the datasets except one and then we evaluate the model on the excluded
dataset. However, this approach is not a true cross-lingual zero-shot test, as multiple datasets exist for some languages,
meaning the model was still exposed to the language of the evaluation data during training.
To address this, we performed language zero-shot experiments, where we removed all datasets for a particular language
from the training data, and then evaluated the model on all datasets in that language. Additionally, we conducted a final
experiment focusing on the ParCor corpora, which consists of parallel English and German datasets. This allowed us to
perform a true zero-shot evaluation on both ParCor corpora.
11
Pražák et al. – E2E Multilingual Coreference Resolution
Table 5: Results of baseline models on dev data. CRAC 2023 official evaluation metric
Dataset/Model monolingual model name reference Monoling XLM-R joined
ca_ancora PlanTL-GOB-ES/roberta-base-ca [29] 74.23 ±.55 70.51 ±.65 74.42 ±.25
cs_pcedt Czert-B-base-cased [26] 73.85 ±.24 72.28 ±.28 73.23 ±.08
cs_pdt Czert-B-base-cased [26] 70.29 ±.35 71.87 ±.7 73.47 ±.23
de_parcorfull deepset/gbert-base [28] 59.83 ±2 71.79 ±3.6 76.57 ±1.5
de_potsdamcc deepset/gbert-large [28] 64.78 ±1.7 68.98 ±2.4 76.3 ±1.1
en_gum roberta-large [27] 67.415 ±2.5 65.73 ±.36 69.04 ±.49
en_parcorfull roberta-large [27] 66.39 ±7.0 67.66 ±2.9 69.16 ±2.0
es_ancora PlanTL-GOB-ES/roberta-large-bne [30] 70.73 ±.23 73.27 ±.7 76.09 ±.13
fr_democrat camembert/camembert-large [31] 57.08 ±.9 57.35 ±1.6 65.24 ±.49
hu_korkor SZTAKI-HLT/hubert-base-cc [32] 60.68 ±1.6 59.44 ±1.0 68.99 ±.95
hu_szegedkoref SZTAKI-HLT/hubert-base-cc [32] 66.75 ±1.7 67.67 ±.44 69.52 ±.59
lt_lcc EMBEDDIA/litlat-bert [33] 76.84 ±1.1 72.52 ±.48 73.59 ±1.1
no_bokmaalnarc ltg/norbert3-large [36] 73.47 ±1.2 72.11 ±.78 74.79 ±.44
no_nynorsknarc ltg/norbert3-large [36] 73.51 ±.67 72.32 ±.55 75.98 ±.17
pl_pcc allegro/herbert-large-cased [34] 74.3 ±.29 72.48 ±.61 74.1 ±.23
ru_rucor DeepPavlov/rubert-base-cased [35] 64.69 ±.65 68.52 ±.35 70.38 ±.64
tr_itcc dbmdz/electra-base-tur.-cased-dis. 18.39 ±1.3 21.49 ±1.6 45.39 ±1.3
avg 65.48 66.23 70.96
5.4 Seq2Seq
6 Results & Discussion
Table 5 presents the results of baseline models on development data parts. The Monoling column shows the result of
the monolingual model specific for each language. The XLM-R column presents results of XLM Roberta large trained
separately for each dataset. The Joined column corresponds to the joined model described in Section 4.1. Several key
observations can be drawn from these results. Most notably, monolingual models tend to outperform the cross-lingual
XLM-Roberta model, especially for the Lithuanian and Czech PCEDT datasets, where the monolingual models—even
though they are smaller—exhibit significantly better performance. The reason is probably in the difference of these two
datasets from the other ones. We can see that for these two datasets, joined training does not help as much as for other
datasets. For the rest of the datasets, joined training surpasses both models (monolingual and multilingual) trained from
scratch.
6.1 Effects of Extensions
6.1.1 Span2Head
Table 6 shows the performance of different variants of span2head model. None column shows the results of the baseline
XLM-R-large model without any extensions. The Multi column reflects the results of the multi-class relative position
classification (the first method described in Section 4.3). The binary column represents the second method (binary
classification of span-headword candidate pair). From the results, we can make several conclusions:
1. Span2head model slightly helps
2. Binary model achieves better results, than the multi-class model
3.
For all the datasets where the span2head model helps, the heads-only model is much better so it does not make
sense to use the span2head model in the next experiments.
Another interesting point is that for a monolingual model trained from scratch, the heads-only model improves the
results on 13 out of 17 datasets and the results are worse only for a single dataset (English-ParCor).
6.1.2 Singletons
Table 7 shows the results of all proposed variants of the singletons extension. Again, None column represents the
baseline xlm-r-large trained from scratch for each dataset. Dummy column shows the results of plain dummy antecedent
12
Pražák et al. – E2E Multilingual Coreference Resolution
Table 6: Results of XLM-R baseline with different types of Span2Head model
Dataset/S2H model None Multi Binary Heads
ca_ancora 70.7 65.08 65.16 78.93
cs_pcedt 72.06 62.73 69.77 74.83
cs_pdt 71.21 66.19 71.64 78.08
de_parcorfull 72.99 64.44 65.73 73.18
de_potsdamcc 68.26 66.03 70.77 73.28
en_gum 64.06 61.13 64.64 72.95
en_parcorfull 69.65 58.4 64.42 67.42
es_ancora 73.18 66.75 65.2 79.7
fr_democrat 55.97 50.18 51.3 60.78
hu_korkor 60.08 46.52 58.32 60.34
hu_szegedkoref 68.02 63.76 67.44 68.87
lt_lcc 72.48 69.5 72.01 75.13
no_bokmaalnarc 72.83 67.52 72.31 72.93
no_nynorsknarc 72.2 67.3 71.34 75.05
pl_pcc 72.32 69.97 72.39 74.19
ru_rucor 68.68 66.03 69.42 72.05
tr_itcc 20.51 17.32 17.08 27.82
avg 66.19
Table 7: Results of XLM-R baseline with different types of singletons model
Model None Dummy Mask Separate Mentions % of singletons
+params 0 3092 3092 40M 40M -
ca_ancora 70.7 70.81 68.75 70.82 71.61 2.6
cs_pcedt 72.06 72.03 71.65 71.81 72.44 6.6
cs_pdt 71.21 70.75 70.04 72.23 72.84 40.8
de_parcorfull 72.99 70.23 70.09 65.47 71.29 6.2
de_potsdamcc 68.26 69.67 68.24 68.44 67.49 76.5
en_gum 64.06 70.4 69.81 70.89 71.61 73.8
en_parcorfull 69.65 67.49 64.78 65.89 64.43 6.9
es_ancora 73.18 72.88 70.98 73.74 73.85 3.3
fr_democrat 55.97 55.6 53.96 53.86 59.75 81.6
hu_korkor 60.08 60.56 56.6 55.59 57.91 0.9
hu_szegedkoref 68.02 67.1 66.57 67.1 67.03 8.0
lt_lcc 72.48 72.15 75.59 73.3 74.48 11.2
no_bokmaalnarc 72.83 73.28 73.33 73.84 74.15 89.4
no_nynorsknarc 72.2 72.99 73.57 74.77 74.26 88.7
pl_pcc 72.32 73.78 72.81 74.4 74.49 82.6
ru_rucor 68.68 67.56 67.91 68.55 69.12 3.3
tr_itcc 20.51 21.11 16.07 20.24 13.21 0.0
avg 66.19 66.38 65.34 65.94 66.47 -
for singletons. mask extends the previous method binary score masking for singletons. This means that the method does
not use the binary score for similarities to singleton dummy antecedent. separate uses separate FFNN for singletons
score prediction (different from the one used for standard mentions). the last method mentions represents the method
described in section 4.5.2. Additionally, the Table shows the number of parameters added by each variant of the model.
The last column shows the percentage of singleton entities in each dataset (datasets with few singletons do not need any
singleton model).
From the table, we can see that on average the best results were achieved with the mention model. mask model has
very poor results, but it makes sense because the number of specific parameters for singletons modeling is very low (3
092 params is used as embedding for dummy antecedent) and if we discard the binary score, we reduce the learning
power of the model even more. Dummy model does not help for any dataset statistically significantly. Separate and
Mentions models are significantly better for Spanish (Mentions even for Catalan) which has almost no singletons. One
reasonable explanation is that the baseline model is under-parametrized for this dataset. We evaluated this hypothesis by
13
Pražák et al. – E2E Multilingual Coreference Resolution
Table 8: Performance gains over best base model with proposed extensions together with 95% confidence intervals.
The models which surpass the baseline significantly are bold. The results which are better only on lower confidence
level are underlined. Variants selected for the best models are green.
Dataset/Extension Joined Trees Span2head Singletons Only Heads Best Comb.
ca_ancora 74.42 ±.25 74.17 ±.27 72.65 ±.39 74.93 ±.33 82.22 ±.4 82.12 ±.13
cs_pcedt 73.23 ±.08 73.2 ±.1 71.27 ±.11 73.17 ±.22 75.95 ±.13 75.89 ±.09
cs_pdt 73.47 ±.23 73.2 ±.46 72.98 ±.29 73.37 ±.28 79.43 ±.1 79.51 ±.02
de_parcorfull 76.57 ±1.5 79.4 ±1.1 76.35 ±1.8 79.57 ±.93 78.63 ±1.9 79.13 ±.89
de_potsdamcc 76.3 ±1.1 78.5 ±1.6 76.28 ±.94 76.62 ±1.2 75.16 ±.79 78.41 ±.89
en_gum 69.04 ±.49 70.19 ±.75 70.84 ±.27 71.29 ±.62 75.18 ±.11 75.51 ±.16
en_parcorfull 69.16 ±2.0 74.1 ±2.7 67.19 ±1.4 70.4 ±2.3 61.77 ±1.4 72.8 ±.2.5
es_ancora 76.09 ±.13 76.07 ±.15 72.97 ±.28 76.2 ±.18 82.39 ±.08 82.43 ±.07
fr_democrat 65.24 ±.49 65.46 ±.4 65.38 ±.49 66.23 ±.44 67.72 ±.81 68.58 ±.23
hu_korkor 68.99 ±.95 67.75 ±1.5 71 ±.69 69.82 ±1.5 73.51 ±.86 73.55 ±.57
hu_szegedkoref 69.52 ±.59 69.59 ±.54 69.32 ±.34 69.47 ±.82 70.68 ±.46 70.67 ±.31
lt_lcc 73.59 ±1.1 74.7 ±.72 73.65 ±.62 76.28 ±.79 76.77 ±.98 77.65 ±.71
no_bokmaalnarc 74.79 ±.44 75.48 ±.49 75.64 ±.66 74.67 ±.24 77.84 ±.29 78.15 ±.29
no_nynorsknarc 75.98 ±.17 74.79 ±.48 74.96 ±176.59 ±.42 78.51 ±.21 78.81 ±.16
pl_pcc 74.1 ±.23 74.35 ±.13 74.19 ±.24 75.22 ±.2 76.1 ±.11 76 ±.17
ru_rucor 70.38 ±.64 70.23 ±.42 70.49 ±.42 70.6 ±.41 75.8 ±.58 75.98 ±.42
tr_itcc 45.39 ±1.3 34.79 ±7.7 29.53 ±.5.2 45.63 ±.4 45.17 ±3.9 44.17 ±1.8
avg 61.30 62.03 66.21
increasing the model size, but the improvement was not statistically significant. In the following chapter, we evaluate
singletons on the model with joined pretraining, we compute confidence intervals and in Table 13 we can see that the
improvemnt is below statistical significance (on
95%
significance level). Another possibility is that the model learns to
classify properly most of the 3.3% singletons and this leads to a slight improvement of the overall score.
Another important conclusion from this table is that for all the datasets that have some singletons, the singletons model
achieves better results than the baseline.
6.1.3 Best Combinations
Table 8 presents the performance of all the extensions proposed. The Table also shows the 95% confidence intervals.
Span2head and Only head extend Trees (both also uses syntax tree representations). We can see, that tree representations
alone improve the results only for ParCor corpora. Span2Head model helps significantly only for English-GUM
and Hungarian-korkor corpora but it is not better than only heads model. Singletons modeling helps for the most
of the datasets which have singletons annotated. The exceptions are in Norwegian Bokmaal and German Potsdam
dataset. Another interesting point is that for German ParCor dataset singletons modeling helps significantly even though
de-parcor itself does not contains singletons. One possible explanation is in the joined model. Adding the singletons
from other languages to the training data can help to predict the mentions in German ParCor dataset. Results in Table
7 supports this theory, because we can see that when we train the model from scratch, singletons modeling does not
help. Only heads model helps for almost all the dataset since it reduces the mention space and improves precision
significantly. The extensions selected for the best model are highlighted in green color and the last column shows the
results of the best combination. The experiments were performed again so even for the datasets where a single extension
was selected the results are slightly different but the confidence intervals match.
6.1.4 Overlapping Segments
Table 9 shows the evaluation of longer context implementation through overlapping segments. We use a maximum
length of 8 segments and we evaluate minimum and maximum overlap (1, and 7 respectively), We also try variants
with filtering already-seen mentions as described in section 4.6. For most of the datasets the results do not change at
all so additionally, we run the same experiment with just 4 segments (Table 10) and we count basic statistics of long
coreference relations (Table 11). Table 11 presents three types of information:
1.
column cross N segment corefs shows the percentage of coreference links that cross the boundary of N
segments. These links cannot be predicted without the overlapping segments.
14
Pražák et al. – E2E Multilingual Coreference Resolution
Table 9: Results of the best model configurations with different approaches to merge segments during prediction step.
Maximum 8 segments
max. segments 8 None min min_filter max max_filter
ca_ancora 82.39 ±.25 82.57 ±.25 82.54 ±.24 82.5 ±.24 82.5 ±.25
cs_pcedt 78.23 ±.13 78.18 ±.11 78.23 ±.15 78.19 ±.15 78.12 ±.1
cs_pdt 80.02 ±.21 79.95 ±.19 79.97 ±.2 79.98 ±.2 79.96 ±.19
de_parcorfull 81.54 ±1.3 81.54 ±1.3 81.54 ±1.3 81.54 ±1.3 81.54 ±1.3
de_potsdamcc 75.88 ±2.2 75.88 ±2.2 75.88 ±2.2 75.88 ±2.2 75.88 ±2.2
en_gum 76.25 ±.12 76.25 ±.12 76.25 ±.12 76.25 ±.12 76.25 ±.12
en_parcorfull 67.92 ±3.1 67.92 ±3.1 67.92 ±3.1 67.92 ±3.1 67.92 ±3.1
es_ancora 82.65 ±.27 82.65 ±.27 82.65 ±.27 82.65 ±.27 82.65 ±.27
fr_democrat 69.49 ±.46 68.44 ±.69 69.01 ±.44 69.03 ±.58 68.9 ±.91
fr_split 69.44 ±.42 69.97 ±.15 69.98 ±.20 69.43 ±.50 69.66 ±.48
hu_korkor 73.41 ±.43 73.41 ±.43 73.41 ±.43 73.41 ±.43 73.41 ±.43
hu_szegedkoref 71.2 ±.14 71.2 ±.14 71.2 ±.14 71.2 ±.14 71.2 ±.14
lt_lcc-corefud 76.82 ±.41 76.82 ±.41 76.82 ±.41 76.82 ±.41 76.82 ±.41
no_bokmaalnarc 78.96 ±.37 78.96 ±.37 78.96 ±.37 78.96 ±.37 78.96 ±.37
no_nynorsknarc 80.5 ±.14 80.5 ±.14 80.5 ±.14 80.5 ±.14 80.5 ±.14
pl_pcc 75.97 ±.17 76.14 ±.23 76.05 ±.22 76.06 ±.23 76.1 ±.15
ru_rucor 75.27 ±.42 77.23 ±.63 76.65 ±.39 76.77 ±.39 77.19 ±.57
tr_itcc 44.32 ±2.2 45.22 ±1.5 45.26 ±1.6 45.22 ±1.4 45.32 ±.98
2.
column nearest coref cross N is the percentage of mentions for which the nearest antecedent is already in a
different segments block so all the antecedents are in the different block and the antecedent cannot be predicted
with the standard model.
3.
columnsements over N shows the percentage of segments, which are after the first N-segments block. Only
those segments are affected by prediction with segment overlapping.
We can see that a significant number of log-distant coreferences is in French Turkish and Russian datasets. Some of
them are also in both Czech datasets, Catalan and Polish. By far the most long coreferences are in the French dataset by
from Table 9 we can see that segments overlapping decreases the performance. We investigated the dataset closely and
found out that in this dataset multiple documents are merged into a single document quite often. When this happens,
cross coreferences between those concatenated documents are not annotated so the model which can predict long-distant
coreferences is penalized. To evaluate the models correctly we add the variant where we split the documents in French
dev data manually (row fr_split). After splitting results with segments overlapping improves significantly but they are
still not as high as we expected in comparison with Russian where the overlap helps significantly. To evaluate this a bit
further we computed also the recall for cross-segment coreference links. The recall is 72% for Russian and only 44%
for French. By comparing these datasets manually we conclude that the coreference chains are very long in the French
dataset and it is probably too hard to predict them correctly from the antecedents (one mistake in the chain might have a
huge effect on the overall performance). For the Turkish dataset, it is hard to make any significant improvement due to
the low quality of the dataset.
6.2 Cross-lingual Transfer
Table 12 shows the results of cross-lingual transfer evaluation. We can see that for most of the datasets fine-tuning
the model on the specific data helps but the difference is not so significant. For several datasets, the results are even
comparable. For small datasets, the fine-tuned models achieve much better results than the joined model. Note that we
do not use any weighing of particular datasets in the joined collection so small datasets have a marginal effect on the
training. The results of dataset zero-shot and language zero-shot are mostly very predictable. For the languages that
have more datasets in CorefUD collection, the drop of zero-shot compared to the joined model was smaller than for the
rest of the datasets. When we discard all the datasets for each language, the results are very similar for most of the
datasets. There are several exceptions that should be analyzed more. The strangest results are on the Turkish dataset.
Here the zero-shot model achieves the best performance. Together with the fact that the numbers are very low for this
dataset, it suggests that there is some noise in the dataset annotation. If we train the model on the training part of the
Turkish dataset the noise is present on two sides (training and testing). If we train a zero-shot model we eliminate one
part.
15
Pražák et al. – E2E Multilingual Coreference Resolution
Table 10: Results of the best model configurations with different approaches to merge segments during prediction step.
Maximum 4 segments
max. segments 4 none min min filter max max filter
ca_ancora 82.39 ±.24 82.52 ±.25 82.5 ±.22 82.42 ±.24 82.21 ±.24
cs_pcedt 77.82 ±.088 77.29 ±.17 77.47 ±.17 77.84 ±.096 77.65 ±.11
cs_pdt 79.9 ±.22 79.87 ±.16 79.84 ±.2 79.91 ±.22 79.74 ±.19
de_parcorfull 81.54 ±1.3 81.54 ±1.3 81.54 ±1.3 81.54 ±1.3 81.54 ±1.3
de_potsdamcc 75.88 ±2.2 75.88 ±2.2 75.88 ±2.2 75.88 ±2.2 75.88 ±2.2
en_gum 76.25 ±.12 76.25 ±.12 76.25 ±.12 76.25 ±.12 76.25 ±.12
en_parcorfull 67.92 ±3.1 67.92 ±3.1 67.92 ±3.1 67.92 ±3.1 67.92 ±3.1
es_ancora 82.59 ±.27 82.48 ±.31 82.56 ±.26 82.59 ±.27 82.57 ±.28
fr_democrat 68.81 ±.84 68.64 ±.89 68.33 ±.79 68.81 ±.92 67.77 ±1.7
hu_korkor 73.41 ±.43 73.41 ±.43 73.41 ±.43 73.41 ±.43 73.41 ±.43
hu_szegedkoref 71.2 ±.14 71.2 ±.14 71.2 ±.14 71.2 ±.14 71.2 ±.14
lt_lcc 76.82 ±.41 76.82 ±.41 76.82 ±.41 76.82 ±.41 76.82 ±.41
no_bokmaalnarc 78.96 ±.36 78.91 ±.38 78.87 ±.34 78.96 ±.36 78.47 ±.37
no_nynorsknarc 79.87 ±.57 80.32 ±.18 80.15 ±.37 79.87 ±.57 78.94 ±.12
pl_pcc 75.94 ±.15 76.05 ±.18 76.03 ±.22 75.93 ±.15 75.71 ±.17
ru_rucor 75.58 ±.37 76.54 ±.31 75.89 ±.55 75.62 ±.42 72.8 ±.25
tr_itcc 43.61 ±1.5 43.85 ±1.8 43.81 ±1.3 43.59 ±1.5 42.19 ±2.7
Table 11: Statistics of long documents and distant coreference relations accross the datasets.
cross N segment corefs
[%] nearest coref cross N
[%] sements over N
[%]
4 8 4 8 4 8
ca_ancora 5.975 1.459 .6044 .2747 4.739 1.422
cs_pcedt 15.82 7.398 1.107 .1441 13.55 2.133
cs_pdt 9.787 1.54 .4787 .06769 4.175 .3976
de_parcorfull 0.0 0.0 0.0 0.0 0.0 0.0
de_potsdamcc 0.0 0.0 0.0 0.0 0.0 0.0
en_gum 0.0 0.0 0.0 0.0 0.0 0.0
en_parcorfull 0.0 0.0 0.0 0.0 0.0 0.0
es_ancora 3.012 0.0 .1396 0.0 .5076 0.0
fr_democrat 52.86 25.12 1.664 .8106 34.86/27.52 27.52/12.84
hu_korkor 0.0 0.0 0.0 0.0 0.0 0.0
hu_szegedkoref 0.0 0.0 0.0 0.0 0.0 0.0
lt_lcc 0.0 0.0 0.0 0.0 0.0 0.0
no_bokmaalnarc 9.026 0.0 .1092 0.0 3.846 0.0
no_nynorsknarc 15.49 0.0 .5405 0.0 10.29 0.0
pl_pcc 11.79 2.724 .3591 .1554 2.954 .8439
ru_rucor 43.77 21.76 3.273 1.283 35.44 16.46
tr_itcc 55.6 25.07 5.703 2.037 55.56 11.11
16
Pražák et al. – E2E Multilingual Coreference Resolution
Table 12: Zero-shot cross-lingual transfer results
Dataset Zero-shot Finetunned Joined Zero-shot Lang. Zero-shot
ca_ancora 82.12 ±.13 81.9225 ±0.38 72.01 ±.17 58.53 ±.49
cs_pcedt 75.89 ±.09 74.7775 ±0.23 62.26 ±.02 51.78 ±.24
cs_pdt 79.51 ±.02 78.8575 ±0.25 69.92 ±.16 64.55 ±.28
de_parcorfull 79.13 ±.89 65.095 ±3.02 56.77 ±.82 62.33 ±.3
de_potsdamcc 78.41 ±.89 75.1325 ±2.24 67.15 ±1.4 64.33 ±.64
en_gum 75.51 ±.16 74.71 ±0.41 60.02 ±.39 62.03 ±.38
en_parcorfull 72.8 ±.2.5 46.3575 ±1.29 58.06 ±1.9 46.67 ±.81
es_ancora 82.43 ±.07 82.245 ±0.35 75.09 ±.26 61.68 ±.38
fr_democrat 68.58 ±.23 67.7225 ±0.5 60.32 ±.14 -
hu_korkor 73.55 ±.57 73.2625 ±0.54 58.26 ±.31 53.75 ±.35
hu_szegedkoref 70.67 ±.31 69.75 ±0.65 55.16 ±.11 53.36 ±.17
lt_lcc 77.65 ±.71 75.65 ±0.78 47.93 ±.59 -
no_bokmaalnarc 78.15 ±.29 77.8225 ±0.67 74.83 ±.05 64.77 ±.76
no_nynorsknarc 78.81 ±.16 77.955 ±0.46 72.7 ±.37 63.24 ±.94
pl_pcc 76 ±.17 76.0625 ±0.08 58.44 ±.13 -
ru_rucor 75.98 ±.42 72.665 ±0.83 62.12 ±.28 -
tr_itcc 44.17 ±1.8 38.52 ±1.61 46.26 ±.2 -
avg 74.67 71.09 62.19 58.92
Another interesting aspect is the superior performance of de-parcor and english-gum dataset in the language zero-shot
scenario
6.3 Final Results
Table 13 presents the final results of the best-performing models on a test set. For a comparison, we use the best model
of CRAC 2023 Multilingual coreference resolution shared task from [
37
] but not the best submission. The best model
is much larger than ours and it uses ensembling so as the main model for comparison we use their RemBERT version
without ensembling which has comparable size to our model.
From the Table, we can see that our model outperformed CorPipe on most of the datasets. For some datasets (mostly
the ones without singletons annotated) it outperformed even the large CorPipe model which has approximately 3 times
more trainable parameters. The only dataset where CorPipe outperformed our model by a large margin is the Polish one
which we consider an anomaly because for this dataset, CorPipe results on test set are much better than for the dev set.
On dev set we achieve similar results. The same happened in the opposite way for Russian dataset. Surprisingly, we
even outperformed the large CorPipe model in average score, but this is caused mainly by by a large margin on Turkish
and German-parcor
7 Conclusion
In this paper, we explored and evaluated various approaches to multilingual coreference resolution using the CorefUD
1.1 dataset. Our experiments revealed that monolingual models typically outperform cross-lingual models, especially
for languages with datasets that are distinct in their characteristics. However, joint training across languages can still
provide benefits for the most of the datasets where sufficient cross-linguistic similarities exist.
We proposed several extensions to enhance the baseline models, including cross-lingual training, Span2Head modeling,
syntactic information integration, headword mention representation, and long-context prediction. Among these,
the heads-only model and singleton modeling showed the most consistent improvements across different datasets,
demonstrating the importance of targeted adaptations for coreference resolution tasks. For several datasets, long-context
prediction bring also significant improvement. Additionally, our zero-shot cross-lingual experiments provided insights
into the challenges and opportunities of cross-lingual transfer, with the Turkish dataset results highlighting potential
issues related to noise in the training data.
Overall, our findings emphasize the need for tailored approaches in coreference resolution, particularly when dealing
with diverse languages and annotation schemes. Future work could further investigate the impact of dataset-specific
17
Pražák et al. – E2E Multilingual Coreference Resolution
Table 13: Final results on test sets
Our Best CorPipe CorPipe-large
ca_ancora 82.05 79.93 82.39
cs_pcedt 78.85 76.02 77.93
cs_pdt 78.23 76.76 77.85
de_parcorfull 77.62 63.3 69.94
de_potsdamcc 71.07 72.63 67.93
en_gum 73.19 72.33 75.02
en_parcorfull 61.62 57.58 64.79
es_ancora 82.71 81.18 82.26
fr_democrat 69.59 65.42 68.22
hu_korkor 69.03 66.18 67.95
hu_szegedkoref 69.02 65.4 69.16
lt_lcc 70.69 68.63 75.63
no_bokmaalnarc 76.24 75.43 78.94
no_nynorsknarc 74.54 73.64 77.24
pl_pcc 77 79.04 78.93
ru_rucor 81.48 78.43 80.35
tr_itcc 56.45 42.47 49.97
avg 73.49 70.26 73.21
characteristics on model performance and explore additional strategies for enhancing cross-lingual transfer, especially
for low-resource languages. Our source codes are publicly available for subsequent research4.
Acknowledgement
Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within
the program Projects of Large Research, Development and Innovations Infrastructures. This work has been supported
by Grant No. SGS-2022-016 Ad- vanced methods of data processing and analysis.
References
[1]
Liyan Xu and Jinho D. Choi. Revealing the myth of higher-order inference in coreference resolution. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
8527–8533, Online, November 2020. Association for Computational Linguistics.
[2]
Rhea Sukthanker, Soujanya Poria, Erik Cambria, and Ramkumar Thirunavukarasu. Anaphora and coreference
resolution: A review. Information Fusion, 59:139–162, 2020.
[3]
Anna Nedoluzhko, Michal Novák, Martin Popel, Zdenˇ
ek Žabokrtsk
`
y, Amir Zeldes, and Daniel Zeman. Corefud
1.0: Coreference meets universal dependencies. In Proceedings of LREC, 2022.
[4]
Joakim Nivre, Daniel Zeman, Filip Ginter, and Francis Tyers. Universal Dependencies. In Proceedings of the
15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts,
Valencia, Spain, April 2017. Association for Computational Linguistics.
[5]
Anna Nedoluzhko, Michal Novák, Martin Popel, Zdenˇ
ek Žabokrtský, and Daniel Zeman. Coreference meets
Universal Dependencies – a pilot experiment on harmonizing coreference datasets for 11 languages. ÚFAL MFF
UK, Praha, Czechia, 2021.
[6]
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference resolution. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 188–197,
Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
[7]
Kenton Lee, Luheng He, and Luke Zettlemoyer. Higher-order coreference resolution with coarse-to-fine inference.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 687–692, 2018.
4https://github.com/ondfa/coref-multiling
18
Pražák et al. – E2E Multilingual Coreference Resolution
[8]
Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel S Weld. Bert for coreference resolution: Baselines and
analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5803–5808, 2019.
[9]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT:
Improving pre-training by representing and predicting spans. Transactions of the Association for Computational
Linguistics, 8:64–77, 2020.
[10]
V Dobrovolskii. Word-level coreference resolution. In EMNLP 2021-2021 Conference on Empirical Methods in
Natural Language Processing, Proceedings, pages 7670–7675, 2021.
[11]
Karel D’Oosterlinck, Semere Kiros Bitew, Brandon Papineau, Christopher Potts, Thomas Demeester, and Chris
Develder. Caw-coref: Conjunction-aware word-level coreference resolution. In Proceedings of The Sixth Workshop
on Computational Models of Reference, Anaphora and Coreference (CRAC 2023), pages 8–14, 2023.
[12]
Zdenˇ
ek Žabokrtský, Miloslav Konopik, Anna Nedoluzhko, Michal Novák, Maciej Ogrodniczuk, Martin Popel,
Ondrej Prazak, Jakub Sido, and Daniel Zeman. Findings of the second shared task on multilingual coreference
resolution. In Zdenˇ
ek Žabokrtský and Maciej Ogrodniczuk, editors, Proceedings of the CRAC 2023 Shared Task
on Multilingual Coreference Resolution, pages 1–18, Singapore, December 2023. Association for Computational
Linguistics.
[13]
Zdenˇ
ek Žabokrtský, Miloslav Konopík, Anna Nedoluzhko, Michal Novák, Maciej Ogrodniczuk, Martin Popel,
Ondˇ
rej Pražák, Jakub Sido, Daniel Zeman, and Yilun Zhu. Findings of the shared task on multilingual corefer-
ence resolution. In Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and
Coreference, Gyeongju, Republic of Korea, October 2022. Association for Computational Linguistics.
[14]
Ondˇ
rej Pražák and Miloslav Konopik. End-to-end multilingual coreference resolution with mention head
prediction. In Zdenˇ
ek Žabokrtský and Maciej Ogrodniczuk, editors, Proceedings of the CRAC 2022 Shared Task
on Multilingual Coreference Resolution, pages 23–27, Gyeongju, Republic of Korea, October 2022. Association
for Computational Linguistics.
[15]
Ondˇ
rej Pražák, Miloslav Konopík, and Jakub Sido. Multilingual coreference resolution with harmonized anno-
tations. In Proceedings of the International Conference on Recent Advances in Natural Language Processing
(RANLP 2021), pages 1119–1123, 2021.
[16]
Milan Straka and Jana Straková. ÚFAL CorPipe at CRAC 2022: Effectivity of multilingual models for coreference
resolution. In Zdenˇ
ek Žabokrtský and Maciej Ogrodniczuk, editors, Proceedings of the CRAC 2022 Shared Task
on Multilingual Coreference Resolution, pages 28–37, Gyeongju, Republic of Korea, October 2022. Association
for Computational Linguistics.
[17]
André Ferreira Cruz, Gil Rocha, and Henrique Lopes Cardoso. Exploring spanish corpora for portuguese
coreference resolution. In 2018 Fifth International Conference on Social Networks Analysis, Management and
Security (SNAMS), pages 290–295, 2018.
[18]
Gourab Kundu, Avi Sil, Radu Florian, and Wael Hamza. Neural cross-lingual coreference resolution and its applica-
tion to entity linking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 395–400, Melbourne, Australia, July 2018. Association for Computational
Linguistics.
[19]
Gorka Urbizu, Ander Soraluze, and Olatz Arregi. Deep cross-lingual coreference resolution for less-resourced
languages: The case of Basque. In Proceedings of the Second Workshop on Computational Models of Refer-
ence, Anaphora and Coreference, pages 35–41, Minneapolis, USA, June 2019. Association for Computational
Linguistics.
[20]
Marta Recasens and Eduard H. Hovy. BLANC: Implementing the Rand index for coreference evaluation. Natural
Language Engineering, 17(4):485–510, 2011.
[21]
Nafise Sadat Moosavi and Michael Strube. Which coreference evaluation metric do you trust? a proposal for
a link-based entity aware metric. In Proceedings of the 54th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages 632–642, Berlin, Germany, August 2016. Association for
Computational Linguistics.
[22]
Marc Vilain, John D Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. A model-theoretic
coreference scoring scheme. In Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference
Held in Columbia, Maryland, November 6-8, 1995, 1995.
[23]
Amit Bagga and Breck Baldwin. Algorithms for scoring coreference chains. In The first international conference
on language resources and evaluation workshop on linguistics coreference, volume 1, pages 563–566. Citeseer,
1998.
19
Pražák et al. – E2E Multilingual Coreference Resolution
[24]
Xiaoqiang Luo. On coreference resolution performance metrics. In Proceedings of human language technology
conference and conference on empirical methods in natural language processing, pages 25–32, 2005.
[25]
Massimo Poesio, Juntao Yu, Silviu Paun, Abdulrahman Aloraini, Pengcheng Lu, Janosch Haber, and Derya Cokal.
Computational models of anaphora. Annual Review of Linguistics, 9(1):561–587, 2023.
[26]
Jakub Sido, Ondˇ
rej Pražák, Pavel Pˇ
ribáˇ
n, Jan Pašek, Michal Seják, and Miloslav Konopík. Czert–czech bert-like
model for language representation. In Proceedings of the International Conference on Recent Advances in Natural
Language Processing (RANLP 2021), pages 1326–1338, 2021.
[27]
Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. A robustly optimized bert pre-training approach with post-training.
In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, 2021.
[28]
Branden Chan, Stefan Schweter, and Timo Möller. German’s next language model. In Proceedings of the 28th
International Conference on Computational Linguistics, pages 6788–6796, 2020.
[29]
Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme
Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, and Marta Villegas. Are multilingual models the
best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Chengqing
Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics:
ACL-IJCNLP 2021, pages 4933–4946, Online, August 2021. Association for Computational Linguistics.
[30]
Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquin Silveira-Ocampo,
Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Aitor Gonzalez-Agirre, and Marta
Villegas. Maria: Spanish language models. Procesamiento del Lenguaje Natural, 68(0):39–60, 2022.
[31]
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte
de la Clergerie, Djamé Seddah, and Benoît Sagot. Camembert: a tasty french language model. In ACL 2020-58th
Annual Meeting of the Association for Computational Linguistics, 2020.
[32]
Dávid Márk Nemeskey. Introducing
huBERT
. In XVII. Magyar Számítógépes Nyelvészeti Konferencia
(MSZNY2021), page TBA, Szeged, 2021.
[33]
Matej Ulˇ
car and Marko Robnik-Šikonja. Training dataset and dictionary sizes matter in bert models: the case of
baltic languages. arXiv preprint arXiv:2112.10553, 2021.
[34]
Robert Mroczkowski, Piotr Rybak, Alina Wróblewska, and Ireneusz Gawlik. HerBERT: Efficiently pretrained
transformer-based language model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural
Language Processing, pages 1–10, Kiyv, Ukraine, April 2021. Association for Computational Linguistics.
[35]
Yuri Kuratov and Mikhail Arkhipov. Adaptation of deep bidirectional multilingual transformers for russian
language. arXiv preprint arXiv:1905.07213, 2019.
[36]
David Samuel, Andrey Kutuzov, Samia Touileb, Erik Velldal, Lilja Øvrelid, Egil Rønningstad, Elina Sigdel, and
Anna Palatkina. NorBench – a benchmark for Norwegian language models. In Tanel Alumäe and Mark Fishel,
editors, Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 618–633,
Tórshavn, Faroe Islands, May 2023. University of Tartu Library.
[37]
Milan Straka. Úfal corpipe at crac 2023: Larger context improves multilingual coreference resolution. In
Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution, pages 41–51, 2023.
20