Figure - uploaded by Maciej Ogrodniczuk
Content may be subject to copyright.
Main results: the CoNLL metric macro-averaged over all datasets. The table shows the primary metric (head-match excluding singletons) and three alternative metrics: partial-match excluding singletons, exact-match excluding singletons and head-match with singletons. A difference relative to the primary metric is reported in parenthesis. The best score in each column is in bold. The systems are ordered by the primary metric.

Main results: the CoNLL metric macro-averaged over all datasets. The table shows the primary metric (head-match excluding singletons) and three alternative metrics: partial-match excluding singletons, exact-match excluding singletons and head-match with singletons. A difference relative to the primary metric is reported in parenthesis. The best score in each column is in bold. The systems are ordered by the primary metric.

Contexts in source publication

Context 1
... main results are summarized in Table 3. The CorPipe system is the best one according to the official primary metric (head-match excluding singletons) as well as according to three alternative metrics: partial-match excluding singletons (which was the primary metric last year), exact-match excluding singletons and head-match including singletons. ...
Context 2
... main results are summarized in Table 3. The CorPipe system is the best one according to the official primary metric (head-match excluding singletons) as well as according to three alternative metrics: partial-match excluding singletons (which was the primary metric last year), exact-match excluding singletons and head-match including singletons. ...

Similar publications

Preprint
Full-text available
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a...
Chapter
Full-text available
Résumé Cette étude explore l'état d'esprit de développement chez des étudiants plurilingues apprenant le français à l'Université de Technologie de Nanyang (NTU) à Singapour. En s'appuyant sur les travaux de Carol Dweck, l'étude examine comment l'état d'esprit de développement influence la persévérance et la motivation des apprenants dans un context...
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multil...
Preprint
Full-text available
This chapter is intended to provide a succinct overview of linguistic and communicative variations clinicians may encounter in speakers whose first language is not English. Speakers of varieties other than the Standard dialect of American English (SAE) are also considered. The chapter's goal is to prepare clinicians for working with speakers from a...

Citations

... These datasets vary based on features such as domain, annotation schemes, and types of references labeled. These variations often lead to annotation inconsistencies, evaluation challenges, and domain limitations (Žabokrtský et al., 2023;Aloraini et al., 2024;Nedoluzhko et al., 2021b). This, along with the need for relatively large annotated datasets to train current state-of-the-art models, motivated us to look into automatic data annotation and harmonization. ...
Preprint
Full-text available
Training models that can perform well on various NLP tasks require large amounts of data, and this becomes more apparent with nuanced tasks such as anaphora and conference resolution. To combat the prohibitive costs of creating manual gold annotated data, this paper explores two methods to automatically create datasets with coreferential annotations; direct conversion from existing datasets, and parsing using multilingual models capable of handling new and unseen languages. The paper details the current progress on those two fronts, as well as the challenges the efforts currently face, and our approach to overcoming these challenges.
... CRAC Shared Task on Multilingual Coreference Resolution (CRAC-coref) [12] is an annual shared task that began in 2022 and is built upon CorefUD collection. In fact, this paper is an extension of CRAC22-coref [13] participant system [14], which is based on [15] Best-performing system in CRAC 2022 was submitted by [16]. ...
Preprint
Full-text available
Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \url{https://github.com/ondfa/coref-multiling}
Article
Coreference resolution is the task of resolving mentions that refer to the same entity into clusters. The area and its tasks are crucial in natural language processing (NLP) applications. Extensive surveys of this task have been conducted for English and Chinese; not too much for Arabic. The few Arabic surveys do not cover recent progress and the challenges for Arabic anaphora; and do not cover zero resolution and comprehensive resolution of zeros and full mentions, or anaphora resolution beyond coreference (e.g., bridging). In this paper, we examine the state-of-the-art for Arabic anaphora resolution, highlighting the challenges and advances in this field. We provide a comprehensive survey of the methods employed for Arabic coreference resolution, as well as an overview of the existing datasets and challenges. The goal is to equip researchers with a thorough understanding of Arabic anaphora resolution and to suggest potential future directions in the field.
Preprint
We present CorPipe 24, the winning entry to the CRAC 2024 Shared Task on Multilingual Coreference Resolution. In this third iteration of the shared task, a novel objective is to also predict empty nodes needed for zero coreference mentions (while the empty nodes were given on input in previous years). This way, coreference resolution can be performed on raw text. We evaluate two model variants: a~two-stage approach (where the empty nodes are predicted first using a pretrained encoder model and then processed together with sentence words by another pretrained model) and a single-stage approach (where a single pretrained encoder model generates empty nodes, coreference mentions, and coreference links jointly). In both settings, CorPipe surpasses other participants by a large margin of 3.9 and 2.8 percent points, respectively. The source code and the trained model are available at https://github.com/ufal/crac2024-corpipe .