Fig 1 - uploaded by Henrique Lopes Cardoso
Content may be subject to copyright.
![Coreference resolution examples. The third example was extracted from the "Winograd Schema Challenge" [2].](profile/Henrique-Lopes-Cardoso-2/publication/329393320/figure/fig1/AS:778746999103491@1562679161463/Coreference-resolution-examples-The-third-example-was-extracted-from-the-Winograd.png)
Coreference resolution examples. The third example was extracted from the "Winograd Schema Challenge" [2].
Context in source publication
Context 1
... resolution has a high-impact on several other NLP tasks, including textual entailment, summarization, information extraction, and question answering. Figure 1 shows examples of sentences and their corresponding coreference chains. A classification algorithm could, for instance, use the hyponym/hypernym semantic relation between "bee" and "insect" to classify the two mentions as co-referent, and use world-knowledge to infer a strong relation between "Barack Obama" and "president". ...
Similar publications
Unlike a few decades ago, using our phones, tablets, phablets or computers, all sorts of foreign language authentic materials are now easily available and accessible outside our language centres. Nevertheless, learners might find it difficult to select what is more effective for their learning process and be daunted by some complex features of natu...
Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BeamDR) that iterati...
O artigo traça o perfil autoral, disciplinar, bibliográfico e temático da revista DADOS por meio de uma análise bibliométrica de todos os textos publicados pelo periódico nos seus 50 anos de existência (1966-2015). A composição da base se deu pelo processamento dos textos indexados na plataforma SciELO e da leitura e codificação manual dos volumes...
According to previous studies on L1 Italian and Spanish, speakers prefer different pragmatic strategies and adopt specific pragmatic patterns to express their attention to the interlocutor. This study deals with communicative strategies used in dialogic speech in L1 and L2 Spanish considering both textual structure and interaction between the two i...
Citations
... The results reported here were partially supported by PORTULAN CLA-RIN-Research Infrastructure for the Science and Technology of Language, funded by Lisboa2020, Alentejo2020 and FCT-Fundação para a Ciência e Tecnologia under the Grant PINFRA/22117/2016. The PORTULAN CLARIN Workbench comprises a number of tools that are based on a large body of research work contributed by different authors and teams, which continues to grow and is acknowledged here: (Barreto et al., 2006;Branco et al., 2010Branco et al., , 2011Branco et al., , 2012Branco et al., , 2014Cruz et al., 2018;Veiga et al., 2011;Branco & Henriques, 2003;Silva et al., 2009;Rodrigues et al., 2016Rodrigues et al., , 2020Costa & Branco, 2012;Santos et al., 2019;Miranda et al., 2011). ...
While language processing services are key assets for the science and technology of language, the possible ways under which they may be made available to the widest range of their end users are critical to support an Open Science policy for this scientific domain. Although providing such processing services under some web-based interface, at large, offers itself as an immediate and cogent response to that challenge, turning this view into an effective access to language processing services is an undertaking deserving a clear conceptual direction and a corresponding robust empirical validation. Based on an extensive overview of major undertakings towards making language processing tools available and on the design principles worked out and implemented in the PORTULAN CLARIN infrastructure, in this paper we advocate for a Research-Infrastructure-as-a-Service (RIaaS) model. This model unleashes accessibility to language processing services in as many web-based interface modalities as the current stage of technological development permits to support, in order to serve as many types of end users as possible, from IT developers to Digital Humanities researchers, and including citizen scientists, teachers, students and digital artists among many others.
... Other cross-lingual experiments include Portuguese by learning from Spanish [17]; Spanish and Chinese relying on an English corpus [18]; and Basque based on an English corpus as well [19]. All these approaches employ neural networks, and they transfer the model via cross-lingual word embeddings. ...
Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \url{https://github.com/ondfa/coref-multiling}
... Researchers have also explored cross-lingual learning for coreference resolution in particular. Cruz et al. (2018) use a large Spanish corpora to create a model for Portuguese, leveraging FastText multilingual embeddings. Urbizu et al. (2019) work on coreference resolution for Basque, relying on English data from OntoNotes to train a cross-lingual model. ...
... Other cross-lingual experiments include Portuguese by learning from Spanish (Cruz et al., 2018); Spanish and Chinese relying on an English corpus Kundu et al. (2018); and Basque based on an English corpus as well (Urbizu et al., 2019). All these approaches employ neural networks, and they transfer the model via cross-lingual word embeddings. ...
In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD. We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models -- for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.
... Even if the vast majority of coreference resolution approaches was focused on English language -as in many other NLP research areas -individual language-specific approaches was also developed. Dedicated approaches for major European languages were proposed: German [43]- [45], Spanish [46], Portuguese [47] Czech [48], French [49] and many others. Concerning Italian, very few systems for coreference resolution were proposed so far [50], [51] Evaluation campaigns like Semeval 2010 [52] and Conll 2012 [24] and the development of multilingual resources like Ontonotes [53] or ParCor [54] shifted the focus to systems that can be language-independent or able to be adapted to several languages simultaneously. ...
... In particular, these approaches have used English as source language, and they have been tested for Spanish and Italian [58], Portuguese and Spanish [60], German and Russian [61]. Other studies have tested a direct transfer learning between languages by using multilingual word embeddings, using a model trained on a language for other languages that share a common semantic space [62]: experiments have been carried out on Chinese, Spanish, Portuguese and English [47], [63]. Figure 1 shows the working process behind the proposed coreference resolution system. ...
In recent years, the impact of Neural Language Models has changed every field of Natural Language Processing. In this scenario, coreference resolution has been among the least considered task, especially in language other than English. This work proposes a coreference resolution system for Italian, based on a neural end-to-end architecture integrating ELECTRA language model and trained on OntoCorefIT, a novel Italian dataset built starting from OntoNotes. Even if some approaches for Italian have been proposed in the last decade, to the best of our knowledge, this is the first neural coreference resolver aimed specifically to Italian. The performance of the system is evaluated with respect to three different metrics and also assessed by replacing ELECTRA with the widely-used BERT language model, since its usage has proven to be effective in the coreference resolution task in English. A qualitative analysis has also been conducted, showing how different grammatical categories affect performance in an inflectional and morphological-rich language like Italian. The overall results have shown the effectiveness of the proposed solution, providing a baseline for future developments of this line of research in Italian.
... For other languages, available corpora are typically smaller in size. This scarcity poses a considerable barrier to improving coreference resolution of low-resource languages, which may be tackled using unsupervised approaches (see Section 5.5) or transfer learning from higher-resourced languages [28,29], a technique that is becoming more frequent. Corpora available for this task often differ on the annotation scheme used, the domain from which they were extracted, and the type of labeled coreferences. ...
... The model can be used to form predictions on the target data (without being explicitly trained using labeled data on the target language) after updating the embedding layer for the target language using multilingual word embeddings. Consequently, a model trained on one language can be used for any other language that shares its semantic space [28,29]. Cruz et al. [28] explored the direct transfer approach to leverage a Spanish corpus for coreference resolution in the Portuguese language. ...
... Consequently, a model trained on one language can be used for any other language that shares its semantic space [28,29]. Cruz et al. [28] explored the direct transfer approach to leverage a Spanish corpus for coreference resolution in the Portuguese language. They reported competitive results compared to an in-language model, which supports further exploring transfer learning techniques to address less-resourced languages using the proposed approach. ...
The task of coreference resolution has attracted considerable attention in the literature due to its importance in deep language understanding and its potential as a subtask in a variety of complex natural language processing problems. In this study, we outlined the field’s terminology, describe existing metrics, their differences and shortcomings, as well as the available corpora and external resources. We analyzed existing state-of-the-art models and approaches, and reviewed recent advances and trends in the field, namely end-to-end systems that jointly model different subtasks of coreference resolution, and cross-lingual systems that aim to overcome the challenges of less-resourced languages. Finally, we discussed the main challenges and open issues faced by coreference resolution systems.
... Moreover, there has been some recent research to build cross-lingual systems for coreference resolution, as cross-lingual transfer learning has given good results in some other NLP tasks such as machine translation or language modeling (Lample and Conneau, 2019). Cruz et al. (2018) used neural networks to solve coreference for Portuguese by learning from Spanish, a related language, using cross-lingual word embeddings. Kundu et al. presented a similar system for Spanish and Chinese using English for training. ...
Coreference resolution systems aim to recognize and cluster together mentions of the same underlying entity. While there exist large amounts of research on broadly spoken languages such as English and Chinese, research on coreference in other languages is comparably scarce. In this work we first present SentiCoref 1.0 - a coreference resolution dataset for Slovene language that is comparable to English-based corpora. Further, we conduct a series of analyses using various complex models that range from simple linear models to current state-of-the-art deep neural coreference approaches leveraging pre-trained contextual embeddings. Apart from SentiCoref, we evaluate models also on a smaller coref149 Slovene dataset to justify the creation of a new corpus. We investigate robustness of the models using cross-domain data and data augmentations. Models using contextual embeddings achieve the best results - up to 0.92 average F1 score for the SentiCoref dataset. Cross-domain experiments indicate that SentiCoref allows the models to learn more general patterns, which enables them to outperform models, learned on coref149 only.
We report on the application of a neural network based approach to the problem of automatically categorizing texts according to their proficiency levels and suitability for learners of Portuguese as a second language. We resort to a particular deep learning architecture, namely Transformers, as we fine-tune GPT-2 and RoBERTa on data sets labeled with respect to the standard CEFR proficiency levels, that were provided by Camões IC, the Portuguese official language institute. Despite the reduced size of the data sets available, we found that the resulting models overperform previous carefully crafted feature based counterparts in most evaluation scenarios, thus offering a new state-of-the-art for this task in what concerns the Portuguese language.
This paper investigates the ability of multilingual BERT (mBERT) language model to transfer syntactic knowledge cross-lingually, verifying if and to which extent syntactic dependency relationships learnt in a language are maintained in other languages. In detail, the main contributions of this paper are: (i) an analysis of the cross-lingual syntactic transfer capability of mBERT model; (ii) a detailed comparison of cross-language syntactic transfer among languages belonging to different branches of the Indo-European languages, namely English, Italian and French, which present very different syntactic constructions; (iii) a study on the transferability of a syntactic phenomenon peculiar of Italian language, namely the pronoun dropping (pro-drop), also known as omissibility of the subject. To this end, a structural probe devoted to reconstruct the dependency parse tree of a sentence has been exploited, representing the input sentences with the contextual embeddings from mBERT layers. The results of the experimental assessment have shown a transfer of syntactic knowledge of the mBERT model among these languages. Moreover, the behaviour of the probe in the transition from pro-drop to non-pro-drop languages and vice versa has proven to be more effective in case of languages sharing a common linguistic matrix. The possibility of transferring syntactical knowledge, especially in the case of specific phenomena, meets both a theoretical need and can have important practical implications in syntactic tasks, such as dependency parsing.