Figure - uploaded by Amir Zeldes
Content may be subject to copyright.
Source publication
We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to...
Similar publications
In a cluster of news texts on the same event, two sentences from different documents might express different multi-document phenomena (redundancy, complementarity, and contradiction). Cross-Document Structure Theory (CST) provides labels to explicitly represent these phenomena. The automatic identification of the multi-document phenomena and their...
This article is part of an ongoing investigation of the combinatorics of q, t-Catalan numbers \({{\,\mathrm{Cat}\,}}_n(q,t)\). We develop a structure theory for integer partitions based on the partition statistics dinv, deficit, and minimum triangle height. Our goal is to decompose the infinite set of partitions of deficit k into a disjoint union o...
Poems, as aesthetic objects, generate a subjective experience, which can be different for different readers. In this paper, we propose a method to quantify these subjective experiences. We gave participants three parallel excerpts and asked them to describe, in free text, the perceived emotive qualities of these excerpts. The descriptions were anal...
The capital structure theory is a very important question in economy, the modern capital structure theory by Modigliani and Miller is proposed based on the assumption of perfect capital market. The core problem of capital structure is the company financing options, with the vigorous development of the world capital markets, enterprise's financing w...
Nuclear electric resonance (NER) spectroscopy is currently experiencing a revival as a tool for nuclear spin-based quantum computing. Compared to magnetic or electric fields, local electron density fluctuations caused by changes in the atomic environment provide a much higher spatial resolution for the addressing of nuclear spins in qubit registers...
Citations
... We also note that although the error prediction models evaluated in Section 4.2 were primarily developed in order to gain a greater understanding of the issues in discourse parsing, they could have some practical applications. 7 Predicting regions of low certainty in discourse parses can: 1) assist by highlighting low confidence regions in userfacing downstream applications; 2) flag potential problems during annotation of resources, especially when relying on NLP (Gessler et al., 2020) or less trained annotators/crowd workers (Scholman et al., 2022;Pyatkin et al., 2023); and 3) help guide additional resource acquisition, either automatically using active learning (to prioritize documents predicted to have parsing problems for manual annotation, cf. Gessler et al. 2022) A Validation Performance Table 6 shows our reproduced 5-run average parsing performance on the dev partition of each corpus. ...
Despite recent advances in Natural Language Processing (NLP), hierarchical discourse parsing in the framework of Rhetorical Structure Theory remains challenging, and our understanding of the reasons for this are as yet limited. In this paper, we examine and model some of the factors associated with parsing difficulties in previous work: the existence of implicit discourse relations, challenges in identifying long-distance relations, out-of-vocabulary items, and more. In order to assess the relative importance of these variables, we also release two annotated English test-sets with explicit correct and distracting discourse markers associated with gold standard RST relations. Our results show that as in shallow discourse parsing, the explicit/implicit distinction plays a role, but that long-distance dependencies are the main challenge, while lack of lexical overlap is less of a problem, at least for in-domain parsing. Our final model is able to predict where errors will occur with an accuracy of 76.3% for the bottom-up parser and 76.6% for the top-down parser.
... We also note that although the error prediction models evaluated in Section 4.2 were primarily developed in order to gain a greater understanding of the issues in discourse parsing, they could have some practical applications. 7 Predicting regions of low certainty in discourse parses can: 1) assist by highlighting low confidence regions in userfacing downstream applications; 2) flag potential problems during annotation of resources, especially when relying on NLP (Gessler et al., 2020) or less trained annotators/crowd workers (Scholman et al., 2022;Pyatkin et al., 2023); and 3) help guide additional resource acquisition, either automatically using active learning (to prioritize documents predicted to have parsing problems for manual annotation, cf. Gessler et al. 2022) A Validation Performance Table 6 shows our reproduced 5-run average parsing performance on the dev partition of each corpus. ...
... Several proposals have been made to scale up anaphoric annotation in terms of size, range of domains, and phenomena covered proposed, including automatic data augmentation (Emami et al., 2019;Gessler et al., 2020;Aloraini and Poesio, 2021), and crowdsourcing combined with active learning (Laws et al., 2012;Li et al., 2020;Yuan et al., 2022) or through Games-With-A-Purpose (Chamberlain et al., 2008;Hladká et al., 2009;Bos et al., 2017;Kicikoglu et al., 2019). However, the largest existing anaphoric corpora created using Games-With-A-Purpose (e.g., ) are still smaller than the largest resources created with traditional methods, and the corpora created using data augmentation techniques are focused on specific aspects of anaphoric reference. ...
... Scaling up anaphoric annotation One approach to scale up anaphoric reference annotation is using fully automatic methods to either annotate a dataset, such as AMALGUM (Gessler et al., 2020), or create a benchmark from scratch, such as KNOWREF (Emami et al., 2019). While entirely automatic annotation may result in datasets of arbitrarily large size, such annotations cannot expand current models' coverage to aspects of anaphoric reference they do not already handle well. ...
... Several proposals have been made to scale up anaphoric annotation in terms of size, range of domains, and phenomena covered proposed, including automatic data augmentation (Emami et al., 2019;Gessler et al., 2020;Aloraini and Poesio, 2021), and crowdsourcing combined with active learning (Laws et al., 2012;Li et al., 2020;Yuan et al., 2022) or through Games-With-A-Purpose (Chamberlain et al., 2008;Hladká et al., 2009;Bos et al., 2017;Kicikoglu et al., 2019). However, the largest existing anaphoric corpora created using Games-With-A-Purpose (e.g., ) are still smaller than the largest resources created with traditional methods, and the corpora created using data augmentation techniques are focused on specific aspects of anaphoric reference. ...
... Scaling up anaphoric annotation One approach to scale up anaphoric reference annotation is using fully automatic methods to either annotate a dataset, such as AMALGUM (Gessler et al., 2020), or create a benchmark from scratch, such as KNOWREF (Emami et al., 2019). While entirely automatic annotation may result in datasets of arbitrarily large size, such annotations cannot expand current models' coverage to aspects of anaphoric reference do not already handle well. ...
Although several datasets annotated for anaphoric reference/coreference exist, even the largest such datasets have limitations in terms of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven't so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to 'complete' markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents (> 2K in length).
... Finally we note that in the plain text scenario, all systems except for DisCoDisCo used noncontextualized tools for sentence splitting and/or automatic parsing (SegFormers: CoreNLP; disCut: stanza; TMVM: SpaCy); DisCoDisCo used the tranformer-based sentence splitter from the AMAL-GUM corpus (Gessler et al., 2020) and DiaParser (Attardi et al., 2021), both with language-specific transformers, with the result that plain text numbers are very close to gold numbers for DisCoDisCo (and in fact insignificantly better for plain Connective Detection: 91.49 on average vs. 91.22 for gold treebanked data). This echoes results from the 2019 task (see Yu et al. 2019) which showed the crucial importance of high quality preprocessing in general, and sentence splitting in particular. ...
In this article we present Enhanced Rhetorical Structure Theory (eRST), a new theoretical framework for computational discourse analysis, based on an expansion of Rhetorical Structure Theory (RST). The framework encompasses discourse relation graphs with tree-breaking, non-projective and concurrent relations, as well as implicit and explicit signals which give explainable rationales to our analyses. We survey shortcomings of RST and other existing frameworks, such as Segmented Discourse Representation Theory, the Penn Discourse Treebank, and Discourse Dependencies, and address these using constructs in the proposed theory. We provide annotation, search, and visualization tools for data, and present and evaluate a freely available corpus of English annotated according to our framework, encompassing 12 spoken and written genres with over 200K tokens. Finally, we discuss automatic parsing, evaluation metrics, and applications for data in our framework.
Named-Entity Recognition (NER) is a core part of information extraction. India is a multilingual country with 23 official languages and over 122 major languages, where a significant population is multilingual. Most of the conversations, whether online or in person, involve code-switching and transliterating. Code-switching is the practice of alternating back and forth between two languages or dialects during a conversation or in writing. Processing of such multilingual code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. In Natural Language Processing, Indian languages are termed as low-resource languages because of a lack of large-scale supervised data and linguistic resources to make statistical Natural Language Processing viable. We propose a novel data augmentation technique for transfer learning from high-resource languages to low-resource Indian languages with adaptive, behavioral and task-specific fine-tuning on existing pre-trained language representations like mBERT for Nested Named-Entity Recognition in multilingual code-switched natural language processing which can overcome the shortcomings of traditional Named-Entity Recognition methods which fare poorly in multilingual, code-switched and low-resource language contexts. Named-Entity Recognition is highly computationally expensive. In our approach, we also try to bring the computational costs down by employing a unified and robust model.
This paper introduces a multi-layered cross-genre corpus, annotated for coreference resolution, causal relations, and temporal relations, comprising a variety of genres, from news articles and children’s stories to Reddit posts. Our results reveal distinctive genre-specific characteristics at each layer of annotation, highlighting unique challenges for both annotators and machine learning models. Children’s stories feature linear temporal structures and clear causal relations. In contrast, news articles employ non-linear temporal sequences with minimal use of explicit causal or conditional language and few first-person pronouns. Lastly, Reddit posts are author-centered explanations of ongoing situations, with occasional meta-textual reference. Our annotation schemes are adapted from existing work to better suit a broader range of text types. We argue that our multi-layered cross-genre corpus not only reveals genre-specific semantic characteristics but also indicates a rich contextual interplay between the various layers of semantic information. Our MLCG corpus is shared under the open-source Apache 2.0 license.
This paper describes our submission to the DISRPT2021 Shared Task on Discourse Unit Segmentation, Connective Detection, and Relation Classification. Our system, called DisCoDisCo, is a Transformer-based neural classifier which enhances contextualized word embeddings (CWEs) with hand-crafted features, relying on tokenwise sequence tagging for discourse segmentation and connective detection, and a feature-rich, encoder-less sentence pair classifier for relation classification. Our results for the first two tasks outperform SOTA scores from the previous 2019 shared task, and results on relation classification suggest strong performance on the new 2021 benchmark. Ablation tests show that including features beyond CWEs are helpful for both tasks, and a partial evaluation of multiple pre-trained Transformer-based language models indicates that models pre-trained on the Next Sentence Prediction (NSP) task are optimal for relation classification.