Figure - uploaded by Margarita Ananyeva
Content may be subject to copyright.
the statistics for relations expressed overt- ly via markers. The most frequent marker for this relation is given.
Source publication
The paper deals with the pilot version of the first RST discourse treebank for Russian. The
project started in 2016. At present, the treebank consists of sixty news texts annotated for rhetorical relations according to RST scheme. However, this scheme was slightly modified in order to achieve higher inter-annotator agreement score. During the annot...
Similar publications
The concept of Corporate Social Responsibility (CSR) has evolved considerably for the last 50 years. Most of the historical reviews in academic literature divided the historical development of the CSR concept according to a timeline basis. The multiplicity of historical reviews and confusion mentioning some of events and neglecting other events hav...
Citations
... We have 6 corpora for English (Prasad et al., 2019;Zeldes, 2017;Carlson et al., 2001;Asher et al., 2016;Yang and Li, 2018;Nishida and Matsumoto, 2022), 4 for Chinese (Zhou et al., 2014;Cao et al., 2018;Cheng and Li, 2019;Yi et al., 2021), 2 for Spanish (da Cunha et al., 2011;Cao et al., 2018), 2 for Portuguese (Cardoso et al., 2011;Mendes and Lejeune, 2022), 1 for German (Stede and Neumann, 2014), 1 for Basque (Iruskieta et al., 2013), 1 for Farsi (Shahmohammadi et al., 2021), 1 for French , 1 for Dutch (Redeker et al., 2012), 1 for Russian (Toldova et al., 2017), 1 for Turkish (Zeyrek and Webber, 2008;Zeyrek and Kurfalı, 2017), 1 for Italian (Tonelli et al., 2010; and 1 for Thai. In addition, OOD datasets come from the multilingual TED Discourse Bank with data for English, Portuguese and Turkish (Zeyrek et al., 2018(Zeyrek et al., , 2020 Identifying EDU boundaries and connectives (Tasks 1 and 2) corresponds to different corpora: PDTB-based datasets have connectives annotated, but not segmentation, while the others have no connectives. ...
... in the past decades (Zeldes et al., 2019(Zeldes et al., , 2021, including English (Carlson et al., 2001;Zeldes, 2017), Basque (Iruskieta et al., 2013), Bangla (Das and Stede, 2018), Brazilian Portuguese (Cardoso et al., 2011), Dutch (Redeker et al., 2012), German (Stede and Neumann, 2014), Persian (Shahmohammadi et al., 2021), Russian (Toldova et al., 2017), Spanish (da Cunha et al., 2011), and the Spanish-Chinese parallel corpus (Cao et al., 2018). ...
... We first manually tokenize according to Xia (2000b) and conduct EDU segmentation based on parts-of-speech defined in Xia (2000a). Most notably, we segment relative clauses in GCDT, following the practice in English and other corpora (Carlson et al., 2001;Zeldes, 2017;Das and Stede, 2018;Cardoso et al., 2011;Redeker et al., 2012;Toldova et al., 2017). Chinese relative clauses present a unique feature in the existing RST treebanks. ...
A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.
... in the past decades (Zeldes et al., 2019(Zeldes et al., , 2021, including English (Carlson et al., 2001;Zeldes, 2017), Basque (Iruskieta et al., 2013), Bangla (Das and Stede, 2018), Brazilian Portuguese (Cardoso et al., 2011), Dutch (Redeker et al., 2012), German (Stede and Neumann, 2014), Persian (Shahmohammadi et al., 2021), Russian (Toldova et al., 2017), Spanish (da Cunha et al., 2011), and the Spanish-Chinese parallel corpus (Cao et al., 2018). ...
... We first manually tokenize according to Xia (2000b) and conduct EDU segmentation based on parts-of-speech defined in Xia (2000a). Most notably, we segment relative clauses in GCDT, following the practice in English and other corpora (Carlson et al., 2001;Zeldes, 2017;Das and Stede, 2018;Cardoso et al., 2011;Redeker et al., 2012;Toldova et al., 2017). Chinese relative clauses present a unique feature in the existing RST treebanks. ...
... Recent years have seen the emergence of corpora for segmentation, prediction of discourse relations and more seldom (or less straightforwardly 1 ) prediction of complete discourse structures. The study of discourse structure in frameworks such as RST, and SDRT, has seen a revival in recent years, and several researchers are now actively engaged in the creation of discourse data; we can mention The Potsdam Commentary Corpus (Stede and Neumann, 2014), The Georgetown University Multilayer (GUM) corpus (Zeldes, 2017), The RST Discourse Treebank (Carlson et al., 2002), Dutch Discourse Treebank (NLDT) (Redeker et al., 2012), Cross-document Structure Theory News Corpus (Cardoso et al., 2011), Russian RST Treebank Toldova et al., 2017), The RST Spanish Treebank (Da Cunha et al., 2011), The RST Spanish-Chinese Treebank (Cao et al., 2016(Cao et al., , 2017a(Cao et al., , 2018(Cao et al., , 2017b, The Penn Discourse Treebank (PDTB) (Prasad et al., 2018), Chinese Discourse Treebank 0.5 (Zhou et al., 2014), DISCOR project (Discourse Structure and Coreference Resolution Reese et al., 2007), ANNODIS (Afantenos et al., 2012a), and the STAC corpus (Afantenos et al., 2012b;Asher et al., 2016). ...
The main objective of this thesis is to improve the automatic capture of semantic information with the goal of modeling and understanding human communication. We have advanced the state of the art in discourse parsing, in particular in the retrieval of discourse structure from chat, in order to implement, at the industrial level, tools to help explore conversations. These include the production of automatic summaries, recommendations, dialogue acts detection, identification of decisions, planning and semantic relations between dialogue acts in order to understand dialogues. In multi-party conversations it is important to not only understand the meaning of a participant's utterance and to whom it is addressed, but also the semantic relations that tie it to other utterances in the conversation and give rise to different conversation threads. An answer must be recognized as an answer to a particular question; an argument, as an argument for or against a proposal under discussion; a disagreement, as the expression of a point of view contrasted with another idea already expressed. Unfortunately, capturing such information using traditional supervised machine learning methods from quality hand-annotated discourse data is costly and time-consuming, and we do not have nearly enough data to train these machine learning models, much less deep learning models. Another problem is that arguably, no amount of data will be sufficient for machine learning models to learn the semantic characteristics of discourse relations without some expert guidance; the data are simply too sparse. Long distance relations, in which an utterance is semantically connected not to the immediately preceding utterance, but to another utterance from further back in the conversation, are particularly difficult and rare, though often central to comprehension. It is therefore necessary to find a more efficient way to retrieve discourse structures from large corpora of multi-party conversations, such as meeting transcripts or chats. This is one goal this thesis achieves. In addition, we not only wanted to design a model that predicts discourse structure for multi-party conversation without requiring large amounts of hand-annotated data, but also to develop an approach that is transparent and explainable so that it can be modified and improved by experts. The method detailed in this thesis achieves this goal as well.
... RST-annotated corpora have been built for various languages so far; for English, for in-stance, there are RST Discourse Treebank (Carlson et al., 2003) and GUM corpus (Zeldes, 2017). There are also RST-annotated corpora for Spanish (Da Cunha et al., 2011), Russian (Toldova et al., 2017), Dutch (Van Der Vliet et al., 2011), German (Stede, 2004), Basque (Iruskieta et al., 2013), and Bangla (Das and Stede, 2018). RST-annotated corpora could be of great use in developing natural language processing tools; especially when struggling with beyond sentence-level structures. ...
Over the past years, interest in discourse analysis and discourse parsing has steadily grown, and many discourse-annotated corpora and, as a result, discourse parsers have been built. In this paper, we present a discourse-annotated corpus for the Persian language built in the framework of Rhetorical Structure Theory as well as a discourse parser built upon the DPLP parser, an open-source discourse parser. Our corpus consists of 150 journalistic texts, each text having an average of around 400 words. Corpus texts were annotated using 18 discourse relations and based on the annotation guideline of the English RST Discourse Treebank corpus. Our text-level discourse parser is trained using gold segmentation and is built upon the DPLP discourse parser, which uses a large-margin transition-based approach to solve the problem of discourse parsing. The performance of our discourse parser in span (S), nuclearity (N) and relation (R) detection is around 78%, 64%, 44% respectively, in terms of F1 measure.
... certain uses of anaphora, see Poesio et al. 2002), alternative lexicalizations using content words, as well as syntactic constructions (see Prasad et al. 2014 and the addition of alternative lexicalization constructions, AltLexC, in the latest version of PDTB, Prasad et al. 2019). 1 In previous work, two main approaches to extracting the inventory of discourse signal types in an open-ended framework can be identified: data-driven approaches, which attempt to extract relevant words from distributional properties of the data, using frequencies or association measures capturing their co-occurrences with certain relation types (e.g. Torabi Asr and Demberg 2013, Toldova et al. 2017); and manual annotation efforts (e.g. Prasad et al. 2008, Taboada andDas 2013), which develop categorization schemes and guidelines for human evaluation of signaling devices. ...
Previous data-driven work investigating the types and distributions of discourse relation signals, including discourse markers such as 'however' or phrases such as 'as a result' has focused on the relative frequencies of signal words within and outside text from each discourse relation. Such approaches do not allow us to quantify the signaling strength of individual instances of a signal on a scale (e.g. more or less discourse-relevant instances of 'and'), to assess the distribution of ambiguity for signals, or to identify words that hinder discourse relation identification in context ('anti-signals' or 'distractors'). In this paper we present a data-driven approach to signal detection using a distantly supervised neural network and develop a metric, ∆ s (or 'delta-softmax'), to quantify signaling strength. Ranging between-1 and 1 and relying on recent advances in contextu-alized words embeddings, the metric represents each word's positive or negative contribution to the identifiability of a relation in specific instances in context. Based on an English corpus annotated for discourse relations using Rhetorical Structure Theory and signal type annotations anchored to specific tokens, our analysis examines the reliability of the metric, the places where it overlaps with and differs from human judgments, and the implications for identifying features that neural models may need in order to perform better on automatic discourse relation classification.
... certain uses of anaphora, see Poesio et al. 2002), alternative lexicalizations using content words, as well as syntactic constructions (see Prasad et al. 2014 and the addition of alternative lexicalization constructions, AltLexC, in the latest version of PDTB, Prasad et al. 2019). 1 In previous work, two main approaches to extracting the inventory of discourse signal types in an open-ended framework can be identified: data-driven approaches, which attempt to extract relevant words from distributional properties of the data, using frequencies or association measures capturing their co-occurrences with certain relation types (e.g. Toldova et al. 2017, Torabi Asr and Demberg 2013); and manual annotation efforts (e.g. Prasad et al. 2008, Taboada andDas 2013), which develop categorization schemes and guidelines for human evaluation of signaling devices. ...
Previous data-driven work investigating the types and distributions of discourse relation signals, including discourse markers such as 'however' or phrases such as 'as a result' has focused on the relative frequencies of signal words within and outside text from each discourse relation. Such approaches do not allow us to quantify the signaling strength of individual instances of a signal on a scale (e.g. more or less discourse-relevant instances of 'and'), to assess the distribution of ambiguity for signals, or to identify words that hinder discourse relation identification in context ('anti-signals' or 'distractors'). In this paper we present a data-driven approach to signal detection using a distantly supervised neural network and develop a metric, {\Delta}s (or 'delta-softmax'), to quantify signaling strength. Ranging between -1 and 1 and relying on recent advances in contextualized words embeddings, the metric represents each word's positive or negative contribution to the identifiability of a relation in specific instances in context. Based on an English corpus annotated for discourse relations using Rhetorical Structure Theory and signal type annotations anchored to specific tokens, our analysis examines the reliability of the metric, the places where it overlaps with and differs from human judgments, and the implications for identifying features that neural models may need in order to perform better on automatic discourse relation classification.
... The proposed work was performed as part of an on-going research project aimed at creation of discourse annotated corpus of popular science texts written in Russian. It includes 68 articles on linguistics and 11 texts taken from the open corpus of Ru-RSTreebank [11], [17], [18]. Popular science discourse is defined as a way of transmitting scientific knowledge or innovation projects by the author-scientist (or a journalist as an intermediary) for their understanding by a mass audience. ...
... It traditionally includes non-significant lexical units, or functional words and phrases: subordinating conjunctions, coordinating conjunctions, adverbs, prepositional phrases, parenthetical words, particles, etc. Recent studies of discourse indicators [16], [17] take into account not only traditional discourse connectives, but also grammatical time, punctuation marks and their combinations, as well as content words and constructions based on them [18]. ...
The proposed work is performed as a part of an on-going research project aimed at creation of discourse annotated corpus of popular science texts written in Russian. Annotation is carried out within the framework of a multi-level model of discourse, which considers the text from the perspective of genre, rhetorical and argumentative organization. We conduct a comparative study of the rhetorical and argument annotations, discuss their similarities and differences on the segment and structural levels and show them on the examples of standard schemes of reasoning described in D. Walton’s theory of structured argumentation: “Argument from Expert Opinion”, “Argument from Example”, and “Argument from Cause to Effect”. Special attention is paid to discourse markers registered during annotation as key indicators of discourse structure. We report the results of the experiment with argument indicator patterns, based on the list of rhetorical markers, and aimed at the extraction of “from Expert Opinion” arguments.
... With the wide adoption of RST, corpora have expanded to other languages and domains. Several of these corpora include sciencerelated texts, a domain that is closer to medical, but unfortunately also use segmentation guidelines that differ sometimes considerably from RST-DT 2 (research articles in Basque, Chinese, English, Russian, Spanish (Iruskieta et al., 2013;Cao et al., 2017;Zeldes, 2017;Toldova et al., 2017;Da Cunha et al., 2012); encyclopedias and science news web pages in Dutch (Redeker et al., 2012) pus of RST-segmented medical articles in English. ...
The first step in discourse analysis involves dividing a text into segments. We annotate the first high-quality small-scale medical corpus in English with discourse segments and analyze how well news-trained segmenters perform on this domain. While we expectedly find a drop in performance, the nature of the segmentation errors suggests some problems can be addressed earlier in the pipeline, while others would require expanding the corpus to a trainable size to learn the nuances of the medical domain.
... SDRT (Asher et al., 2003) was proposed as an extension to Discourse Representation Theory Corpus name Language Annotation style Tokens RSTBT (Iruskieta et al., 2013) Basque RST 28,658 CDTB (Zhou and Xue, 2015) Chinese PDTB 63,239 SCTB (Cao et al., 2018) Chinese RST 11,067 NLDT (Redeker et al., 2012) Dutch RST 21,355 PDTB (Prasad et al., 2008) English PDTB 1,100,990 GUM (Zeldes, 2017) English RST 82,691 RSTDT (Carlson et al., 2002) English RST 184,158 STAC (Asher et al., 2016) English SDRT 41,666 ANNODIS (Afantenos et al., 2012) French SDRT 25,050 PCC (Stede and Neumann, 2014) German RST 29,883 RRST (Toldova et al., 2017) Russian RST 243,896 RSTSTB (da Cunha et al., 2011) Spanish RST 50,565 SCTB (Cao et al., 2018) Spanish RST 12,699 CSTN (Cardoso et al., 2011) Brazilian Portuguese RST 51,041 (Kamp, 1981). By including propositions as variables to reason over and discourse relations to rule out certain antecedents or promote others, it accounts for relations in a text beyond the sentence level (where dynamic semantic approaches often fail). ...
... We do note however that, again, for the higher scoring corpora, IAA was relatively high; Carlson et al. (2002) note a Kappa of 0.97 for RSTDT, Asher et al. (2016) note an initial agreement of 90% for automatic segmentation in STAC, and segmentation is manually improved after this automatic procedure, and Redeker et al. (2012) note an agreement of 97% for EDU segmentation in NLDT. On the other end of the spectrum, Iruskieta et al. (2013) report an EDU agreement of 81.35% for RSTBT and Toldova et al. (2017) report Krippendorf's α figures of 0.2792, 0.3173 and 0.4965 where they consider figures around 0.8 to be acceptable for RRST. Figure 1 plots performance for the RandomForest and LSTM approaches (and the baseline for comparison) on the Y axis (f1 score) and the corpora ordered by size (increasing from left to right) 6 For highly agglutinative languages, depending on tokenisation procedures average sentence lengths may of course be shorter, but given the set of languages here, excluding Chinese, difference in morphology plays a less prominent role. on the X axis, illustrating that there is no clear correlation between corpus size and performance. ...