Conference PaperPDF Available

Towards building a Discourse-annotated corpus of Russian

Authors:
  • Federal Research Center "Computer Science and Control" of the Russian Academy of Science
  • Federal Research Center 'Compurer Science and Control' of Russian Academy of Sciences

Abstract

For many natural language processing tasks (machine translation evaluation, anaphora resolution, information retrieval, etc.) a corpus of texts annotated for discourse structure is essential. As for now, there are no such corpora of written Russian, which stands in the way of developing a range of applications. This paper presents the first steps of constructing a Rhetorical Structure Corpus of the Russian language. Main annotation principles are discussed, as well as the problems that arise and the ways to solve them. Since annotation consistency is often an issue when texts are manually annotated for something as subjective as discourse structure, we specifically focus on the subject of inter-annotator agreement measurement. We also propose a new set of rhetorical relations (modified from the classic Mann & Thompson set), which is more suitable for Russian. We aim to use the corpus for experiments on discourse parsing and believe that the corpus will be of great help to other researchers. The corpus will be made available for public use.
Computational Linguistics and Intellectual Technologies:
Proceedings of the International Conference “Dialogue 2017”
Moscow, May 31—June 3, 2017
TOWARDS BUILDING A DISCOURSE-
ANNOTATED CORPUS OF RUSSIAN
Pisarevskaya D. (dinabpr@gmail.com)1,
Ananyeva M. (ananyeva@isa.ru)2,
Kobozeva M. (kobozeva@isa.ru)2,
Nasedkin A. (kloudsnuff@gmail.com)3,
Nikiforova S. (son.nik@mail.ru)3,
Pavlova I. (ispavlovais@gmail.com)3,
Shelepov A. (alexshelepov1992@gmail.com)3
1Institute for System Programming of the RAS, Moscow, Russia;
2Institute for Systems Analysis FRC CSC RAS, Moscow, Russia;
3NRU Higher School of Economics, Moscow, Russia
For many natural language processing tasks (machine translation evaluation,
anaphora resolution, information retrieval, etc.) a corpus of texts annotated
for discourse structure is essential. As for now, there are no such corpora
of writ ten Russian, which stands in the way of developing a range of applica-
tions. This paper presents the first steps of constructing a Rhetorical Struc-
ture Corpus of the Russian language. Main annotation principles are dis-
cussed, as well as the problems that arise and the ways to solve them. Since
annotation consistency is often an issue when texts are manually annotated
for something as subjective as discourse structure, we specifically focus
on the subject of inter-annotator agreement measurement. We also propose
a new set of rhetorical relations (modified from the classic Mann & Thompson
set), which is more suitable for Russian. We aim to use the corpus for experi-
ments on discourse parsing and believe that the corpus will be of great help
to other researchers. The corpus will be made available for public use.
Keywords: rhetorical structure theory, discourse analysis, corpus linguis-
tics, corpus annotation, discourse structure, inter-annotator agreement
Pisarevskaya D. et al.
ПРИНЦИПЫ РАЗРАБОТКИ
ДИСКУРСИВНОГО КОРПУСА
РУССКОГО ЯЗЫКА
Писаревская Д. (dinabpr@gmail.com)1,
Ананьева М. (ananyeva@isa.ru)2,
Кобозева М. (kobozeva@isa.ru)2,
Наседкин А. (kloudsnuff@gmail.com)3,
Никифорова С. (son.nik@mail.ru)3,
Павлова И. (ispavlovais@gmail.com)3,
Шелепов А. (alexshelepov1992@gmail.com)3
1Институт системного программирования РАН, Москва,
Россия; 2Институт системного анализа ФИЦ ИУ РАН,
Москва, Россия; 3НИУ ВШЭ, Москва, Россия
1. Introduction
Discourse analysis is the linguistics level that deals with language units of maxi-
mal size (K ibrik, Podlesskaya, 2009, 26). During discourse analysis the text is often
represented as a hierarchical tree with its parts connected by various rhetorical rela-
tions. Discourse and pragmatics have been considered in Natural Language Process-
ing only in recent years due to the complexity of the approach. Discourse parsing can
be used in a wide range of natural language processing tasks, including machine
translation evaluation, sentiment analysis, information retrieval, text summariza-
tion, information extraction, anaphora resolution, question-answering systems, text
classification, etc.—it gives significant performance gain in all these applications,
as has been shown by a lot of research.
Creation of corpora with discourse structure has become very popular in recent
years because they are then used for developing machine learning algorithms to build
automated systems for discourse parsing and analysis. Discourse parsers already
exist for several languages, most notably for English (RASTA, SPADE, HILDA, CO-
DRA parsers). However, there are no discourse-annotated corpora for written Rus-
sian at the moment, and therefore no possibility of creating an automated discourse
parser, and as long as only manual annotation of texts is possible, discourse analysis
will not be used in any applications for Russian. That is why it is essential to develop
a publicly available discourse-annotated corpus for the Russian language.
In this paper we describe the first steps of building a discourse corpus of Russian:
the annotation procedure, including establishing the appropriate set of discourse re-
lations, the process of measuring the inter-annotator agreement, and various chal-
lenges we faced along the way.
Towards Building a Discourse-annotated Corpus of Russian
1.1. Background
There are different approaches to discourse analysis. In Rhetorical Structure The-
ory (RST) discourse structure amounts to a non-projective tree. Penn Discourse Tree-
bank (PDTB) style is connective-led (PDTB (Webber et al., 2016), TurkishDB (Zeyrek
et al., 2013), etc.) or punctuation-led (Chinese Discourse TreeBank (Zhou, Xue, 2015))
and is not presented in a tree form. Models based on cohesive relations (Halliday, Hasan
1976) are also not tree-like. We decided to choose RST to take into consideration not
only cohesive markers and discourse cues, but also discourse structure of texts. It is im-
portant, for example, for coreference resolution in English—sometimes the most crucial
for it is the rhetorical distance and not the linear one, cf. (Loukachevitch et al. 2011).
Therefore, f or our corpus we adopt t he RST framework (Ma nn, Thompson, 1988).
It represents text as a hierarchy of elementary discourse units (EDUs) and describes
relations between them and between bigger parts of text. Some EDUs are more essen-
tial and carry more important information (nucleus) than others (satellite). There are
two rhetorical relation types: nucleus-satellite and multinuclear. While the first ty pe
connects a nucleus and a satellite, the latter includes EDUs that are equally important
in the analysed discourse. The set of rhetorical relations can vary; it can include, for
instance, such relations as Elaboration, Justify, Contrast, Antithesis, Volitional Result,
etc. The rhetorical structure theory claims to be applicable to all languages.
In our work we take into account the ex isting exper ience of constructing discours e
corpora. There are ma ny RST-annotated corpora of different languages. The most well-
known one is the RST Discourse Treebank (Carlson et al., 2003)—an English-language
corpus of Wall Street Journal articles (385 articles—176,383 tokens). It is the biggest
discourse corpus with a detailed manual. Potsdam Commentary Corpus (Stede, Neu-
mann, 2014) [http://corpus.iingen.unam.mx/rst/manual_en.html] is a German-lan-
guage corpus that consists of newspaper materials (175 articles—32,000 tokens). Cor-
pusTCC (Pardo et al., 2004) is a corpus of Brazilian Portuguese. It includes 100 intro-
ductions (53,000 tokens) to PhD theses. Well-developed are also corpora for other lan-
guages: Dutch—Dutch RUG Corpus (van der Vliet et al., 2011), Basque—RST Basque
Treebank (Iruskieta et al., 2013), Chinese and Spanish— Chinese/Spanish Treebank
as a parallel corpus (Cao et al., 2016), etc. Different sets of rhetorical relations have
been created based on the “classic set” (Mann, T homspon, 1988). For instance, the RST
Discourse Treebank makes use of 88 relation types (53 nucleus-satellite and 25 mul-
tinuclear relations), the Potsdam Commentary Corpus is based on 31 relation types.
The only existing discourse corpus project for Russian is TED-Multilingual Dis-
course Treebank. This project contains a parallel corpus of TED talks transcripts for
6 languages, including Russian (along with English, Turkish, European Portuguese,
Polish, and German). However, it is based on the principles of the Penn Discourse
Treebank annotations framework—discourse connectives as discourse-level predi-
cates with a binary argument structure at a local level (Prasad et al., 2007; Zeyrek
et al., 2013)—and not on the RST framework. Besides, this recent effor t is still in prog-
ress and is not publicly available yet.
The foundation for the project of the Discourse-annotated corpus of Russian was
laid by the following works of the research team of the Institute for Systems Analysis,
FRC CSC RAS (Ananyeva, Kobozeva, 2016 [1, 2]).
Pisarevskaya D. et al.
2. Rhetorical Structure Corpus for the Russian Language
The Discourse-annotated corpus of Russian will include texts of different genres
(scienсe, popular science, news stories, and analytic journalism). The development of the
corpus will be continued for 3 years, during which time we are going to annotate more
than 100,000 tokens. The corpus will be available for public use. The user will be able
to view annotated texts (represented as discourse trees), search for specific relations
(or sequences thereof) and word forms, download the annotated texts in XML format.
2.1. Annotation Principles
After conducting extensive research on discourse corpora of other languages,
we have developed a detailed annotation manual. As a tool for annotation we have cho-
sen an open-source tool called rstWeb [https://corpling.uis.georgetown.edu/rstweb/
info/], which allows to edit a set of relations and change other features if needed.
International experience of discourse annotation demonstrates that due to gram-
matical differences between languages, an adaptation of the classic RS theory is neces-
sary for almost all of them. That is why in our project we will, among other things, aim
to specif y the concept of a discourse unit and the set of rhetorical relations for Russian.
Firstly, we have establi shed a preliminar y notion of an elementar y discourse unit,
which, from a syntactic point of view, we take to be roughly equivalent to a clause
(similarly to the classic Mann & Thompson approach). However, there are several no-
table exceptions, such as nominalization constructions with prepositions like для ‘for’
and из-за ‘because of’ being classified as an EDU and relative clauses with restrictive
semantics not being classified as one.
Secondly, we have discussed main annotation principles and created a detailed
manual to guide the annotators. It included description of the following 22 relations,
which were based on the “classic set” with the specific features of news and scientific
texts in Russian taken into account.
• 16 nucleus-satellite (mononuclear) relations: Background, Cause (with subtypes:
Volitional Cause and Non-volitional Cause), Evidence, Effect (with subtypes: Vo-
litional Effect and Non-volitional Effect), Condition, Purpose, Concession, Prepa-
ration, Conclusion, Elaboration, Antithesis, Solutionhood, Motivation, Evalua-
tion, Attribution (with subtypes: Attribution1 (precise source specification) and
Attribution2 (imprecise source specification)), Interpretation.
• 6 multinuclear relations: Contrast, Restatement, Sequence, Joint, Comparison,
Same-unit.
We decided to add Preparation and Conclusion to the set due to the genre prop-
erties of scientific and analytic texts. We divided Attribution into two subtypes due
to the differing level of precision of specifying the information source in news stories.
There are two strategies of annotators’ work in RST analysis (Carlson et al.,
2003). An annotator could apply relations to the segments sequentially, from one seg-
ment to another, connecting the current node to the previous node (left-to-right). This
method is suitable for short texts, such as news reports, but even in such texts there
is a risk of overlooking important relations. The other method is more flexible: the
Towards Building a Discourse-annotated Corpus of Russian
annotator segments multiple units simultaneously, then builds discourse sub-trees for
each segment, links nearby segments and builds firstly larger subtrees and after that
the final tree, linking key parts of the discourse structure (top-down and bottom-up).
It is more suitable for big texts. We chose the second method of tagging since it is more
intuitive and easier for the annotator.
For the first 3 texts annotators used the set of discourse relations specified above.
The texts were of approximately the same length (34, 26 and 38 sentences respec-
tively). All of them were shor t news articles. The annotators followed the initial guide-
lines while annotating pilot texts: they segmented the texts and assigned RST rela-
tions to the resulting segments. During subsequent discussion it has become clear that
this set of relations was not quite convenient for the annotation since some of the rela-
tions were extremely hard to differentiate bet ween. Moreover, we have realized that
adopting a “classic set” requires further modifications as some relations are probably
more obvious and therefore more common in English than in the Russian language.
2.2. Inter-annotator agreement measurement
One of the main problems with RST tagging is the subjectivity of annotators’ inter-
pretation: the same text can be annotated i n very different ways (Artstein, Poesio, 2008).
However, a simple discussion is not enough to establish the level of inter-annotator agree-
ment (IA A). It must be measured quantitatively using a valid and reliable statistic.
Much like in other discourse analysis tasks (Miltsakaki et al., 2004; Eckle-Kohler,
2015), we faced certain challenges with inter-annotator consistency computation. Al-
though the rules of splitting text into EDUs are relatively straightfor ward, the result-
ing segmentation is rarely the same for any text segmented by different people. That
is why, for example, the Cohen’s kappa coefficient is not suitable in our case. The token-
based Fleiss’ kappa is also not applicable as we deal with units that consist of several
tokens. We have finally selected Krippendorff’s unitized alpha as a statistic to measure
inter-annotator agreement. It operates on whole annotation spans instead of isolated
tokens, it can be calculated for any number of annotators, it can be applied to sparse
data, and it can process features of different types, including nominal features in our
case. As splitting text into EDUs and labeling relations are two separate tasks, the
inter-annotator agreement can be measured separately as well. However, Krippen-
dorff’s unitized alpha can (and will in our case) be used for both measurements.
The corpus size used for inter-annotator consistency calculation varies from one
project to another. Usually it covers about 30 units (Lacy, Riffe, 1996), but we decided
to take texts that contain more units so we could check if relation types in the manual
are suitable for further work. The total number of EDUs was approximately 190.
The RST tagging by means of rstWeb tool, which is used by annotators, is done
in the browser (see Fig. 2), but the system allows to export the result file as an xml-
document, which has the following structure:
Pisarevskaya D. et al.
Fig. 1. XML structure of an annotated file
Fig. 2. Annotation in the browser
All the relations used in the scheme are listed in the header of the xml-document.
Each EDU tag includes two ids and a relation type, where “segment id” stands for
the id of the EDU, “parent”—for the id of the nucleus in case it is a nucleus-satellite
relation and “relname”—for the type of the relation. If the relation is multinuclear, “seg-
ment” and “parent” ids both represent the ids of equal by discourse importance EDUs.
If the relation type is specified as “span”, the EDU is included in a bigger discourse group
which is assigned a new id (i.e. the EDUs 4– 6 form a bigger group of relations and the
EDU 4 as the ma in nucleus in this g roup is marked to have a parent w ith id 54 which is au-
tomatically assigned to this group: <segment id="4" parent="54" relname="span">).
Calculat ion of the IAA co efficient was implemented i n Python. X mltodict 0.10.2 pack-
age was used to read and to convert the XML-object of the marked-up text to the Python
dictionary. The code used for IAA calculation can be accessed via GitHub [https://github.
com/nasedkinav/rst_corpus_rus/blob/master/krippendorffs_alpha.py].
Towards Building a Discourse-annotated Corpus of Russian
Since the ids of segments and groups may differ in the texts annotated by differ-
ent people, we have decided to use concatenated text spans to uniquely identify the
selected relations since it is the only reliable data between distinct annotations. This
format has the additional advantage because it allows to locate identical relations
in different parts of text in case of different EDU fragmentation.
During the first iteration of the particular annotator’s markup processing, each
of the relations trees is traversed in such a way that each node is associated with an or-
dered by id set of segments of the text, dominated by the node. The "span" relations
were not counted during the IA A measurement since this relation plays a structural
role in annotation and has no actual meaning. After that, the index of the form {key:
value} was produced for all the relations, where the value is the type of the relation,
and the key is represented as a string:
• for mononuclear relations: “nuclear: <nuclear_text>, satel lite: <satellite_te xt>
• for multinuclear relations: “multinuclear: <multinuclear_text>”,
where <nuclear_text>, <satellite_text> and <multinuclear_text> are replaced by cor-
respondent parts of the text. After performing this procedure for each of the annotators,
the obtained indices are combined by key <key>, and the list of all the values of the
relation, marked by each annotator, is assigned to it. Length of the list can be lower than
the number of the annotators when the relation is absent in somebody’s markup.
According to (Krippendorff, 2013) we then build the reliability data matrix:





,
,
,
,

,
,
,
,

,
,
,
,
Number of
coders
marked

1
1


= 
=  
 =        
1
 = 1
=
1=(  1)(1)
(1)(1)
where keyu serves as encoding unit and obsi stands for particular annotator. Using this
matrix, the coincidence matrix within units is calculated (Krippendorff, 2013):
   

,
,
,
,

,
,
,
,

,
,
,
,
Number of
coders
marked 
1
1




= 
=  
 =        
1
 = 1
=
1=(  1)(1)
( 1)(1)
where k, c are concrete relation types and
   

,
,
,
,

,
,
,
,

,
,
,
,
Number of
coders
marked 
1
1




= 
=  
 =        
1
 = 1
=
1=(  1)(1)
( 1)(1)
,
Pisarevskaya D. et al.
where u is the encoded unit (keyu), mu is the number of annotators who have marked
up this unit.
The final calculation of the coefficient can be done in the following way:
   

,
,
,
,

,
,
,
,

,
,
,
,
Number of
coders
marked 
1
1




= 
=  
 =        
1
 = 1
=
1=(  1)(1)
( 1)(1)
Fig. 3. Coefficient calculation
We have measured the IAA coefficient for each of three texts and the coefficients
for the texts were 0.2792, 0.3173 and 0.4965 respectively. We suppose the third text
has the higher IAA coefficient due to the easier and more obvious discourse structure.
The acceptable level of Krippendorff’s unitized alpha coefficient for our task
would be approximately 0.8 and our results for every text were much lower.
2.3. Initial tree’s modification
We have decided to reduce the set of RST relations used for annotation in order
to reach the higher IA A coefficient and to minimize the subjectivity of the annotation.
One of the main reasons to exclude particular relations was their high specific-
ity and low frequency of their usage during annotation. Although presence of such
relations would not radically affect IAA, reducing the relations’ set would make the
annotation task easier, and at the same time we would not lose much if we got rid
of highly specific and rare relations. If there was always a possibility of replacing some
relation with another, more common one, without a great loss in semantic adequacy,
it was considered to be an argument in favor of excluding it. The changes that we have
accepted after a thorough analysis and much discussion are listed below.
We have decided to exclude from the set of relations
• Motivation, since it is very specific and therefore extremely rare: it was used only
2 times in these three texts (approx. 190 EDUs).
• Antithesis (nucleus-satellite relation), since the only difference between Antith-
esis and Contrast (multinuclear relation) is that in Antithesis one part should
be more important than the other. None of the annotators could establish the
relative importance of EDUs in such cases.
• Volitional and Non-Volitional subtypes of Cause and Effect, since in many cases
it was impossible to determine whether the actions were intentional or not. How-
ever, this distinction might be important for some of the tasks the corpus will
be needed for. Those who will use the corpus for this kind of tasks will have
the opportunity to substitute Cause/Effect relation with Volitional Cause/Effect
or Non-volitional Cause/Effect themselves (as the annotated texts will be avail-
able for downloading in an easily changeable XML format).
• Conclusion, because it is quite rare and can be considered a subtype of Restate-
ment, which we decided to use for contexts when the Conclusion relation could
be possible.
Towards Building a Discourse-annotated Corpus of Russian
We have combined in one relation
• Cause and Effect, since the difference between the two lies in determining the
nucleus, which is cause in the Cause relation and effect in the Effect relation.
Thus, the annotator has to conclude what is more important in two given EDUs:
the cause or the effect, which is ver y subjective.
• Interpretation and Evaluation, since the difference between these relations
is very subtle and in order to distinguish between them, one has to determine
the degree of objectivity of the evaluation, and that is again very subjective.
• Attribution1 and Attribution2, since the level of precision required for Attribu-
tion1 is often unstable and unclear.
All of the above has resulted in a new RST relations tree. The set of relations
in Fig. 4 is final and will be used during the rest of the annotation process:
Fig. 4. Final set of relations
After modifying the set of discourse relations, three new texts were annotated
and the IAA was measured again. The texts were, respectively, 37, 44 and 28 sen-
tences long and all of them were short news articles, same as during the previous IAA
measurement. The new IAA coefficients were 0.7768, 0.691 and 0.7615 respectively,
which indicates a big leap in the annotation quality. These three texts, annotated
in XML format, are available at [https://github.com/nasedkinav/rst_corpus_rus]
along with other texts annotated so far. The web interface for the corpus will be cre-
ated as soon as the appropriate number of texts (and tokens) is reached.
Pisarevskaya D. et al.
3. Conclusion
By establishing a reliable set of discourse relations we have formed a sound basis
for further work. The t wo iterations of IAA measurement let us believe that using the
final relation list will lead to a less biased annotation from now on, which is very impor-
tant because of the well-known subjectivity of the discourse relations’ understanding.
During the rest of the project every text will be annotated by one person and
then checked but not annotated by another one. We plan to measure IA A regularly
to ensure that the agreement level remains high enough.
After annotating approximately one hundred texts, we plan to conduct several
experiments regarding automatic EDUs and discourse relations recognition. Automatic
rhetorical structure analysis often relies heavily on determining linguistic discourse
markers—connectors that join clauses and sentences into an interconnected piece
of text. That is why during the annotation we will also fixate and analyze these markers
in order to identif y particular words and constructions that indicate discourse relations.
Acknowledgements
We would like to express our sincere gratitude to the advisor of the Discourse-
annotated corpus of Russian project, Svetlana Toldova, for the continuous support for
the project and related research.
References
1. Ananyeva M. I., Kobozeva M. B. (2016), Developing the cor pus of Russian texts w ith
markup based on the Rhetorical Structure Theory [Razrabotka korpusa tekstov
na russkom yazyke s razmetkoj na osnove teorii ritorecheskikh struktur], Com-
putational Linguistics and Intellectual Technologies: Proceedings of the Interna-
tional Conference “Dialogue 2016” [Komp’yuternaya Lingvistika i Intellektual’nye
Tekhnologii: Trudy Mezhdunarodnoy Konferentsii “Dialog 2016”], Moskva, avail-
able at: http://ww w.dialog-21.ru/media/3460/ananyeva.pdf
2. Ananyeva M. I., Kobozeva M. B. (2016), Discourse analysis for natural language
processing tasks [Diskursivnyi analiz v zadachakh obrabotki yestestvennogo ya-
zyka], Informatics, management and systems analysis: Proceedings of the IV All-
Russian Conference for young scientists [Informatika, upravleniye i sistemnyi
analiz: Trudy IV Vserossiiskoi nauchnoi konferencii molodykh uchenykh s mezh-
dunarodny m uchastiyem], Tver’, pp. 138–148, available at: http://www.isa.ru/
icsa/images/stories/%D0%A1%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%
D0%BA_%D0%A2%D0%BE%D0%BC_1.pdf#page=139
3. Artstein R., Poesio M. (2008), Inter-coder agreement for computational linguis-
tics. Computational Linguistics 34(4), pp. 555–596.
4. Cao S. Y., da Cunha I., Iruskieta M. (2016), Elaboration of a Spanish-Chinese par-
allel corpus with translation and language learning purposes, 34th International
Conference of the Spanish Society for Applied Linguistics (AESLA), to appear.
Towards Building a Discourse-annotated Corpus of Russian
5. Carlson L., Marcu D., Okurowski M. E. (2003), Building a Discourse-Tagged Cor-
pus in the Framework of Rhetorical Structure Theory, Current directions in dis-
course and dialogue, Kluwer Academic Publishers, pp. 85–112.
6. Eckle-Kohler J., Kluge R., Gurevych I. (2015), On the Role of Discourse Markers
for Discriminating Claims and Premises in Argumentative Discourse, Proceed-
ings of the 2015 Conference on Empirical Methods in Natural Language Process-
ing (EMNLP), pp. 2236–2242.
7. Halliday, M. A. K., Hasan R. (1976), Cohesion in English. London: Longman.
8. Hayes A . F., Krippendorff K. (2007), Answering the call for a standard reliability
measure for cod ing data, Communication Met hods and Measures Vol. 1, pp. 77–89.
9. Iruskieta M., Aranzabe M. J., Díaz de Ilarraza A., Gonzalez I., Lersundi M., Lo-
pez de la Calle O. (2013), The RST Basque TreeBank: an online search interface
to check rhetorical relations, IV Workshop RST and Discourse Studies. Fortaleza,
Brasil, Outubro 21–23, pp. 40–49.
10. Kibrik A., Podlesskaya V. (2009), Stories of dreams: A Corpus-based Research
of Russian Oral Discourse [Rasskazy o snovideniyakh: Korpusnoye issledovaniye
ustnogo russkogo diskursa], Yazyki slav yanskikh kul'tur, Moskva.
11. Krippendorf f K. (2013), Computing Krippendorff ’s Alpha-Reliability, available
at: http://web.asc.upenn.edu/usr/krippendorff/mwebreliability5.pdf.
12. Lacy S., Riffe D. (1996), Sampling error and selecting intercoder reliability sam-
ples for nominal content categories: Sins of omission and commission in mass
communication quantitative research, Journalism & Mass Communication
Quarterly No 73, pp. 969–973.
13. Loukachevitch N. V., Dobrov G. B., Kibrik A. A., Khudiakova M. V., Linnik A. S. (2011),
Factors of referential choice: computational modeling [Faktory referencial’nogo
vybora: komp’yuternoye modelirovaniye], Computational Linguistics and Intel-
lectual Technologies: Proceedings of the International Conference “Dialogue 2011”
[Komp’yuternaya Lingvistika i Intellektual’nye Tekhnologii: Trudy Mezhdunarod-
noy Konferentsii “Dialog 2011”], Moskva, available at: http://www.dialog-21.ru/
media/1446/45.pdf
14. Mann W. C., Thompson S. A. (1988), Rhetorical Structure Theory: Toward
a Functional Theory of Text Organization, Text 8, 3, 1988, pp. 243–281.
15. Miltsakaki E., Prasad R., Joshi A., Webber B. (2004), Annotating discourse con-
nectives a nd their arguments, Proceed ings of the HLT/ NAACL Workshop on Fron-
tiers in Corpus Annotation, Boston, Massachusetts, USA, pp. 9–16.
16. Pardo T. A. S., Nunes M. G. V., Rino L. H. M. (2004), Dizer: An automatic dis-
course analyzer for brazilian portuguese, Brazilian Sy mposium on Artificial In-
telligence, Springer Berlin Heidelberg, pp. 224 –234.
17. Prasad R., Miltsakaki E., Dinesh N., Lee A., Joshi A., Robaldo L., Webber B. (2007).
The Penn Discourse Treebank 2.0 Annotation Manual. Technical Report 203, In-
stitute for Research in Cognitive Science, University of Pennsylvania.
18. Stede M., Neumann A. (2014), Potsdam Commentary Corpus 2.0: Annotation for
Discourse Research. Proc. of LREC, Reykjavik.
Pisarevskaya D. et al.
19. Van der Vliet N., Berzlanovich I., Bouma G., Egg M., Redeker G. (2011), Building
a Discourse-Annotated Dutch Text Corpus. Proceedings of the Workshop “Be-
yond Semantics: Corpus-based Investigations of Pragmatic and Discourse Phe-
nomena”, Goettingen, Germany, 23–25 February 2011, pp. 157–171.
20. Webber B., Prasad R., Lee A., Joshi A. (2016)., A Discourse-Annotated Corpus of
Conjoined VPs. Proc. 10th Linguistics Annotation Workshop, Berlin, pp. 22–31.
21. Zeyrek D., Demirşahin I., Sevdik Çallı A. B., Çakıcı R. (2013), Turkish Discourse
Bank: Porting a discourse annotation style to a morphologically rich language.
Dialogue and Discourse, 4(2), pp. 174–184.
22. Zhou Y., Xue N. (2015), The Chinese Discourse TreeBank: A Chinese corpus a nno-
tated with discourse re lations. Language Resources and Eva luation, pp. 397–431.
... (da Cunha et al., 2011;Cao et al., 2018) Mandarin Chinese(Zhou et al., 2014;Cao et al., 2018), German RST(Stede and Neumann, 2014), French SDRT-Annodis(Afantenos et al., 2012), Basque RST(Iruskieta et al., 2013), PortugueseRST (Cardoso et al., 2011), Russian RST(Pisarevskaya et al., 2017), Turkish PDTB(Zeyrek et al., 2013) Dutch RST ...
... Due to the availability of the large annotated discourse corpora for many languages, especially English, discourse parsers [19,21,17] provide reliable and correct DT for the text. Manually annotated Ru-RSTreebank corpus [26] has been recently introduced which resulted in the creation of discourse parser for Russian [9]. The availability of state-of-the-art discourse parsers for different languages makes the discourse-based models universal, so they could be applied to different texts without modifications. ...
Chapter
We explore the role of discourse analysis in ontology construction. While extracting candidate phrases to form ontology entries from text, it is important to pay attention to which discourse units these phrases occur in. It turns out that not all discourse units are equal in terms of their contribution to forming ontology entries. We survey text mining and ontology information extraction techniques in medical domain and select the ones where advanced linguistic analysis including the discourse level is leveraged the most to produce a robust and efficient ontology. We evaluate the consistency of the resultant ontology and its role in assuring high search relevance using several real-life medical datasets and prove the importance of introducing discourse information into the ontology construction.
... Rhetorical relations also underlie M. Taboada's corpus research on the issues of coherence and cohesion in dialogical communication [4]. The most modern project for Russian language material is the Ru-RSTreebank (https://rstreebank.ru) [5], mainly based on news texts. ...
Chapter
The paper presents an approach to modeling and study of argumentation found in popular science literature. The study of argumentation is performed by means of comparative analysis of discourse structures. Different types of argumentative structure are considered and the co-occurrence of arguments “from Expert opinion” with other types of argumentative reasoning typical of the popular science genre is analyzed. With the view of automatic extraction of argumentative relations, the analysis of correlation between rhetorical and argumentative annotations was carried out. The experiment was conducted on a corpus of 11 popular science articles from the Ru-RSTreebank.
Article
Full-text available
The problem of implicit logical-semantic relations (LSRs) annotation is considered. The state-of-the-art in the annotation of implicit LSRs is analyzed. The approaches focused on (i) analysis of the global discourse structure; (ii) analysis of the local discourse structure; and (iii) unification of the data annotated within different frameworks and development of a unified annotation standard are presented. The principles for annotating implicit LSRs in parallel texts are proposed, i. e., target of annotation is a translated correspondence (a pair of fragments from the original and translated texts). Translation correspondences illustrating implicit–explicit mismatch have been studied, i. e., where LSR markers are absent in the Russian text while in the text in another language, on the contrary, they are present. Taking into account the specificity of implicit LSRs, the following principles of their annotation were formulated: (i) it is necessary to determine the boundaries of LSR arguments (to ensure clarity and convenience of analysis); (ii) features of text blocks should form a hierarchical structure (to ensure convenience of using a large number of features); and (iii) if a feature of a text block has a lexical marker, this marker should be indicated (to ensure better justification of the annotator’s decisions). [full-text in Russian]
Thesis
The main objective of this thesis is to improve the automatic capture of semantic information with the goal of modeling and understanding human communication. We have advanced the state of the art in discourse parsing, in particular in the retrieval of discourse structure from chat, in order to implement, at the industrial level, tools to help explore conversations. These include the production of automatic summaries, recommendations, dialogue acts detection, identification of decisions, planning and semantic relations between dialogue acts in order to understand dialogues. In multi-party conversations it is important to not only understand the meaning of a participant's utterance and to whom it is addressed, but also the semantic relations that tie it to other utterances in the conversation and give rise to different conversation threads. An answer must be recognized as an answer to a particular question; an argument, as an argument for or against a proposal under discussion; a disagreement, as the expression of a point of view contrasted with another idea already expressed. Unfortunately, capturing such information using traditional supervised machine learning methods from quality hand-annotated discourse data is costly and time-consuming, and we do not have nearly enough data to train these machine learning models, much less deep learning models. Another problem is that arguably, no amount of data will be sufficient for machine learning models to learn the semantic characteristics of discourse relations without some expert guidance; the data are simply too sparse. Long distance relations, in which an utterance is semantically connected not to the immediately preceding utterance, but to another utterance from further back in the conversation, are particularly difficult and rare, though often central to comprehension. It is therefore necessary to find a more efficient way to retrieve discourse structures from large corpora of multi-party conversations, such as meeting transcripts or chats. This is one goal this thesis achieves. In addition, we not only wanted to design a model that predicts discourse structure for multi-party conversation without requiring large amounts of hand-annotated data, but also to develop an approach that is transparent and explainable so that it can be modified and improved by experts. The method detailed in this thesis achieves this goal as well.
Article
Full-text available
In this study we analyze the applicability of specific machine learning algorithms to the task of detecting sentences containing argumentation in Russian text. We employ a collection of scientific and popular science texts with manually annotated argumentation to evaluate the quality of identifying argumentative sentences in terms of precision, recall, and F -measure. The experiment involves three algorithms: MNB, SVM, and MLP. The bag of words model is used for representing texts. Lemmas of words in analyzed sentences serve as features for the classification. We perform the automatic selection of informative features in accordance with Variance and χ ² criteria combined with the weight-based filtration of lemmas (via TF*IDF and EMI). The training set includes around 800 sentences, while the test set contains 180. The MNB algorithm demonstrates the highest F -measure and recall scores on almost all feature sets (maximal values reached equal 68.7% and 89% respectively), while the MLP algorithm shows the best precision for about half of feature selection variations (the maximal value is 72.5%).
Chapter
This work presents the first fully-fledged discourse parser for Russian based on the Rhetorical Structure Theory of Mann and Thompson (1988). For the segmentation, discourse tree construction, and discourse relation classification we employ deep learning models. With the help of multiple word embedding techniques, the new state of the art for discourse segmentation of Russian texts is achieved. We found that the neural classifiers using contextual word representations outperform previously proposed feature-based models for discourse relation classification. By ensembling both methods, we are able to further improve the performance of the discourse relation classification achieving the new state of the art for Russian.
Article
Full-text available
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks -- but that their use makes the interpretation of the value of the coefficient even harder.
Article
Full-text available
This paper introduces the first Basque discourse TreeBank annotated with rhetorical relations following Rhetorical Structure Theory. We report the main features of the corpus, such as the annotation criteria, inter-annotator agreement and harmonization procedure. We describe an online search system to check the annotation of discourse relations.
Article
Full-text available
In content analysis and similar methods, data are typically generated by trained human observers who record or transcribe textual, pictorial, or audible matter in terms suitable for analysis. Conclusions from such data can be trusted only after demonstrating their reliability. Unfortunately, the content analysis literature is full of proposals for so-called reliability coefficients, leaving investigators easily confused, not knowing which to choose. After describing the criteria for a good measure of reliability, we propose Krippendorff's alpha as the standard reliability measure. It is general in that it can be used regardless of the number of observers, levels of measurement, sample sizes, and presence or absence of missing data. To facilitate the adoption of this recommendation, we describe a freely available macro written for SPSS and SAS to calculate Krippendorff's alpha and illustrate its use with a simple example.
Article
Full-text available
Rhetorical Structure Theory is a descriptive theory of a major aspect of the organization of natural text. It is a linguistically useful method for describing natural texts, characterizing their structure primarily in terms of relations that hold between parts of the text. This paper establishes a new definitional foundation for RST. The paper also examines three claims of RST: the predominance of nucleus/satellite structural patterns, the functional basis of hierarchy, and the communicative role of text structure.
Article
Full-text available
This report contains the guidelines for the annotation of discourse relations in the Penn Discourse Treebank (http://www.seas.upenn.edu/~pdtb), PDTB. Discourse relations in the PDTB are annotated in a bottom up fashion, and capture both lexically realized relations as well as implicit relations. Guidelines in this report are provided for all aspects of the annotation, including annotation explicit discourse connectives, implicit relations, arguments of relations, senses of relations, and the attribution of relations and their arguments. The report also provides descriptions of the annotation format representation.
Article
The paper presents the Chinese Discourse TreeBank, a corpus annotated with Penn Discourse TreeBank style discourse relations that take the form of a predicate taking two arguments. We first characterize the syntactic and statistical distributions of Chinese discourse connectives as well as the role of Chinese punctuation marks in discourse annotation, and then describe how we design our annotation strategy procedure based on this characterization. The Chinese-specific features of our annotation strategy include annotating explicit and implicit discourse relations in one single pass, defining the argument labels on semantic, rather than syntactic, grounds, as well as annotating the semantic type of implicit discourse relations directly. We also introduce a flat, 11-valued semantic type classification scheme for discourse relations. We finally demonstrate the feasibility of our approach with evaluation results.
Article
Krippendorff’s alpha (α) is a reliability coefficient developed to measure the agreement among observers, coders, judges, raters, or measuring instruments drawing distinctions among typically unstructured phenomena or assign computable values to them. α emerged in content analysis but is widely applicable wherever two or more methods of generating data are applied to the same set of objects, units of analysis, or items and the question is how much the resulting data can be trusted to represent something real.
Article
Views intercoder reliability as a sampling problem for content analyses. Develops a formula for generating sample sizes needed to have valid reliability estimates. Suggests steps for reporting reliability. (TB)
Chapter
We describe our experience in developing a discourse-annotated corpus for community-wide use. Working in the framework of Rhetorical Structure Theory, we were able to create a large annotated resource with very high consistency, using a well-defined methodology and protocol. This resource is made publicly available through the Linguistic Data Consortium to enable researchers to develop empirically grounded, discourse-specific applications. Key wordsdiscourse–corpus–annotation–rhetorical structure