Conference PaperPDF Available

The First Komi-Zyrian Universal Dependencies Treebanks

Authors:
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 126–132
Brussels, Belgium, November 1, 2018. c
2018 Association for Computational Linguistics
126
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen1, Rogier Blokland2, KyungTae Lim3, Thierry Poibeau3, Michael Rießler4
niko.partanen@kotus.fi,rogier.blokland@moderna.uu.se,
kyungtae.lim@ens.fr,thierry.poibeau@ens.fr,
michael.riessler@uni-bielefeld.de
1Institute for the Languages of Finland
2University of Uppsala
3LATTICE (CNRS & ENS / PSL & Université Sorbonne nouvelle / USPC)
4University of Bielefeld
Abstract
Two Komi-Zyrian treebanks were included in
the Universal Dependencies 2.2 release. This
article contextualizes the treebanks, discusses
the process through which they were created,
and outlines the future plans and timeline for
the next improvements. Special attention is
paid to the possibilities of using UD in the doc-
umentation and description of endangered lan-
guages.
1 Introduction
Komi-Zyrian is a Uralic language spoken in the
north-eastern corner of the European part of Rus-
sia. Smaller Komi settlements can also be found
elsewhere in northern Russia, from the Kola
Peninsula to Western Siberia. The language has
approximately 160,000 speakers and, although not
moribund, is still threatened by the local major-
ity language, Russian. There is a long history of
research on Komi, but contemporary descriptions
and computational resources could be greatly im-
proved. Over the last few years some larger docu-
mentation projects have been carried out on Komi.
These projects have focused on the most endan-
gered spoken varieties, while at the same time,
new written resources for Standard Komi have be-
came available.
This paper discusses the creation of two Komi
treebanks, one containing written and another spo-
ken data. Both the treebanks and the scripts used
to create them are included in this paper as sup-
plementary materials, and the treebanks are part
of the Universal Dependencies 2.2 release (Nivre
et al.,2018). The treebanks are called Lattice and
IKDP, due to the fact that most of the work on
them has been carried out at the LATTICE-CNRS
laboratory in Paris, and the work has been done
collaboratively with the IKDP-21project, which is
a continuation of earlier work that produced a lan-
guage documentation corpus of Komi called IKDP
(Blokland et al.,2009-2018). A comprehensive
descriptive grammar of Komi with a focus on syn-
tax is currently being written by members of the
team. The present treebanks are intended to sup-
port the grammatical description.
The authors’ recent research at LATTICE labo-
ratory has focused on dependency parsing of low-
resource languages, using Komi and North Saami
as examples (Lim et al.,2018). The Lattice tree-
bank was initially created for use in testing depen-
dency parsers, and the IKDP treebank was created
at a later date with the aim of also including spo-
ken language data.
2 Language Documentation
Language documentation refers to a linguistic
practice aiming at the provision of long-lasting
and accountable records of speech events, usu-
ally carried out in the context of endangered lan-
guages and with the goal of understanding spoken
communication beyond mere structural grammar.
Himmelmann (1998) was the first to define "Doc-
umentary Linguistics" as separate from "Descrip-
tive Linguistics", although with considerable over-
lap between the two. He also pays special atten-
tion to the interface between research outputs and
primary data, ideally including audio and video
recordings (Himmelmann,2006). This has gen-
erally been the approach in the present work too,
so that the spoken language UD corpus is directly
connected to the documentary multimedia corpus
1https://langdoc.github.io/IKDP-2
127
through matching sentence IDs. This allows the
treebank sentences to be connected to rich non-
linguistic metadata. Additionally, the coded time-
alignment in the original utterances provides in-
formation about turn-taking and overlapping at the
millisecond level. The documentary corpus refers
to the materials collected and processed within the
language documentation activities, which are usu-
ally fieldwork-based and aim to represent various
genres and speech practices, all of which are often
under a threat of disappearance.
In language documentation, traditional annota-
tion methods have mainly consisted of so-called
interlinear glossing.2This is normally done man-
ually or semi-manually, i.e. with little or no use
of natural language processing tools (cf. Gersten-
berger et al.,2016). With the available Komi
data in our project, however, we wanted to ap-
ply an annotation method that would connect our
work more closely to established corpus linguis-
tics and NLP. Universal Dependencies appeared to
be a very attractive annotation scheme as it aims
at cross-linguistic comparability and already con-
tains several Uralic languages. Komi-Zyrian is
currently the sixth Uralic language to be included
in the project.
Work with Komi complements well the devel-
opments associated with the emergence of new
Uralic treebanks in 2017, with new repositories
created for North Saami3and Erzya (Rueter and
Tyers,2018). Another noteworthy trend is that
there are several treebanks currently being created
for endangered languages in situations similar to
that of Komi. As far as we have been able to
ascertain, these are, at least: Dargwa spoken in
the Caucasus (Kozhukhar,2017), Pnar4spoken in
South-East Asia and Shipibo-Konibo5spoken in
Peru. The description of the last treebank men-
tioned does not indicate the use of language doc-
umentation materials, but as the language is very
small, the context is comparable. To our knowl-
edge, the IKDP treebank discussed here is the first
treebank included in the UD release that is directly
2Cf., e.g., the Leipzig Glossing Rules https:
//www.eva.mpg.de/lingua/resources/
glossing-rules.php
3https://github.com/
UniversalDependencies/UD_North_
Sami-Giella
4https://github.com/
UniversalDependencies/UD_Pnar-PTB
5https://github.com/
UniversalDependencies/UD_Shipibo_
Konibo-PUCP
based on language documentation material. It is
too early to say whether there will be more simi-
lar treebanks in the future and within what time-
frame, but having more materials like these in-
cluded in UD would fit into the original ideas of
the multifunctional language documentation en-
terprise very well.
3 Methodology
The initial analysis of Komi plain text was created
using Giellatekno’s6open infrastructure (Mosha-
gen et al.,2014), which is currently at a rather ma-
ture level for Komi. The syntactic analysis compo-
nent demands the most further work, which in turn
can be guided by the work on treebanks. Simi-
lar rule-based architectures have already been used
for other treebanks as well. The Northern Saami
and Erzya corpora, for example, seem to have been
created using a similar approach. Some work has
been conducted with integrating these NLP tools
into workflows commonly used in language doc-
umentation (Gerstenberger et al.,2017a,b,2016).
Since these languages often lack larger annotated
resources, the use of infrastructures other than
rule-based ones has not been common or possible,
but these workflows have been implemented in a
modular fashion that would make enable the inte-
gration of other tools when they become available
or reach needed accuracy.
It has been demonstrated that it is possible to
convert annotations from Giellatekno’s annotation
scheme into the UD scheme (Sheyanova and Ty-
ers,2017), and this has also worked well in our
case, although the exact procedure will continue
to be refined while the token count of the corpus
grows, which will ultimately also reveal rarer and
not-yet-analysed morphosyntactic features. Af-
ter starting with manually editing CoNLL-U files,
the UD Annotatrix tool (Tyers et al.,2018) was
adopted in January 2018, which marked the mid-
point in the project’s timeline. This greatly im-
proved the annotation speed and consistency.
The treebank creation thus consisted of the fol-
lowing steps:
1. Sending Komi sentences to the Giellatekno
morphosyntactic analyser (consisting of an
FST component for morphological categories
and a syntactic component using Constraint
Grammar)
6http://giellatekno.uit.no
128
2. Resolving the remaining ambiguity manually
3. Adding the missing syntactic relations manu-
ally to the UD Annotatrix
4. Automatically converting the analyzer’s
XPOS-tags into UPOS-tags and converting
morphological feature tags into their UD
counterparts
5. Manual correction and verification
The current workflow involves a rather large
amount of manual work. We are interested in
testing various approaches to morphological and
syntactic analysis so that different (rule-based,
statistic-based and hybrid) parsers can eventually
replace the manual work. Some tests have already
been carried out with the dependency parser used
by the Lattice team in the CoNLL-U Shared Task
2017 (Lim and Poibeau,2017) and a follow-up
project (Partanen et al.,2018).
The treebank processing pipeline has been
tied to several scripts and existing tools. The
primary analysisis done within the Giellatekno
toolkit (building on FST Morphology and Con-
straint Grammar), where tokenization, morpho-
logical analysis and rule-based disambiguation are
tied to the script ‘kpvdep’. The script returns a
vislcg3 file that contains all ambiguities left after
the analysis. Once the ambiguities are resolved
manually, the vislcg3 file can be imported into the
UD Annotatrix tool. As a final step, the Giellate-
kno POS-tags and morphological features are con-
verted to follow the UD standard with a Python
script, originally written by Francis Tyers7. A
modified version of the script with the conversion
pattern file is stored in not-to-release folder in the
dev-branch of the Lattice treebank, which is the lo-
cation where all development scripts of both tree-
banks will be maintained.
4 Data Sources and Design Principles
Most of the work on the Komi language is cur-
rently being done by collaborators of FU-Lab8in
Syktyvkar, the capital of the Komi Republic in
Russia. The work of FU-Lab, led by Marina Fe-
dina, has been particularly exceptional, as it has
resulted in a significant number of Komi-language
7https://github.com/ftyers/ud-scripts/
blob/master/conllu-feats.py
8http://fu-lab.ru
books being digitalized, made available online9
and converted into a linguistic corpus.10 The cor-
pus is currently 40 million words large, and the
long-term goal is to digitalize all books and other
printed texts ever published in Komi-Zyrian. The
number of publications is approximately 4,500
books, plus tens of thousands of newspaper and
journal issues. A significant portion of the lat-
ter are available in the Public Domain as part of
the Fenno-Ugrica project of the National Library
of Finland11. We have exclusively chosen to use
openly available data for the Lattice treebank in
order to ensure as broad and simple reuse as pos-
sible. The forthcoming releases will include more
genres of text, such as newspaper texts and longer
sections of Wikipedia articles.
All sentences in the Lattice treebank are pre-
sented in the contemporary orthography, even
when they were originally published using vari-
ous earlier Komi writing systems. The propor-
tion of texts originally written in the Molodcov al-
phabet will rise dramatically in the next releases,
as this is probably the most commonly used or-
thography in the upcoming texts. Storing several
orthographic variants may be necessary. Conver-
sion between systems has been carried out using
FU-Lab’s Molodcov converter12. The data orig-
inates from scanned books through text recogni-
tion, currently with loss of page coordinates. This
connects to the question of how to retrieve arbi-
trary information from different sources that can
be connected to the sentence IDs: metadata, page
positions, page images, time codes and audio seg-
ments.
We considered it very important to also in-
clude spoken language in the treebank, ideally
eventually covering all dialects. During the last
years, one of the largest research projects inves-
tigating spoken Komi has been the IKDP project,
led by Rogier Blokland in 20142016, which re-
sulted in a large transcribed spoken language cor-
pus (Blokland et al.,2009-2018). The IKDP
treebank contains dialectal texts taken from this
corpus, and since written Komi does not follow
the exact same principles employed in the tran-
scriptions, it seems problematic to mix these ma-
terials together. The orthographic conventions
9http://komikyv.org
10http://komicorpora.ru
11https://fennougrica.
kansalliskirjasto.fi
12http://fu- lab.ru/convertermolodcov
129
of the spoken treebank are basically similar to
those used in the recent Komi dialect dictionary
(Beznosikova et al.,2012), with only relatively
subtle differences.What it comes to spoken fea-
tures, corrections are kept and marked with the re-
lation reparandum, but features such as pauses are
not separately marked. The user can access the
original archived audio, which enables a more de-
tailed analysis of spoken phenomena if desired. In
their typographic simplicity, the transcribed texts
are reminiscent of some of the dialect texts pub-
lished previously in various printed text collec-
tions (without the original audio recordings). The
context of the spoken data here is therefore not
only a faithful representation of the spoken sig-
nal, which could include also more exact phonetic
transcriptions, but also the larger landscape of spo-
ken language resources which we would like to in-
tegrate into our NLP ecosystem.
Furthermore, because local Komi speech and
research communities are often conscious of or-
thographic norms, we wanted to draw a clear
boundary between written and spoken representa-
tions. Additionally, the spoken language treebank
contains a large number of Russian phrases due to
code-switching, which makes it to some degree a
multilingual treebank. In the IKDP treebank, Rus-
sian items are currently marked with a language
tag in the misc-field, but verification that Russian
annotations are consistent with monolingual Rus-
sian treebanks is a topic that requires further atten-
tion.
The sentences represent running texts and narra-
tives, and, to a great extent, they link together into
continuous larger text units. There are deviations
from this in situations where individual examples
have been selected in order to include instances
of each dependency relation in the treebank. This
was done particularly in the early stages of the
treebanks when it was important to gain more un-
derstanding of how different syntactic relations are
tagged consistently in UD. In the upcoming re-
leases, occurrences of each morphosyntactic phe-
nomena present in Komi may also be hand-picked
from corpora to ensure that they occur in the tree-
banks, the need for which is discussed next.
5 Some Questions Arising From
Komi-Zyrian
As the majority of languages in UD are larger
Indo-European languages, the project does not yet
include many examples of languages with very
complex case systems. For example, Komi has
two values of nominal case that were not yet in-
cluded in the earlier documentation, namely the
approximative and the egressive. One issue aris-
ing when comparing current treebanks is the cross-
comparability of the case labels applied. Komi has
two cases that express a path of some sort, tra-
ditionally called prolative and transitive in Komi
and Uralic linguistics. These would match closely
with a case label already in the UD documentation,
perlative, found in Warlpiri, but the fact that there
are two very similar cases already makes the label-
ing problematic. Differences in case labeling are
related to further linguistic analyses that are possi-
ble with the corpora, as well as to parsing accuracy
in multilingual scenarios. In the present treebanks,
the traditional labels for Komi cases are used.
Another theoretical question arising from Komi
concerns the way different cases can be combined,
resulting in "double case marking". For example,
it is entirely possible to use several spatial case
markers linearly combined in one and the same
inflected noun form, and, although this is some-
what rare, examples can be easily found even for
more marginal combinations. For example, the
case suffixes for elative and terminative can com-
bine to mark subtle changes in focus: vengrija-iC-
edý Hungary-ELA-TER ‘all the way from Hungary’
(see e.g. (Bartens,2003, 53). This raises the ques-
tion of how to best annotate this in UD. Of course
each combination could be labeled as a new case,
which is also sometimes seen in the literature on
Komi nominal case (Kuznetsov,2012, p. 374),
but this would greatly increase the number of case
values that need to be documented, and most of
them would be very marginal and specific to in-
dividual languages. Another solution would be to
allow several case affixes to be added to one word
form. However, this would only help when sev-
eral cases are clearly combined and would not be
useful when new spatial cases have emerged from
postpositions, a phenomenon typical of Komi and
Udmurt dialects.
Currently, a large portion of the cases in UD
documentation are used only in Hungarian. In-
cluding more languages with large case systems,
such as Uralic or Northeast Caucasian languages
like Lezgian, would only increase the number of
names for case values used mainly in individ-
ual languages. Eventually this also boils down
130
to the question of how comparable the cases in
different languages actually are. Haspelmath has
argued convincingly that case labels are valid
only for particular languages (Haspelmath,2009,
510), and the issue probably cannot be explicitly
solved within UD either, but for the sake of us-
ability of treebanks and their suitability for mul-
tilingual NLP applications, some harmonization
would seem desirable. One alternative could be
to create a higher layer of mapping that connects
language-specific labels to broader shared cate-
gories. In this way, both Komi cases expressing
a path could be connected to a concept of move-
ment along a path, but the language-specific nu-
ances would not be lost.
6 Conclusion and Further Work
The written and spoken treebanks have 1389 and
988 tokens, respectively. Due to their small size,
they have not been split into test and development
sets. Based on this experience, it already seems
clear that providing annotations in this framework
has several advantages compared to traditional
methods used in language documentation projects.
The main benefit is the comparability between dif-
ferent languages, and also straightforward licens-
ing and distribution within UD framework.
It can be argued that tagging according the UD
principles is necessarily a compromise, and that
it may not express all particularities of individ-
ual languages. One possible way to solve this
problem is to include further annotations in the
misc-column. Another possible approach would
be to provide different parts of the documentary
corpus with varying degrees of annotations. In
any case, based on our experience, we would
strongly encourage endangered language docu-
mentation projects to take a small segment of their
materials and add to it an additional layer of anno-
tations in the Universal Dependencies framework.
Language documentation data is usually stored in
archives that require access requests. This is not
very compatible with openly available treebanks.
Still, it should be possible to collect small subsets
of materials with the clear intention and permis-
sion for these recordings to be openly licensed, or
to use texts old enough that they are copyright free.
New material is currently being brought into the
Lattice treebank. The main genres obtained from
Fenno-Ugrica collection are newspaper texts, non-
fiction works and schoolbooks. Samples of these,
along with some larger Wikipedia texts, will be in-
cluded in the next UD release 2.3. The next phase
of the IKDP treebank will include individual texts
from the Komi recordings made by Eric Vászolyi
in the 1950s and 1960s (Vászolyi-Vasse,2003),
which the present authors have acquired permis-
sion to re-publish electronically. These texts orig-
inate from a time and place of intensive language
contact between Komi-Zyrian and Tundra Nenets,
what makes them a particularly interesting target
for further study.
One possibly useful addition to the treebank
could be English glosses in the misc-field, since
many linguists are used to working with data from
endangered languages in a format like this. The
English gloss could contain a contextual transla-
tion of the lemma, for example, which would make
the sentences in the treebank much more accessi-
ble to different linguistic audiences.
In terms of size, the target is to reach 5,000
tokens in both treebanks during 2018, and to in-
crease this to 20,000 in the first half of 2019. Our
long-term goal is to create a resource that would
contribute to research on Komi and provide bet-
ter resources for Natural Language Processing of
this language, which has yet to receive sufficient
attention in computational linguistic research.
7 Acknowledgements
We want to thank the reviewers for their use-
ful comments. This work has been developed
in the framework of the LAKME project funded
by a grant from Paris Sciences et Lettres (IDEX
PSL reference ANR-10-IDEX-0001-02). Thierry
Poibeau is partially supported by a RGNF-CNRS
(grant between the LATTICE-CNRS Laboratory
and the Russian State University for the Human-
ities in Moscow). Kyungtae Lim is partially sup-
ported by the ERA-NET Atlantis project. Niko
Partanen’s work has been carried out at the LAT-
TICE laboratory, and besides Partanen, both Ro-
gier Blokland and Michael Rießler collaborate
within the project Language Documentation meets
Language Technology: the Next Step in the De-
scription of Komi, funded by the Kone Founda-
tion. Thanks to Jack Rueter for numerous discus-
sions on Komi and Erzya, and to Alexandra Kell-
ner for proofreading the paper.
131
References
Raija Bartens. 2003. Kahden kaasuspäätteen jonoista
suomalais-ugrilaisissa kielissä. In Bakró-Nagy Mar-
ianne and Károly Rédei, editors, Ünnepi kötet Honti
László tiszteletére, pages 46–54. MTA, Budapest.
L.M. Beznosikova, E.A. Ajbabina, N.K. Zaboeva, and
R.I. Kosnyreva. 2012. Komi sërnisikas kyvˇcukör.
Slovar dialektov komi âzyka: v 2-h tomah. Insti-
tut âzyka, literatury i istorii Komi naunogo centra
Uralskogo otdeleniâ Rossijskoj akademii nauk, Syk-
tyvkar.
Rogier Blokland, Marina Fedina, Niko Partanen, and
Michael Rießler. 2009-2018. IKDP. In The
Language Archive (TLA): Donated Corpora. Max
Planck Institute for Psycholinguistics, Nijmegen.
Ciprian Gerstenberger, Niko Partanen, and Michael
Rießler. 2017a. Instant annotations in ELAN cor-
pora of spoken and written Komi, an endangered
language of the Barents Sea region. In Proceed-
ings of the 2nd Workshop on the Use of Compu-
tational Methods in the Study of Endangered Lan-
guages, pages 57–66. ACL.
Ciprian Gerstenberger, Niko Partanen, Michael
Rießler, and Joshua Wilbur. 2016. Utilizing
language technology in the documentation of
endangered Uralic languages. Northern European
Journal of Language Technology, 4:29–47.
Ciprian Gerstenberger, Niko Partanen, Michael
Rießler, and Joshua Wilbur. 2017b. Instant anno-
tations: Applying NLP methods to the annotation
of spoken language documentation corpora. In
Proceedings of the 3rd International Workshop on
Computational Linguistics for Uralic languages,
pages 25–36. ACL.
Martin Haspelmath. 2009. Terminology of case. In
Andrew Spencer and Andrej L. Malchukov, editors,
The Oxford handbook of case, pages 505–517. OUP,
Oxford.
Nikolaus Himmelmann. 2006. Language documenta-
tion: What is it and what is it good for? In Jost Gip-
pert, Ulrike Mosel, and Nikolaus Himmelmann, ed-
itors, Essentials of Language Documentation, pages
1–30. Mouton de Gruyter, Berlin.
Nikolaus P. Himmelmann. 1998. Documentary and de-
scriptive linguistics. Linguistics, 36:161–195.
Alexandra Kozhukhar. 2017. Universal dependencies
for Dargwa Mehweb. In Proceedings of the Fourth
International Conference on Dependency Linguis-
tics, pages 92–99. ACL.
Nikolay Kuznetsov. 2012. Matrix of cognitive do-
mains for Komi local cases. Journal of Estonian and
Finno-Ugric Linguistics, 3(1):373–394.
KyungTae Lim, Niko Partanen, and Thierry Poibeau.
2018. Multilingual Dependency Parsing for Low-
Resource Languages: Case Studies on North Saami
and Komi-Zyrian. In Proceedings of the Eleventh
International Conference on Language Resources
and Evaluation. ELRA.
KyungTae Lim and Thierry Poibeau. 2017. A sys-
tem for multilingual dependency parsing based on
bidirectional LSTM feature representations. In Pro-
ceedings of the CoNLL 2017 Shared Task: Multilin-
gual Parsing from Raw Text to Universal Dependen-
cies, pages 63–70. ACL.
Sjur Moshagen, Jack Rueter, Tommi Pirinen, Trond
Trosterud, and Francis M Tyers. 2014. Open-source
infrastructures for collaborative work on under-
resourced languages. In Proceedings of the Ninth
International Conference on Language Resources
and Evaluation, pages 71–77. ELRA.
Joakim Nivre, Mitchell Abrams, Željko Agi´
c, Lars
Ahrenberg, Lene Antonsen, Maria Jesus Aranz-
abe, Gashaw Arutie, Masayuki Asahara, Luma
Ateyah, Mohammed Attia, Aitziber Atutxa, Lies-
beth Augustinus, Elena Badmaeva, Miguel Balles-
teros, Esha Banerjee, Sebastian Bank, Verginica
Barbu Mititelu, John Bauer, Sandra Bellato, Kepa
Bengoetxea, Riyaz Ahmad Bhat, Erica Biagetti,
Eckhard Bick, Rogier Blokland, Victoria Bobicev,
Carl Börstell, Cristina Bosco, Gosse Bouma, Sam
Bowman, Adriane Boyd, Aljoscha Burchardt, Marie
Candito, Bernard Caron, Gauthier Caron, Gül¸sen
Cebiro˘
glu Eryi˘
git, Giuseppe G. A. Celano, Savas
Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,
Jayeol Chun, Silvie Cinková, Aurélie Collomb,
Ça˘
grı Çöltekin, Miriam Connor, Marine Courtin,
Elizabeth Davidson, Marie-Catherine de Marneffe,
Valeria de Paiva, Arantza Diaz de Ilarraza, Carly
Dickerson, Peter Dirix, Kaja Dobrovoljc, Tim-
othy Dozat, Kira Droganova, Puneet Dwivedi,
Marhaba Eli, Ali Elkahky, Binyam Ephrem, Tomaž
Erjavec, Aline Etienne, Richárd Farkas, Hector
Fernandez Alcalde, Jennifer Foster, Cláudia Fre-
itas, Katarína Gajdošová, Daniel Galbraith, Mar-
cos Garcia, Moa Gärdenfors, Kim Gerdes, Filip
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh
Gökırmak, Yoav Goldberg, Xavier Gómez Guino-
vart, Berta Gonzáles Saavedra, Matias Grioni, Nor-
munds Gr¯
uz¯
ıtis, Bruno Guillaume, Céline Guillot-
Barbance, Nizar Habash, Jan Hajiˇ
c, Jan Hajiˇ
c jr.,
Linh Hà M˜
y, Na-Rae Han, Kim Harris, Dag Haug,
Barbora Hladká, Jaroslava Hlaváˇ
cová, Florinel
Hociung, Petter Hohle, Jena Hwang, Radu Ion,
Elena Irimia, Tomáš Jelínek, Anders Johannsen,
Fredrik Jørgensen, Hüner Ka¸sıkara, Sylvain Ka-
hane, Hiroshi Kanayama, Jenna Kanerva, Tolga
Kayadelen, Václava Kettnerová, Jesse Kirchner,
Natalia Kotsyba, Simon Krek, Sookyoung Kwak,
Veronika Laippala, Lorenzo Lambertino, Tatiana
Lando, Septina Dian Larasati, Alexei Lavrentiev,
John Lee, Phng Lê H`
ông, Alessandro Lenci, Saran
Lertpradit, Herman Leung, Cheuk Ying Li, Josie
Li, Keying Li, KyungTae Lim, Nikola Ljubeši´
c,
132
Olga Loginova, Olga Lyashevskaya, Teresa Lynn,
Vivien Macketanz, Aibek Makazhanov, Michael
Mandl, Christopher Manning, Ruli Manurung,
C˘
at˘
alina M˘
ar˘
anduc, David Mareˇ
cek, Katrin Marhei-
necke, Héctor Martínez Alonso, André Martins, Jan
Mašek, Yuji Matsumoto, Ryan McDonald, Gustavo
Mendonça, Niko Miekka, Anna Missilä, C˘
at˘
alin
Mititelu, Yusuke Miyao, Simonetta Montemagni,
Amir More, Laura Moreno Romero, Shinsuke
Mori, Bjartur Mortensen, Bohdan Moskalevskyi,
Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,
Pinkey Nainwani, Juan Ignacio Navarro Horñi-
acek, Anna Nedoluzhko, Gunta Nešpore-B¯
erzkalne,
Lng Nguy˜
ên Thi
., Huy`
ên Nguy˜
ên Thi
.Minh, Vi-
taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,
Stina Ojala, Adédayo
.Olúòkun, Mai Omura, Petya
Osenova, Robert Östling, Lilja Øvrelid, Niko
Partanen, Elena Pascual, Marco Passarotti, Ag-
nieszka Patejuk, Siyao Peng, Cenel-Augusto Perez,
Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily
Pitler, Barbara Plank, Thierry Poibeau, Mar-
tin Popel, Lauma Pretkalnin
,a, Sophie Prévost,
Prokopis Prokopidis, Adam Przepiórkowski, Ti-
ina Puolakainen, Sampo Pyysalo, Andriela Rääbis,
Alexandre Rademaker, Loganathan Ramasamy,
Taraka Rama, Carlos Ramisch, Vinit Ravishankar,
Livy Real, Siva Reddy, Georg Rehm, Michael
Rießler, Larissa Rinaldi, Laura Rituma, Luisa
Rocha, Mykhailo Romanenko, Rudolf Rosa, Da-
vide Rovati, Valentin Roca, Olga Rudina, Shoval
Sadde, Shadi Saleh, Tanja Samardži´
c, Stephanie
Samson, Manuela Sanguinetti, Baiba Saul¯
ıte,
Yanin Sawanakunanon, Nathan Schneider, Sebas-
tian Schuster, Djamé Seddah, Wolfgang Seeker,
Mojgan Seraji, Mo Shen, Atsuko Shimada, Muh
Shohibussirri, Dmitry Sichinava, Natalia Silveira,
Maria Simi, Radu Simionescu, Katalin Simkó,
Mária Šimková, Kiril Simov, Aaron Smith, Is-
abela Soares-Bastos, Antonio Stella, Milan Straka,
Jana Strnadová, Alane Suhr, Umut Sulubacak,
Zsolt Szántó, Dima Taji, Yuta Takahashi, Takaaki
Tanaka, Isabelle Tellier, Trond Trosterud, Anna
Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Ue-
matsu, Zdeˇ
nka Urešová, Larraitz Uria, Hans Uszko-
reit, Sowmya Vajjala, Daniel van Niekerk, Gertjan
van Noord, Viktor Varga, Veronika Vincze, Lars
Wallin, Jonathan North Washington, Seyi Williams,
Mats Wirén, Tsegay Woldemariam, Tak-sum Wong,
Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu,
Zdenˇ
ek Žabokrtský, Amir Zeldes, Daniel Zeman,
Manying Zhang, and Hanzhi Zhu. 2018. Univer-
sal dependencies 2.2. LINDAT/CLARIN digital li-
brary at the Institute of Formal and Applied Linguis-
tics (ÚFAL), Faculty of Mathematics and Physics,
Charles University.
Niko Partanen, KyungTae Lim, Michael Rießler, and
Thierry Poibeau. 2018. Dependency parsing of
code-switching data with cross-lingual feature rep-
resentations. In Proceedings of the 4th International
Workshop on Computational Linguistics for Uralic
languages, pages 1–17. ACL.
Jack Rueter and Francis Tyers. 2018. Towards an open-
source universal-dependency treebank for Erzya.
In Proceedings of the 4th International Workshop
on Computational Linguistics for Uralic languages,
pages 106–118. ACL.
Mariya Sheyanova and Francis M. Tyers. 2017. Anno-
tation schemes in North Sámi dependency parsing.
In Proceedings of the Third Workshop on Computa-
tional Linguistics for Uralic Languages, pages 66–
75. ACL.
Francis M Tyers, Mariya Sheyanova, and
Jonathan North Washington. 2018. UD Annotatrix:
An annotation tool for universal dependencies. In
Proceedings of the 16th International Workshop on
Treebanks and Linguistic Theories, pages 10–17.
ACL.
Eric Vászolyi-Vasse. 2003. Syrjaenica: Narratives,
folklore and folk poetry from eight dialects of the
Komi language. Vol. 1, Upper Izhma, Lower Ob,
Kanin Peninsula, Upper Jusva, Middle Inva, Udora.
Savariae, Szombathely.
... The dialect underlying the Old Permic corpus seems to be close to the contemporary Udora and Lower Vychegda dialects to which it has been geographically closest as well. As the current Komi dialect treebanks also contain lemmas in the standard Zyrian Komi (Partanen et al. 2018), this choice seems to be well founded for our purposes here. ...
... Part-of-speech tagging follows the tagset in Universal Dependencies project, as used in the Permian Komi and Zyrian Komi treebanks (Partanen et al. 2018), included in the UD release 2.5. (Zeman et al. 2019). ...
Article
Full-text available
Old Permic, also known as Old Komi, is an extinct variety of Komi that was spoken in the late Middle Ages in the lower Vychegda river basin in Northeastern European Russia, in an area that currently is not Komi-speaking. This language variety is attested in fragmentary records from the 14th to 17th century written both in the Old Permic alphabet and in Cyrillic. These records are of significant importance for research on the history of the Komi language. Here we introduce our attempt towards a new Universal Dependencies treebank that will eventually contain the existing corpus of Old Permic in a structured and CoNLL-U annotated format. This will be the first time this material is being made openly available in digital format, and our contribution describes the current state of the art and remaining challenges.
... They adopted a multilingual parsing approach (Lim and Poibeau, 2017) and used Russian and Komi monolingual training data with bilingual Komi-Russian word embeddings. Later, this treebank expanded into the Komi-Zyrian IKDP treebank (Partanen et al., 2018a). ...
... Hence, training of the latter two treebanks are on out-of-domain data. For Kpv-Ru which includes Komi-Russian code-switching, we train the models on Komi-Zyrian Lattice UD Treebank (Partanen et al., 2018a) of monolingual Komi data. The first 562 sentences in Komi-Zyrian Lattice are used for training, the remaining 100 are used for development. ...
Conference Paper
Full-text available
Code-switching dependency parsing stands as a challenging task due to both the scarcity of necessary resources and the structural difficulties embedded in code-switched languages. In this study, we introduce novel sequence labeling models to be used as auxiliary tasks for dependency parsing of code-switched text in a semi-supervised scheme. We show that using auxiliary tasks enhances the performance of an LSTM-based dependency parsing model and leads to better results compared to an XLM-R-based model with significantly less computational and time complexity. As the first study that focuses on multiple code-switching language pairs for dependency parsing, we acquire state-of-the-art scores on all of the studied languages. Our best models outperform the previous work by 7.4 LAS points on average.
... As far as language variety is concerned, Frisian-Dutch Fame (Braggaar and Goot, 2021) and Turkish-German SAGT (Çetinoglu and Çöltekin, 2019) use the Lang attribute in the MISC field, while Komi-Zyrian IKDP (Partanen et al., 2018) uses OrigLang for Russian words. The KIParla resource includes both code-switching and dialectal variation. ...
Preprint
Full-text available
The paper presents an overview of initial design choices discussed towards the creation of a treebank for the Italian KIParla corpus
... When considering the annotations of the dozen spoken dependency treebanks in UD more broadly, they appear to mostly focus on fillers, discourse particles, and disfluency (Dobrovoljc, 2022;Kahane et al., 2021); in some cases, detailed information is lacking largely due to the fact that the treebanks are relatively small (Braggaar and van der Goot, 2021;Partanen et al., 2018). Given that our dataset is on a much larger scale, we are able to carefully note different speech-related phenomena along with providing clear annotation guidelines. ...
Article
Full-text available
We present a syntactic dependency treebank for naturalistic child and child-directed spoken English. Our annotations largely follow the guidelines of the Universal Dependencies project (UD [Zeman et al., 2022]), with detailed extensions to lexical and syntactic structures unique to spontaneous spoken language, as opposed to written texts or prepared speech. Compared to existing UD-style spoken treebanks and other dependency corpora of child-parent interactions specifically, our dataset is much larger (44,744 utterances; 233,907 words) and contains data from 10 children covering a wide age range (18–66 months). We conduct thorough dependency parser evaluations using both graph-based and transition-based parsers, trained on three different types of out-of-domain written texts: news, tweets, and learner data. Out-of-domain parsers demonstrate reasonable performance for both child and parent data. In addition, parser performance for child data increases along children’s developmental paths, especially between 18 and 48 months, and gradually approaches the performance for parent data. These results are further validated with in-domain training.
... Apart from structured dictionaries and rulebased tools, we have treebanks of the universal dependencies for the Skolt Saami, Moksha, Erzya (Rueter and Tyers, 2018), Komi-Zyrian (Partanen et al., 2018) and Komi-Permyak (Rueter et al., 2020b). These treebanks contain syntactic annotations with the tags Morphological characteristics of universal dependencies. ...
Conference Paper
Full-text available
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are struc-tured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica.
... Unesco classifies these languages as definitely endangered (Moseley, 2010). In terms of NLP, these languages have FSTs (Rueter et al., 2020, Universal Dependencies Treebanks (Partanen et al., 2018;Rueter and Tyers, 2018) (excluding Udmurt) and constraint grammars available in Giella repositories (Moshagen et al., 2014). For some of the languages, there have also been efforts in employing neural models in disambiguation (Ens et al., 2019) and morphological tasks . ...
Preprint
Full-text available
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56\% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.
... Unesco classifies these languages as definitely endangered (Moseley, 2010). In terms of NLP, these languages have FSTs (Rueter et al., 2020, Universal Dependencies Treebanks (Partanen et al., 2018;Rueter and Tyers, 2018) (excluding Udmurt) and constraint grammars available in Giella repositories (Moshagen et al., 2014). For some of the languages, there have also been efforts in employing neural models in disambiguation (Ens et al., 2019) and morphological tasks . ...
Conference Paper
Full-text available
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word em-beddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.
... In recent years, at least within the Uralic language family, we have seen new treebanks emerging in languages with closely related siblings that already have an existing treebank. Examples of such languages are Skolt Saami, in relation to Northern Saami (Tyers and Sheyanova, 2017), Komi-Permyak, in relation to Komi-Zyrian (Partanen et al., 2018), or Moksha in relation to Erzya (Rueter and Tyers, 2018). Although the entirety of Uralic languages is still not fully represented within the Universal Dependencies project, the situation has improved in many ways since the last survey on the state of this language family in UD was conducted (Partanen and Rueter, 2019). ...
Conference Paper
Full-text available
This study discusses the way different numerals and related expressions are currently annotated in the Universal Dependencies project, with specific focus on the Uralic language family. We analyse different annotation conventions between individual treebanks, and aim to highlight some areas where further development work and systematization could prove beneficial. At the same time the Universal Dependencies project already offers a wide range of conventions to mark nuanced variation in numerals and counting expressions, and the harmonization of conventions between different languages could be the next step to take. The discussion here refers to the UD version 2.8., and some differences found may already be harmonized in the version 2.9.
... However, HIENCS does not consider intra-word CS, and SAGT simply tags the intra-word CS with the MIXED tag, agnostic of what language codes are inside the token. UD Komi-Zyrian Lattice (Partanen et al., 2018), a UD treebank of another minority language of Russia, also explicitly annotates Russian words by specifying as OrigLang=ru, but their CS segments are unclear. NMCTT differs from these corpora by tagging each intra-word CS segment with a language code, allowing for more flexibility and expressiveness in the language tagging. ...
Conference Paper
Full-text available
This paper introduces a new Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first outline NMCTT with a focus on differences from other treebanks of Turkic languages. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.
Kahden kaasuspäätteen jonoista suomalais-ugrilaisissa kielissä
  • Raija Bartens
Raija Bartens. 2003. Kahden kaasuspäätteen jonoista suomalais-ugrilaisissa kielissä. In Bakró-Nagy Marianne and Károly Rédei, editors, Ünnepi kötet Honti László tiszteletére, pages 46-54. MTA, Budapest.
UD Annotatrix: An annotation tool for universal dependencies
  • Mariya Francis M Tyers
  • Jonathan North Sheyanova
  • Washington
Francis M Tyers, Mariya Sheyanova, and Jonathan North Washington. 2018. UD Annotatrix: An annotation tool for universal dependencies. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, pages 10-17. ACL.
Syrjaenica: Narratives, folklore and folk poetry from eight dialects of the Komi language
  • Eric Vászolyi-Vasse
Eric Vászolyi-Vasse. 2003. Syrjaenica: Narratives, folklore and folk poetry from eight dialects of the Komi language. Vol. 1, Upper Izhma, Lower Ob, Kanin Peninsula, Upper Jusva, Middle Inva, Udora. Savariae, Szombathely.