Conference PaperPDF Available

The First Komi-Zyrian Universal Dependencies Treebanks

Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 126–132
Brussels, Belgium, November 1, 2018. c
2018 Association for Computational Linguistics
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen1, Rogier Blokland2, KyungTae Lim3, Thierry Poibeau3, Michael Rießler4,,,,
1Institute for the Languages of Finland
2University of Uppsala
3LATTICE (CNRS & ENS / PSL & Université Sorbonne nouvelle / USPC)
4University of Bielefeld
Two Komi-Zyrian treebanks were included in
the Universal Dependencies 2.2 release. This
article contextualizes the treebanks, discusses
the process through which they were created,
and outlines the future plans and timeline for
the next improvements. Special attention is
paid to the possibilities of using UD in the doc-
umentation and description of endangered lan-
1 Introduction
Komi-Zyrian is a Uralic language spoken in the
north-eastern corner of the European part of Rus-
sia. Smaller Komi settlements can also be found
elsewhere in northern Russia, from the Kola
Peninsula to Western Siberia. The language has
approximately 160,000 speakers and, although not
moribund, is still threatened by the local major-
ity language, Russian. There is a long history of
research on Komi, but contemporary descriptions
and computational resources could be greatly im-
proved. Over the last few years some larger docu-
mentation projects have been carried out on Komi.
These projects have focused on the most endan-
gered spoken varieties, while at the same time,
new written resources for Standard Komi have be-
came available.
This paper discusses the creation of two Komi
treebanks, one containing written and another spo-
ken data. Both the treebanks and the scripts used
to create them are included in this paper as sup-
plementary materials, and the treebanks are part
of the Universal Dependencies 2.2 release (Nivre
et al.,2018). The treebanks are called Lattice and
IKDP, due to the fact that most of the work on
them has been carried out at the LATTICE-CNRS
laboratory in Paris, and the work has been done
collaboratively with the IKDP-21project, which is
a continuation of earlier work that produced a lan-
guage documentation corpus of Komi called IKDP
(Blokland et al.,2009-2018). A comprehensive
descriptive grammar of Komi with a focus on syn-
tax is currently being written by members of the
team. The present treebanks are intended to sup-
port the grammatical description.
The authors’ recent research at LATTICE labo-
ratory has focused on dependency parsing of low-
resource languages, using Komi and North Saami
as examples (Lim et al.,2018). The Lattice tree-
bank was initially created for use in testing depen-
dency parsers, and the IKDP treebank was created
at a later date with the aim of also including spo-
ken language data.
2 Language Documentation
Language documentation refers to a linguistic
practice aiming at the provision of long-lasting
and accountable records of speech events, usu-
ally carried out in the context of endangered lan-
guages and with the goal of understanding spoken
communication beyond mere structural grammar.
Himmelmann (1998) was the first to define "Doc-
umentary Linguistics" as separate from "Descrip-
tive Linguistics", although with considerable over-
lap between the two. He also pays special atten-
tion to the interface between research outputs and
primary data, ideally including audio and video
recordings (Himmelmann,2006). This has gen-
erally been the approach in the present work too,
so that the spoken language UD corpus is directly
connected to the documentary multimedia corpus
through matching sentence IDs. This allows the
treebank sentences to be connected to rich non-
linguistic metadata. Additionally, the coded time-
alignment in the original utterances provides in-
formation about turn-taking and overlapping at the
millisecond level. The documentary corpus refers
to the materials collected and processed within the
language documentation activities, which are usu-
ally fieldwork-based and aim to represent various
genres and speech practices, all of which are often
under a threat of disappearance.
In language documentation, traditional annota-
tion methods have mainly consisted of so-called
interlinear glossing.2This is normally done man-
ually or semi-manually, i.e. with little or no use
of natural language processing tools (cf. Gersten-
berger et al.,2016). With the available Komi
data in our project, however, we wanted to ap-
ply an annotation method that would connect our
work more closely to established corpus linguis-
tics and NLP. Universal Dependencies appeared to
be a very attractive annotation scheme as it aims
at cross-linguistic comparability and already con-
tains several Uralic languages. Komi-Zyrian is
currently the sixth Uralic language to be included
in the project.
Work with Komi complements well the devel-
opments associated with the emergence of new
Uralic treebanks in 2017, with new repositories
created for North Saami3and Erzya (Rueter and
Tyers,2018). Another noteworthy trend is that
there are several treebanks currently being created
for endangered languages in situations similar to
that of Komi. As far as we have been able to
ascertain, these are, at least: Dargwa spoken in
the Caucasus (Kozhukhar,2017), Pnar4spoken in
South-East Asia and Shipibo-Konibo5spoken in
Peru. The description of the last treebank men-
tioned does not indicate the use of language doc-
umentation materials, but as the language is very
small, the context is comparable. To our knowl-
edge, the IKDP treebank discussed here is the first
treebank included in the UD release that is directly
2Cf., e.g., the Leipzig Glossing Rules https:
based on language documentation material. It is
too early to say whether there will be more simi-
lar treebanks in the future and within what time-
frame, but having more materials like these in-
cluded in UD would fit into the original ideas of
the multifunctional language documentation en-
terprise very well.
3 Methodology
The initial analysis of Komi plain text was created
using Giellatekno’s6open infrastructure (Mosha-
gen et al.,2014), which is currently at a rather ma-
ture level for Komi. The syntactic analysis compo-
nent demands the most further work, which in turn
can be guided by the work on treebanks. Simi-
lar rule-based architectures have already been used
for other treebanks as well. The Northern Saami
and Erzya corpora, for example, seem to have been
created using a similar approach. Some work has
been conducted with integrating these NLP tools
into workflows commonly used in language doc-
umentation (Gerstenberger et al.,2017a,b,2016).
Since these languages often lack larger annotated
resources, the use of infrastructures other than
rule-based ones has not been common or possible,
but these workflows have been implemented in a
modular fashion that would make enable the inte-
gration of other tools when they become available
or reach needed accuracy.
It has been demonstrated that it is possible to
convert annotations from Giellatekno’s annotation
scheme into the UD scheme (Sheyanova and Ty-
ers,2017), and this has also worked well in our
case, although the exact procedure will continue
to be refined while the token count of the corpus
grows, which will ultimately also reveal rarer and
not-yet-analysed morphosyntactic features. Af-
ter starting with manually editing CoNLL-U files,
the UD Annotatrix tool (Tyers et al.,2018) was
adopted in January 2018, which marked the mid-
point in the project’s timeline. This greatly im-
proved the annotation speed and consistency.
The treebank creation thus consisted of the fol-
lowing steps:
1. Sending Komi sentences to the Giellatekno
morphosyntactic analyser (consisting of an
FST component for morphological categories
and a syntactic component using Constraint
2. Resolving the remaining ambiguity manually
3. Adding the missing syntactic relations manu-
ally to the UD Annotatrix
4. Automatically converting the analyzer’s
XPOS-tags into UPOS-tags and converting
morphological feature tags into their UD
5. Manual correction and verification
The current workflow involves a rather large
amount of manual work. We are interested in
testing various approaches to morphological and
syntactic analysis so that different (rule-based,
statistic-based and hybrid) parsers can eventually
replace the manual work. Some tests have already
been carried out with the dependency parser used
by the Lattice team in the CoNLL-U Shared Task
2017 (Lim and Poibeau,2017) and a follow-up
project (Partanen et al.,2018).
The treebank processing pipeline has been
tied to several scripts and existing tools. The
primary analysisis done within the Giellatekno
toolkit (building on FST Morphology and Con-
straint Grammar), where tokenization, morpho-
logical analysis and rule-based disambiguation are
tied to the script ‘kpvdep’. The script returns a
vislcg3 file that contains all ambiguities left after
the analysis. Once the ambiguities are resolved
manually, the vislcg3 file can be imported into the
UD Annotatrix tool. As a final step, the Giellate-
kno POS-tags and morphological features are con-
verted to follow the UD standard with a Python
script, originally written by Francis Tyers7. A
modified version of the script with the conversion
pattern file is stored in not-to-release folder in the
dev-branch of the Lattice treebank, which is the lo-
cation where all development scripts of both tree-
banks will be maintained.
4 Data Sources and Design Principles
Most of the work on the Komi language is cur-
rently being done by collaborators of FU-Lab8in
Syktyvkar, the capital of the Komi Republic in
Russia. The work of FU-Lab, led by Marina Fe-
dina, has been particularly exceptional, as it has
resulted in a significant number of Komi-language
books being digitalized, made available online9
and converted into a linguistic corpus.10 The cor-
pus is currently 40 million words large, and the
long-term goal is to digitalize all books and other
printed texts ever published in Komi-Zyrian. The
number of publications is approximately 4,500
books, plus tens of thousands of newspaper and
journal issues. A significant portion of the lat-
ter are available in the Public Domain as part of
the Fenno-Ugrica project of the National Library
of Finland11. We have exclusively chosen to use
openly available data for the Lattice treebank in
order to ensure as broad and simple reuse as pos-
sible. The forthcoming releases will include more
genres of text, such as newspaper texts and longer
sections of Wikipedia articles.
All sentences in the Lattice treebank are pre-
sented in the contemporary orthography, even
when they were originally published using vari-
ous earlier Komi writing systems. The propor-
tion of texts originally written in the Molodcov al-
phabet will rise dramatically in the next releases,
as this is probably the most commonly used or-
thography in the upcoming texts. Storing several
orthographic variants may be necessary. Conver-
sion between systems has been carried out using
FU-Lab’s Molodcov converter12. The data orig-
inates from scanned books through text recogni-
tion, currently with loss of page coordinates. This
connects to the question of how to retrieve arbi-
trary information from different sources that can
be connected to the sentence IDs: metadata, page
positions, page images, time codes and audio seg-
We considered it very important to also in-
clude spoken language in the treebank, ideally
eventually covering all dialects. During the last
years, one of the largest research projects inves-
tigating spoken Komi has been the IKDP project,
led by Rogier Blokland in 20142016, which re-
sulted in a large transcribed spoken language cor-
pus (Blokland et al.,2009-2018). The IKDP
treebank contains dialectal texts taken from this
corpus, and since written Komi does not follow
the exact same principles employed in the tran-
scriptions, it seems problematic to mix these ma-
terials together. The orthographic conventions
of the spoken treebank are basically similar to
those used in the recent Komi dialect dictionary
(Beznosikova et al.,2012), with only relatively
subtle differences.What it comes to spoken fea-
tures, corrections are kept and marked with the re-
lation reparandum, but features such as pauses are
not separately marked. The user can access the
original archived audio, which enables a more de-
tailed analysis of spoken phenomena if desired. In
their typographic simplicity, the transcribed texts
are reminiscent of some of the dialect texts pub-
lished previously in various printed text collec-
tions (without the original audio recordings). The
context of the spoken data here is therefore not
only a faithful representation of the spoken sig-
nal, which could include also more exact phonetic
transcriptions, but also the larger landscape of spo-
ken language resources which we would like to in-
tegrate into our NLP ecosystem.
Furthermore, because local Komi speech and
research communities are often conscious of or-
thographic norms, we wanted to draw a clear
boundary between written and spoken representa-
tions. Additionally, the spoken language treebank
contains a large number of Russian phrases due to
code-switching, which makes it to some degree a
multilingual treebank. In the IKDP treebank, Rus-
sian items are currently marked with a language
tag in the misc-field, but verification that Russian
annotations are consistent with monolingual Rus-
sian treebanks is a topic that requires further atten-
The sentences represent running texts and narra-
tives, and, to a great extent, they link together into
continuous larger text units. There are deviations
from this in situations where individual examples
have been selected in order to include instances
of each dependency relation in the treebank. This
was done particularly in the early stages of the
treebanks when it was important to gain more un-
derstanding of how different syntactic relations are
tagged consistently in UD. In the upcoming re-
leases, occurrences of each morphosyntactic phe-
nomena present in Komi may also be hand-picked
from corpora to ensure that they occur in the tree-
banks, the need for which is discussed next.
5 Some Questions Arising From
As the majority of languages in UD are larger
Indo-European languages, the project does not yet
include many examples of languages with very
complex case systems. For example, Komi has
two values of nominal case that were not yet in-
cluded in the earlier documentation, namely the
approximative and the egressive. One issue aris-
ing when comparing current treebanks is the cross-
comparability of the case labels applied. Komi has
two cases that express a path of some sort, tra-
ditionally called prolative and transitive in Komi
and Uralic linguistics. These would match closely
with a case label already in the UD documentation,
perlative, found in Warlpiri, but the fact that there
are two very similar cases already makes the label-
ing problematic. Differences in case labeling are
related to further linguistic analyses that are possi-
ble with the corpora, as well as to parsing accuracy
in multilingual scenarios. In the present treebanks,
the traditional labels for Komi cases are used.
Another theoretical question arising from Komi
concerns the way different cases can be combined,
resulting in "double case marking". For example,
it is entirely possible to use several spatial case
markers linearly combined in one and the same
inflected noun form, and, although this is some-
what rare, examples can be easily found even for
more marginal combinations. For example, the
case suffixes for elative and terminative can com-
bine to mark subtle changes in focus: vengrija-iC-
edý Hungary-ELA-TER ‘all the way from Hungary’
(see e.g. (Bartens,2003, 53). This raises the ques-
tion of how to best annotate this in UD. Of course
each combination could be labeled as a new case,
which is also sometimes seen in the literature on
Komi nominal case (Kuznetsov,2012, p. 374),
but this would greatly increase the number of case
values that need to be documented, and most of
them would be very marginal and specific to in-
dividual languages. Another solution would be to
allow several case affixes to be added to one word
form. However, this would only help when sev-
eral cases are clearly combined and would not be
useful when new spatial cases have emerged from
postpositions, a phenomenon typical of Komi and
Udmurt dialects.
Currently, a large portion of the cases in UD
documentation are used only in Hungarian. In-
cluding more languages with large case systems,
such as Uralic or Northeast Caucasian languages
like Lezgian, would only increase the number of
names for case values used mainly in individ-
ual languages. Eventually this also boils down
to the question of how comparable the cases in
different languages actually are. Haspelmath has
argued convincingly that case labels are valid
only for particular languages (Haspelmath,2009,
510), and the issue probably cannot be explicitly
solved within UD either, but for the sake of us-
ability of treebanks and their suitability for mul-
tilingual NLP applications, some harmonization
would seem desirable. One alternative could be
to create a higher layer of mapping that connects
language-specific labels to broader shared cate-
gories. In this way, both Komi cases expressing
a path could be connected to a concept of move-
ment along a path, but the language-specific nu-
ances would not be lost.
6 Conclusion and Further Work
The written and spoken treebanks have 1389 and
988 tokens, respectively. Due to their small size,
they have not been split into test and development
sets. Based on this experience, it already seems
clear that providing annotations in this framework
has several advantages compared to traditional
methods used in language documentation projects.
The main benefit is the comparability between dif-
ferent languages, and also straightforward licens-
ing and distribution within UD framework.
It can be argued that tagging according the UD
principles is necessarily a compromise, and that
it may not express all particularities of individ-
ual languages. One possible way to solve this
problem is to include further annotations in the
misc-column. Another possible approach would
be to provide different parts of the documentary
corpus with varying degrees of annotations. In
any case, based on our experience, we would
strongly encourage endangered language docu-
mentation projects to take a small segment of their
materials and add to it an additional layer of anno-
tations in the Universal Dependencies framework.
Language documentation data is usually stored in
archives that require access requests. This is not
very compatible with openly available treebanks.
Still, it should be possible to collect small subsets
of materials with the clear intention and permis-
sion for these recordings to be openly licensed, or
to use texts old enough that they are copyright free.
New material is currently being brought into the
Lattice treebank. The main genres obtained from
Fenno-Ugrica collection are newspaper texts, non-
fiction works and schoolbooks. Samples of these,
along with some larger Wikipedia texts, will be in-
cluded in the next UD release 2.3. The next phase
of the IKDP treebank will include individual texts
from the Komi recordings made by Eric Vászolyi
in the 1950s and 1960s (Vászolyi-Vasse,2003),
which the present authors have acquired permis-
sion to re-publish electronically. These texts orig-
inate from a time and place of intensive language
contact between Komi-Zyrian and Tundra Nenets,
what makes them a particularly interesting target
for further study.
One possibly useful addition to the treebank
could be English glosses in the misc-field, since
many linguists are used to working with data from
endangered languages in a format like this. The
English gloss could contain a contextual transla-
tion of the lemma, for example, which would make
the sentences in the treebank much more accessi-
ble to different linguistic audiences.
In terms of size, the target is to reach 5,000
tokens in both treebanks during 2018, and to in-
crease this to 20,000 in the first half of 2019. Our
long-term goal is to create a resource that would
contribute to research on Komi and provide bet-
ter resources for Natural Language Processing of
this language, which has yet to receive sufficient
attention in computational linguistic research.
7 Acknowledgements
We want to thank the reviewers for their use-
ful comments. This work has been developed
in the framework of the LAKME project funded
by a grant from Paris Sciences et Lettres (IDEX
PSL reference ANR-10-IDEX-0001-02). Thierry
Poibeau is partially supported by a RGNF-CNRS
(grant between the LATTICE-CNRS Laboratory
and the Russian State University for the Human-
ities in Moscow). Kyungtae Lim is partially sup-
ported by the ERA-NET Atlantis project. Niko
Partanen’s work has been carried out at the LAT-
TICE laboratory, and besides Partanen, both Ro-
gier Blokland and Michael Rießler collaborate
within the project Language Documentation meets
Language Technology: the Next Step in the De-
scription of Komi, funded by the Kone Founda-
tion. Thanks to Jack Rueter for numerous discus-
sions on Komi and Erzya, and to Alexandra Kell-
ner for proofreading the paper.
Raija Bartens. 2003. Kahden kaasuspäätteen jonoista
suomalais-ugrilaisissa kielissä. In Bakró-Nagy Mar-
ianne and Károly Rédei, editors, Ünnepi kötet Honti
László tiszteletére, pages 46–54. MTA, Budapest.
L.M. Beznosikova, E.A. Ajbabina, N.K. Zaboeva, and
R.I. Kosnyreva. 2012. Komi sërnisikas kyvˇcukör.
Slovar dialektov komi âzyka: v 2-h tomah. Insti-
tut âzyka, literatury i istorii Komi naunogo centra
Uralskogo otdeleniâ Rossijskoj akademii nauk, Syk-
Rogier Blokland, Marina Fedina, Niko Partanen, and
Michael Rießler. 2009-2018. IKDP. In The
Language Archive (TLA): Donated Corpora. Max
Planck Institute for Psycholinguistics, Nijmegen.
Ciprian Gerstenberger, Niko Partanen, and Michael
Rießler. 2017a. Instant annotations in ELAN cor-
pora of spoken and written Komi, an endangered
language of the Barents Sea region. In Proceed-
ings of the 2nd Workshop on the Use of Compu-
tational Methods in the Study of Endangered Lan-
guages, pages 57–66. ACL.
Ciprian Gerstenberger, Niko Partanen, Michael
Rießler, and Joshua Wilbur. 2016. Utilizing
language technology in the documentation of
endangered Uralic languages. Northern European
Journal of Language Technology, 4:29–47.
Ciprian Gerstenberger, Niko Partanen, Michael
Rießler, and Joshua Wilbur. 2017b. Instant anno-
tations: Applying NLP methods to the annotation
of spoken language documentation corpora. In
Proceedings of the 3rd International Workshop on
Computational Linguistics for Uralic languages,
pages 25–36. ACL.
Martin Haspelmath. 2009. Terminology of case. In
Andrew Spencer and Andrej L. Malchukov, editors,
The Oxford handbook of case, pages 505–517. OUP,
Nikolaus Himmelmann. 2006. Language documenta-
tion: What is it and what is it good for? In Jost Gip-
pert, Ulrike Mosel, and Nikolaus Himmelmann, ed-
itors, Essentials of Language Documentation, pages
1–30. Mouton de Gruyter, Berlin.
Nikolaus P. Himmelmann. 1998. Documentary and de-
scriptive linguistics. Linguistics, 36:161–195.
Alexandra Kozhukhar. 2017. Universal dependencies
for Dargwa Mehweb. In Proceedings of the Fourth
International Conference on Dependency Linguis-
tics, pages 92–99. ACL.
Nikolay Kuznetsov. 2012. Matrix of cognitive do-
mains for Komi local cases. Journal of Estonian and
Finno-Ugric Linguistics, 3(1):373–394.
KyungTae Lim, Niko Partanen, and Thierry Poibeau.
2018. Multilingual Dependency Parsing for Low-
Resource Languages: Case Studies on North Saami
and Komi-Zyrian. In Proceedings of the Eleventh
International Conference on Language Resources
and Evaluation. ELRA.
KyungTae Lim and Thierry Poibeau. 2017. A sys-
tem for multilingual dependency parsing based on
bidirectional LSTM feature representations. In Pro-
ceedings of the CoNLL 2017 Shared Task: Multilin-
gual Parsing from Raw Text to Universal Dependen-
cies, pages 63–70. ACL.
Sjur Moshagen, Jack Rueter, Tommi Pirinen, Trond
Trosterud, and Francis M Tyers. 2014. Open-source
infrastructures for collaborative work on under-
resourced languages. In Proceedings of the Ninth
International Conference on Language Resources
and Evaluation, pages 71–77. ELRA.
Joakim Nivre, Mitchell Abrams, Željko Agi´
c, Lars
Ahrenberg, Lene Antonsen, Maria Jesus Aranz-
abe, Gashaw Arutie, Masayuki Asahara, Luma
Ateyah, Mohammed Attia, Aitziber Atutxa, Lies-
beth Augustinus, Elena Badmaeva, Miguel Balles-
teros, Esha Banerjee, Sebastian Bank, Verginica
Barbu Mititelu, John Bauer, Sandra Bellato, Kepa
Bengoetxea, Riyaz Ahmad Bhat, Erica Biagetti,
Eckhard Bick, Rogier Blokland, Victoria Bobicev,
Carl Börstell, Cristina Bosco, Gosse Bouma, Sam
Bowman, Adriane Boyd, Aljoscha Burchardt, Marie
Candito, Bernard Caron, Gauthier Caron, Gül¸sen
glu Eryi˘
git, Giuseppe G. A. Celano, Savas
Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,
Jayeol Chun, Silvie Cinková, Aurélie Collomb,
grı Çöltekin, Miriam Connor, Marine Courtin,
Elizabeth Davidson, Marie-Catherine de Marneffe,
Valeria de Paiva, Arantza Diaz de Ilarraza, Carly
Dickerson, Peter Dirix, Kaja Dobrovoljc, Tim-
othy Dozat, Kira Droganova, Puneet Dwivedi,
Marhaba Eli, Ali Elkahky, Binyam Ephrem, Tomaž
Erjavec, Aline Etienne, Richárd Farkas, Hector
Fernandez Alcalde, Jennifer Foster, Cláudia Fre-
itas, Katarína Gajdošová, Daniel Galbraith, Mar-
cos Garcia, Moa Gärdenfors, Kim Gerdes, Filip
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh
Gökırmak, Yoav Goldberg, Xavier Gómez Guino-
vart, Berta Gonzáles Saavedra, Matias Grioni, Nor-
munds Gr¯
ıtis, Bruno Guillaume, Céline Guillot-
Barbance, Nizar Habash, Jan Hajiˇ
c, Jan Hajiˇ
c jr.,
Linh Hà M˜
y, Na-Rae Han, Kim Harris, Dag Haug,
Barbora Hladká, Jaroslava Hlaváˇ
cová, Florinel
Hociung, Petter Hohle, Jena Hwang, Radu Ion,
Elena Irimia, Tomáš Jelínek, Anders Johannsen,
Fredrik Jørgensen, Hüner Ka¸sıkara, Sylvain Ka-
hane, Hiroshi Kanayama, Jenna Kanerva, Tolga
Kayadelen, Václava Kettnerová, Jesse Kirchner,
Natalia Kotsyba, Simon Krek, Sookyoung Kwak,
Veronika Laippala, Lorenzo Lambertino, Tatiana
Lando, Septina Dian Larasati, Alexei Lavrentiev,
John Lee, Phng Lê H`
ông, Alessandro Lenci, Saran
Lertpradit, Herman Leung, Cheuk Ying Li, Josie
Li, Keying Li, KyungTae Lim, Nikola Ljubeši´
Olga Loginova, Olga Lyashevskaya, Teresa Lynn,
Vivien Macketanz, Aibek Makazhanov, Michael
Mandl, Christopher Manning, Ruli Manurung,
alina M˘
anduc, David Mareˇ
cek, Katrin Marhei-
necke, Héctor Martínez Alonso, André Martins, Jan
Mašek, Yuji Matsumoto, Ryan McDonald, Gustavo
Mendonça, Niko Miekka, Anna Missilä, C˘
Mititelu, Yusuke Miyao, Simonetta Montemagni,
Amir More, Laura Moreno Romero, Shinsuke
Mori, Bjartur Mortensen, Bohdan Moskalevskyi,
Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,
Pinkey Nainwani, Juan Ignacio Navarro Horñi-
acek, Anna Nedoluzhko, Gunta Nešpore-B¯
Lng Nguy˜
ên Thi
., Huy`
ên Nguy˜
ên Thi
.Minh, Vi-
taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,
Stina Ojala, Adédayo
.Olúòkun, Mai Omura, Petya
Osenova, Robert Östling, Lilja Øvrelid, Niko
Partanen, Elena Pascual, Marco Passarotti, Ag-
nieszka Patejuk, Siyao Peng, Cenel-Augusto Perez,
Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily
Pitler, Barbara Plank, Thierry Poibeau, Mar-
tin Popel, Lauma Pretkalnin
,a, Sophie Prévost,
Prokopis Prokopidis, Adam Przepiórkowski, Ti-
ina Puolakainen, Sampo Pyysalo, Andriela Rääbis,
Alexandre Rademaker, Loganathan Ramasamy,
Taraka Rama, Carlos Ramisch, Vinit Ravishankar,
Livy Real, Siva Reddy, Georg Rehm, Michael
Rießler, Larissa Rinaldi, Laura Rituma, Luisa
Rocha, Mykhailo Romanenko, Rudolf Rosa, Da-
vide Rovati, Valentin Roca, Olga Rudina, Shoval
Sadde, Shadi Saleh, Tanja Samardži´
c, Stephanie
Samson, Manuela Sanguinetti, Baiba Saul¯
Yanin Sawanakunanon, Nathan Schneider, Sebas-
tian Schuster, Djamé Seddah, Wolfgang Seeker,
Mojgan Seraji, Mo Shen, Atsuko Shimada, Muh
Shohibussirri, Dmitry Sichinava, Natalia Silveira,
Maria Simi, Radu Simionescu, Katalin Simkó,
Mária Šimková, Kiril Simov, Aaron Smith, Is-
abela Soares-Bastos, Antonio Stella, Milan Straka,
Jana Strnadová, Alane Suhr, Umut Sulubacak,
Zsolt Szántó, Dima Taji, Yuta Takahashi, Takaaki
Tanaka, Isabelle Tellier, Trond Trosterud, Anna
Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Ue-
matsu, Zdeˇ
nka Urešová, Larraitz Uria, Hans Uszko-
reit, Sowmya Vajjala, Daniel van Niekerk, Gertjan
van Noord, Viktor Varga, Veronika Vincze, Lars
Wallin, Jonathan North Washington, Seyi Williams,
Mats Wirén, Tsegay Woldemariam, Tak-sum Wong,
Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu,
ek Žabokrtský, Amir Zeldes, Daniel Zeman,
Manying Zhang, and Hanzhi Zhu. 2018. Univer-
sal dependencies 2.2. LINDAT/CLARIN digital li-
brary at the Institute of Formal and Applied Linguis-
tics (ÚFAL), Faculty of Mathematics and Physics,
Charles University.
Niko Partanen, KyungTae Lim, Michael Rießler, and
Thierry Poibeau. 2018. Dependency parsing of
code-switching data with cross-lingual feature rep-
resentations. In Proceedings of the 4th International
Workshop on Computational Linguistics for Uralic
languages, pages 1–17. ACL.
Jack Rueter and Francis Tyers. 2018. Towards an open-
source universal-dependency treebank for Erzya.
In Proceedings of the 4th International Workshop
on Computational Linguistics for Uralic languages,
pages 106–118. ACL.
Mariya Sheyanova and Francis M. Tyers. 2017. Anno-
tation schemes in North Sámi dependency parsing.
In Proceedings of the Third Workshop on Computa-
tional Linguistics for Uralic Languages, pages 66–
75. ACL.
Francis M Tyers, Mariya Sheyanova, and
Jonathan North Washington. 2018. UD Annotatrix:
An annotation tool for universal dependencies. In
Proceedings of the 16th International Workshop on
Treebanks and Linguistic Theories, pages 10–17.
Eric Vászolyi-Vasse. 2003. Syrjaenica: Narratives,
folklore and folk poetry from eight dialects of the
Komi language. Vol. 1, Upper Izhma, Lower Ob,
Kanin Peninsula, Upper Jusva, Middle Inva, Udora.
Savariae, Szombathely.
... They adopted a multilingual parsing approach (Lim and Poibeau, 2017) and used Russian and Komi monolingual training data with bilingual Komi-Russian word embeddings. Later, this treebank expanded into the Komi-Zyrian IKDP treebank (Partanen et al., 2018a). ...
... Hence, training of the latter two treebanks are on out-of-domain data. For Kpv-Ru which includes Komi-Russian code-switching, we train the models on Komi-Zyrian Lattice UD Treebank (Partanen et al., 2018a) of monolingual Komi data. The first 562 sentences in Komi-Zyrian Lattice are used for training, the remaining 100 are used for development. ...
Conference Paper
Full-text available
Code-switching dependency parsing stands as a challenging task due to both the scarcity of necessary resources and the structural difficulties embedded in code-switched languages. In this study, we introduce novel sequence labeling models to be used as auxiliary tasks for dependency parsing of code-switched text in a semi-supervised scheme. We show that using auxiliary tasks enhances the performance of an LSTM-based dependency parsing model and leads to better results compared to an XLM-R-based model with significantly less computational and time complexity. As the first study that focuses on multiple code-switching language pairs for dependency parsing, we acquire state-of-the-art scores on all of the studied languages. Our best models outperform the previous work by 7.4 LAS points on average.
... Despite clear advantages of the UD framework for annotating CS treebanks, the annotation of multiple languages in a single treebank needs additional considerations that have not been studied before. Although there have been a few UD treebanks with code-switching (Bhat et al. 2018;Partanen et al. 2018;Seddah et al. 2020;Braggaar and van der Goot 2021), the papers describing these treebanks do not document or discuss the code-switching aspects of the annotation process, except for a brief section in (Braggaar and van der Goot 2021). ...
... The major annotation augmentation is the language IDs assigned to each token. Komi-Zyrian IKDP (Partanen et al. 2018) consists of spoken language, and some utterances include Russian phrases. In those utterances mixed and Russian tokens are marked with respective language IDs, and the Russian syntax is applied. ...
Full-text available
This paper presents the SAGT Turkish–German code-switching treebank, and observations and annotation challenges we encountered during its development. The treebank consists of transcriptions of bilingual conversations annotated with several layers: language IDs, lemmas, POS tags, morphological features, and dependency relations. The annotations follow the Universal Dependencies annotation scheme and the conventions used in monolingual treebanks as much as possible. We present and discuss a number of issues that arise because of the need for consistent multilingual annotation within a single treebank, as well as the informal language, which is where code-switching is observed most. Besides proposing solutions to these issues, we present some observations about code-switching phenomena that are only possible to observe in a data set with rich linguistic annotation. The treebank was annotated with a focus on quality of annotations through an iterative process of detecting and correcting annotation errors. We also present quantitative measures for indication of annotation quality. The code-switching treebank created in this study is released to the public through Universal Dependencies repositories.
... Unesco classifies these languages as definitely endangered (Moseley, 2010). In terms of NLP, these languages have FSTs (Rueter et al., 2020, Universal Dependencies Treebanks (Partanen et al., 2018;Rueter and Tyers, 2018) (excluding Udmurt) and constraint grammars available in Giella repositories (Moshagen et al., 2014). For some of the languages, there have also been efforts in employing neural models in disambiguation (Ens et al., 2019) and morphological tasks . ...
Full-text available
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56\% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.
... Unesco classifies these languages as definitely endangered (Moseley, 2010). In terms of NLP, these languages have FSTs (Rueter et al., 2020, Universal Dependencies Treebanks (Partanen et al., 2018;Rueter and Tyers, 2018) (excluding Udmurt) and constraint grammars available in Giella repositories (Moshagen et al., 2014). For some of the languages, there have also been efforts in employing neural models in disambiguation (Ens et al., 2019) and morphological tasks . ...
Conference Paper
Full-text available
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word em-beddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.
... In recent years, at least within the Uralic language family, we have seen new treebanks emerging in languages with closely related siblings that already have an existing treebank. Examples of such languages are Skolt Saami, in relation to Northern Saami (Tyers and Sheyanova, 2017), Komi-Permyak, in relation to Komi-Zyrian (Partanen et al., 2018), or Moksha in relation to Erzya (Rueter and Tyers, 2018). Although the entirety of Uralic languages is still not fully represented within the Universal Dependencies project, the situation has improved in many ways since the last survey on the state of this language family in UD was conducted (Partanen and Rueter, 2019). ...
Conference Paper
Full-text available
This study discusses the way different numerals and related expressions are currently annotated in the Universal Dependencies project, with specific focus on the Uralic language family. We analyse different annotation conventions between individual treebanks, and aim to highlight some areas where further development work and systematization could prove beneficial. At the same time the Universal Dependencies project already offers a wide range of conventions to mark nuanced variation in numerals and counting expressions, and the harmonization of conventions between different languages could be the next step to take. The discussion here refers to the UD version 2.8., and some differences found may already be harmonized in the version 2.9.
... However, HIENCS does not consider intra-word CS, and SAGT simply tags the intra-word CS with the MIXED tag, agnostic of what language codes are inside the token. UD Komi-Zyrian Lattice (Partanen et al., 2018), a UD treebank of another minority language of Russia, also explicitly annotates Russian words by specifying as OrigLang=ru, but their CS segments are unclear. NMCTT differs from these corpora by tagging each intra-word CS segment with a language code, allowing for more flexibility and expressiveness in the language tagging. ...
Conference Paper
Full-text available
This paper introduces a new Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first outline NMCTT with a focus on differences from other treebanks of Turkic languages. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.
... Apart from Livonian, these languages have received some digital language documentation interest. Erzya (Rueter and Tyers, 2018) and Komi-Zyrian (Partanen et al., 2018) have small Universal Dependencies tree banks and morphological transducers . ...
Conference Paper
Full-text available
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, En-glish and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionar-ies. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livo-nian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.
... Sentences contain filler words and may have abrupt boundaries. Sources range from elicited speech of native speakers (Komi Zyrian-IKDP by Partanen et al., 2018) to radio program transcriptions (Frisian Dutch-Fame by Braggaar and van der Goot, 2021). ...
Full-text available
This work provides the first in-depth analysis of genre in Universal Dependencies (UD). In contrast to prior work on genre identification which uses small sets of well-defined labels in mono-/bilingual setups, UD contains 18 genres with varying degrees of specificity spread across 114 languages. As most treebanks are labeled with multiple genres while lacking annotations about which instances belong to which genre, we propose four methods for predicting instance-level genre using weak supervision from treebank metadata. The proposed methods recover instance-level genre better than competitive baselines as measured on a subset of UD with labeled instances and adhere better to the global expected distribution. Our analysis sheds light on prior work using UD genre metadata for treebank selection, finding that metadata alone are a noisy signal and must be disentangled within treebanks before it can be universally applied.
Full-text available
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of their lexicon, yet the overlap in their dialects is very complex. This paper evaluates the degree of difference in these written varieties based on changes required for computational resources in the description of these languages when adapted from the Komi-Zyrian original. Primarily these changes include the FST architecture, but we are also looking at its application to the Universal Dependencies annotation scheme in the morphologies of the two languages. Дженыта висьталӧм Коми кылын кык гижан кыв: пермяцкӧй да зырянскӧӥ. Öтамӧд коласын нія вежӧртанаӧсь, но аркмисӧ нія разнӧй коми диалекттэзісь. Пермяцкӧй кыв олӧ Кудымкар лапӧлын, а зырянскӧӥ-Сыктывкар ладорын. Пермяцкӧй да зырянскӧй литературнӧй кыввезын эм уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт тор лексикаын, но ны диалектнӧй чертаэзлӧн пантасьӧмыс ӧддьӧн гардчӧм. Эта статьяын мийӧ видзӧтам эна кык кывлісь ассямасӧ сы ладорсянь, мый ковсяс вежны лӧсьӧтӧм зырянскӧй вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз энӧ вежсьӧммесӧ колӧ керны FST-ын, но мийӧ сідзжӧ видзӧтам, кыдз FST лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ морфология ладорсянь.
Full-text available
This paper explores quantitative results based on theoretical assumptions related to the predictions on N-merge systems (Rizzi 2016) ranked from minimum to a maximum of complexity in terms of the computational devices and derivational operations they require. We investigate the nature of external arguments focussing on 2-merge systems (two elements of the lexicon merge and the created unit is again merged with a further element directly extracted from the lexicon) and 3-merge systems (merge two elements created by previous operations of merge). We add a quantitative dimension to the established qualitative dimension discussed in the theory (Rizzi 2016) by investigating large-scale corpora representative of three populations of speakers: adult grammar (102 treebanks/101 languages), typically developing children (2 corpora/English and Chinese) and children with atypical development (1 corpus). The results confirm the predictions in Rizzi (2016): every language in our data set exploits 3-merge systems and less complex systems are the preferred options in early grammars.
Kahden kaasuspäätteen jonoista suomalais-ugrilaisissa kielissä
  • Raija Bartens
Raija Bartens. 2003. Kahden kaasuspäätteen jonoista suomalais-ugrilaisissa kielissä. In Bakró-Nagy Marianne and Károly Rédei, editors, Ünnepi kötet Honti László tiszteletére, pages 46-54. MTA, Budapest.
UD Annotatrix: An annotation tool for universal dependencies
  • Mariya Francis M Tyers
  • Jonathan North Sheyanova
  • Washington
Francis M Tyers, Mariya Sheyanova, and Jonathan North Washington. 2018. UD Annotatrix: An annotation tool for universal dependencies. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, pages 10-17. ACL.
Syrjaenica: Narratives, folklore and folk poetry from eight dialects of the Komi language
  • Eric Vászolyi-Vasse
Eric Vászolyi-Vasse. 2003. Syrjaenica: Narratives, folklore and folk poetry from eight dialects of the Komi language. Vol. 1, Upper Izhma, Lower Ob, Kanin Peninsula, Upper Jusva, Middle Inva, Udora. Savariae, Szombathely.