Conference PaperPDF Available

Towards an open-source universal-dependency treebank for Erzya

Authors:
Proceedings of the 4th International Workshop for Computational Linguistics for Uralic Languages (IWCLUL 2018), pages 108–120,
Helsinki, Finland, January 8–9, 2018. c
2018 Association for Computational Linguistics
108
   
  
 
  
   

jack.rueter@helsinki.fi
 
 
  

ftyers@hse.ru
Abstract
          
          
             
           
          
          
            
          
           
    
Tiivistelmä
       
     
            
      
        
       
        
      
        
 
Abstract
        
       
      
      
        
         
       
       
       
       
             
http://creativecommons.org/licenses/by/4.0/
109
        
        
        
     
1 Introduction
          
           
          
            
          
 
         
            
            
         
           
           
           
              
   
            
           
          
             
        
            
     
             
           
              
               
            
              
         
2 Baground
2.1 Erzya
Erzya            
              
http://universaldependencies.org/
http://www.glossary.sil.org/
http://wals.info/
http://www.helsinki.fi/~kopotev/finnish_corpora_eng.pdf
             
 http://urn.fi/urn:nbn:fi:lb-2016012202
       http://urn.fi/urn:nbn:fi:lb-201407306
110
Case Denite Form Function
Nom  ś t́ńe   
 
Nom    
  
  
Nom  Ozo  
Gen     
   
  
Gen  Ońt́  
  
Ine  sO  

      Note that with the exception of the third person
singular possessive sux, there is generally no distinctions made for number or geni-
tive/nominative marking in the possessive declension.
              
             
             
              
          
         
         
           
           
           
        
          
              
          
             
        
            
   nsubjroot      
        
https://efo.revues.org/1829
http://giellatekno.uit.no/cgi/index.myv.eng.html
    
111

     
NOUN NOUN DET NOUN ADV PRON PUNCT
     
     
    
    
 


 
        
3 Methodology
3.1 Corpus
              
            
             

Document Description Sentences Tokens Av. length
Valskeń gudok      
Kirdažt        
Veĺeń vajgeĺt́       
Pićipalakst      
Lažnɨća Sura       
Separate    
  
        
         
              
           
       
              
             
            appos  conj
         obl   
            
              
      nmod  nmod:poss   
              
            
               

4 Annotation guidelines
              
              
112
obl nmod nmod:poss obj root nsubj
Case=Nom|Definite=Def     
Case=Nom|Definite=Ind     
Case=Nom|Number[psor]=Sing|Person[psor]=3     
Case=Gen|Definite=Ind    
Case=Gen|Definite=Def    
Case=Dat|Definite=Ind   
Case=Dat|Definite=Def   
Case=Ine|Definite=Ind   
          
 
              
            
            
          
            
         
4.1 Number
              
          
          
                
           
             
        
         
    



  
           
            
       



  
            
       
113



  
             
            
          
            

4.2 Copula and polarity
          
            
            
        araś    
           
           avoĺ  
         apak    
     avoĺ-    iĺa-
    a         aux:neg
        a  avoĺ      
Proh/Opt Ind.Prt1 Cnd Prc.Neg Part.Neg Part.Neg.Emp
/iĺa-/     
  
              
          
           
    
    a          
   avoĺ          
            
   iĺa-     avoĺ    
         
  iĺa- Mood=Proh         
              
4.3 Dependent copula morphology
          
             
             
            
           
            
  
114
Tense Sg1 Sg2 Sg3 Pl1 Pl2 Pl3
Nonpast mon odan/ ton odat/ son od/ miń odtano/ tɨń odtado/ sɨń odt/
Prt2 mon odoľiń/ ton odoľiť/ son odoľ/ miń odoľińek/ tɨń odoľiďe/ sɨń odoľť/
       
      uĺń-   uĺ-     
            nonpast 
prt2          
   prt2         nomen agentis  
prt2          
         
            
          
          
             
 
           
            
              
           
             
         
 
 
PRON NOUN
 
 
 

 


  

  
PRON PRON NOUN PUNCT
  
  
 

 

  
  
           
      lomań-eś        
         ki-jat   
            
        
4.4 Further auxiliaries
             
           
               
            
            mońeń   
115

   
PRON AUX VERB NOUN PUNCT
   
   
  
  

  
     
            
             
          
            
  
4.5 Compound nouns
             
             
           
              
               
            
           

 
  
ADJ NOUN NOUN
  
  
  
  
 
   

  
ADJ NOUN NOUN
  
  
  
  
 
   
             
              veď
vedra
4.6 Noun head ellipsis
             
           
              
    
                
           
        iśťamo    
            
             
         
116
 
   
DET ADJ ADJ NOUN
   
   
   
   
 
    

   
DET ADJ ADJ NOUN
   
   
   
   
 
   

   
NOUN DET ADJ NOUN
   
   
   
   
  
   
            
               
               
        
            
          śe  
            
          



  



  
         
         
   



  
           
              
           
 
 http://www.glossary.sil.org/term/elliptical-construction
117
4.7 Numerals
           
           
             
    nummod       
          
            
           
           
      tout     
  advcl  acl  det        
     

 
 
   

 
 
  

  
  
    

 
 
     
5 Future work
               
             
           
           
           
             
          
          
          
             
  
118
6 In conclusion
              
          
         
           
         
              

Anowledgements
             
             
             

           
        
              
       
119
References
   Mordvalaiskielten rakenne ja kehitys    Suomalais-
Ugrilaisen Seuran Toimituksia   
   e negation of stative relation clauses in the Mordvin languages 
   Suomalais-Ugrilaisen Seuran Toimituksia  

   Development of Mordvin Denite Conjugation   
Suomalais-Ugrilaisen Seuran Toimituksia   
   Moksha non-verbal predication    
   
   Semantics 2   
   Symmetric and Asymmetric Standard Negation 
       

   Adnominal Person in the Morphological System of Erzya  
 Suomalais-Ugrilaisen Seuran Toimituksia   
   On quantication in the Erzya language   

   Homonymy in the Uralic Two-Argument Agreement Paradigms
   Suomalais-Ugrilaisen Seuran Toimituksia 
 
   Nonverbal Predication in Erzya    
          
        Proceedings of the
16th International Workshop on Treebanks and Linguistic eories   
          
     3rd International Conference on Tur-
kic Languages Processing, (TurkLang 2015)  
          e
Prohibitive       

    GRAMMATIK DER ERSAMORDWINISCHEN SPRACHE 
       
       

        
          
 Эрзянь кель. Синтаксис: тонавтнемапель  
 
120
          
Эрзянь келень орфографиянь валкс   

... As far as we know, there are no published large parallel corpora or NMT systems for Erzya. Rueter and Tyers (2018) develop an Erzya treebank with a few hundred translations to English and Finnish. Архангельский (2019) present an Erzya web corpus 5 along with the way it was collected, but the corpus is available only via the web interface. ...
... For model evaluation, we prepare a held-out corpus of 3000 aligned Erzya-Russian sentences from 6 diverse sources: the Bible, Erzya folk tales (Sheyanova, 2017), the Soviet 1938 constitution, descriptions of folk children's games (Брыжинский, 2009), modern Erzya fiction and poetry, and Wikipedia. To evaluate English and Finnish translation, we use translations from the Erzya universal-dependency treebank (Rueter and Tyers, 2018): 441 sentence pairs for en, and 309 for fi. We split all these sets into development and test parts, and report the results on the test set. ...
Preprint
Full-text available
We present the first neural machine translation system for translation between the endangered Erzya language and Russian and the dataset collected by us to train and evaluate it. The BLEU scores are 17 and 19 for translation to Erzya and Russian respectively, and more than half of the translations are rated as acceptable by native speakers. We also adapt our model to translate between Erzya and 10 other languages, but without additional parallel data, the quality on these directions remains low. We release the translation models along with the collected text corpus, a new language identification model, and a multilingual sentence encoder adapted for the Erzya language. These resources will be available at https://github.com/slone-nlp/myv-nmt.
... The UD of the endangered languages can be obtained directly from Universal Dependencies' website 4 . At the time of writing, 1,690, 167, 104 and 435 sen-tences were in Erzya's [44], Moksha's [42], Skolt Sami's [30] and Komi-Zyrian's UDs 5 [32], respectively. These numbers highlight the insufficient amount of data present for training machine learning or NLP models for endangered languages. ...
... We apply them in the task of sentiment analysis. We hand pick all positive and negative sentences from the Erzya treebank [44] based on the translations provided in the treebank in English and Finnish. This constitutes our Erzya test corpus that contains 23 negative sentences and 22 positive sentences, giving us a total of 45 sentences. ...
Preprint
Full-text available
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
... In this context, it also has to be taken into account that several other smaller Uralic languages have had their own treebanks introduced in the past couple of years, e.g. Erzya (Rueter and Tyers, 2018), Karelian (Pirinen, 2019) and North Saami (Tyers and Sheyanova, 2017). This kind of work that concentrates more on manually annotated corpora complements the descriptive work on morphological analyzers extremely well. ...
Article
Full-text available
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of their lexicon, yet the overlap in their dialects is very complex. This paper evaluates the degree of difference in these written varieties based on changes required for computational resources in the description of these languages when adapted from the Komi-Zyrian original. Primarily these changes include the FST architecture, but we are also looking at its application to the Universal Dependencies annotation scheme in the morphologies of the two languages. Дженыта висьталӧм Коми кылын кык гижан кыв: пермяцкӧй да зырянскӧӥ. Öтамӧд коласын нія вежӧртанаӧсь, но аркмисӧ нія разнӧй коми диалекттэзісь. Пермяцкӧй кыв олӧ Кудымкар лапӧлын, а зырянскӧӥ-Сыктывкар ладорын. Пермяцкӧй да зырянскӧй литературнӧй кыввезын эм уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт тор лексикаын, но ны диалектнӧй чертаэзлӧн пантасьӧмыс ӧддьӧн гардчӧм. Эта статьяын мийӧ видзӧтам эна кык кывлісь ассямасӧ сы ладорсянь, мый ковсяс вежны лӧсьӧтӧм зырянскӧй вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз энӧ вежсьӧммесӧ колӧ керны FST-ын, но мийӧ сідзжӧ видзӧтам, кыдз FST лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ морфология ладорсянь.
... We [sms] have UD treebanks (Rueter and Tyers, 2018;Rueter, 2018;Pirinen, 2019;Rueter, 2014;Rueter et al., 2020;Partanen et al., 2018;Sheyanova and Tyers, 2017), but these are considerably smaller in size. Although none of these languages are officially supported by any of the language models we evaluate, we train crosslingual models and find that the models have remarkable crosslingual capabilities. ...
Preprint
Full-text available
Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hungarian, Erzya, Moksha, Karelian, Livvi, Komi Permyak, Komi Zyrian, Northern S\'ami, and Skolt S\'ami. When monolingual models are available (currently only et, fi, hu), these perform better on their native language, but in general they transfer worse than multilingual models or models of genetically unrelated languages that share the same character set. Remarkably, straightforward transfer of high-resource models, even without special efforts toward hyperparameter optimization, yields what appear to be state of the art POS and NER tools for the minority Uralic languages where there is sufficient data for finetuning.
... In short, our method requires only a corpus with OCRed text that we want to automatically correct, a word list, a morphological analyzer and any corpus of error free text. Since we focus on Finnish only, it is important to note that such resources exist for many endangered Uralic languages as well as they have extensive XML dictionaries and FSTs available (see (Hämäläinen and Rueter, 2018)) together with a growing number of Universal Dependencies (Nivre et al., 2016) treebanks such as Komi-Zyrian (Lim et al., 2018), Erzya (Rueter and Tyers, 2018), Komi-Permyak (Rueter et al., 2020) and North Sami (Sheyanova and Tyers, 2017). ...
Conference Paper
Full-text available
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.
... Since 1998, Rueter has been working on the finite-state description of the Erzyan and Moksha languages. Although most of the application of this work have not appeared until the last decade in the form of vocabularies and morphology, on the one hand (see Rueter 2014;Moshagen et al 2014;Hämäläinen 2019) and the description of Universal Dependencies, on the other (Rueter & Tyers 2018) and the task of improving their harmonization (Partanen & Rueter 2019). With this in mind, it becomes possible to evaluate language research studies with the help of a rule-based translation system -in the Apertium project, based on shallow transfer. ...
Preprint
Full-text available
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfer machine translation system for the Mordvin language forms. We indicate reference points within Mordvin Studies and other parts of Uralic studies, as a point of departure for outlining a linguistic studies with a means for measuring its own progress and developing a roadmap for further studies.
Article
Full-text available
This paper explores quantitative results based on theoretical assumptions related to the predictions on N-merge systems (Rizzi 2016) ranked from minimum to a maximum of complexity in terms of the computational devices and derivational operations they require. We investigate the nature of external arguments focussing on 2-merge systems (two elements of the lexicon merge and the created unit is again merged with a further element directly extracted from the lexicon) and 3-merge systems (merge two elements created by previous operations of merge). We add a quantitative dimension to the established qualitative dimension discussed in the theory (Rizzi 2016) by investigating large-scale corpora representative of three populations of speakers: adult grammar (102 treebanks/101 languages), typically developing children (2 corpora/English and Chinese) and children with atypical development (1 corpus). The results confirm the predictions in Rizzi (2016): every language in our data set exploits 3-merge systems and less complex systems are the preferred options in early grammars.
Article
Full-text available
This dissertation is a synchronic description of adnominal person in the highly synthetic morphological system of Erzya as attested in extensive Erzya-language written-text corpora consisting of nearly 140 publications with over 4.5 million words and over 285,000 unique lexical items. Insight for this description have been obtained from several source grammars in German, Russian, Erzya, Finnish, Estonian and Hungarian, as well as bounteous discussions in the understanding of the language with native speakers and grammarians 1993 2010. Introductory information includes the discussion of the status of Erzya as a lan- guage, the enumeration of phonemes generally used in the transliteration of texts and an in-depth description of adnominal morphology. The reader is then made aware of typological and Erzya-specifc work in the study of adnominal-type person. Methods of description draw upon the prerequisite information required in the development of a two-level morphological analyzer, as can be obtained in the typological description of allomorphic variation in the target language. Indication of original author or dialect background is considered important in the attestation of linguistic phenomena, such that variation might be plotted for a synchronic description of the language. The phonological description includes the establishment of a 6-vowel, 29-consonant phoneme system for use in the transliteration of annotated texts, i.e. two phonemes more than are generally recognized, and numerous rules governing allophonic variation in the language. Erzya adnominal morphology is demonstrated to have a three-way split in stem types and a three-layer system of non-derivative affixation. The adnominal-affixation layers are broken into (a) declension (the categories of case, number and deictic marking); (b) nominal conjugation (non-verb grammatical and oblique-case items can be conjugated), and (c) clitic marking. Each layer is given statistical detail with regard to concatenability. Finally, individual subsections are dedicated to the matters of: possessive declension compatibility in the distinction of sublexica; genitive and dative-case paradigmatic defectivity in the possessive declension, where it is demonstrated to be parametrically diverse, and secondary declension, a proposed typology modifiers without nouns , as compatible with adnominal person. Väitöskirjatyöni on synkroninen kuvaus ersän kielen monipuolisesta omistusliitteiden käytöstä. Tutkimusaineistona on käytetty ersänkielisiä tekstikorpuksia, jotka koostuvat lähes 140 julkaisusta, yli 4,5 miljoonasta sanasta ja yli 285000 erillisestä sanamuodosta. Kuvauksen pohjana ovat erikieliset ersän kieliopit. Keskusteluilla niin ersää äidinkielenään puhuvien kuin muiden kielioppien kirjoittajien kanssa vuosina 1993 2010 on ollut tärkeä merkitys sille, miten työ kokonaisuudessaan on muotoutunut. Väitöskirjan alkuosassa pohditaan ersän kielen asemaa ja sen tulevaisuutta. Myös ersän äännejärjestelmän kuvaus sekä perusteellinen ja monipuolinen ersän nominaalilausekkeiden rakenteiden kuvaus sisältyy väitöskirjan alkuosaan. Luvussa 1.3. käsitellään persoonatutkimuksen taustaa ja tutkimuksia, jotka koskevat typologiaa ja eri kieliopeissa käsiteltyjä ersän persoonarakenteita. Kuvauksen menetelmissä (luku 2.) on hyödynnetty kielen morfologisen tason kuvausta varten kehitettyä kaksitasomallia, jota voidaan käyttää sanamuotojen morfologisessa analyysissa. Tätä analyysia voidaan taas hyödyntää ersän allomorfisen variaation typologisesta kuvauksesta. Hypoteesina on, että tekstin alkuperäisen kirjoittajan taustaa ja myös murretaustaa koskeva tieto on tärkeä kielellisten ilmiöiden kuvauksissa; näiden tietojen käyttäminen mahdollistaa kielen variaation synkronisen kuvauksen. Fonologiseen kuvaukseen (luku 3.) kuuluu 6-vokaalinen ja 29-konsonanttinen foneemijärjestelmä (2 uutta), jota on käytetty automaattisesti jäsennettyjen tekstien tarkekirjoituksessa. Lisäksi tarjotaan lukuisia sääntöjä, joiden avulla kuvataan allofonista vaihtelua. Ersän nominaalilausekkeiden taivutus esitetään kolmena kerrostumana. Nämä kerrostumat jaetaan (luku 4.2.1.-3.) substiivityyppiseen, johon sisältyy sijan, luvun ja deiksiksen merkintä, (luku 4.2.4.) nominaalikonjugaatioon: tämä koskee ersän nominatiivi- ja oblikvisijaisia nomineja, postpositioita, adverbeja ja infinitiivejä, ja (luku 4.2.5.) partikkelien merkintään. Jokaisen kerrostuman kuvauksessa esitetään tilastollista tietoa siitä, miten eri elementit voidaan liittää toisiinsa. Erikseen käsitellään luvussa 4.3. possessiivitaivutuksessa esiintyviä sijamuotoja suhteessa semanttisiin alileksikoihin, luvussa 4.4. possessiivitaivutuksen genetiivin ja datiivin vajaaparadigmaisuutta ja parametrista eroavaisuutta, ja luvussa 4.5 laajennettua modifioija ilman pääsanoja typologiaa, ja sen yhteen sopivuus adnominaalisen persoonan kanssa.
Article
Standard negation can be defined as the basic way (or ways) a language has for negating declarative verbal main clauses. Negative constructions that fall outside standard negation include the negation of existential, copular or non-verbal clauses, the negation of subordinate clauses, and the negation of non-declarative clauses like imperatives (see chapter 71). These negatives are not taken into account here, but it is of course possible that languages use their standard negation constructions for the negation of these clause types too. This map shows how symmetric and asymmetric standard negation are distributed among the languages of the world. In symmetric negation the structure of the negative is identical to the structure of the affirmative, except for the presence of the negative marker(s). In asymmetric negation the structure of the negative differs from the structure of the affirmative in various other ways too, i.e. there is asymmetry between affirmation and negation. Affirmative and negative structures can be symmetric or asymmetric in two ways: there can be (a)symmetry either between the affirmative and negative constructions, or between the paradigms that the affirmative and negative constructions form. Symmetric negative constructions do not differ from the corresponding affirmative constructions in any other way than by the presence of the negative marker(s), whereas asymmetric negative constructions show structural differences in comparison to the corresponding affirmative constructions. In symmetric paradigms, all (verbal) categories or forms have corresponding affirmative and negative forms, whereas in asymmetric paradigms such one-to-one correspondences do not obtain.
Mordvalaiskielten rakenne ja kehitys
  • Raija Bartens
Raija Bartens. 1999. Mordvalaiskielten rakenne ja kehitys, volume 232 of Suomalais-Ugrilaisen Seuran Toimituksia. Suomalais-Ugrilainen Seura, Helsinki.
Development of Mordvin Definite Conjugation
  • László Keresztes
László Keresztes. 1999. Development of Mordvin Definite Conjugation, volume 233 of Suomalais-Ugrilaisen Seuran Toimituksia. Suomalais-Ugrilainen Seura, Helsinki.
Moksha non-verbal predication
  • Maria Kholodilova
Maria Kholodilova. 2016. Moksha non-verbal predication, Printon, Tallinn, pages 229-259. Uralica Helsingiensia 10.
On quantification in the Erzya language
  • Jack Rueter
Jack Rueter. 2013. On quantification in the Erzya language, LINCOM, Muenchen, pages 99-118.
Homonymy in the Uralic Two-Argument Agreement Paradigms
  • Trond Trosterud
Trond Trosterud. 2006. Homonymy in the Uralic Two-Argument Agreement Paradigms, volume 251 of Suomalais-Ugrilaisen Seuran Toimituksia. Suomalais-Ugrilainen Seura, Helsinki.
Nonverbal Predication in
  • Rigina Turunen
Rigina Turunen. 2010. Nonverbal Predication in Erzya. A. S. Pakett, Tallinn.
UD Annotatrix: An annotation tool for Universal Dependencies
  • Francis M Tyers
  • Mariya Sheyanova
  • Jonathan North Washington
Francis M. Tyers, Mariya Sheyanova, and Jonathan North Washington. 2018. UD Annotatrix: An annotation tool for Universal Dependencies. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. page [to appear].