Conference PaperPDF Available

Towards an open-source universal-dependency treebank for Erzya

Authors:
Proceedings of the 4th International Workshop for Computational Linguistics for Uralic Languages (IWCLUL 2018), pages 108–120,
Helsinki, Finland, January 8–9, 2018. c
2018 Association for Computational Linguistics
108
   
  
 
  
   

jack.rueter@helsinki.fi
 
 
  

ftyers@hse.ru
Abstract
          
          
             
           
          
          
            
          
           
    
Tiivistelmä
       
     
            
      
        
       
        
      
        
 
Abstract
        
       
      
      
        
         
       
       
       
       
             
http://creativecommons.org/licenses/by/4.0/
109
        
        
        
     
1 Introduction
          
           
          
            
          
 
         
            
            
         
           
           
           
              
   
            
           
          
             
        
            
     
             
           
              
               
            
              
         
2 Baground
2.1 Erzya
Erzya            
              
http://universaldependencies.org/
http://www.glossary.sil.org/
http://wals.info/
http://www.helsinki.fi/~kopotev/finnish_corpora_eng.pdf
             
 http://urn.fi/urn:nbn:fi:lb-2016012202
       http://urn.fi/urn:nbn:fi:lb-201407306
110
Case Denite Form Function
Nom  ś t́ńe   
 
Nom    
  
  
Nom  Ozo  
Gen     
   
  
Gen  Ońt́  
  
Ine  sO  

      Note that with the exception of the third person
singular possessive sux, there is generally no distinctions made for number or geni-
tive/nominative marking in the possessive declension.
              
             
             
              
          
         
         
           
           
           
        
          
              
          
             
        
            
   nsubjroot      
        
https://efo.revues.org/1829
http://giellatekno.uit.no/cgi/index.myv.eng.html
    
111

     
NOUN NOUN DET NOUN ADV PRON PUNCT
     
     
    
    
 


 
        
3 Methodology
3.1 Corpus
              
            
             

Document Description Sentences Tokens Av. length
Valskeń gudok      
Kirdažt        
Veĺeń vajgeĺt́       
Pićipalakst      
Lažnɨća Sura       
Separate    
  
        
         
              
           
       
              
             
            appos  conj
         obl   
            
              
      nmod  nmod:poss   
              
            
               

4 Annotation guidelines
              
              
112
obl nmod nmod:poss obj root nsubj
Case=Nom|Definite=Def     
Case=Nom|Definite=Ind     
Case=Nom|Number[psor]=Sing|Person[psor]=3     
Case=Gen|Definite=Ind    
Case=Gen|Definite=Def    
Case=Dat|Definite=Ind   
Case=Dat|Definite=Def   
Case=Ine|Definite=Ind   
          
 
              
            
            
          
            
         
4.1 Number
              
          
          
                
           
             
        
         
    



  
           
            
       



  
            
       
113



  
             
            
          
            

4.2 Copula and polarity
          
            
            
        araś    
           
           avoĺ  
         apak    
     avoĺ-    iĺa-
    a         aux:neg
        a  avoĺ      
Proh/Opt Ind.Prt1 Cnd Prc.Neg Part.Neg Part.Neg.Emp
/iĺa-/     
  
              
          
           
    
    a          
   avoĺ          
            
   iĺa-     avoĺ    
         
  iĺa- Mood=Proh         
              
4.3 Dependent copula morphology
          
             
             
            
           
            
  
114
Tense Sg1 Sg2 Sg3 Pl1 Pl2 Pl3
Nonpast mon odan/ ton odat/ son od/ miń odtano/ tɨń odtado/ sɨń odt/
Prt2 mon odoľiń/ ton odoľiť/ son odoľ/ miń odoľińek/ tɨń odoľiďe/ sɨń odoľť/
       
      uĺń-   uĺ-     
            nonpast 
prt2          
   prt2         nomen agentis  
prt2          
         
            
          
          
             
 
           
            
              
           
             
         
 
 
PRON NOUN
 
 
 

 


  

  
PRON PRON NOUN PUNCT
  
  
 

 

  
  
           
      lomań-eś        
         ki-jat   
            
        
4.4 Further auxiliaries
             
           
               
            
            mońeń   
115

   
PRON AUX VERB NOUN PUNCT
   
   
  
  

  
     
            
             
          
            
  
4.5 Compound nouns
             
             
           
              
               
            
           

 
  
ADJ NOUN NOUN
  
  
  
  
 
   

  
ADJ NOUN NOUN
  
  
  
  
 
   
             
              veď
vedra
4.6 Noun head ellipsis
             
           
              
    
                
           
        iśťamo    
            
             
         
116
 
   
DET ADJ ADJ NOUN
   
   
   
   
 
    

   
DET ADJ ADJ NOUN
   
   
   
   
 
   

   
NOUN DET ADJ NOUN
   
   
   
   
  
   
            
               
               
        
            
          śe  
            
          



  



  
         
         
   



  
           
              
           
 
 http://www.glossary.sil.org/term/elliptical-construction
117
4.7 Numerals
           
           
             
    nummod       
          
            
           
           
      tout     
  advcl  acl  det        
     

 
 
   

 
 
  

  
  
    

 
 
     
5 Future work
               
             
           
           
           
             
          
          
          
             
  
118
6 In conclusion
              
          
         
           
         
              

Anowledgements
             
             
             

           
        
              
       
119
References
   Mordvalaiskielten rakenne ja kehitys    Suomalais-
Ugrilaisen Seuran Toimituksia   
   e negation of stative relation clauses in the Mordvin languages 
   Suomalais-Ugrilaisen Seuran Toimituksia  

   Development of Mordvin Denite Conjugation   
Suomalais-Ugrilaisen Seuran Toimituksia   
   Moksha non-verbal predication    
   
   Semantics 2   
   Symmetric and Asymmetric Standard Negation 
       

   Adnominal Person in the Morphological System of Erzya  
 Suomalais-Ugrilaisen Seuran Toimituksia   
   On quantication in the Erzya language   

   Homonymy in the Uralic Two-Argument Agreement Paradigms
   Suomalais-Ugrilaisen Seuran Toimituksia 
 
   Nonverbal Predication in Erzya    
          
        Proceedings of the
16th International Workshop on Treebanks and Linguistic eories   
          
     3rd International Conference on Tur-
kic Languages Processing, (TurkLang 2015)  
          e
Prohibitive       

    GRAMMATIK DER ERSAMORDWINISCHEN SPRACHE 
       
       

        
          
 Эрзянь кель. Синтаксис: тонавтнемапель  
 
120
          
Эрзянь келень орфографиянь валкс   

... As far as we know, there are no published large parallel corpora or NMT systems for Erzya. Rueter and Tyers (2018) develop an Erzya treebank with a few hundred translations to English and Finnish. Архангельский (2019) present an Erzya web corpus 5 along with the way it was collected, but the corpus is available only via the web interface. ...
... For model evaluation, we prepare a held-out corpus of 3000 aligned Erzya-Russian sentences from 6 diverse sources: the Bible, Erzya folk tales (Sheyanova, 2017), the Soviet 1938 constitution, descriptions of folk children's games (Брыжинский, 2009), modern Erzya fiction and poetry, and Wikipedia. To evaluate English and Finnish translation, we use translations from the Erzya universal-dependency treebank (Rueter and Tyers, 2018): 441 sentence pairs for en, and 309 for fi. We split all these sets into development and test parts, and report the results on the test set. ...
Preprint
Full-text available
We present the first neural machine translation system for translation between the endangered Erzya language and Russian and the dataset collected by us to train and evaluate it. The BLEU scores are 17 and 19 for translation to Erzya and Russian respectively, and more than half of the translations are rated as acceptable by native speakers. We also adapt our model to translate between Erzya and 10 other languages, but without additional parallel data, the quality on these directions remains low. We release the translation models along with the collected text corpus, a new language identification model, and a multilingual sentence encoder adapted for the Erzya language. These resources will be available at https://github.com/slone-nlp/myv-nmt.
... The available corpora of the Erzya language (especially parallel ones) are not numerous. Rueter and Tyers (2018) presented the Erzya corpus with morphosyntactic markup, including the translation of several hundred sentences into English and Finnish. Arkhangelskiy (2019) has compiled a web corpus of the Erzya language (available for download); there is also a corpus of the literary Erzya language, avaliable only for search 5 . ...
... The method itself does not require any training or additional annotated data. However, to evaluate our method, we use the Universal Dependences treebanks for Erzya (Rueter and Tyers, 2018) and Skolt Sami (Nivre et al., 2022). These treebanks have word forms and their correct lemmas for each word in each sentence. ...
Conference Paper
Full-text available
We showcase that ChatGPT can be used to disambiguate lemmas in two endangered languages ChatGPT is not proficient in, namely Erzya and Skolt Sami. We augment our prompt by providing dictionary translations of the candidate lemmas to a majority language-Finnish in our case. This dictionary augmented generation approach results in 50% accuracy for Skolt Sami and 41% accuracy for Erzya. On a closer inspection, many of the error types were of the kind even an untrained human annotator would make.
... The method itself does not require any training or additional annotated data. However, to evaluate our method, we use the Universal Dependences treebanks for Erzya (Rueter and Tyers, 2018) and Skolt Sami (Nivre et al., 2022). These treebanks have word forms and their correct lemmas for each word in each sentence. ...
Preprint
Full-text available
We showcase that ChatGPT can be used to disambiguate lemmas in two endangered languages ChatGPT is not proficient in, namely Erzya and Skolt Sami. We augment our prompt by providing dictionary translations of the candidate lemmas to a majority language - Finnish in our case. This dictionary augmented generation approach results in 50\% accuracy for Skolt Sami and 41\% accuracy for Erzya. On a closer inspection, many of the error types were of the kind even an untrained human annotator would make.
... Rule-based morphosyntactic analyzers are being developed for the Erzya and Moksha languages on the Giella infrastructure, where they are automatically rendered reusable as spell checkers and the motor for morphologically savvy dictionaries [Rueter et al. 2020] 24 . Multiple use of the rule-based description extends to the Universal Dependencies projects [see Rueter and Tyers, 2018;Zeman et al. 2023-11], and the shallow-transfer machine translation projects at Apertium [cf Rueter 2020a]. Figures 4a and 4b, above, show how this metadata appears in the right margin of the Korp interface. ...
Article
Full-text available
Description of Mordvin language corpora development at the Language Bank of Finland.Description of development.
... Apart from structured dictionaries and rulebased tools, we have treebanks of the universal dependencies for the Skolt Saami, Moksha, Erzya (Rueter and Tyers, 2018), Komi-Zyrian (Partanen et al., 2018) and Komi-Permyak (Rueter et al., 2020b). These treebanks contain syntactic annotations with the tags Morphological characteristics of universal dependencies. ...
Conference Paper
Full-text available
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are struc-tured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica.
... In this context, it also has to be taken into account that several other smaller Uralic languages have had their own treebanks introduced in the past couple of years, e.g. Erzya (Rueter and Tyers, 2018), Karelian (Pirinen, 2019) and North Saami (Tyers and Sheyanova, 2017). This kind of work that concentrates more on manually annotated corpora complements the descriptive work on morphological analyzers extremely well. ...
Article
Full-text available
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of their lexicon, yet the overlap in their dialects is very complex. This paper evaluates the degree of difference in these written varieties based on changes required for computational resources in the description of these languages when adapted from the Komi-Zyrian original. Primarily these changes include the FST architecture, but we are also looking at its application to the Universal Dependencies annotation scheme in the morphologies of the two languages. Дженыта висьталӧм Коми кылын кык гижан кыв: пермяцкӧй да зырянскӧӥ. Öтамӧд коласын нія вежӧртанаӧсь, но аркмисӧ нія разнӧй коми диалекттэзісь. Пермяцкӧй кыв олӧ Кудымкар лапӧлын, а зырянскӧӥ-Сыктывкар ладорын. Пермяцкӧй да зырянскӧй литературнӧй кыввезын эм уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт тор лексикаын, но ны диалектнӧй чертаэзлӧн пантасьӧмыс ӧддьӧн гардчӧм. Эта статьяын мийӧ видзӧтам эна кык кывлісь ассямасӧ сы ладорсянь, мый ковсяс вежны лӧсьӧтӧм зырянскӧй вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз энӧ вежсьӧммесӧ колӧ керны FST-ын, но мийӧ сідзжӧ видзӧтам, кыдз FST лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ морфология ладорсянь.
Article
Full-text available
This dissertation is a synchronic description of adnominal person in the highly synthetic morphological system of Erzya as attested in extensive Erzya-language written-text corpora consisting of nearly 140 publications with over 4.5 million words and over 285,000 unique lexical items. Insight for this description have been obtained from several source grammars in German, Russian, Erzya, Finnish, Estonian and Hungarian, as well as bounteous discussions in the understanding of the language with native speakers and grammarians 1993 2010. Introductory information includes the discussion of the status of Erzya as a lan- guage, the enumeration of phonemes generally used in the transliteration of texts and an in-depth description of adnominal morphology. The reader is then made aware of typological and Erzya-specifc work in the study of adnominal-type person. Methods of description draw upon the prerequisite information required in the development of a two-level morphological analyzer, as can be obtained in the typological description of allomorphic variation in the target language. Indication of original author or dialect background is considered important in the attestation of linguistic phenomena, such that variation might be plotted for a synchronic description of the language. The phonological description includes the establishment of a 6-vowel, 29-consonant phoneme system for use in the transliteration of annotated texts, i.e. two phonemes more than are generally recognized, and numerous rules governing allophonic variation in the language. Erzya adnominal morphology is demonstrated to have a three-way split in stem types and a three-layer system of non-derivative affixation. The adnominal-affixation layers are broken into (a) declension (the categories of case, number and deictic marking); (b) nominal conjugation (non-verb grammatical and oblique-case items can be conjugated), and (c) clitic marking. Each layer is given statistical detail with regard to concatenability. Finally, individual subsections are dedicated to the matters of: possessive declension compatibility in the distinction of sublexica; genitive and dative-case paradigmatic defectivity in the possessive declension, where it is demonstrated to be parametrically diverse, and secondary declension, a proposed typology modifiers without nouns , as compatible with adnominal person. Väitöskirjatyöni on synkroninen kuvaus ersän kielen monipuolisesta omistusliitteiden käytöstä. Tutkimusaineistona on käytetty ersänkielisiä tekstikorpuksia, jotka koostuvat lähes 140 julkaisusta, yli 4,5 miljoonasta sanasta ja yli 285000 erillisestä sanamuodosta. Kuvauksen pohjana ovat erikieliset ersän kieliopit. Keskusteluilla niin ersää äidinkielenään puhuvien kuin muiden kielioppien kirjoittajien kanssa vuosina 1993 2010 on ollut tärkeä merkitys sille, miten työ kokonaisuudessaan on muotoutunut. Väitöskirjan alkuosassa pohditaan ersän kielen asemaa ja sen tulevaisuutta. Myös ersän äännejärjestelmän kuvaus sekä perusteellinen ja monipuolinen ersän nominaalilausekkeiden rakenteiden kuvaus sisältyy väitöskirjan alkuosaan. Luvussa 1.3. käsitellään persoonatutkimuksen taustaa ja tutkimuksia, jotka koskevat typologiaa ja eri kieliopeissa käsiteltyjä ersän persoonarakenteita. Kuvauksen menetelmissä (luku 2.) on hyödynnetty kielen morfologisen tason kuvausta varten kehitettyä kaksitasomallia, jota voidaan käyttää sanamuotojen morfologisessa analyysissa. Tätä analyysia voidaan taas hyödyntää ersän allomorfisen variaation typologisesta kuvauksesta. Hypoteesina on, että tekstin alkuperäisen kirjoittajan taustaa ja myös murretaustaa koskeva tieto on tärkeä kielellisten ilmiöiden kuvauksissa; näiden tietojen käyttäminen mahdollistaa kielen variaation synkronisen kuvauksen. Fonologiseen kuvaukseen (luku 3.) kuuluu 6-vokaalinen ja 29-konsonanttinen foneemijärjestelmä (2 uutta), jota on käytetty automaattisesti jäsennettyjen tekstien tarkekirjoituksessa. Lisäksi tarjotaan lukuisia sääntöjä, joiden avulla kuvataan allofonista vaihtelua. Ersän nominaalilausekkeiden taivutus esitetään kolmena kerrostumana. Nämä kerrostumat jaetaan (luku 4.2.1.-3.) substiivityyppiseen, johon sisältyy sijan, luvun ja deiksiksen merkintä, (luku 4.2.4.) nominaalikonjugaatioon: tämä koskee ersän nominatiivi- ja oblikvisijaisia nomineja, postpositioita, adverbeja ja infinitiivejä, ja (luku 4.2.5.) partikkelien merkintään. Jokaisen kerrostuman kuvauksessa esitetään tilastollista tietoa siitä, miten eri elementit voidaan liittää toisiinsa. Erikseen käsitellään luvussa 4.3. possessiivitaivutuksessa esiintyviä sijamuotoja suhteessa semanttisiin alileksikoihin, luvussa 4.4. possessiivitaivutuksen genetiivin ja datiivin vajaaparadigmaisuutta ja parametrista eroavaisuutta, ja luvussa 4.5 laajennettua modifioija ilman pääsanoja typologiaa, ja sen yhteen sopivuus adnominaalisen persoonan kanssa.
Article
Standard negation can be defined as the basic way (or ways) a language has for negating declarative verbal main clauses. Negative constructions that fall outside standard negation include the negation of existential, copular or non-verbal clauses, the negation of subordinate clauses, and the negation of non-declarative clauses like imperatives (see chapter 71). These negatives are not taken into account here, but it is of course possible that languages use their standard negation constructions for the negation of these clause types too. This map shows how symmetric and asymmetric standard negation are distributed among the languages of the world. In symmetric negation the structure of the negative is identical to the structure of the affirmative, except for the presence of the negative marker(s). In asymmetric negation the structure of the negative differs from the structure of the affirmative in various other ways too, i.e. there is asymmetry between affirmation and negation. Affirmative and negative structures can be symmetric or asymmetric in two ways: there can be (a)symmetry either between the affirmative and negative constructions, or between the paradigms that the affirmative and negative constructions form. Symmetric negative constructions do not differ from the corresponding affirmative constructions in any other way than by the presence of the negative marker(s), whereas asymmetric negative constructions show structural differences in comparison to the corresponding affirmative constructions. In symmetric paradigms, all (verbal) categories or forms have corresponding affirmative and negative forms, whereas in asymmetric paradigms such one-to-one correspondences do not obtain.
Mordvalaiskielten rakenne ja kehitys
  • Raija Bartens
Raija Bartens. 1999. Mordvalaiskielten rakenne ja kehitys, volume 232 of Suomalais-Ugrilaisen Seuran Toimituksia. Suomalais-Ugrilainen Seura, Helsinki.
Development of Mordvin Definite Conjugation
  • László Keresztes
László Keresztes. 1999. Development of Mordvin Definite Conjugation, volume 233 of Suomalais-Ugrilaisen Seuran Toimituksia. Suomalais-Ugrilainen Seura, Helsinki.
Moksha non-verbal predication
  • Maria Kholodilova
Maria Kholodilova. 2016. Moksha non-verbal predication, Printon, Tallinn, pages 229-259. Uralica Helsingiensia 10.
On quantification in the Erzya language
  • Jack Rueter
Jack Rueter. 2013. On quantification in the Erzya language, LINCOM, Muenchen, pages 99-118.
Homonymy in the Uralic Two-Argument Agreement Paradigms
  • Trond Trosterud
Trond Trosterud. 2006. Homonymy in the Uralic Two-Argument Agreement Paradigms, volume 251 of Suomalais-Ugrilaisen Seuran Toimituksia. Suomalais-Ugrilainen Seura, Helsinki.
Nonverbal Predication in
  • Rigina Turunen
Rigina Turunen. 2010. Nonverbal Predication in Erzya. A. S. Pakett, Tallinn.
UD Annotatrix: An annotation tool for Universal Dependencies
  • Francis M Tyers
  • Mariya Sheyanova
  • Jonathan North Washington
Francis M. Tyers, Mariya Sheyanova, and Jonathan North Washington. 2018. UD Annotatrix: An annotation tool for Universal Dependencies. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. page [to appear].