# Neural machine translation with a polysynthetic low resource language

Low-resource languages (LRL) with complex morphology are known to be more difficult to translate in an automatic way. Some LRLs are particularly more difficult to translate than others due to the lack of research interest or collaboration. In this article, we experiment with a specific LRL, Quechua, that is spoken by millions of people in South America yet has not undertaken a neural approach for translation until now. We improve the latest published results with baseline BLEU scores using the state-of-the-art recurrent neural network approaches for translation. Additionally, we experiment with several morphological segmentation techniques and introduce a new one in order to decompose the language’s suffix-based morphemes. We extend our work to other high-resource languages (HRL) like Finnish and Spanish to show that Quechua, for qualitative purposes, can be considered compatible with and translatable into other major European languages with measurements comparable to the state-of-the-art HRLs at this time. We finalize our work by making our best two Quechua–Spanish translation engines available on-line.
JohnE.Ortega1 · RichardCastroMamani2· KyunghyunCho1
Received: 24 February 2020 / Accepted: 13 December 2020 / Published online: 4 February 2021
Keywords Neural machine translation· Low resource languages· Morphology·
Quechua· Finnish· Spanish
* John E. Ortega
jortega@cs.nyu.edu
Richard Castro Mamani
rcastro@hinant.in
Kyunghyun Cho
kyunghyun.cho@nyu.edu
1 New York University, NewYork, USA
2 Hinantin Software, Cusco, Peru
Languages are disappearing at an alarming rate, linguistics rights of speakers of most of the 7000 languages are under risk. ICT play a key role for the preservation of endangered languages; as ultimate use of ICT, natural language processing must be highlighted since in this century the lack of such support hampers literacy acquisition as well as prevents the use of Internet and any electronic means. The first step is the building of resources for processing, therefore we introduce the first speech corpus of Southern Quechua, Siminchik, suitable for training and evaluating speech recognition systems. The corpus consists of 97 hours of spontaneous conversations recorded in radio programs in the Southern regions of Peru. The annotation task was carried out by native speakers from those regions using the unified written convention. We present initial experiments on speech recognition and language modeling and explain the challenges inherent to the nature and current status of this ancestral language.
Conference Paper
In this paper, we present the implementation of an Automatic Speech Recognition system (ASR) for southern Quechua language. The software can recognize both continuous speech and isolated words. The ASR was developed using Hidden Markov Model Toolkit (HTK) and the corpus collected by Siminchikkunarayku. A dictionary provides the system with a mapping of vocabulary words to sequences of phonemes; the audio files were processed to extract the speech feature vectors (MFCC) and then, the acoustic model was trained using the MFCC files until its convergence. The paper also describes a detailed architecture of an ASR system developed using HTK library modules and tools. The ASR was tested using the audios recorded by volunteers obtaining a 12.70% word error rate.
