ArticlePublisher preview available

# Neural machine translation with a polysynthetic low resource language

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract and Figures

Low-resource languages (LRL) with complex morphology are known to be more difficult to translate in an automatic way. Some LRLs are particularly more difficult to translate than others due to the lack of research interest or collaboration. In this article, we experiment with a specific LRL, Quechua, that is spoken by millions of people in South America yet has not undertaken a neural approach for translation until now. We improve the latest published results with baseline BLEU scores using the state-of-the-art recurrent neural network approaches for translation. Additionally, we experiment with several morphological segmentation techniques and introduce a new one in order to decompose the language’s suffix-based morphemes. We extend our work to other high-resource languages (HRL) like Finnish and Spanish to show that Quechua, for qualitative purposes, can be considered compatible with and translatable into other major European languages with measurements comparable to the state-of-the-art HRLs at this time. We finalize our work by making our best two Quechua–Spanish translation engines available on-line.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Machine Translation (2020) 34:325–346
https://doi.org/10.1007/s10590-020-09255-9
1 3
Neural machine translation withapolysynthetic low
resource language
JohnE.Ortega1 · RichardCastroMamani2· KyunghyunCho1
Received: 24 February 2020 / Accepted: 13 December 2020 / Published online: 4 February 2021
© The Author(s), under exclusive licence to Springer Nature B.V. part of Springer Nature 2021
Abstract
Low-resource languages (LRL) with complex morphology are known to be more
diﬃcult to translate in an automatic way. Some LRLs are particularly more diﬃcult
to translate than others due to the lack of research interest or collaboration. In this
article, we experiment with a speciﬁc LRL, Quechua, that is spoken by millions of
people in South America yet has not undertaken a neural approach for translation
until now. We improve the latest published results with baseline BLEU scores using
the state-of-the-art recurrent neural network approaches for translation. Addition-
ally, we experiment with several morphological segmentation techniques and intro-
duce a new one in order to decompose the language’s suﬃx-based morphemes. We
extend our work to other high-resource languages (HRL) like Finnish and Spanish to
show that Quechua, for qualitative purposes, can be considered compatible with and
translatable into other major European languages with measurements comparable to
the state-of-the-art HRLs at this time. We ﬁnalize our work by making our best two
Quechua–Spanish translation engines available on-line.
Keywords Neural machine translation· Low resource languages· Morphology·
Quechua· Finnish· Spanish
* John E. Ortega
jortega@cs.nyu.edu
Richard Castro Mamani
rcastro@hinant.in
Kyunghyun Cho
kyunghyun.cho@nyu.edu
1 New York University, NewYork, USA
2 Hinantin Software, Cusco, Peru
... When translating, the original text is first analyzed semantically to obtain the semantic content of the original text, and then this semantic content is represented by the text of the translation. It is rare to see rule-based machine translation in use right now [12,13]. Rules are more commonly used with other machine translation approaches. ...
Article
Full-text available
The importance of translation services has become increasingly prominent with the acceleration of economic globalization. Compared with human translation, machine translation is cheaper and faster, and therefore more suitable for the current era. The current mainstream machine translation method is neural machine translation, which employs machine methods to train on parallel corpora and create translation models. Research into neural machine translation has yielded a wealth of information. Learning and generalization abilities of neural networks have substantially enhanced the effectiveness of neural machine translation. This work applies machine learning and wireless network technology to build an online translation system for real-time translation. First, this work proposes a multigranularity feature fusion method based on a directed acyclic graph, which uses a directed acyclic graph to fuse different granularities as input and obtain a position representation. Secondly, this paper improves the Transformer model and proposes multigranularity position encoding and multigranularity self-attention. Then, on the basis of multigranularity features as input, this work introduces dynamic word vectors to improve the word embedding module, and uses the ELMo model to obtain dynamic word vector embeddings. Finally, this work builds a multigranularity feature-dynamic word vector machine translation model with above strategy, deploys it on server. Users can upload the content to be translated and download the translated content through the wireless network and realize an online translation system based on machine learning and wireless network.
... For languages with complex morphology, morphological segmentation is an indispensable step in natural language processing (NLP) , which can reduce the word space of models and effectively reduce problems such as data sparseness. Therefore, morphological segmentation has been widely used in NLP downstream tasks such as named entity recognition, speech recognition, and machine translation (Abudubiyaz et al., 2020;Bareket & Tsarfaty, 2021;Ortega et al., 2020). ...
Article
Full-text available
Morphological segmentation is a basic task in agglutinative language information processing, dividing words into the smallest semantic unit morphemes. There are two types of morphological segmentation: canonical segmentation and surface segmentation. As a typical agglutinative language, Uyghur usually uses statistical-based methods in canonical segmentation, which relies on the artificial extraction of features. In surface segmentation, the artificial feature extraction process is avoided by using the neural network. However, to date, no model can provide both segmentation results in Uyghur without adding features. In addition, morphological segmentation is usually regarded as a sequence annotation task, so label imbalance easily occurs in datasets. Given the above situation, this paper proposes an improved labelling scheme that joins morphological boundary labels and voice harmony labels for the two kinds of segmentation simultaneously. Then, a convolution network and attention mechanism are added to capture local and global features, respectively. Finally, morphological segmentation is regarded as a sequence labeling task of character sequences. Due to the problem of label proportion imbalance and noise in the dataset, a focal loss function with label smoothing is used. The experimental results show that the F1 values of canonical segmentation and surface segmentation achieve the best results.
... Nevertheless, applying BPEs on top of morphology-based segmentation for Turkish-English, Uyghur-Chinese, and Arabic-English has shown to bring improvements over solely using BPEs or morphology-based segmentation for neural MT task (Pan et al., 2020;Tawfik et al., 2019). A similar result was achieved by (Ortega et al., 2020), using a morphological guided BPE for polysynthetic languages. However, Oudah et al. (2019) show that such an approach is beneficial in the case of Statistical Machine Translation (SMT), and does not improve results for Neural Machine Translation (NMT). ...
Preprint
Full-text available
Data sparsity is one of the main challenges posed by Code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of Machine Translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees in CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
... One low-resource language from South America, called Quechua, is spoken by nearly 8 million people 1 yet still does not have enough resources to effectively compete with other high-resource languages as has been shown in previous research (Ebrahimi et al., 2021;Ortega et al., , 2020. Oftentimes, due to insufficient resources, scores such as BLEU (Papineni et al., 2002) and accuracy are more than three times lower. ...
Conference Paper
Full-text available
In the effort to minimize the risk of extinction of a language, linguistic resources are fundamental. Quechua, a low-resource language from South America, is a language spoken by millions but, despite several efforts in the past, still lacks the resources necessary to build high-performance computational systems. In this article, we present WordNet-QU which signifies the inclusion of Quechua in a well-known lexical database called wordnet. We propose WordNet-QU to be included as an extension to wordnet after demonstrating a manually-curated collection of multiple digital resources for lexical use in Quechua. Our work uses the synset alignment algorithm to compare Quechua to its geographically nearest high-resource language, Spanish. Altogether, we propose a total of 28,582 unique synset IDs divided according to region like so: 20510 for Southern Quechua, 5993 for Central Quechua, 1121 for Northern Quechua, and 958 for Amazonian Quechua.
... The purpose is to remind consumers to consume anytime and anywhere. Since the beginning of the twenty-first century, the rapid development of the Internet industry has provided a broad development platform for the sales [1,2]. pointed out that among the many communication channels in the market, rational advertising is undoubtedly one of the convenient channels. ...
Article
Full-text available
... examples (9) and (10) in Section 1 above). While modeling such cases in MT is not yet practically possible due to the virtual absence of training data for polysynthetic languages (but see Ortega, Castro Mamani, and Cho (2020) for important first steps in this direction), much effort has gone recently into studying low-resource language directions for which there is some data (say, 100-300K sentence pairs), even if it is two orders of magnitude smaller than the data for English-German or French-English. Low-resource settings can also be recreated by limiting the amount of data otherwise available to the system. ...
Preprint
Full-text available
The success of deep learning in natural language processing raises intriguing questions about the nature of linguistic meaning and ways in which it can be processed by natural and artificial systems. One such question has to do with subword segmentation algorithms widely employed in language modeling, machine translation, and other tasks since 2016. These algorithms often cut words into semantically opaque pieces, such as 'period', 'on', 't', and 'ist' in 'period|on|t|ist'. The system then represents the resulting segments in a dense vector space, which is expected to model grammatical relations among them. This representation may in turn be used to map 'period|on|t|ist' (English) to 'par|od|ont|iste' (French). Thus, instead of being modeled at the lexical level, translation is reformulated more generally as the task of learning the best bilingual mapping between the sequences of subword segments of two languages; and sometimes even between pure character sequences: 'p|e|r|i|o|d|o|n|t|i|s|t' $\rightarrow$ 'p|a|r|o|d|o|n|t|i|s|t|e'. Such subword segmentations and alignments are at work in highly efficient end-to-end machine translation systems, despite their allegedly opaque nature. The computational value of such processes is unquestionable. But do they have any linguistic or philosophical plausibility? I attempt to cast light on this question by reviewing the relevant details of the subword segmentation algorithms and by relating them to important philosophical and linguistic debates, in the spirit of making artificial intelligence more transparent and explainable.
... examples (9) and (10) in Section 1 above). While modeling such cases in MT is not yet practically possible due to the virtual absence of training data for polysynthetic languages (but see Ortega, Castro Mamani, and Cho (2020) for important first steps in this direction), much effort has gone recently into studying low-resource language directions for which there is some data (say, 100-300K sentence pairs), even if it is two orders of magnitude smaller than the data for English-German or French-English. Low-resource settings can also be recreated by limiting the amount of data otherwise available to the system. ...
Article
Full-text available
The success of deep learning in natural language processing raises intriguing questions about the nature of linguistic meaning and ways in which it can be processed by natural and artificial systems. One such question has to do with subword segmentation algorithms widely employed in language modeling, machine translation, and other tasks since 2016. These algorithms often cut words into semantically opaque pieces, such as ‘period’, ‘on’, ‘t’, and ‘ist’ in ‘period|on|t|ist’. The system then represents the resulting segments in a dense vector space, which is expected to model grammatical relations among them. This representation may in turn be used to map ‘period|on|t|ist’ (English) to ‘par|od|ont|iste’ (French). Thus, instead of being modeled at the lexical level, translation is reformulated more generally as the task of learning the best bilingual mapping between the sequences of subword segments of two languages; and sometimes even between pure character sequences: ‘p|e|r|i|o|d|o|n|t|i|s|t’ → ‘p|a|r|o|d|o|n|t|i|s|t|e’. Such segmentations and alignments are at work in highly efficient end-to-end machine translation systems, despite their allegedly opaque nature. But do they have linguistic or philosophical plausibility? I attempt to cast light on this question, in the spirit of making artificial intelligence more transparent and explainable.
Article
Full-text available
Machine translation (namely MT) has been one of the most popular fields in computational linguistics and Artificial Intelligence (AI). As one of the most promising approaches, MT can potentially break the language barrier of people from all over the world. Despite a number of studies in MT, there are few studies in summarizing and comparing MT methods. To this end, in this paper, we principally focus on presenting the two mainstream MT schemes: statistical machine translation (SMT) and neural machine translation (NMT), including their basic rationales and developments. Meanwhile, the detailed translation models are also presented, such as the word-based model, syntax-based model, and phrase-based model in statistical machine translation. Similarly, approaches in NMT, such as the recurrent neural network-based, attention mechanism-based, and transformer-based models are presented. Last but not least, the evaluation approaches also play an important role in helping developers to improve their methods better in MT. The prevailing machine translation evaluation methodologies are also presented in this article.
Article
Speaker Change Detection (SCD) is the problem of splitting an audio-recording by its speaker-turns. Many real-world problems, such as the Speaker Diarization (SD) or automatic speech transcription, are influenced by the quality of the speaker-turns estimation. Previous works have already shown that auxiliary textual information (for mono-lingual systems) can be of great use for detection of speaker-turns and the diarization systems’ performance. In this paper, we suggest a framework for speaker-turn estimation, as well as the determination of clustered speaker identities to the SD system, and examine our approach over a multi-lingual dataset that consists of three mono-lingual datasets - in English, French, and Hebrew. As such, we propose a generic and language-independent framework for the SCD problem that is learned through textual information using state-of-the-art transformer-based techniques and speech-embedding modules. Comprehensive experimental evaluation shows that (i) our multi-lingual SCD framework is competitive enough when compared to a framework over mono-lingual datasets, and that (ii) textual information improves the solution’s quality compared to the speech signal-based approach. In addition, we show that our multi-lingual SCD approach does not harm the performance of SD systems.
Chapter
Neural Machine Translation better known as NMT is an end-to-end approach for autonomous language translation that utilizes neural models. This is an effort to bridge the gap between the multinational and multilingual people to understand their views. The NMT systems involves models to learn directly through mapping of input–output which has proven to generate increased accuracy outputs. This technique has made remarkable accomplishments and has overcome the weakness of the conventional translations models. The paper implements the RNN, Attention based mechanism and transformer on Indian-English language pairs. So far there are no specific benchmarks for Indian languages. There are companies such as Facebook, Bing, Google whose translators supports few Indian languages. In this research work models have been trained on two set of Indian language pairs which have been retrieved from open source platform Tatoeba.KeywordsAttention mechanismEncoder-DecoderMachine translationRecurrent Neural Network (RNN)Transformers
Conference Paper
Full-text available
Languages are disappearing at an alarming rate, linguistics rights of speakers of most of the 7000 languages are under risk. ICT play a key role for the preservation of endangered languages; as ultimate use of ICT, natural language processing must be highlighted since in this century the lack of such support hampers literacy acquisition as well as prevents the use of Internet and any electronic means. The first step is the building of resources for processing, therefore we introduce the first speech corpus of Southern Quechua, Siminchik, suitable for training and evaluating speech recognition systems. The corpus consists of 97 hours of spontaneous conversations recorded in radio programs in the Southern regions of Peru. The annotation task was carried out by native speakers from those regions using the unified written convention. We present initial experiments on speech recognition and language modeling and explain the challenges inherent to the nature and current status of this ancestral language.
Conference Paper
In this paper, we present the implementation of an Automatic Speech Recognition system (ASR) for southern Quechua language. The software can recognize both continuous speech and isolated words. The ASR was developed using Hidden Markov Model Toolkit (HTK) and the corpus collected by Siminchikkunarayku. A dictionary provides the system with a mapping of vocabulary words to sequences of phonemes; the audio files were processed to extract the speech feature vectors (MFCC) and then, the acoustic model was trained using the MFCC files until its convergence. The paper also describes a detailed architecture of an ASR system developed using HTK library modules and tools. The ASR was tested using the audios recorded by volunteers obtaining a 12.70% word error rate.
Book
Cambridge Core - Linguistic Anthropology - Describing Morphosyntax - by Thomas E. Payne
Chapter
For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or unknown words with a ⟨UNK⟩ token, which limits the translation performance to some extent. Most of recent work handled such a problem by splitting words into characters or other specially extracted subword units to enable open-vocabulary translation. Byte pair encoding (BPE) is one of the successful attempts that has been shown extremely competitive by providing effective subword segmentation for NMT systems. In this paper, we extend the BPE style segmentation to a general unsupervised framework with three statistical measures: frequency (FRQ), accessor variety (AV) and description length gain (DLG). We test our approach on two translation tasks: German to English and Chinese to English. The experimental results show that AV and DLG enhanced systems outperform the FRQ baseline in the frequency weighted schemes at different significant levels.
Chapter
This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.