Conference PaperPDF Available

Overcoming Resistance: The Normalization of an Amazonian Tribal Language


Abstract and Figures

Languages can be considered endangered for many reasons. One of the principal reasons for endangerment is the disappearance of its speakers. Another, more identifiable reason, is the lack of written resources. We present an automated sub-segmentation system called AshMorph that deals with the morphology of an Amazonian tribal language called Ashaninka which is at risk of being endangered due to the lack of availability (or resistance) of native speakers and the absence of written resources. We show that by the use of a cross-lingual lexicon and finite state transducers we can increase accuracy by more than 30% when compared to other modern sub-segmentation tools. Our results, made freely available on-line, are verified by an Ashaninka speaker and perform well in two distinct domains , everyday literary articles and the bible. This research serves as a first step in helping to preserve Ashaninka by offering a sub-segmentation process that can be used to normalize any Ashaninka text which will serve as input to a machine translation system for translation into other high-resource languages spoken by higher populated locations like Spanish and Portuguese in the case of Peru and Brazil where Ashaninka is mostly spoken.
Content may be subject to copyright.
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pages 1–13
Decmber 04, 2020. c
2020 Association for Computational Linguistics
Overcoming Resistance: The Normalization of an Amazonian Tribal
John E. Ortega
New York University
New York, New York, USA
Richard Alexander Castro-Mamani
Univ. Nacional de San Antonio Abad
Cusco, Perú
Jaime Rafael Montoya Samame
Pontificia Universidad Católica del Perú
Lima, Perú
Languages can be considered endangered for
many reasons. One of the principal rea-
sons for endangerment is the disappearance
of its speakers. Another, more identifiable
reason, is the lack of written resources. We
present an automated sub-segmentation sys-
tem called AshMorph that deals with the
morphology of an Amazonian tribal language
called Ashaninka which is at risk of being en-
dangered due to the lack of availability (or re-
sistance) of native speakers and the absence of
written resources. We show that by the use of
a cross-lingual lexicon and finite state trans-
ducers we can increase accuracy by more than
30% when compared to other modern sub-
segmentation tools. Our results, made freely
available on-line, are verified by an Ashaninka
speaker and perform well in two distinct do-
mains, everyday literary articles and the bible.
This research serves as a first step in help-
ing to preserve Ashaninka by offering a sub-
segmentation process that can be used to nor-
malize any Ashaninka text which will serve as
input to a machine translation system for trans-
lation into other high-resource languages spo-
ken by higher populated locations like Spanish
and Portuguese in the case of Peru and Brazil
where Ashaninka is mostly spoken.
1 Introduction
In South America, there are hundreds, if not thou-
sands, of low-resource languages. In Brazil alone,
one can quickly find a list1on-line of languages
that range from vulnerable to critically endan-
gered. Some languages, like Quechua, a low-
resource languages mostly spoken in Peru and Bo-
livia, have gained more attention in recent work
(Cardenas et al.,2018;Cotterell et al.,2018;Or-
tega and Pillaipakkamnatt,2018;Hintz and Hintz,
2017) and, while not 100% translatable to their
high-resource counterparts, are on the path to digi-
tal preservation. Other South American languages
are closer to extinction. One such language, called
Ashaninka, is spoken by nearly 70,000 people in
the Amazon forests shared by Peru and Brazil (see
Figure 1). While there are plenty of Ashaninka
speakers when compared to other tribal languages
like Guarasu which is spoken by only hundreds,
native Ashaninka speakers are generally less will-
ing to cooperate in the digitization of the language
We have located a near-native Ashaninka per-
son to help validate our work which consists of
multiple findings and improvements. Our research
ultimately presents a normalization tool that will
completely automate the processing of Ashaninka
text as input for a machine translation (MT) sys-
tem. More specifically, we complete three main
tasks: (1) alphabet normalization, (2) morpho-
logical disambiguation, and (3) a tagging system
which includes part-of-speech (POS) and morpho-
logical tagging. In addition to creating an automa-
tion tool2, called AshMorph, we make a human-
verified sub-segmentation development and test
corpus freely available.
We present our findings using AshMorph on
a development set in one domain (articles from
story books, plays, and educational text) and; then,
on a test set in another domain, common bibli-
cal translations — often times the only resource
low-resource languages in South America. We
detail our results by comparing them to the use
of the latest sub-segmentation techniques, namely
Subword-NMT (Sennrich et al.,2015) (or byte-
pair encoding–BPE), Morfessor (Smit et al.,2014)
and SentencePiece (Kudo and Richardson,2018)
(BPE with a unigram model).
Figure 1: Ashaninka speakers covering the Amazon re-
gion between Peru and Brazil.
The structure of this paper has been created to
first provide an overview of our in-depth normal-
ization process and then provide convincing re-
sults that serve as reasoning to use AshMorph for
future tasks. In Section 2, we compare and con-
trast related work on Ashaninka’s nearest neigh-
boring language, Quechua, and provide citations
for other work that led up to this paper. Then, in
Section 3, we dive deep into the details of Ash-
Morph to show how it takes advantage of another
system currently used to normalize Quechua by
extending it to cover morphology and grammatical
context from Ashaninka. After that, we cover our
experimental settings for two stages (development
and test) of AshMorph’s development. Next, we
provide detailed results along with sample output
from AshMorph in Section 5. Lastly, in Section
6, we give insight into what we plan as our next set
of experiments towards a final product for transla-
2 Related Work
Our work is novel due to its overall linguistic
coverage of Ashaninka in several ways, serving
specifically as a first step for translation by nor-
malization and sub-segmentation. However, there
are other works that perform sub-segmentation
on low-resource languages, including Ashaninka’s
Peruvian counterpart, Quechua. Since Quechua
and Ashaninka are both polysynthetic languages
and, in some ways, have a similar morphological
makeup, we present those works along with the
biblical corpus translations that are used in our ex-
Since the bible has been translated in several
languages (Christodouloupoulos and Steedman,
2015) and is available on-line in a parallel format
by Opus3, we include it here as related work be-
cause it has been used in several machine trans-
lation tasks on other low-resource languages, in-
cluding Quechua. Contrastingly, while the bible
corpus for Quechua (acronym QUW on Opus) has
12,400 total sentences, it has 19,100 sentences for
Ashaninka (acryonym CNI on Opus). Nonethe-
less, Quechua by far has received more attention
in experimentation related to MT.
The work most similar to ours performs normal-
ization on Quechua (Rios,2010). It consists of
the development of a morphological analyzer and
normalization technique similar to ours which al-
lows it to analyze several Quechua dialects. Us-
ing her work as a guide and the help of a near-
native Ashaninka reviewer, we present a normal-
ization pipeline for Ashaninka which, mainly due
to resistance and lack of resources, is still in the
early stages of a normalized writing system. Un-
like Quechua, which possesses a normalized lex-
icon for all its southern dialect varieties (Cerrón-
Palomino,1994), Ashaninka only possesses a few
candidates for a normalized alphabet, among them
is the one developed by Elena Mihas (Mihas,
2010) which, based on our observations, is best
suited for use as a normalized alphabet Ashaninka
and its dialects.
While there are several works that perform mor-
phological segmentation and normalization, we
do not consider them especially related to our
work. However, one MT system, Apertium (For-
cada et al.,2011), has been used in a similar way
to produce rule-based translations based on mor-
phology. As a comparison, Apertium was used
for translating Quechua to Spanish in the past
(Larico Uchamaco et al.,2013). It has a unique
way of rule creation in its “lttoolbox” that allows
for the creation of a finite state transducer (FST),
similar to our work described in Section 3, that
then serves as input to its MT component. But,
we additionally introduce other tagging strategies,
such as an alphabetical mapping device, specific to
Ashaninka, that are based on its syntactic structure
as was done in another system used to translate
Spanish to Quechua based on the same normaliza-
tion technique used in AshMorph originally cre-
ated for Quechua (Rios,2010).
As for the other systems used for comparison in
our results (Subword-NMT (Sennrich et al.,2015),
Morfessor (Smit et al.,2014) and SentencePiece
(Kudo and Richardson,2018)), we do not consider
them directly related to our work because they
are generic segmentation system that lack normal-
ization techniques based on linguistic knowledge.
They are included here as means of comparison
due to the lack of sub-segmentation systems avail-
able for Ashaninka.
Other work, while similar in nature, does not
dive into the complexities of Ashaninka. In the
next section, we describe our implementation of
AshMorph in detail with an attempt to show why
our work is novel and unlike research in other lan-
guages, except for Quechua, which this work is
based upon.
3 Methodology
Our system has been developed so that MT sys-
tems can use its normalized output as input. In this
section, we describe the normalization pipeline
that consists of several steps. We first provide
linguistic details about Ashaninka which include
a description of the alphabet, details on the mor-
phology, and parts of speech (POS). Then, we
cover the two tag sets that cover the POS and mor-
3.1 Alphabetical Normalization
Ashaninka has several dialects and is known to be
both polysynthetic and agglutinative (Bustamante
et al.,2020). The language is highly inflective and
contains several suffixes that are added on, much
like Quechua, to the end of a root word. The am-
biguous nature of its grammatical construct has
most likely led to the high amount of dialects
known as pan-Ashaninka. With that in mind,
we decided to use one alphabet, called Asheninka
Perene (Mihas,2010), as the main alphabet for
mapping pan-Ashaninka to one alphabet for cre-
ating lexicons and other grammatical constructs in
AshMorph. To our knowledge, Asheninka Perene
is the most recent alphabet available for Ashaninka
and it extends the original alphabet created by
Payne (1981) as illustrated in Table 4.
3.2 Morphological normalization
Ashaninka, much like its Peruvian counterpart
(Quechua), is a polysynthetic language whose in-
flection, which changes the meaning of a word,
depends on the head of a phrase. This makes
sub-segmentation of Ashaninka different than
Quechua despite them being both agglutinating
languages. Ashaninka typically inflects on a trans-
parent noun and verb agreement and, more often
than not, one can find a word that combines sev-
eral stems (noun and verb classifiers) that make up
a specific semantic meaning. For example, we de-
note the incorporation of a noun classifier (tsapya)
and a verbal classifier (ha) below:
3.2.1 apaani asheninka isaikatsapyaatziro inkaare
“a man who lived near a lake”
3.2.2 katsinkahari
“cold water”
Examples 3.2.1 and 3.2.2 show how inflection
works in Ashaninka. We note that the word heads,
or roots, are marked when verbs and nouns agree
with properties, such as the gender, of their ar-
guments. In Ashaninka, verbs are often times
marked by gender and that property is transfer-
able; this is seen where both the subject and ob-
ject of a phrase are cross-referenced. The cross-
referencing is illustrated clearly in Example 3.2.1
where its subject (apaani asheninka – one man)
is masculine and its corresponding verb’s pre-
fix (the first letter iin i-saik-a-tsapya-atz-i-ro) is
also masculine. This cross-referencing is further-
more replicated in the direct object of the sentence
(inkaare – lake). That means that not only do we
see gender agreement in the corresponding verb;
but, we also see the gender-based dependency be-
tween the object of the sentence (inkaare – lake).
One of the interesting phenomena of Ashaninka
is that it contains a suffix (–paye) that can inflect
the meaning of a word to be plural and nominal.
Nonetheless, Ashaninka can still be highly inflec-
tive and ambiguous when dealing with plural or
singular nouns. One example where there is a
nominal root that could refer to more than one en-
tity occurs in a word like koya – woman:
1one or more entities, e.g.: koya ’woman
or women’.
= 1 or one entity, e.g.: aparoni koya ’one
>1more than one entity, e.g.: koyapaye
In our morphological analysis, we collected
several morphological rules and assigned them
operations. The operations generally depend on
boundary symbols (marked as in our rule set).
We make all rules publicly available in our FST.
Here’s an example of a morphological rule used
to differentiate verbal roots where a personal pro-
noun prefix is both partially and totally duplicated:
{ (n‘1SG.S/A’ + oirink ‘to.lower’ + a ‘EP’ –
>noinoirinka “I get lower and lower”) } (partial
duplication of “lower”)
{ (n‘ 1SG.S/A’ + ak ‘to.answer.back’ + a ‘EP’
–> nakanaka ‘I answer back constantly’) } (total
duplication of “answer back”).
3.3 Grammatical Tagging
The normalization of most languages where there
are not enough resources to learn linguistic fea-
tures using an automated method, such as deep
learning, typically requires a tagging approach.
One of the most common tagging approaches
is POS tagging; however, for languages like
Quechua and Ashaninka, initial morphological
tagging is needed before POS tagging can be per-
formed. The implementation of a dual-faceted ap-
proach to cover both POS and morphology is what
makes AshMorph unique.
While earlier tagging efforts (Payne,1981)
were originally based on three types of morphol-
ogy (nouns, verbs, and adverbs), our work takes
advantage of two more main classes (adjectives
and pronouns). The original argument (Payne,
1981) for having only three classes was that adjec-
tives and pronouns were indistinguishable; how-
ever, we have found the contrary by taking into
account more seminal work (Mihas,2010) that
points towards the advantage of having major
word classes (nouns and verbs) along with several
smaller classes of adjective, adverbs, pronouns,
and more. Our work uses their work (Mihas,2010)
to assist in the creation of a normalization tech-
nique that covers both POS and low-level mor-
phological anomalies. Many of the final rules (de-
scribed in further detail below) were gotten during
the development stage in several trial-and-error
experiments to expand initial lexicons.
Part of Speech Tagging One of the more inter-
esting approaches for our tag set development is
the use of rules developed using the Czech lan-
guage, specifically those mentioned in previous
work (Hlaváˇ
cová,2017). We adapt those rules,
as done by others who worked on similar Peru-
vian languages (Pereira-Noriega et al.,2017;Car-
denas et al.,2018), for matching Ashaninka in a
quick fashion as described in the following. First,
our tag set, as (Hlaváˇ
cová,2017) does, consists of
POS tags, i.e. labels used to indicate the POS and
other grammatical categories such as case or tense
for each token – a token is defined as a sequence
of non-blank characters between blanks, handling
punctuation as separate tokens. Second, we group
multi-token units, such as proper names or num-
bers, at the structural level as done in previously
(Brants et al.,2003, pag. 77). Table 1provide
a more in-depth illustration the Ashaninka tag set
that we developed for AshMorph.
Deep Morphological Tagging Our morpholog-
ical tagger has been created to perform a deep
analysis on Ashaninka, deeper than POS tagging
alone. It takes several lexical classes into account
that have nothing to do with POS. We use strict
tagging for word forms which takes into account
inflection, tense, gender, and more. Addition-
ally, we convert our tagging output into a com-
prehensive, human-readable, tag set similar to an-
other morphological analyzer created for Quechua
The words forms that we analyze in Ashaninka
are formed, much like in English, by the con-
catenation of letters. For example, the English
words “write”, “writes”, “sisters”, and “where”
in Ashaninka are sankenataantsi,isankenate,
choenipaeni, and tsika, repectively. The heads,
or nuclei, of Ashaninka words are created in Ash-
Morph by using their corresponding lemmas. For
Category Type Abbrev.
Verb (V)
A-class A
I-class I
Copula COP
Noun (N)
Adjective (Adj)
Underived Adjective
Derived Adjective
Adverb (Adv)
Pronoun (Prn)
Personal Pers
Possessive Poss
Demonstrative Dem
Interrogative Wh
Indefinite Forms Indef
Numeral (NUM)
Particle (C)
Connective Conn
Interjection Interj
Ideophone (IDEO)
Unknown (UNK)
Punctuation $., $", $’, $-, $?, $!
Table 1: AshMorph’s POS Tag Set – originally
based on structural information from Czech language
the preceding example, the heads are: sankena,
sankena,choeni, and tsika. By using the lemma,
we are able to expand the understanding of inflec-
tion into what we call a paradigm – a set of word-
forms that can be created by means of “inflection”
from their lemma. We have found that lemmas in
Ashaninka are often times different according to
the dialect; so, in future iterations, some input may
differ for words where lemmas are distinct much
like would occur in English with the words “color”
and “colour”. Nonetheless, in our corpora (de-
scribed further in Section 4), there were no major
variants found despite the corpora contents which
consisted of several different dialects.
Our morphological tagging system is based on
two main concepts: category and value. Mor-
phological category in our system refers to the
properties of a word such as gender (masculine–
+m, or non-masculine–n.m.). The tense, on the
other hand, is based on what’s known as the re-
ality status systems (RSS), a binary verbal dis-
tinction between “realized” and “unrealized” sit-
uations established in previous work (Michael,
2014). We deem morphological value as the ac-
tual value of a morphological category. For exam-
ple, the morphological values for the morpholog-
ical category “number” could be singular (SG) or
plural (PL). In our system, the final morphologi-
cal tag that AshMorph outputs is a sequence of
(X, Y )pairs annotated with the format +X@Y,
denoting that Yis the morphological value of the
morphological category X. The example below
shows how AshMorph tags the verbal root tash
(“to be hungry” in English) followed by the suf-
fixes agantsi and antsi which denote the infini-
tive tense. When lemmas, or heads, take on dif-
ferent morphological values for the same morpho-
logical category (X, Y1),(X, Y2), ..., (X, Yn), we
annotate them as +X@Y1;Y2...;Yn.. The lemma
tash- ’’ may take either of the infini-
tive ’INF’ suffixes: -agantsi or -antsi resulting
in the annotation [+INF@agantsi;antsi.] as
seen in Section 7.1.1 (Supplemental Material).
Since South American languages with complex
morphology like Quechua and Ashaninka contain
a lot of information inside of their suffixes, we use
amaximal set of morphological values to cover
what can be considered a complex RSS (Mihas,
2010) which provides a lot of contextual informa-
tion by using what are known as realis, or ‘REAL’,
and irrealis, or ‘IRR’, suffixes. AshMorph’s max-
imal set of morphological values can be defined
as the set that is sufficient for morphological de-
scription of a single word form given its lemma
that typically depends on the POS classification
for the lemma. For the REAL suffixes, the annota-
tion model is +X@Y1Y2Y3.; and, for the IRR suf-
fixes, the annotation model is +X@Y1Y2Y3Y4Y5..
In order to group the morphological values in said
format, we use what is known as the EAGLES4
format which, amongst other formats, is used for
those cases that need to take position of a sequence
of values into account.
EAGLE output can be difficult to read; so, we
prepared a method to modify the output in such
a way that a beginner, not trained on Ashaninka,
can use them. Our method follows previous work
(Hulden and Francom,2012;Rios,2010, pag.
2115) on lemmas and suffixes; and, is nearly iden-
tical to the Quechua normalizer from their work
that produces human-readable labels based on a
lexicon as seen in Sections 7.1.1 - 7.1.4 (Supple-
mental Material).
3.4 Combined Normalization
The output of AshMorph is a combination of al-
phabetical and morphological normalization that
produces a readable tag set that explains the POS,
morphology, and grammatical construct for each
word. It allows us to separate each word from the
Ashaninka input into parsed informational chunks.
The process explained in this section can be con-
sidered somewhat more complex than others; but,
it is advantageous at this stage because it provides
deep insight into how Ashaninka is linguistically
structured. The finalized output is straightforward
to read from other software packages. We pro-
vide as an example in Figure 2(“OUTPUT WITH
Figure 2shows how a sample input (la-
belled “PRIMARY INPUT”) is first parsed us-
ing alphabetical and morphological normalization.
Since the three words (amõyasatzi,amoyasatzi,
amoñasatzi) are actually the same word but from
different dialects, we normalize them to one word
(amonyasatzi). After that, an FST is used to trans-
form the normalized input into a statistical model
using the method described in Section 3.2. The
model is then applied to the word using a method
similar to previous work (Rios,2010) which pro-
duces output that contains POS tags along with
other morphological tags such as gender.
In the next section, we cover the data used for
creating our models and validation based on the
methodology explained.
4 Experimental Settings
Our experiments consist of two stages, a develop-
ment stage and a test stage. While it is not cus-
tomary to include the development stage in ex-
periments, we add it here due to the corpus do-
main difference. During the development stage,
extra attention was given to the different dialects
and suffixes to construct a generic system that will
work with any type of Ashaninka input.
The AshMorph system is made publicly avail-
able5as are the development and test corpora. As
mentioned in Section 3, it consists of a mor-
phological analyzer and FST model similar to
those (Rios,2010) created for another language,
Quechua, most commonly spoken in Peru. Our
contribution lies in the modification of the system
from Quechua to Ashaninka; the applying of an al-
phabetical, grammatical, and POS normalization,
along with the development and test corpora, vali-
dated by a reviewer.
The development stage corpus consists of a
collection of various stories, plays, and educa-
tion text extracted from ‘Ñaantsipeta asháninkaki
birakochaki’ (Cushimariano Romano and Se-
bastián Q.,2008) made publicly availabe6with
permission from the authors. In total, 745 sen-
tences were extracted by analyzing each sentence
and converting them to their inflected forms which
resulted in a vocabulary of 2389 words. For creat-
ing language models during development, we used
50 randomly selected sentences for evaluation by a
near-native Ashaninka speaker. Evaluations were
done by first providing the evaluator with the out-
put which contained several sub-segments marked
by our system (AshMorph). Rules were modified
based on the evaluation in order to fine tune Ash-
Morph during development and helped the system
to gain the optimum results during the test stage.
In order to show that AshMorph works well
with texts not found in the development stage, we
chose a corpus from a reliable well-known, on-
line, resource, Opus7. We randomly chose 50 of
the 7774 total bible-based sentences and executed
AshMorph on them using the model based on
rules gotten from the development stage. Again,
a near-native reviewer was asked to validate the
AshMorph sub-segmentation and accuracy along
with a confusion matrix were used to show how
well both systems performed.
Three main systems were used to compare
off-the-shelf sub-segmentation systems with Ash-
Morph. We decided to use systems that have
been previously used in MT tasks because our fi-
nal goal, left for future work, is to use this work
for translating local Ashaninka texts to Spanish us-
ing MT systems. The first system used for com-
parison was Subword-NMT8(Version 0.3.7). We
used the default byte-pair encoding mechanism in
Subword-NMT, in our experiments, the vocabu-
lary was 2389 words in size. The second sys-
tem used for comparison was Morfessor9(Ver-
sion 2.0.6). Morfessor is a classifier-based sys-
tem that uses statistical-based rules based on a N-
best Viterbi algorithm. It has also been used in
other MT projects (Liu et al.,2017;Zuters and
Strazds,2019) for translation on languages similar
to Quechua or Ashaninka. The third system, Sen-
tencePiece (Kudo and Richardson,2018), is com-
monly used in high-grade MT systems for tasks re-
lated to neural machine translation and rapid eval-
uation(Bérard et al.,2019;Neubig and Hu,2018).
With SentencePiece, we used the default byte-pair
encoding with unigram model for segmentation.
For all three systems, we used a corpus split sim-
ilar to the development stage above used in Ash-
Morph consisting of 696 train/tune sentences and
50 test sentences.
In the next section, we provide details on how
well our system performed by providing both ac-
curacy and a detailed precision score.
5 Results and Conclusion
AshMorph performs well considering the amount
of resources available. When compared to other,
current, sub-segmentation systems based on sta-
tistical models of some sort, it outperforms them
on the order of 37 to 59% during the development
stage and 26 to 45% during the test stage. The
change of corpus from localized stories, plays and
educational texts during development to a biblical
domain does not seem to highly affect it (a dif-
ference on average of 13% between systems). We
believe that the decrease in performance is mostly
due to new words introduced by the biblical corpus
that were hard to discern even by the near-native
The normalization rules introduced during the
development stage were created from various
consultations with linguists, trained in Quechua,
Ashaninka, and other South American native lan-
guages. In order to develop AshMorph, a clus-
ter of dialects with a wide range of writing sys-
tems, initially without a normalized writing sys-
tem, were gathered to created and served as a pivot
to provide higher vocabulary coverage and reduce
human effort. The normalization system presented
here provided evidence that, for a low-resource
language that consists of several dialects and rar-
ities in its infancy, a system based on linguistic
consensus can outperform other, more modern,
statistical-based modeling systems.
Finalized results from AshMorph result in a
robust normalization and sub-segmentation sys-
tem that display a clear path from input to out-
put, clarity should be provided when develop-
ing a system of this nature. By providing
clear rule-based boundaries (Input–>Morpheme–
>Suffix–>Morphological Tag–>Essential Transla-
tion) as shown in Table 2, one can trace and mod-
ify rules accordingly to build a more powerful sys-
tem, that, later, combined with other MT tech-
niques can achieve better translations.
In order to show that a sub-segmentation system
developed with linguistic knowledge can be pow-
erful for Ashaninka, we provide a detailed review
of our findings during both the development and
test stages in Table 3. We measured accuracy, true
positives (TP), false positives (FP) and false nega-
tives (FN). For our evaluation, accuracy is consid-
ered to be only those sub-segments were there was
complete agreement between AshMorph and the
Ashaninka reviewer, similar to a TP in the follow-
ing sentence. A TP is considered to be the number
of sub-segments where AshMorph agreed with
the Ashaninka reviewer. An FP is considered to
be the number of sub-segments marked by Ash-
Morph but not marked by the Ashaninka reviewer.
Lastly, an FN is considered to be the number of
sub-segments that the AshMorph did not mark but
the human annotator did.
From Table 2, we clearly establish that for
Ashaninka, much like previous work (Rios,2010)
on Quechua, a completely linguistic-based ap-
proach performs better than other baseline sub-
segmentation systems. The addition of morpho-
logical and grammatical rules help AshMorph
outperform the other systems tested by nearly 30%
in most cases. The overall result can be considered
a first step for MT or other tasks that may want to
include the Ahaninka language. Here, we over-
come the initial drawbacks of resistance from na-
tive speakers using previous work on a similar lan-
guage that is spoken in a nearby region, Quechua.
We believe that the publicly-available rules
and sub-segmentation system presented here along
with the reviewed corpora and texts should be
considered a principle step for low-resource lan-
guages, specifically Ashaninka. The results have
shown that a more robust system with more re-
sources could take advantage of AshMorph’s
ability to separate words into their morphological
and grammatical construct by extending the sys-
tem to match its needs.
The idea of using an initial Quechua-based nor-
malization system for Ashaninka was based on
Input: amõyasatzi
Morpheme Suffix Class Morphology Tag Essential Translation
Output: [=amonya/amõya] [NRoot:CPB] [=Amonia.river]
[-satsi+m.] [NS:CPB] [+CL:provenance] [=human.provenance,
1 2 3 4
English: ‘from the Amonia river’
Table 2: Sample input and output from AshMorph that shows how tracing the system’s decision path can be helpful
when developing finite-state transducer rules for input to a machine translation system.
dev test
subword-nmt morfessor sentencepiece AshMorph subword-nmt morfessor sentencepiece AshMorph
ACC 16.3 % 38.94 % 38.08 % 74.99 % 29.99 % 47.21 % 48.4 % 74.42 %
TP 82 131 135 305 668 923 907 1489
FP 152 172 131 179 1382 1270 1163 344
FN 321 265 261 91 1376 1121 1137 555
Table 3: A side-by-side comparison of accuracy(ACC), true positives (TP), false positives (FP) and false nega-
tives(FN) for AshMorph and other modern sub-segmentation tools.
several similarities in the two languages, mainly
agglutination and polysemy. Apart from that, the
possibility of finding Ashaninka reviewers that
had some background with Quechua morphology
was higher. That led us to believe that by creat-
ing a system for segmenation based on morphol-
gical analysis and linguistic knowledge, we could
achieve high performance, better than those used
in most MT tasks. AshMorph is a first step in low-
resource translation and could be used for other
6 Future Work
We believe that the “sky is the limit” for
Ashaninka. We have somewhat overcome ini-
tial resistance by native speakers to help estab-
lish the first step of translating Ashaninka, sub-
segmentation and language tagging. With im-
provements of up to 30% in accuracy as a base-
line for our initial experiments, future work could
include more linguists and/or native Ashaninka
speakers as the margin for improvement stands
around 20% (our current highest accuracy score
in the test stage is 74.42%). There is one exper-
iment in particular for which resources (time and
economic) fell short: the training of an unsuper-
vised segmentation technique over the entire set
of sentences for which we do not have annota-
tions from Opus. The experiment would elimi-
nate the need to constrain the amount of text avail-
able for learning segmentation. We also believe
that by analyzing various combinations of hyper-
parameters to statistical systems, we may achieve
better results than published. In our opinion, MT
systems can now easily use the output from Ash-
Morph to produce an initial translation set; or, at a
minimum, reproduce results from other languages
with extremely low resources as has been done re-
cently (Karakanta et al.,2018). We plan on first
using AshMorph to normalize Ashankinka text
as input to a rule-based MT system like Apertium
(Forcada et al.,2011) where AshMorph FST rules
could be compared and evaluated for performance.
The hope is to create translations that perform as
well as its neighboring language, Quechua, as pre-
sented in previous work (Rios,2015).
We would like to thank Liliana Fernández-Fabián
(Univ. Nacional Mayor de San Marcos) for lin-
guistic assistance. Also, Rubén Cushimariano-
Romano and Richer Sebastian-Quinticuarithe –
authors of the “Asháninka – Spanish Dictionary”.
Lastly, a special thanks to Elena Mihas for her re-
search on Perene Asheninka which helped us de-
velop our morphological analyzer.
Alexandre Bérard, Ioan Calapodescu, and Claude
Roux. 2019. Naver labs europe’s systems for the
wmt19 machine translation robustness task. arXiv
preprint arXiv:1907.06488.
Thorsten Brants, Wojciech Skut, and Hans Uszkoreit.
2003. Syntactic annotation of a german newspaper
corpus. In Treebanks, pages 73–87. Springer.
Gina Bustamante, Arturo Oncevay, and Roberto
Zariquiey. 2020. No data to crawl? monolingual
corpus creation from pdf files of truly low-resource
languages in peru. In Proceedings of The 12th Lan-
guage Resources and Evaluation Conference, pages
Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
izo, and Luis Camacho. 2018. Siminchik: A speech
corpus for preservation of southern quechua. In Pro-
ceedings of the Eleventh International Conference
on Language Resources and Evaluation (LREC’18).
Rodolfo Cerrón-Palomino. 1994. Quechua sureño. dic-
cionario unificado. Biblioteca Básica Perúana, Bib-
lioteca Nacionál del Peru.
Christos Christodouloupoulos and Mark Steedman.
2015. A massively parallel corpus: the bible in
100 languages. Language resources and evaluation,
Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
Géraldine Walther, Ekaterina Vylomova, Arya D.
McCarthy, Katharina Kann, S. J. Mielke, Garrett
Nicolai, Miikka Silfverberg, David Yarowsky, Ja-
son Eisner, and Mans Hulden. 2018. The conll-
sigmorphon 2018 shared task: Universal morpho-
logical reinflection.CoRR, abs/1810.07125.
Rubén Cushimariano Romano and Richer C. Se-
bastián Q. 2008. Ñaantsipeta asháninkaki bi-
rakochaki. diccionario asháninka-castellano. ver-
sión preliminar.
publicaciones/diccionarios/. Visitado:
Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nord-
falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An-
tonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema
Ramírez-Sánchez, and Francis M Tyers. 2011.
Apertium: a free/open-source platform for rule-
based machine translation. Machine translation,
Daniel J Hintz and Diane M Hintz. 2017. The eviden-
tial category of mutual knowledge in quechua. Lin-
gua, 186:88–109.
Jaroslava Hlaváˇ
cová. 2017. Golden rule of morphol-
ogy and variants of word forms. Journal of Linguis-
y casopis, 68(2):136–144.
Mans Hulden and Jerid Francom. 2012. Boosting
statistical tagger accuracy with simple rule-based
grammars. In LREC, pages 2114–2117.
Alina Karakanta, Jon Dehdari, and Josef van Genabith.
2018. Neural machine translation for low-resource
languages without parallel corpora. Machine Trans-
lation, 32(1-2):167–189.
Taku Kudo and John Richardson. 2018. Sentencepiece:
A simple and language independent subword tok-
enizer and detokenizer for neural text processing.
arXiv preprint arXiv:1808.06226.
Guido-Raúl Larico Uchamaco, Hugo David
Calderón Vilca, and Flor Cagniy Cárdenas Mariño.
2013. Incubation system machine translation
spanish to quechua, based on free and open source
platform apertium. CEPROSIMAD. Online
Chao-Hong Liu, Qun Liu, and Glasnevin Dublin. 2017.
Introduction to the shared tasks on cross-lingual
word segmentation and morpheme segmentation.
Proceedings of MLP, pages 71–74.
Lev Michael. 2014. The nanti reality status sys-
tem: Implications for the typological validity of
the realis/irrealis contrast. Linguistic Typology,
Elena Mihas. 2010. Essentials of Ashéninka Perené
Grammar. Ph.D. thesis, The University of Wiscon-
Graham Neubig and Junjie Hu. 2018. Rapid adapta-
tion of neural machine translation to new languages.
arXiv preprint arXiv:1808.04189.
John Ortega and Krishnan Pillaipakkamnatt. 2018. Us-
ing morphemes from agglutinative languages like
quechua and finnish to aid in low-resource transla-
tion. In Proceedings of the AMTA 2018 Workshop
on Technologies for MT of Low Resource Languages
(LoResMT 2018), pages 1–11.
David Lawrence Payne. 1981. The phonology and
morphology of Axininca Campa, volume 66. Sum-
mer Institute of Linguistics Arlington Texas.
José Pereira-Noriega, Rodolfo Mercado-Gonzales, An-
drés Melgar, Marco Sobrevilla-Cabezudo, and Ar-
turo Oncevay-Marcos. 2017. Ship-lemmatagger:
Building an nlp toolkit for a peruvian native lan-
guage. In International Conference on Text, Speech,
and Dialogue, pages 473–481. Springer.
Annette Rios. 2010. Applying finite-state techniques
to a native american language: Quechua. Institut für
Computer Linguistik, Universitaät Zürich.
Annette Rios. 2015. A basic language technology
toolkit for Quechua. Ph.D. thesis, University of
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Peter Smit, Sami Virpioja, Stig-Arne Grönroos, Mikko
Kurimo, et al. 2014. Morfessor 2.0: Toolkit for sta-
tistical morphological segmentation. In The 14th
Conference of the European Chapter of the Associa-
tion for Computational Linguistics (EACL), Gothen-
burg, Sweden, April 26-30, 2014. Aalto University.
Stefano Varese. 2004. Salt of the mountain: Campa
Asháninka history and resistance in the Peruvian
jungle. University of Oklahoma Press.
anis Zuters and Gus Strazds. 2019. Subword seg-
mentation for machine translation based on grouping
words by potential roots. Baltic Journal of Modern
Computing, 7(4):500–509.
7 Supplemental Material
In this section we provide examples of suffixes
in Section 7.1 to help understand the Ashaninka
language. The examples presented are based on
previous work (Hulden and Francom,2012;Rios,
2010, pag. 2115) on lemmas and suffixes; and, is
nearly identical to the Quechua normalizer from
their work that produces human-readable labels
based on a lexicon. In the second section (Sec-
tion 7.2), a table is presented that shows the dif-
ference in alphabetical characters used as part of
our normalization technique. There is a compari-
son between two authors that include final normal-
ization by (Mihas,2010) as the character set used
in this paper. Lastly, there is a finalized view the
AshMorph normalization process on a few words
(amõyasatzi,amoyasatzi,amoñasatzi) which end
up being normalized to amonyasatzi.
7.1 Suffix Examples
In this section, we present several tagging exam-
ples to help understand the strategy of normaliza-
tion in Ashaninka. These are provide in English
and Spanish counterparts to show the differences
that one would encounter when attempting to nor-
malize these words to Ashaninka’s neighboring
high-resource language, Spanish.
7.1.1 Verbal root:
(EN: have/feel.hungry)]" : {tash}
7.1.2 Verbal root:
| "[=ameet+INF@aantsi.][VRoot][’, to.shave
(ES: cortar.(el.pelo), afeitar)]" : {ameet}
7.1.3 Verbal suffix:
| "[--][-eNpa~empaa+tns@fut.+C.A@mid.v.+RSS@00000]
[MODALITY][+IRR]" : "@EP"[{empaa}]
| "[=kishi+gndr@n.m.+inal@tsi.+mrph.phon@LEN:^k^Ø.][NRoot]
[=hair (ES: cabello, pelo; PT: pelo; QU: chukcha)]" : {kishi}
7.1.4 Nominal suffix:
| "[--][-satsi+gndr@m.][NS:CPB][+CL:provenance][=human.provenance;
inhabitant.of.swh., one.who.dwells.swh.]" : {satsi}
7.2 Alphabet Comparison
Here, we present a comparison of two main alpha-
bets that have been derived from various written
Ashaninka dialects. The final normalization (Mi-
has,2010) has made it easier for AshMorph to in-
corporate Ashananika’s dialects into one language
for normalization purposes.
Phoneme Writing systems
Payne 1981 Mihas,
tjˇc ty
Ss+Vi,ç+Va,i,o sh
ts c/ tz
w¯w w (or
NNn, m
(or n)
Table 4: The first found alphabetical list of Pan-
Ashaninka phonemes (Payne,1981) along with the
most current (Mihas,2010).
7.3 Input Normalization
In this section, we provide a visualization of how
a word that is written in various ways due to
Ashaninka’s complexity is normalized into one
word. These normalization rules are part of the
finite state transducer (FST) model introduced in
this paper.
[=amonya+gndr@n.m.+sem@plant][NRoot][=plant.sp. (ES:][--][-satzi+gndr@m.][NS][+CL:provenance][=human.provenance;
inhabitant.of.swh., one.who.dwells.swh.]
NOR MA LI ZE D IN PU T:amonyasatzi
Ashaninka FST model
PRIMARY INPUT:amõyasatzi,amoyasatzi,amoñasatzi
Normalization rules
Figure 2: Normalization in the LOOKUP of the
Ashaninka FST model which illustrates distinct input
forms of a word being first converted into their normal-
ized input and then into their resulting output.
... Research has also been performed by Ortega et al. (2020) on a neighboring Peruvian language called Ashaninka. Ashaninka has less resources than Quechua and is spoken by fewer people. ...
... The amount of resources available for Ashaninka is on the order of 8,000 sentences (or segments) whereas Quechua data is about 40,000 segments and growing. Ortega et al. (2020) dedicated their initial work on Ashaninka to language normalization by creating a finite-state transducer based on previous Quechua work (Rios, 2010). They left for future work the inclusion of Ashaninka in an NMT system. ...
... In order to advance the work by Ortega et al. (2020Ortega et al. ( , 2021, we use resources from their published articles available online 3 to extend their experiments which, in turn, marks the first time, to our knowledge, that an Ashaninka-Spanish machine translation (MT) system is introduced to the MT research community. Our hope is that, since Finnish and Quechua were found to be successful in previous work (Ortega and Pillaipakkamnatt, 2018) due to their highly-similar morphology, the addition of Ashaninka as source-side input should increase performance since Quechua and Ashaninka are from the same region, display similar morphological constructs, and even share loaned vocabulary words where higher-resource languages (Quechua and Spanish) are found in the lower-resource language (Ashaninka). ...
Conference Paper
Full-text available
Low-resource languages sometimes take on similar morphological and syntactic characteristics due to their geographic nearness and shared history. Two low-resource neighboring languages found in Peru, Quechua and Ashaninka, can be considered, at first glance, two languages that are morphologically similar. In order to translate the two languages, various approaches have been taken. For Quechua, neural machine transfer-learning has been used along with byte-pair encoding. For Ashaninka, the language of the two with fewer resources, a finite-state transducer is used to transform Ashaninka texts and its dialects for machine translation use. We evaluate and compare two approaches by attempting to use newly-formed Ashaninka corpora for neural machine translation. Our experiments show that combining the two neighboring languages, while similar in morphology, word sharing, and geographical location, improves Ashaninka-Spanish translation but degrades Quechua-Spanish translations.
... There are also numerous morphological analyses with FST representations available. Most relevant to this task are analyses performed for other indigenous Peruvian languages: Shipibo-Konibo (Cardenas and Zeman, 2018), Quechua (Rios, 2010) and pan-Ashaninka (Ortega et al., 2020;Castro Mamani, 2020). In particular, the last work includes applications of the FST to spell-checking and segmentation. ...
... Instead open category root inventories are edited in spreadsheets and exported via Python scripts to FST source files. All morphological analysis is coded in FST, consistent with efforts by (Cardenas and Zeman, 2018;Ortega et al., 2020;Castro Mamani, 2020) for other Amazonian languages. ...
... Asháninka The Asháninka-Spanish data 4 were extracted and pre-processed by Richard Castro (Cushimariano Romano and Sebastián Q., 2008; Ortega et al., 2020a;Mihas, 2011). Though the texts came from different pan-Ashaninka dialects, they were normalized using AshMorph (Ortega et al., 2020a). ...
... Asháninka The Asháninka-Spanish data 4 were extracted and pre-processed by Richard Castro (Cushimariano Romano and Sebastián Q., 2008; Ortega et al., 2020a;Mihas, 2011). Though the texts came from different pan-Ashaninka dialects, they were normalized using AshMorph (Ortega et al., 2020a). The development and test sets came from translations of Spanish texts done by Feliciano Torres Ríos. ...
... It is known (Pinnis et al., 2017;Kann, 2019;Karakanta et al., 2018) to be a highly inflective language based on its suffixes which agglutinate. Due to its morphology, Quechua has been found to be similar to other languages like Finnish (Ortega et al., 2021(Ortega et al., , 2020bOrtega and Pillaipakkamnatt, 2018). ...
... It is known (Pinnis et al., 2017;Kann, 2019;Karakanta et al., 2018) to be a highly inflective language based on its suffixes which agglutinate. Due to its morphology, Quechua has been found to be similar to other languages like Finnish (Ortega et al., 2021(Ortega et al., , 2020bOrtega and Pillaipakkamnatt, 2018). ...
Conference Paper
Full-text available
The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.
... That is because MT systems backed by neural networks, or neural machine translation (NMT) systems, require high amounts of data, typically on the order of millions of sentences or more [1,8]. We demonstrate an approach inspired by previous work [3] that uses the proximity of Portuguese and Galician to overcome the lack of resources problem and produces corpora to build an NMT system, similar to low-resource NMT systems found in previous work [10,9], for translating Spanish to Galician. Our system first uses a high-quality Spanish-Portuguese (ES-PT) parallel corpus to translate the target-sided (Portuguese) sentences (or segments) to Galician using transliteration, the conversion of text in one language to another through spelling. ...
Conference Paper
Full-text available
We present a neural machine translation (NMT) system for translating Spanish to Galician (ES-GL). Galician is a low-to-medium resource language found in Spain. Our NMT system is trained on a large-scale synthetic ES → P T → GL parallel corpus created by the spelling transliteration of Portuguese to Galician from a high-quality Spanish to Portuguese (ES-P T) translation memory. The NMT system is then made available via a public web interface for other to use at: tradutor.
... The training data for each of the languages comes from a variety of different sources. The Asháninka training data is sourced from Ortega et al. (2020); Cushimariano Romano and Sebastián Q. (2008); Mihas (2011) and consists of stories, educational texts, and environmental laws. The Aymara training data consists mainly of news text from the GlobalVoices corpus (Prokopidis et al., 2016) as available through OPUS (Tiedemann, 2012). ...
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our model to a monolingual baseline, and show that the multilingual pre-trained approach yields much more consistent segmentation quality across target dataset sizes, including a zero-shot performance of 20.6 F1, and exceeds the monolingual performance in 9/10 experimental settings. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).
... The corpus source is the parallel treebank of Rios et al. (Rios et al., 2012)., so we deduce that they worked with Quechua Cuzco (quz). (Ortega et al., 2020a) In the AmericasNLP shared task, new out-of-domain evaluation sets were released, and there were two tracks: using or not the validation set for training the final submission. We addressed both tracks by collecting more data and pre-training the NMT model with large Spanish-English data. ...
Full-text available
Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task.
Conference Paper
Full-text available
Languages are disappearing at an alarming rate, linguistics rights of speakers of most of the 7000 languages are under risk. ICT play a key role for the preservation of endangered languages; as ultimate use of ICT, natural language processing must be highlighted since in this century the lack of such support hampers literacy acquisition as well as prevents the use of Internet and any electronic means. The first step is the building of resources for processing, therefore we introduce the first speech corpus of Southern Quechua, Siminchik, suitable for training and evaluating speech recognition systems. The corpus consists of 97 hours of spontaneous conversations recorded in radio programs in the Southern regions of Peru. The annotation task was carried out by native speakers from those regions using the unified written convention. We present initial experiments on speech recognition and language modeling and explain the challenges inherent to the nature and current status of this ancestral language.
Conference Paper
Full-text available
Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are similar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems.
Full-text available
In many languages, some words can be written in several ways. We call them variants. Values of all their morphological categories are identical, which leads to an identical morphological tag. Together with the identical lemma, we have two or more wordforms with the same morphological description. This ambiguity may cause problems in various NLP applications. There are two types of variants – those affecting the whole paradigm (global variants) and those affecting only wordforms sharing some combinations of morphological values (inflectional variants). In the paper, we propose means how to tag all wordforms, including their variants, unambiguously. We call this requirement “Golden rule of morphology”. The paper deals mainly with Czech, but the ideas can be applied to other languages as well.
Full-text available
The problem of a total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. We describe a language-independent method to enable machine translation between a low-resource language (LRL) and a third language, e.g. English. We deal with cases of LRLs for which there is no readily available parallel data between the low-resource language and any other language, but there is ample training data between a closely- related high-resource language (HRL) and the third language. We take advantage of the similarities between the HRL and the LRL in order to transform the HRL data into data similar to the LRL using transliteration. The transliteration models are trained on transliteration pairs extracted from Wikipedia article titles. Then, we automatically back-translate monolingual LRL data with the models trained on the transliterated HRL data and use the resulting parallel corpus to train our final models. Our method achieves significant improvements in translation quality, close to the results that can be achieved by a general purpose neural machine translation system trained on a significant amount of parallel data. Moreover, the method does not rely on the existence of any parallel data for training, but attempts to bootstrap already existing resources in a related language.
Full-text available
Thesis written by Annette Rios under the supervision of Prof. Dr. Martin Volk at the University of Zurich. The thesis defense was held at the University of Zurich on September 21, 2015 and was awarded Summa Cum Laudé. The members of the committee were Prof. Dr. Martin Volk (University of Zurich, Institute of Computational Linguistics), Prof. Dr. Balthasar Bickel (University of Zurich, Department of Comparative Linguistics) and Dr. Paul Heggarty (Max Planck Institute for Evolutionary Anthropology). © 2016 Sociedad Espanola para el Procesamiento del Lenguaje Natural.
Conference Paper
Natural Language Processing deals with the understanding and generation of texts through computer programs. There are many different functionalities used in this area, but among them there are some functions that are the support of the remaining ones. These methods are related to the core processing of the morphology of the language (such as lemmatization) and automatic identification of the part-of-speech tag. Thereby, this paper describes the implementation of a basic NLP toolkit for a new language, focusing in the features mentioned before, and testing them in an own corpus built for the occasion. The obtained results exceeded the expected results and could be used for more complex tasks such as machine translation.