Conference PaperPDF Available

Allin Qillqay! A Free Online Web Spell Checking Service for Quechua


Abstract and Figures

In this paper we analyze the advantages and disadvantages of porting the current available spell checking technologies in its primary form (meaning without speed and efficiency improvements) to the Internet in the form of Web services, taking the existing Quechua spell checkers as a case of study. For this purpose we used the CKEditor, a well-known HTML text processor and its spell-check-as-you-type (SCAYT) add-on on the client side. Furthermore, we built our own compatible server side application called "Allin Qillqay!" 'Correct Writing/Spelling!'.
Content may be subject to copyright.
Allin Qillqay! A Free On-Line Web spell checking
Service for Quechua
Richard A. Castro Mamani1, Annette Rios Gonzales2
1 Computer Science Department, Universidad Nacional de San Antonio Abad del Cuzco
2 Institute of Computational Linguistics, University of Zurich,
Abstract: In this paper we analyze the advantages and disadvantages of porting the current available spell checking
technologies in its primary form (meaning without speed and efficiency improvements) to the Internet in the form of
Web services, taking the existing Quechua spell checkers as a case of study. For this purpose we used the CKEditor, a
well-known HTML text processor and its spell-check-as-you-type (SCAYT) add-on on the client side. Furthermore, we
built our own compatible server side application called “Allin Qillqay!” ‘Correct Writing/Spelling!'.
Key words: spellchecker traffic, spell checking parameters, HTML Editor, Quechua.
1 Introduction
This is a paper about the current spell-checking
technologies and is based on two premises. The first is
that the Internet is becoming an increasingly important
    (The Mozilla Manifesto
). During the
past few years, several new JavaScript applications have
appeared that provide the user with functionalities on the
web comparable to desktop programs. One of the main
reasons behind this development is that the slow page
requests every time a user interacts with a web application
are gone; as the JavaScript engines are now sufficiently
powerful to keep part of the processing on the client side
Among the most well-known rich JavaScript productivity
figure iWork for iCloud
and, more
recently, Microsoft Office 365
, Google Docs
, GMail
and also the CKEditor
, a free open source HTML text
editor which brings common word processor features
directly to web pages.
Most of the web applications listed above are constantly
being enhanced with new features, yet some important
features, such as spell checking, have been neglected or
are not integrated as web services but instead depend
heavily on the web browser language configuration, or the
spell checking plug-ins installed.
In 2012, we decided to implement our own productivity
application using HTML, JavaScript and the state-of-the-
art spell checking technology available for Quechua. The
goal was to create an application with a user friendly
The term productivity software or productivity
application refers to programs used to create or modify a
document, image, audio or video clip.
interface similar to what users can expect from desktop
applications. The integration of the spell checkers into the
web application provides a comfortable and easy way to
test the quality of the Quechua spelling correction.
The outline of this paper is as follows: Section 2 presents
the basic concepts in spell checking. In section 3 we
describe related work regarding the advancements in the
field of on-line spell checking. Section 4 gives a general
overview of the Quechua language family. Section 5 lists
all the publicly available spell checkers for Quechua. The
overall description of the system is given in section 6, and
in section 7 we describe the some of the recent
experiments and improvements.
2 Spell Checking
Liang [Liang2009] describes the overall spell checking
        
encoded by some encoding (such as ASCII, UNICODE,
etc.), identify the words that are valid in some language,
as well as the words that are invalid in the language (i.e.
misspelled words) and, in that case, suggest one or more
The spell checking process can generally be divided into
three steps (See Figure 1):
Figure 1: Spell checking process.
2.1 Error Detection
Error detection is a crucial task in spelling correction. In
order to detect invalid words, the spell checker usually
performs some kind of a dictionary lookup. There are
three main formats for machine readable dictionaries used
in spelling correction:
1. a list of fully-fledged word forms
2. a separate word (.dic) file and an affix (.aff) file
3. a data structure called 'finite state transducer' that
comprehends the morphology of the language
(i.e. the rules of word formation). This approach
is generally used in spelling correction for
languages with complex morphology, where one
word (or root) may appear in thousands of
different word forms, such as Quechua. As an
illustration of Quechua word formation, see
Example 1 with the parts (i.e. morphemes)
contained in the Quechua word
(1) ñaqch’a -yku -chka -rqa -yki
comb +Aff +Prog +Pst +1.Sg.Subj_2.Sg.Obj
-ña -chu -m
+Disc +Intr +DirE
Was I already combing you?
Further details about finite state transducers applied to
spell checking are not here presented for space reasons
and can be consulted in Beesley & Karttunen
2.2 Error Correction
There are two main approaches for the correction of
misspelled words: isolated-word error correction or
context-dependent error correction. With the former
approach, each word is treated separately disregarding the
context, whereas with the latter approach, the textual
context of a word is taken into consideration as well.
Error Model produces the list of suggestions for a given
misspelling, using different algorithms and strategies
depending on the characteristics of the misspelled word.
A Typo is a small mistake in a typed or printed text.
A Real Word Error is an error which accidentally results
in a valid word but it is not the intended word in sentence.
Only a context-dependent corrector can correct real-word
errors, as the isolated-word approach will not detect this
kind of mistake.
2.3 Suggestion Ranking
Ranking is the ordering of suggested corrections
according to the likelihood that the suggestion is the
originally intended word.
3 Related Work
In recent years there have been some advancements
regarding online spell checking, mainly the incorporation
of spell-check-as-you-type SCAYT technology, allowing
users to have a much more responsive and natural
experience. SCAYT is based purely on JavaScript and
asynchronous requests to the server from its client
It is not uncommon for a spell checker to start with a web
application and then to get to the more traditional desktop
version. Dembitz et al. [Dembitz2011] developed
Hascheck, an online spellchecker for Croatian, an under-
Abbreviations: +Aff: affective, +Prog: progressive,
+Pst: past, Sg: singular, Obj: object, +Disc:
discontinuative ('already'), +Intr: interrogative, DirE:
direct evidentiality
resourced language with a relatively rich morphology
which is spoken by approximately 4.5 million persons in
Croatia. The dictionary used for this system is a list of
fully-fledged word forms. What sets this spell checker
apart from others is its ability to learn from the texts it
spellchecks. With this approach they achieve a quality
comparable to English spell checkers, as a consequence
Hascheck was crucial during the development of other
applications for NLP tasks.
Francom et al. [Hulden2013] developed jsft, a free open-
source JavaScript library which provides means to access
finite-state machines. This API is used to build a spell
checking dictionary on the client side of a web application
obtaining good results. Although we did not use this API
as part of the current version of our system, we believe
that jsft is clearly a very important development in the
evolution of spell-checking on the web.
is a non-free spell checking service
for a wide range of languages; it can be integrated in the
form of a plug-in to the major open-source HTML text
editors. WebSpellChecker was used as a model for our
project, although ours is open-source and freely available.
4 Quechua
Quechua [Rios2011] is a language family spoken in the
Andes by 8-10 million people in Peru, Bolivia, Ecuador,
Southern Colombia and the North-West of Argentina.
Although Quechua is often referred to as a language and
its local varieties as dialects, Quechua is a language
family, comparable in depth to the Romance or Slavic
languages [AdelaarMuysken04, 168]. Mutual
intelligibility, especially between speakers of distant
our experiments are designed for different Quechua
5 A case of study: Quechua Spell
When it comes to elaborating a spell checker, Hunspell
and MySpell are the most well-known technologies.
Nevertheless, these formalisms have serious
disadvantages concerning the suggestion quality for
morphologically complex agglutinative languages such as
Quechua. In order to overcome the problems of HunSpell,
several spell checkers for agglutinative languages rely on
finite-state methods, as these are better suited to capture
complex word formation strategies. An example of such a
finite-state spelling corrector is part of the Voikko
for Finnish. The Quechua spell checkers used in our
experiments make also use of this approach.
These are the spell checkers used in our web application:
Cuzco Quechua spell checker (3 vowels),
implemented with the Foma Toolkit [Rios2011].
The orthography used as standard in this
corrector adheres to the local Cuzco dialect. We
   cuz_simple_foma
refer to this spell check engine.
Normalized Southern Quechua spell checker,
implemented in Foma as well. The orthography
in this spell checker is the official writing
standard in Peru and Bolivia
, as proposed by
the Peruvian linguist R. Cerrón Palomino
[Cerrón-Palomino94]. We will use the
abbreviatio uni_simple_foma for this spell
Southern Unified Quechua, with an extended
Spanish lexicon and a large set of correction
rules. This spelling corrector is also implemented
in Foma, and it uses the same orthography as
uni_simple_foma. The Spanish lexicon permits
the correction of loan words consisting of a
Spanish root combined with Quechua suffixes.
The additional set of rules, on the other hand,
rewrites common spelling errors directly to the
correct form. By this procedure, the quality of
the suggestions improves considerably. We will
   uni_extended_foma to
refer to this spell checker.
Bolivian Quechua spell checker (5 vowels) by
Amos Batto, it was built using MySpell. In the
following we will use the abbreviation
bol_myspell for this spell checker.
Ecuadorian Unified Kichwa (from Spanish,
Kichwa Ecuatoriano Unificado) spell checker,
implemented in Hunspell by Arno Teigseth. We
will use the abbreviation ec_hunspell for this
spell checker.
6 Our spell checking web service: Allin
The system
is an on-line spell checking service which
offers a demo version of all the different spell checkers
for Quechua that have been built so far in a user friendly
HTML text editor. The system operates interactively,
preserving the original formatting of the document that
the user is proofreading. The most important advantage of
online spell checking lies in the community of users (See
Figure 2). Unlike conventional spell checking in a
desktop environment, where the user-application relation
is one-to-one, in on-line spell checking, there is a many-
to-one relation. This circumstance has been beneficial for
the enhancement of the spell checker dictionary: Unlike
the user-defined customized dictionary in a desktop
program, which stores the false positives
of only one
user, all of the false positives that occur in on-line spell
There is one small difference: Bolivia uses the letter
<j> to write /h/, whereas Peru uses <h>, e.g. Peru: hatun
vs. Bolivia: jatun 
A false positive refers to words that are correctly
spelled, but unknown to the spell checker. In this case, the
user can add those words to the dictionary.
checking are stored in a single dictionary and thus benefit
the entire community. Hence, our on-line spell checking
service is constantly improving its functionality through
interaction with the community of users.
6.1 Client side application
This section describes the different resources we use for
the client side of the web service and how they interact
with each other.
6.1.1 CKEditor
The CKEditor
is an open source HTML text editor
designed to simplify web content creation. This program
editor that brings common word
processor features to web pages.
6.1.2 Dojo Toolkit
The Dojo Toolkit
is an Open-Source JavaScript library
used for rapid development of robust, scalable, rich web
projects and fast applications, among diverse browsers. It
is dual licensed under the BSD and AFL license.
6.1.3 SpellCheckAsYouType (SCAYT) Plug-in
This Spell Check As You Type (SCAYT) plug-in
for the
CKEditor, is implemented using the Dojo Toolkit
JavaScript libraries. By default it provides only access to
the spell checking web-services of
Figure 3: SCAYT working with our spell checking web
service and the cuz_simple_foma spell-checking engine
in the same manner as it works with service.
What You See Is What You Get
The SCAYT product allows users to see and correct
misspellings while typing, the misspelled words are
underlined. If a user right-clicks one of those underlined
words he will be offered a list of suggestions to replace
the word, see Figure 3. Furthermore, SCAYT allows the
creation of custom user dictionaries. SCAYT is available
as a plug-in for CKEditor, FCKEditor and TinyMCE. The
plugin is compatible with the latest versions of Internet
Explorer, Firefox, Chrome and Safari, but not with the
Opera Browser.
6.1.4 Client Side Pipeline
The encoding used throughout the processing chain is
UTF-8, since the Quechua alphabet contains non-ASCII
CS01 Submitting tokenized text:
The tokenization process is done entirely by the SCAYT
plug-in. For instance, if the original text written inside the
CKEditor textbox is:
Kuraq runaqa erqekunan karanku chaypaspisillan
The data submitted to our web server is the tokenized
input text:
Kuraq, runaqa, erqekunan, karanku, chaypaspisillan,
Note that each word is separated by a comma and none of
the format properties such as Bold or Italic are sent to the
server. There are, however, other parameters that can be
included in the data sent to the server, such as the
language, the type of operation, or whether or not the
word should be added to the user dictionary.
CS02 CGI program's response
JSON is the format of the response data from our CGI
The response data is processed and rendered by the
SCAYT plug-in.
6.2 Server side application
In summary, the server side implementation is an
interface which interacts with the spell checkers for
Quechua, as well as with the user dictionary and the error
corpora, see Figure 4.
We developed a server side application that is compatible
with the SCAYT add-on. The application makes it
possible to use state-of-the-art spell checking software,
such as finite-state transducers, in a web service. Our web
server runs on a Linux Ubuntu Server 12.04 x64 operating
We did not test our server-side application on a server
running Windows Server.
Figure 2: Online Web Spell checking Client/Server CGI System Diagram, every step is explained in section 6
(notice the codes CS** client side and SS** server side with their corresponding step number).
Figure 4: Server Side Application: This robustness
diagram is a simplified version of the
communication/collaboration between the entities of our
The user dictionary is stored in a MySQL
database. More specific information (classification, type,
language, language variety, etc.) concerning misspellings
and unknown words is stored in an object oriented
database, in a XML format, using Basex
6.2.1 Server side pipeline
SS01 Call CGI:
The web server calls and uses CGI as an interface to the
programs that generate the spell-checking responses.
SS02 Spell check input terms:
The CGI program splits the comma separated words, and
checks their correctness using the corresponding spell
checker back end.
If a word is not recognized as correct, a request for
suggestions is sent to the spell checking back end and the
received suggestions are then included into the JSON
response string.
We used two different approaches for the interaction with
the spell checking back end:
The first approach consisted in a re-implementation of
foma's flookup for the processing chain of the different
finite state transducers used for spell checking in
uni_simple_foma. This module can process text in batch
mode, but it has to load the finite state transducers into the
memory with every new call. As the finite state
transducers, especially with the improved version
uni_extended_foma, are quite large, loading those
transducers takes a few seconds, which in turn makes the
text editing through CKEditor noticeably slower.
For this reason, we implemented a TCP server-client back
end for spell checking: the server loads the finite state
transducers into memory at start up and can later be
accessed through the client. As the transducers are already
loaded, the response time is much quicker, see Section
SS03 Save relevant data into the database:
The misspellings are saved in our MySQL database in the
form of custom user dictionaries and a list of incorrect
terms to be analyzed. More information about the
misspellings is saved in our
XML Object Oriented Database in BaseX, since these
misspellings will conform our error corpus.
Two linguists from the UNMSM
are currently analyzing
and categorizing those misspellings according to the type
of error, this information will be used as feedback to
improve our spell-checking engines (lexicons, suggestion
7 Experiments and Results
7.1 Evaluating Suggestion Accuracy from
each Spell Checking Engine
In this section, we present a comparison between the
different approaches used in the spell checking back end,
and we hope to answer the following question: Does
finite-state spell checking with foma give more reliable
suggestions than MySpell and HunSpell for an
agglutinative language?
Our online application makes it possible to group all the
available spell checking engines in one place, which in
turn allows for an easy comparison.
7.1.1 Minimum Edit Distance as a Metric for
Spell Checking Suggestion Quality
We used the Natural Language Toolkit
publicly available software, to calculate the edit distance.
Suggestion Edit Rate (SER) reports the ratio of the
number of edits incurred to the total number of characters
in the reference word; we used this toolkit for easy
replicability of the tests we present here.
Misspelled term:
Suggestions by uni_simple_foma (number of edits):
- Rimacharankiraqchusina (0.04)
- rimacharankiraqchusina (0.08)
- rimacharankiraqchusuna (0.13)
- Rimacharankiraqchusuna (0.08)
- rimacharankitaqchusina (0.13)
- Rimacharankitaqchusina (0.08)
Reference word:
7.1.2 Evaluation of Spell Checkers using
Minimum Edit Distance
Table 1 contains the suggestions produced by
uni_simple_foma, ec_hunspell and bol_myspell.
Universidad Nacional Mayor de San Marcos
The first column of Table 1 contains the word forms for
testing, taken from Paredes-Cusi [Paredes-Cusi2009]. All
of these words have the same root (rima-  )
which is a highly used word across dialects and is
contained in the lexicons of each spell checking engines
we presented in Section 5. The test words are written in
the standard proposed by the AMLQ
, and are spelled
correctly according to the cuz_simple_foma spell
checking engine.
The columns on the right contain the suggestions
provided by the spell checking engines
uni_simple_foma, ec_hunspell, bol_myspell. Note that
           
additionally we provided the SER value for each
Academia Mayor de la Lengua Quechua in Cusco.
suggestion and we signal if the suggestion is correct by
poi     expected flag and by
highlighting it, otherwise, we do not signal anything.
A glance at the suggestions by ec_hunspell and
bol_myspell reveals that the quality varies according to
the complexity of the word: the more suffixes the
misspelled word has, the less adequate and more distorted
the suggestions become (see Table 1).
The suggestions offered by each one of the spell checker
engines, especially by Ecuadorian Kichwa and Bolivian
Quechua do not cope adequately with the rich
morphology of this language, as some of their suggestions
do not even share the same root as the misspellings in the
first column of Table 1.
Table 1: Comparing the suggestions from each spell checker engine.
Misspelled Term
(Cuzco Quechua)
Suggestions with its corresponding SER values
Ecuadorian Kichwa
Hunspell (ec_hunspell)
Bolivian Quechua
MySpell (bol_myspell)
Rimashkani (0.18),
Imashinashi (0.55)
Imashashunchik (0.57)
Rimashawanki (0.08),
Khashkarimunki (0.54),
Kimsancharinki (0.38),
Rimarichinki (0.54),
Rankhayarimunki (0.69)
Imashashunchik (0.57)
Kimsancharin (0.47),
Rankhayarin (0.47)
Imashashunchik (0.71)
Rankhayarimuy (0.63)
Imashashunchik (0.86)
Marankiru (0.56)
Imashashunchik (0.86)
Charancharimuychu (0.63)
Imashashunchik (1)
Wariwiraqocharunasina (0.61)
Suggestion Edit Rate (SER) measures the amount of
editing that a human would have to perform to change a
system output (a spell checking suggestion) so it exactly
matches a reference word. We calculated the value using
equation 2.
)(exp ),(tan_ ectedlength suggestionoriginalcedisedit
Where original is the word to be spell checked,
suggestion is the output from the spell checking engine
and expected is the referenced word.
In Figure 5 we present SER values for each misspelling
(we calculated the average SER value when there are
more than one suggestion), if the quality of the suggestion
are good the SER value ought to be low, otherwise a high
one. It becomes evident that the quality of the suggestions
by ec_hunspell and bol_myspell, Hunspell and MySpell
respectively are poor, because they do not cope well with
complex words.
Figure 5: Suggestion Edit Rate.
7.2 Improving (Error Model) spell
checking quality
7.2.1 Improving Spell Checking Suggestion
The misspelled morpheme in the test words (rimasha-) in
Table 1 is the suffix -sha, that should be spelled -chka in
the unified standard. The Edit Distance between sha and
chka is 2 (delete k, substitute s with c). As the spell
checker uni_simple_foma relies on Minimum Edit
Distance as the only error metric, it will first suggest
Quechua words with a smaller edit distance, e.g. with the
suffixes -sqa or -cha (edit distance to -sha is 1).
From the results in Table 1 it becomes clear that using
edit distance as the only algorithm to find the correct
suggestions is not good enough. For this reason, we built
the improved version of the spell checker
uni_extended_foma: This back end uses several
cascaded finite state transducers that employ a set of
rewrite rules to produce more useful suggestions. For
instance, the suffix -sha will be rewritten to the
corresponding form in the standard, -chka. Furthermore,
we included a Spanish lexicon of nouns/adjectives and
verbs into the spell checker. This allows the correction of
words with Spanish roots and Quechua suffixes (very
frequent in Quechua texts)
Table 2 illustrates the quality of the suggestions with this
improved approach, the results are encouraging as SER
values are low, see Figure 6, this results compared with
its counterparts are much better, see Figure 7.
Moreover uni_extended_foma presents us with the
correct alternatives for every test word (see Table 2).
Table 2: The suggestions and SER values provided by
Term (Cuzco Quechua)
Figure 6: Graphical interpretation of the SER values for
the suggestions provided by uni_extended_foma
Figure 7: Suggestion Edit Rate for uni_extended_foma
in contrast with the others.
The Spanish lexicon has been built with part of
FreeLing, an open source library for language processing,
7.3 Improving CGI Program's Speed
The first implementation of our application (see Section
6.2.1, SS02: Spell check input terms) was fast enough for
the web service, the lookup tool could load the spell
checker consisting of only one transducer of
approximately 2MB very quickly.
However, this is not the case for the cascaded transducers
of the improved version uni_extended_foma, for which
the same lookup takes 40 to 60 seconds for a group of 6 to
10 words. This results in a deficient and slow web
application. In order to overcome the slow response with
the extended spell checker, we re-implemented the lookup
module as a TCP server-client application.
We measured the time with both approaches on our server
for a single word. The standard lookup took 4.434
seconds, whereas with the TCP server-client, the lookup
took only 0.021 seconds.
Figure 8: Speed response (measured in seconds)
comparison between the two implementations Command
Line - Batch Mode and TCP Server.
As illustrated in Figure 8, the response time of the TCP
service is 0.021 seconds, as compared to 4.434 seconds
with the regular lookup. Using the TCP sever-client thus
solves the problem for the web service.
8 Conclusions and Future Work
We integrated existing spell checkers for Quechua into an
easy to use web application with functionalities
comparable to a desktop program. Furthermore, we
improved the spell checker back end by using a more
fine-grained set of rules to predict the correct suggestion
for a given word form.
Additionally, we implemented a TCP server-client lookup
for finite state transducers written in Foma, in order to
mitigate the low response time for the enhanced spell
In order to further improve our spell checker, we collect
the unknown words from the web service in an error
corpus, which gives us an indication for missing lexicon
entries or missing morpheme combinations.
The Foma spell checkers described in this paper are
already available as plug-ins to OpenOffice and
LibreOffice, and we are currently working on a version
for MS Office programs.
[AdelaarMuysken04] Adelaar, W. F. H. and Muysken, P.
(2004). The Languages of the Andes. Cambridge
Language Surveys. Cambridge University Press.
[BeesleyKarttunen03] Beesley, K. R. and Karttunen, L.
(2003). Finite-state morphology: Xerox tools and
techniques. CSLI, Stanford.
[Cerrón-Palomino94] Cerrón-Palomino, R. (1994).
Quechua sureño, diccionario unificado quechua-
castellano, castellano-quechua. Biblioteca Nacional
del Perú, Lima.
[Dembitz2011] Dembitz, v., , M., and Gledec, G.
(2011). Advantages of online spellchecking: a
Croatian example. Software: Practice and Experience,
[Hulden2013] Hulden, M., Silfverberg, M., and Francom,
J. (2013). Finite state applications with Javascript. In
Proceedings of the 19th Nordic Conference of
Computational Linguistics (NODALIDA 2013);
Linköping Electronic Conference Proceedings,
volume 85, pages 441446.
[Liang2009] Liang, H. L. (2009). Spell checkers and
correctors: a    sis,
Universiteit van Pretoria.
[MacCaw2011] MacCaw, A. (2011). JavaScript Web
Applications. 
[Paredes-Cusi2009] Paredes Cusi, B. (2009). Qheswa
Simi, Lengua Quechua. Editorial Pantigozo, Cuzco,
[Rios2011] Rios, A. (2011). Spell checking an
agglutinative language: Quechua. In Proceedings of
the 5th Language and Technology Conference:
Human Language Technologies as a Challenge for
Computer Science and Linguistics, , Poland.
... The Instituto de Lengua y Literatura Andina Amazónica (ILLA) has been working on the construction of electronic dictionaries for Quechua, Aymara and Guaraní which versions for smartphones, QichwaDic [16], were developed by the entrepreneur Marco Vela; the group Hinantin [1] at the Universidad Nacional San Antonio Abad del Cusco (UNSAAC) has produced a text-to-speech system for Southern Quechua, a Quechua spell checker plug-in for LibreOffice [17] and a morphological analyzer for Ashaninka, an aboriginal language whose population is scattered across the Amazonian rainforest in Peru and Brazil. ...
Conference Paper
Language technology is the missing piece of the puzzle that will bring us closer to a complete revitalization of endangered languages. Almost every digital product uses and is dependent on language; language technology is not anymore an option but the key enabler and solution to boosting future growth. Technical issues are hard but the lesser problems on the building of the corpus of endangered languages, centuries of oppression managed to dent the pride and sense of belonging which is reflected in a lack of awareness of the loss of the own language. In order to reach a revitalization based on language technology, powered by artificial intelligence, to be successful, it is necessary to show a value proposition from the very beginning, from the creation of the corpus. In that direction, we propose the inclusion of the fundamentals of linguistics into high schools with the twofold goal of building corpus from the grassroots and providing highly valued skills to teenagers.
... Parts of this section are based onRios [2011b] and Castro Mamani and RiosGonzales [2014].19 The minimum edit distance between two string is the smallest number of basic edit operations (deletion, insertion and substitution of characters) that is necessary to convert one string into the other. ...
Full-text available
Thesis written by Annette Rios under the supervision of Prof. Dr. Martin Volk at the University of Zurich. The thesis defense was held at the University of Zurich on September 21, 2015 and was awarded Summa Cum Laudé. The members of the committee were Prof. Dr. Martin Volk (University of Zurich, Institute of Computational Linguistics), Prof. Dr. Balthasar Bickel (University of Zurich, Department of Comparative Linguistics) and Dr. Paul Heggarty (Max Planck Institute for Evolutionary Anthropology). © 2016 Sociedad Espanola para el Procesamiento del Lenguaje Natural.
The LANGAS project provides an online database containing historical (16th–19th) texts in Quechua, Guarani and Tupi, for sociolinguistic studies. Querying texts for such low-resourced languages raises several questions, issues and challenges. Among them, our work addresses word variation (diacritization, typographic variations) as an optional query expansion mechanism of the search engine. For such processing, taking into account the peculiarities of considered languages is unavoidable. This paper describes the morphology of considered languages, collected linguistic resources, implemented modules (regular expressions, stemming, word clusters) and some preliminary evaluations. Our work will be an opportunity to release resources for those languages. We plan to deepen this work in the near future and hopefully expect it to be useful for other researchers interested in the matter.
The aim of this dissertation is to provide a unified treatment of various spell checkers and correctors. Firstly, the spell checking and correcting problems are formally described in mathematics in order to provide a better understanding of these tasks. An approach that is similar to the way in which denotational semantics used to describe programming languages is adopted. Secondly, the various attributes of existing spell checking and correcting techniques are discussed. Extensive studies on selected spell checking/correcting algorithms and packages are then performed. Lastly, an empirical investigation of various spell checking/correcting packages is presented. It provides a comparison and suggests a classification of these packages in terms of their functionalities, implementation strategies, and performance. The investigation was conducted on packages for spell checking and correcting in English as well as in Northern Sotho and Chinese. The classification provides a unified presentation of the strengths and weaknesses of the techniques studied in the research. The findings provide a better understanding of these techniques in order to assist in improving some existing spell checking/correcting applications and future spell checking/correcting package designs and implementations.
Online spellchecking is commonly regarded as an auxiliary way of performing spellchecking. However, it offers a unique opportunity to constantly improve spellchecker linguistic functionality through interaction with the community of spellchecker users. Such a possibility is crucial for spellchecking in non-central and under-resourced languages, in order to overcome gaps in NLP tools between them and central languages. The paper describes Hascheck, a Croatian online spellchecker able to learn words from texts it receives. It started as the first Croatian spellchecker, hence as a basic NLP tool for an under-resourced language, but due to its learning ability it demonstrates linguistic functionality comparable to that of conventional central-language spellcheckers. Based on these experiences we also discuss the future of online spellchecking in the context of global NLP tasks. Copyright © 2010 John Wiley & Sons, Ltd.
The Languages of the Andes. Cambridge Language Surveys Finite-state morphology: Xerox tools and techniques Quechua sureño
  • W F H Adelaar
  • P Muysken
  • K R Beesley
  • L R Karttunen
[AdelaarMuysken04] Adelaar, W. F. H. and Muysken, P. (2004). The Languages of the Andes. Cambridge Language Surveys. Cambridge University Press. [BeesleyKarttunen03] Beesley, K. R. and Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI, Stanford. [Cerrón-Palomino94] Cerrón-Palomino, R. (1994). Quechua sureño, diccionario unificado quechuacastellano, castellano-quechua. Biblioteca Nacional del Perú, Lima.
Spell checkers and correctors: a unified treat ent Master's t esis JavaScript Web Applications. O'Reilly Media Qheswa Simi, Lengua Quechua Spell checking an agglutinative language: Quechua
  • H L A Liang
  • A Rios
[Liang2009] Liang, H. L. (2009). Spell checkers and correctors: a unified treat ent Master's t esis, Universiteit van Pretoria. [MacCaw2011] MacCaw, A. (2011). JavaScript Web Applications. O'Reilly Media, In [Paredes-Cusi2009] Paredes Cusi, B. (2009). Qheswa Simi, Lengua Quechua. Editorial Pantigozo, Cuzco, Perú. [Rios2011] Rios, A. (2011). Spell checking an agglutinative language: Quechua. In Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland.
Quechua sureño, diccionario unificado quechuacastellano, castellano-quechua
  • K R Beesley
  • L Karttunen
  • R Cerrón-Palomino
Beesley, K. R. and Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI, Stanford. [Cerrón-Palomino94] Cerrón-Palomino, R. (1994). Quechua sureño, diccionario unificado quechuacastellano, castellano-quechua. Biblioteca Nacional del Perú, Lima.
JavaScript Web Applications. O'Reilly Media
  • A Maccaw
MacCaw, A. (2011). JavaScript Web Applications. O'Reilly Media, In [Paredes-Cusi2009] Paredes Cusi, B. (2009). Qheswa Simi, Lengua Quechua. Editorial Pantigozo, Cuzco, Perú.
Spell checking an agglutinative language: Quechua
  • A Rios
Rios, A. (2011). Spell checking an agglutinative language: Quechua. In Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland.