Conference PaperPDF Available

Rules Ruling Neural Networks – Neural vs. Rule-Based Grammar Checking for a Low Resource Language

Authors:
Proceedings of Recent Advances in Natural Language Processing, pages 1526–1535
Sep 1–3, 2021.
https://doi.org/10.26615/978-954-452-072-4_171
1526
Rules Ruling Neural Networks –
Neural vs. Rule-Based Grammar Checking for a Low Resource Language
Linda Wiechetek
UiT Norgga árktalaš
universitehta
Norway
Flammie A Pirinen
UiT Norgga árktalaš
universitehta
Norway
Mika Hämäläinen
University of Helsinki,
Rootroo Ltd
Finland
Chiara Argese
UiT Norgga árktalaš
universitehta
Norway
Abstract
We investigate both rule-based and machine
learning methods for the task of compound er-
ror correction and evaluate their efficiency for
North Sámi, a low resource language. The lack
of error-free data needed for a neural approach
is a challenge to the development of these
tools, which is not shared by bigger languages.
In order to compensate for that, we used a rule-
based grammar checker to remove erroneous
sentences and insert compound errors by split-
ting correct compounds. We describe how we
set up the error detection rules, and how we
train a bi-RNN based neural network. The pre-
cision of the rule-based model tested on a cor-
pus with real errors (81.0%) is slightly better
than the neural model (79.4%). The rule-based
model is also more flexible with regard to fix-
ing specific errors requested by the user com-
munity. However, the neural model has a bet-
ter recall (98%). The results suggest that an
approach that combines the advantages of both
models would be desirable in the future. Our
tools and data sets are open-source and freely
available on GitHub and Zenodo.
1 Introduction
This paper presents our work on automatically cor-
recting compound errors in real world text of North
Sámi and exploring both rule-based and neural net-
work methods. We chose this error type as it is
the most frequent grammatical error type (after
spelling and punctuation errors) and twice as fre-
quent as the second most frequent grammatical er-
ror (agreement error). It also regards both spelling
and grammar as the error is a space between two
words, but its correction requires grammatical con-
text.
A grammar checker is a writer’s tool and partic-
ularly relevant to improve writing skills of a minor-
ity language in a bilingual context, as is the case
for North Sámi. According to UNESCO (Mose-
ley,2010), North Sámi, spoken in the North of
Norway, Sweden and Finland, has around 30,000
speakers. It is a low resource language in a bilin-
gual setting, and language users frequently face
bigger challenges to writing proficiency as there is
always a competing language. (Outakoski,2013)
Developing a reliable grammar checker with a high
precision that at the same time covers a lot of errors
has therefore been our main focus. Good precision
(i.e. avoiding false alarms) is a priority because
users get easily frustrated if a grammar checker
gives false alarms and underlines correct sentences.
In this paper we focus on the correction of com-
pound errors. This type of errors is easy to generate
artificially in the absence of large amounts of error
marked-up text, and we have a good amount of
manually marked-up corpus for evaluation for this
error type. Compound errors (i.e. one-word com-
pounds that are erroneously written as two words)
can be automatically inserted by using a rule-based
morphological analyser on the corpus and splitting
the word wherever we get a compound analysis.
Unlike other error types (like e.g. real word errors)
they are easily inserted, and existing compounds
are seldom errors. In addition, they are interesting
from a linguistic point of view as they are proper
(complex) syntactic errors and not just spelling er-
rors and serve as an example for higher level tools.
Two adjacent words can either be syntactically re-
lated or erroneous compounds, depending on the
syntax. In North Sámi orthography, as in the ma-
jority languages spoken in the region (Norwegian,
Swedish and Finnish), nouns that form a new con-
cept are usually written together. For example,
the North Sámi word boazodoalloguovlu ‘reindeer
herding area’ consists of three words boazu ‘rein-
deer’, doallu ‘industry’ and guovlu ‘area’, and thus
it is written together as a single compound. The
task of our methods is to correct spellings such as
1527
boazodoallu guovlu into boazodoalloguovlu in case
the words have been written separately in error.
We develop both a rule-based and a neural model
for the correction of compound errors. The rule-
based model (GramDivvun) is based on finite-state
technology and Constraint Grammar. The neural
model is bi-directional recurrent (BiRNN). While
the rule-based model has earlier produced good
precision, it did not handle unknown compounds
well, which is why we were interested in a neural
approach. However, neural models depend on large
amounts of ‘clean’ data and synthetic error genera-
tion (or alternatively marked-up data). Typical for
low-resource languages and also North Sámi, the
corpora are not clean and contain a fair amount of a
variety of different spelling and grammatical errors
(see Antonsen 2013). Therefore, efficiently prepar-
ing data as to making it available for neural model
training is an important part of this paper. In our
case, we make use of the existing rule-based tools
to both, generate synthetic error data and clean the
original data for training. For evaluation, on the
other hand, we use real world error data.
Our free and open-source rule-based tools can be
found on GiellaLT GitHub.
1
The training data and
the neural models are freely available on Zenodo.
2
We hereby want to promote a wider academic in-
terest in conducting NLP research for the North
Sámi.
2 Background
Sámi open source rule-based language tools have
a long and successful tradition (nearly 20 years)
(Trosterud,2004;Moshagen,2011;Antonsen and
Trosterud,2011;Rueter and Hämäläinen,2020).
North Sámi is a low-resource language in terms
of available corpus data (32.24M tokens raw data).
Although there is a fair amount of data, it contains
many real errors and only a small amount is marked
up for errors. Applying neural approaches for high-
level language tasks to low resource languages is
an interesting research question due the various
limitations of minority language corpora, versus
the existing research in the topic in well-resourced,
majority languages and artificially constrained se-
tups (Nekoto et al.,2020). Rules have been used
and are in a wide-spread use in the context of en-
dangered Uralic languages. There is recent work
on grammar checking for North Sámi (Wiechetek
1https://github.com/giellalt/
2https://zenodo.org/record/5172095
et al.,2019a) and spell checking for Skolt Sámi
(Trosterud and Moshagen,2021). Other rule-based
approaches to grammar checking are extensively
described in Wiechetek (2017).
Before the era of neural models, it was common
to use statistical machine translation (SMT) as a
method for grammar error correction (Behera and
Bhattacharyya,2013;Kunchukuttan et al.,2014;
Hoang et al.,2016). Many recent papers on gram-
mar checking use bi-directional LSTM models that
are trained to tag errors in an input sentence. Such
methods have been proposed for Latvian (Deksne,
2019), English (Rei and Yannakoudakis,2016) and
Chinese (Huang and Wang,2016). Similar LSTM
based approaches have also been applied for er-
ror correction (Yuan and Briscoe,2016;Ge et al.,
2019;Jahan et al.,2021). Other recent approaches
(Kantor et al.,2019;Omelianchuk et al.,2020) use
methods that take advantage of BERT (Devlin et al.,
2019) and other data-hungry models. While such
rich sentence embeddings can be used for English
and a few other languages with a large amount of
data, their use is not viable for North Sámi.
3 Data
For evaluation and training the neural model we use
the SIKOR (2018) (the Sámi International KOR-
pus), which is a collection of texts in different Sámi
languages compiled by UiT The Arctic University
of Norway and the Norwegian Sámi Parliament.
It consists of two subcorpora: GT-Bound
3
(texts
limited by a copyright which are available only
by request) and GT-Free
4
(the publicly available
texts). As a preprocessing step, we run a rule-based
grammar checker (Wiechetek,2012) and remove
sentences with potential compound errors, as we
cannot automatically ensure whether these errors
are real or not. This is needed as we want this data
to be fully free of any compound errors as it serves
as the target side of our neural model.
Thereafter, we take in each sentence in this error
free data and analyse it by a rule-based morpholog-
ical analyser
5
. When the analyser sees a potential
compound word, it indicates the word boundary
with a compound (
+Cmp#
) tag. We use this infor-
mation to automatically split all compounds iden-
tified by the rule-based analyser. This results in a
3https://gtsvn.uit.no/boundcorpus/
orig/sme/
4https://gtsvn.uit.no/freecorpus/orig/
sme/
5https://github.com/giellalt/lang-sme
1528
parallel corpus of the original sentences as the pre-
diction target and their corresponding versions with
synthetically introduced compound errors. Many
of the compound boundaries are ambiguous, and
the algorithm decides the one used in training data
based on heuristics: maximum number of com-
pound boundaries where the splitting will not cause
any other modifications of the word stems or other
content.
As an additional data source, we use the North
Sámi Universal Dependencies treebank (Tyers and
Sheyanova,2017). We parse the corpus with Uralic-
NLP (Hämäläinen,2019) and split the compounds
the rule-based morphological analyser identifies as
consisting of two or more words in order to synthet-
ically introduce errors. We also run the rule-based
morphological analyser and morpho-syntactic dis-
ambiguator to add part-of-speech (POS) informa-
tion to produce an additional data set with POS
tags. For the Universal Dependencies data, we use
the POS tags provided in the data set.
We then make sure that all sentences have at
least one generated compound error and that the
only type of error the sentences have is the com-
pound error (no other changes introduced by the
rule-based models). We shuffle this data randomly
and split it on a sentence level into 70% training,
15% validation and 15% testing. The size of the
data set can be seen in Table 1, the sentences were
tokenized based on punctuation marks.
Sentences Source tokens
Train 43,658 388,167
Test 9,356 83,107
Validation 9,355 82,566
Real-world errors 3,291 26,565
Table 1: Training, testing and validation sizes for the
neural model (corpus with synthetic errors)
For the rule-based model GramDivvun we do not
generate synthetic errors. We have hand-selected a
large corpus for rule development and as regression
tests, consisting of representative sentences from
GT-Free. The current selection for syntactic com-
pound errors includes 3,291 sentences with real
world compound errors (and possibly other errors
in addition).
4 Methods
We use a neural models and a rule-based model for
compound error correction.
4.1 Neural Model
We model the problem at a character instead of
word level in NMT (neural machine translation).
The reason for using a character-level model in-
stead of a word-level model is that, this way,
the model can work better with out-of-vocabulary
words. This is important due to the low-resourced
nature of North Sámi, although there are other deep
learning methods for endangered languages that do
not utilize character level models (Alnajjar,2021).
In practice, we split words into characters separated
by white spaces and mark actual spaces between
words with an underscore (_). We train the model
to predict from text with compound errors into text
without compound errors. As previous research
(Partanen et al.,2019;Alnajjar et al.,2020) has
found that using chunks of words instead of full
sentences at a time improves the results in character
level models, we will be training different models
with different chunk sizes. This means that we will
train a model to predict two words at a time, three
words at a time, all the way to five words at a time.
We train the models with and without POS tags.
For the models with POS tags, we surround each
word with a token indicating the beginning and the
end of the POS tag. The POS tags are included only
on the source side, not on the target side. They are
separated from the word with a white space.
An example of the data can be seen in Table 2.
Even though every sentence in the training data has
a compound error, this does not mean that every in-
put chunk the model sees would have a compound
error. This way, the model will also learn to leave
the input unchanged if no compound errors are
detected.
We train all models using a bi-directional long
short-term memory (LSTM) based model (Hochre-
iter and Schmidhuber,1997) by using OpenNMT-
py (Klein et al.,2017) with the default settings ex-
cept for the encoder where we use a BiRNN (Schus-
ter and Paliwal,1997) instead of the default RNN
(recurrent neural network), since BiRNN based
models have been shown to provide better results
in character-level models (Hämäläinen et al.,2019).
We use the default of two layers for both the en-
coder and the decoder and the default attention
model, which is the general global attention pre-
sented by Luong et al. (2015). The models are
trained for the default of 100,000 steps. All models
are trained with the same random seed (3,435) to
ensure reproducibility.
1529
n Input Output
2 geahˇ
cˇ
caladdan_prošeaktan geahˇ
cˇ
caladdanprošeaktan
3 geahˇ
cˇ
caladdan_prošeaktan_prošeaktan geahˇ
cˇ
caladdanprošeaktan_prošeaktan
2 V>geahˇ
cˇ
caladdan<V_N>prošeaktan<N geah ˇ
cˇ
caladdanprošeaktan
3V>geah ˇ
cˇ
caladdan<V_N>prošeaktan<N_
N>jagi<N geahˇ
cˇ
caladdanprošeaktan_prošeaktan
Table 2: Examples of the character-level input and output, where nindicates the chunk size. The first examples are
without POS tags and the last with POS tags
During the training of the neural models, we eval-
uate the models using simple sentence level scores.
There we look only at full-sentence matches and
evaluate their accuracy, precision and recall, as op-
posed to the evaluations in Section 5, where we
study them more carefully at the word-level. The
results of the neural models for the generated cor-
pus (where errors were introduced by splitting com-
pounds) can be seen in Table 3. The results indicate
that both of the models receiving a chunk of two
words at a time reached to the highest accuracy,
and the model without the POS tags also reached
to the highest precision.
Chunk POS Accuracy Precision Recall
2 no 0.925 0.949 0.974
3 no 0.847 0.883 0.955
4 no 0.852 0.892 0.950
5 no 0.869 0.909 0.952
2 yes 0.925 0.948 0.976
3 yes 0.906 0.934 0.968
4 yes 0.856 0.896 0.951
5 yes 0.857 0.895 0.953
Table 3: Sentence level scores for different neural mod-
els tested on a corpus with artificially introduced errors
The POS tags were not important for the models,
as the results with and without them are fairly simi-
lar. The largest gain was when the compound error
correction was done for three words at a time. As
this performance gain only occurred for that spe-
cific model, it suggests that it is more of an artefact
of the training data and how it is fed into the model
than any actual improvement.
4.2 Rule-based Model
The rule-based grammar checker GramDivvun is
a full-fledged grammar checker fixing spelling
errors, (morpho-)syntactic errors (including real
word spelling errors
6
, inflection errors, and com-
pounding errors) and punctuation and spacing er-
rors.
It takes input from the finite-state transducer
(FST) to a number of other modules, the core
6
Real word errors are spelling errors where the outcome is
an actual word that is not fit for the context.
of which are several Constraint Grammar mod-
ules for tokenization disambiguation, morpho-
syntactic disambiguation and a module for error
detection and correction. The full modular struc-
ture (Figure 1) is described in Wiechetek (2019b).
This work regards predominantly the modifica-
tion of the disambiguation and error detection
modules mwe-dis.cg3,grc-disambiguator.cg3, and
grammerchecker-release.cg3. We are using finite-
state morphology (Beesley and Karttunen,2003)
to model word formation processes. The technol-
ogy behind our FSTs is described in Pirinen (2014).
Constraint Grammar is a rule-based formalism for
writing disambiguation and syntactic annotation
grammars (Karlsson,1990;Karlsson et al.,1995).
In our work, we use the free open source imple-
mentation VISLCG-3 (Bick and Didriksen,2015).
All components are compiled and built using the
GiellaLT infrastructure (Moshagen et al.,2013).
The code and data for the model is available for
download
7
with specific version tagged for repro-
ducibility.
The syntactic context is specified in hand-written
Constraint Grammar rules. The REMOVE-rule
below removes the compound error reading (iden-
tified by the tag
Err/SpaceCmp
) if the head is
a 3rd person singular verb (cf. l.2) and the first
element of the potential compound is a noun in
nominative case (cf. l.3). The context condition
further specifies that there should be a finite verb
(VFIN) somewhere in the sentence (cf. l.4) for the
rule to apply.
REMOVE (Err/SpaceCmp)
(0/0 (V Sg3))
(0/1 (N Sg Nom))
(*0 VFIN);
All possible compounds written apart are con-
sidered to be errors by default, unless the lexicon
specifies a two or several word compound or a syn-
tactic rule removes the error reading.
7https://github.com/giellalt/lang-sme/
releases/tag/naacl-2021- ws
1530
Figure 1: System architecture of the North Sámi grammar checker (GramDivvun)
The process of rule writing includes several con-
secutive steps, and like neural network models they
require data. The process is as follows:
1.
Modelling an error detection rule based on at
least one actual sentence containing the error
2.
Adding constraints based on the linguist’s
knowledge of possible contexts (remembered
data)
3.
A corpus search for sentences containing sim-
ilar forms/errors, testing of the rule and report-
ing rule mistakes
4.
Modification of constraints in the rule based
on this data and testing against regression tests
so that unfit constraints depending on results
for precision and recall (focus on precision)
The basis of rule development is continuous in-
tegration. Typical shortcomings and bad errors can
be fixed right away with added conditions. Neural
models are not usually trained in this way.
The frequent experience of false alarms can de-
crease the users’ trust in the grammar checker. Typ-
ically, full-fledged user oriented grammar checkers,
e.g. DanProof focus on keeping false alarms low
and precision high (Bick,2015) because users’ ex-
periences have shown that certain experiences will
frustrate users and stop them from using the appli-
cation.
For rule development, regression tests are used.
These consist in error-specific YAML
8
tests and
8https://yaml.org/spec/1.2/spec.html
are manually marked up. The regression test for
compound errors contains 3,291 sentences (1,368
compound errors, used for development and regres-
sion) give the results as shown in Table 4.
Precision Recall F1score
94.95 86.22 90.80
Table 4: The rule-based model tested on the devel-
oper’s corpus (regression tests)
5 Results
We evaluate the models both quantitatively and
qualitatively. We evaluate on accuracy, precision
and recall, and do a linguistic evaluation. The mea-
surements are defined in this article as follows: Ac-
curacy
A=C
S
, where C is a correct sentence (1:1
string match) and
S
is corpus size in sentences, pre-
cision
P=tp
tp+fp
and recall
R=tp
tp+fn
, where
tp
is true positive,
fp
is false positive and
fn
is false
negative. The
F1
score is the harmonic mean of
precision and recall
F1= 2 ×P×R
P+R
. The accuracy
is thus sentence level correctness rate—as used
in the method section to probe model qualities—
whereas precision measures how often corrections
were right and recall measures how many errors
we found. The word-level errors are counted once
per error in the marked-up corpus. Thus, if a three-
part compound contains two compounding errors
it is counted towards the total as one error, but if a
sentence has three separate compounds with wrong
splits each, we count three errors.
1531
The error marked-up corpus we used includes
140 syntactic compound errors (there are other
compound errors that can be discovered by the
spellchecker as they are word internal) and is from
GT-Bound. We chose GT-Bound to make sure that
the sentences had not been used to develop rules. It
is part of our error-marked up corpus, which makes
it possible to run an automatic analysis. This error
corpus does only contain real world (as opposed to
synthetic) errors.
Chunk POS Accuracy Precision Recall
2 no 0.781 0.794 0.980
3 no 0.707 0.720 0.974
4 no 0.726 0.747 0.963
5 no 0.727 0.757 0.950
2 yes 0.777 0.788 0.982
3 yes 0.761 0.775 0.976
4 yes 0.720 0.744 0.958
5 yes 0.751 0.765 0.976
Table 5: Sentence level scores for the neural models
tested on a real world error corpus
Table 5shows the results for the neural models
on this corpus. The drop in results is expected as
the models were trained on synthetic data, whereas
this data consists of real world errors. However,
the results stay relatively good, given that synthetic
data was the only way to produce enough training
data for North Sámi.
We ran the neural and rule-based model on two
different corpora of compound error materials, i.e.
synthetic and real world. Table 6shows the evalua-
tion on a real world error corpus.
Model Precision Recall F1
Rule-based model 81.0 60.7 69.3
Neural model 79.4 98.0 87.7
Table 6: Results for both models based on a manually
marked-up evaluation corpus
The neural network performs well in terms of
numbers, but has the following shortcomings that
are problematic for the end users. It introduces new
(types of) errors unrelated to compounding, like
changing km
²
randomly either to kmy or km kind
of unforgivable (because not understandable) for
the end user. They introduce compounds like Sta-
toileamiálbmogiid ‘Statoil (national oil company
and gasstation) indigenous people’ as in ex. (1).
The rule-based grammar checker presupposes that
the compound is listed in the lexicon, which is why
these corrections can easily be avoided.
(1) Statoil
Statoil
eamiálbmogiid
indigenous.people.ACC.P L
eatnamiid
land.ACC.P L
billisteami
destruction.G EN
birra
about
‘about the destruction of the indigenous peo-
ples’ territories by Statoil’
It also produces untypically long non-sense
words like NorggaSámiidRiidRiidRiidRiidRiidRi-
idRiikasearvvi. In addition, there are false pos-
itives of certain grammatical combinations that
are systematically avoided by rule-based grammar
checker. These are combinations of attributive ad-
jectives and nouns (17 occurrences) like boares
eallinoainnuid in ex. (2) and genitive modifier and
noun combinations (11 occurrences) like njealje-
haskilomehtera eatnamat in ex. (3).
(2) boares
old
eallinoainnuid
life.view.AC C.PL
ja
and
modearna
modern
servodaga
society.G EN
váikkuhusaid
impact.ACC .PL
gaskii.
between
‘between old philosophies and the impact
of modern society’
(3) Dasalassin
in.addition
137000
137000
njealjehaskilomehtera
square.kilometre.G EN
eatnamat
landPL .
bi ¯
dgejuvvojit
split.PASS .PL3
seismalaš
seismic
linnjáid
line.ACC .PL
‘In addition, 137,000 square kilometres of
land are split by seismic lines’
The rule-based model, on the other hand, typ-
ically suggests compounding, where both com-
pounding and two word combinations would be
adequate, for example in the case of the first part of
the compound having homonymous genitive and a
nominative analyses. The suggested compound is
not an error. However, the written form is grammat-
ically correct as well. These suggestions still count
as false positives. Other typical errors are cases
where there are two accepted ways of spelling a
compound/MWE as in ex. (4), where both Riddu
Ri ¯
d¯
du and Riddu-Ri ¯
d¯
du are correct spellings, and
the latter one is suggested as a correction of the
former one.
(4) ovdanbuktojuvvojit
present.PASS .PRS.PL 3
omd.
e.g.
jahkásaš
annual
Riddu Ri ¯
d¯
du
Riddu Ri ¯
d¯
du
festiválas.
festival.LO C
‘they are presented at the annual Riddu
Ri ¯
d¯
du festival.
The rule-based model also struggles predominantly
1532
with false negatives, like njunuš olbmot ‘leading
people’ that are due to missing entries in the lexicon
like in ex. (5).
(5) Sii
they
leat
are
gieldda
municipality.G EN
njunuš
leading
olbmot.
people
‘They are the leading people of the munici-
pality’
6 Discussion
In the future, we would like to look into hybrid
grammar checking of other error types and other
(Sámi) languages.
The neural approach gives us relatively high re-
call in the real world situation with lower precision,
whereas the rule-based model is designed to give
us high precision even at the cost of lower recall
(user experience), which is why hybrid approaches
that combine the best of two worlds are interesting.
Noisy data is to be expected in any endangered
language context, as the language norms are to a
lesser degree internalized. We will therefore need a
way of preparing the data to train neural networks,
which can either consist in creating synthetic data
or automatically fixing errors and creating a parallel
corpus.
When creating synthetic data for neural net-
works, the amount of data is hardly the main is-
sue. Many generative systems are capable of over-
generating data. The main question that arises is
the quality and representatives (Hämäläinen and
Alnajjar 2019) of the generated data. If the rules
used to generate the data are not in line with the
real world phenomenon the neural model is meant
to solve, we cannot expect very high quality results
in real world data.
Generated sentences can easily be less complex
‘text book examples’ that are not representative of
real world examples. In the case of agreement er-
rors between subjects and verbs, for example, there
are long distance relationships and complex coor-
dinated subjects including personal pronouns that
can change the structure of a seemingly straight-
forward relation. Therefore, we advocate the use
of high quality rule-based tools to prepare the data,
i.e. fix the errors and create a parallel corpus.
While synthetic error data generation for com-
pound errors is somewhat more straightforward as
it only affects adjacent words, the generation of
synthetic error corpora for other error types is not
as straightforward, in part also because generat-
ing synthetic errors of other kind can potentially
create valid and grammatically correct sentences
with different meanings. We therefore predict that
(hybrid) neural network approaches for other er-
ror types that either involve specific morphological
forms (of which there are many in North Sámi)
or changes in word order will be more difficult to
resolve.
7 Conclusion
In this paper, we have developed both a neural
network and a rule-based grammar checker mod-
ule for compound errors in North Sámi. We have
shown that a neural compound-corrector for a low-
resource language can be built based on synthetic
error data by introducing the compound errors us-
ing a high level rule-based grammar models. It
is based on the rule-based tools to both generate
errors and clean the data using both part-of-speech
analysis, disambiguation and even the error detec-
tor.
The rule-based module is embedded in the
full-fledged GramDivvun grammar checker and
achieves a good precision of 81% and a lower recall
of 61%. A higher precision, even at the cost of a
lower recall, is in line with our objective of keeping
false alarms low, so users will be comfortable using
our language tools. The neural network achieves a
slightly lower precision of 79% and a much higher
recall of 98%.
However, the rule-based model has more user-
friendly suggestions and some false positives are
simply other correct alternatives to the ones in
the text, while the neural network’s false positives
sometimes introduce new and unrelated errors. On-
the-fly fixes that avoid false positives are an ad-
vantage of rule-based models. Rule-based models,
on the other hand, are not so good at recognizing
unknown combinations. Hybrid models that com-
bine the benefits of both approaches are therefore
desirable for efficient compound error correction
in the future.
Acknowledgments
Thanks to Børre Gaup for his work on the evalu-
ation script. Some computations were performed
on resources provided by UNINETT Sigma2 – the
National Infrastructure for High Performance Com-
puting and Data Storage in Norway.
1533
References
Khalid Alnajjar. 2021. When word embeddings be-
come endangered. In Mika Hämäläinen, Niko Par-
tanen, and Khalid Alnajjar, editors, Multilingual Fa-
cilitation, pages 275–288. Rootroo Ltd, Finland.
Khalid Alnajjar, Mika Hämäläinen, Niko Partanen,
and Jack Rueter. 2020. Automated prediction
of medieval Arabic diacritics. arXiv preprint
arXiv:2010.05269.
Lene Antonsen. 2013. Cállinmeattáhusaid guorran.
University of Tromsø. [English summary: Tracking
misspellings.].
Lene Antonsen and Trond Trosterud. 2011. Next to
nothing–a cheap south saami disambiguator. In Pro-
ceedings of the NODALIDA 2011 Workshop Con-
straint Grammar Applications, pages 1–7.
Kenneth R. Beesley and Lauri Karttunen. 2003. Finite
State Morphology. CSLI Studies in Computational
Linguistics. CSLI Publications, Stanford.
Bibek Behera and Pushpak Bhattacharyya. 2013. Auto-
mated grammar correction using hierarchical phrase-
based statistical machine translation. In Proceed-
ings of the Sixth International Joint Conference on
Natural Language Processing, pages 937–941.
Eckhard Bick. 2015. DanProof: Pedagogical spell and
grammar checking for Danish. In Proceedings of the
10th International Conference Recent Advances in
Natural Language Processing (RANLP 2015), pages
55–62, Hissar, Bulgaria. INCOMA Ltd.
Eckhard Bick and Tino Didriksen. 2015. CG-3 – be-
yond classical Constraint Grammar. In Proceedings
of the 20th Nordic Conference of Computational Lin-
guistics (NoDaLiDa 2015), pages 31–39. Linköping
University Electronic Press, Linköpings universitet.
Daiga Deksne. 2019. Bidirectional lstm tagger for lat-
vian grammatical error detection. In International
Conference on Text, Speech, and Dialogue, pages
58–68. Springer.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Tao Ge, Xingxing Zhang, Furu Wei, and Ming Zhou.
2019. Automatic grammatical error correction for
sequence-to-sequence text generation: An empiri-
cal study. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
pages 6059–6064.
Mika Hämäläinen and Khalid Alnajjar. 2019. A
template based approach for training nmt for low-
resource uralic languages - a pilot with Finnish. In
Proceedings of the 2019 2nd International Confer-
ence on Algorithms, Computing and Artificial Intel-
ligence, ACAI 2019, page 520–525, New York, NY,
USA. Association for Computing Machinery.
Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiede-
mann, and Eetu Mäkelä. 2019. Revisiting nmt for
normalization of early English letters. In Proceed-
ings of the 3rd Joint SIGHUM Workshop on Com-
putational Linguistics for Cultural Heritage, Social
Sciences, Humanities and Literature, pages 71–75.
Duc Tam Hoang, Shamil Chollampatt, and Hwee Tou
Ng. 2016. Exploiting n-best hypotheses to improve
an smt approach to grammatical error correction. In
Proceedings of the Twenty-Fifth International Joint
Conference on Artificial Intelligence, pages 2803–
2809.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation,
9(8):1735–1780.
Shen Huang and Houfeng Wang. 2016. Bi-lstm neu-
ral networks for chinese grammatical error diagnosis.
In Proceedings of the 3rd Workshop on Natural Lan-
guage Processing Techniques for Educational Appli-
cations (NLPTEA2016), pages 148–154.
Mika Hämäläinen. 2019. UralicNLP: An NLP library
for Uralic languages.Journal of Open Source Soft-
ware, 4(37):1345.
Mir Noshin Jahan, Anik Sarker, Shubra Tanchangya,
and Mohammad Abu Yousuf. 2021. Bangla real-
word error detection and correction using bidirec-
tional lstm and bigram hybrid model. In Proceed-
ings of International Conference on Trends in Com-
putational and Cognitive Engineering, pages 3–13.
Springer.
Yoav Kantor, Yoav Katz, Leshem Choshen, Edo Cohen-
Karlik, Naftali Liberman, Assaf Toledo, Amir
Menczel, and Noam Slonim. 2019. Learning to com-
bine grammatical error corrections. In Proceedings
of the Fourteenth Workshop on Innovative Use of
NLP for Building Educational Applications, pages
139–148.
Fred Karlsson. 1990. Constraint Grammar as a Frame-
work for Parsing Running Text. In Proceedings
of the 13th Conference on Computational Linguis-
tics (COLING 1990), volume 3, pages 168–173,
Helsinki, Finland. Association for Computational
Linguistics.
Fred Karlsson, Atro Voutilainen, Juha Heikkilä, and
Arto Anttila. 1995. Constraint Grammar: A
Language-Independent System for Parsing Unre-
stricted Text. Mouton de Gruyter, Berlin.
1534
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-
lart, and Alexander M. Rush. 2017. OpenNMT:
Open-Source Toolkit for Neural Machine Transla-
tion. In Proc. ACL.
Anoop Kunchukuttan, Sriram Chaudhury, and Pushpak
Bhattacharyya. 2014. Tuning a grammar correction
system for increased precision. In Proceedings of
the Eighteenth Conference on Computational Natu-
ral Language Learning: Shared Task, pages 60–64.
Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. 2015. Effective approaches to attention-
based neural machine translation. arXiv preprint
arXiv:1508.04025.
Christopher Moseley, editor. 2010. Atlas of
the World0s Languages in Danger, 3rd edi-
tion. UNESCO Publishing. Online version:
http://www.unesco.org/languages-atlas/.
Sjur Moshagen. 2011. Tilgjengelegheit for samisk og
andre nasjonale minoritetsspråk. In Språkteknologi
för ökad tillgänglighet.
Sjur N. Moshagen, Tommi A. Pirinen, and Trond
Trosterud. 2013. Building an open-source develop-
ment infrastructure for language technology projects.
In NODALIDA.
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa
Matsila, Timi Fasubaa, Taiwo Fagbohungbe,
Solomon Oluwole Akinola, Shamsuddeen Muham-
mad, Salomon Kabongo Kabenamualu, Salomey
Osei, Freshia Sackey, Rubungo Andre Niyongabo,
Ricky Macharm, Perez Ogayo, Orevaoghene Ahia,
Musie Meressa Berhe, Mofetoluwa Adeyemi,
Masabata Mokgesi-Selinga, Lawrence Okegbemi,
Laura Martinus, Kolawole Tajudeen, Kevin Degila,
Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer,
Jason Webster, Jamiil Toure Ali, Jade Abbott,
Iroro Orife, Ignatius Ezeani, Idris Abdulkadir
Dangana, Herman Kamper, Hady Elsahar, Good-
ness Duru, Ghollah Kioko, Murhabazi Espoir,
Elan van Biljon, Daniel Whitenack, Christopher
Onyefuluchi, Chris Chinenye Emezue, Bonaventure
F. P. Dossou, Blessing Sibanda, Blessing Bassey,
Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem,
Adewale Akinfaderin, and Abdallah Bashir. 2020.
Participatory research for low-resourced machine
translation: A case study in African languages.
In Findings of the Association for Computational
Linguistics: EMNLP 2020, pages 2144–2160,
Online. Association for Computational Linguistics.
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem
Chernodub, and Oleksandr Skurzhanskyi. 2020.
Gector–grammatical error correction: Tag, not
rewrite. In Proceedings of the Fifteenth Workshop
on Innovative Use of NLP for Building Educational
Applications, pages 163–170.
Hanna Outakoski. 2013. Davvisámegielat ˇ
cálamáhtu
konteaksta [The context of North Sámi literacy].
Sámi die ¯
dalaš áigeˇcála, 1/2015:29–59.
Niko Partanen, Mika Hämäläinen, and Khalid Alnaj-
jar. 2019. Dialect text normalization to normative
standard Finnish. In The Fifth Workshop on Noisy
User-generated Text (W-NUT 2019), page 141–146,
United States. The Association for Computational
Linguistics.
Tommi A. Pirinen and Krister Lindén. 2014. State-
of-the-art in weighted finite-state spell-checking. In
Proceedings of the 15th International Conference on
Computational Linguistics and Intelligent Text Pro-
cessing - Volume 8404, CICLing 2014, pages 519–
532, Berlin, Heidelberg. Springer-Verlag.
Marek Rei and Helen Yannakoudakis. 2016. Composi-
tional sequence labeling models for error detection
in learner writing. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1181–
1191, Berlin, Germany. Association for Computa-
tional Linguistics.
Jack Rueter and Mika Hämäläinen. 2020. Fst morphol-
ogy for the endangered Skolt Sami language. In
Proceedings of the 1st Joint Workshop on Spoken
Language Technologies for Under-resourced lan-
guages (SLTU) and Collaboration and Computing
for Under-Resourced Languages (CCURL), pages
250–257.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE transactions
on Signal Processing, 45(11):2673–2681.
SIKOR. 2018. SIKOR uit norgga árktalaš universitehta
ja norgga sámedikki sámi teakstaˇ
coakkáldat, veršu-
vdna 06.11.2018. http://gtweb.uit.no/korp.
Accessed: 2018-11-06.
Trond Trosterud. 2004. Porting morphological analysis
and disambiguation to new languages. In SALTMIL
Workshop at LREC 2004: First Steps in Language
Documentation for Minority Languages, pages 90–
92. Citeseer.
Trond Trosterud and Sjur Moshagen. 2021. Soft on
errors? the correcting mechanism of a Skolt Sami
speller. In Mika Hämäläinen, Niko Partanen, and
Khalid Alnajjar, editors, Multilingual Facilitation,
pages 197–207. Rootroo Ltd.
Francis M. Tyers and Mariya Sheyanova. 2017. Anno-
tation schemes in North Sámi dependency parsing.
In Proceedings of the Third Workshop on Computa-
tional Linguistics for Uralic Languages, pages 66–
75, St. Petersburg, Russia. Association for Computa-
tional Linguistics.
Linda Wiechetek. 2012. Constraint Grammar based
correction of grammatical errors for North Sámi. In
Proceedings of the Workshop on Language Technol-
ogy for Normalisation of Less-Resourced Languages
(SALTMIL 8/AFLAT 2012), pages 35–40, Istanbul,
Turkey. European Language Resources Association
(ELRA).
1535
Linda Wiechetek. 2017. When grammar can’t be
trusted – Valency and semantic categories in North
Sámi syntactic analysis and error detection. PhD
thesis, UiT The Arctic University of Norway.
Linda Wiechetek, Sjur Moshagen, and Kevin Brubeck
Unhammer. 2019a. Seeing more than whites-
pace—tokenisation and disambiguation in a north
Sámi grammar checker. In Proceedings of the 3rd
Workshop on the Use of Computational Methods in
the Study of Endangered Languages Volume 1 (Pa-
pers), pages 46–55.
Linda Wiechetek, Sjur Nørstebø Moshagen, Børre
Gaup, and Thomas Omma. 2019b. Many shades of
grammar checking – launching a constraint grammar
tool for North Sámi. In Proceedings of the NoDaL-
iDa 2019 Workshop on Constraint Grammar - Meth-
ods, Tools and Applications, NEALT Proceedings
Series 33:8, pages 35–44.
Zheng Yuan and Ted Briscoe. 2016. Grammatical error
correction using neural machine translation. In Pro-
ceedings of the 2016 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
380–386.
ResearchGate has not been able to resolve any citations for this publication.
Book
Full-text available
This is a Festschrift for Dr. Jack Rueter, compiled on the occasion of his 60th birthday. The book consists of peer-reviewed scientific work by Dr. Rueter’s colleagues. Its contents, compiled by well-established scholars and researchers in NLP, linguistics, philology and digital humanities, pertain to latest advances in natural language processing, to newly developed digital resources, and to endangered languages. Contributions touch upon a wide array of languages such as historical English, Chukchi, Mansi, Erzya, Komi, Finnish, Apurinã, Sign Languages, Sami languages, and Japanese. Most papers present work on endangered languages or on domains with a limited number of resources available for NLP. This book is a tribute to Dr. Rueter’s long career as a true pioneer in the field of digital documentation of endangered languages. His work has always been and remains to be characterized by altruistic thinking and dedication to a greater good in building free and open-source tools and resources for languages which have previously not been afforded such much-needed attention.
Conference Paper
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.
Conference Paper
Full-text available
We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for norma-tive Finnish text. We work on a corpus consisting of dialectal data from 23 distinct Finnish dialect varieties. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.
Chapter
This is the Festschrift of Dr. Jack Rueter. The book presents peer-reviewed scientific work from Dr. Rueter’s colleagues related to the latest advances in natural language processing, digital resources and endangered languages in a variety of languages such as historical English, Chukchi, Mansi, Erzya, Komi, Finnish, Apurina, Sign Languages, Sami languages and Japanese. Most of the papers present work on endangered languages or on domains with a limited number of resources available for NLP. This book collects original and insightful papers from well-established researchers in NLP, linguistics, philology and digital humanities. This is a tribute to Dr. Rueter’s long career that is characterized by constant altruistic work towards a greater good in building free and open-source tools and resources for endangered languages. Dr. Rueter is a true pioneer in the field of digital documentation of endangered languages.
Conference Paper
This paper discusses methodological strengths and shortcomings of the Constraint Grammar paradigm (CG), showing how the classical CG formalism can be extended to achieve greater expressive power and how it can be enhanced and hybridized with techniques from other parsing paradigms. We present a new, largely theory-independent CG framework and rule compiler (CG-3), that allows the linguist to write CG rules incorporating different types of linguistic information and methodology from a wide range of parsing approaches, covering not only CG's native topological technique, but also dependency grammar, phrase structure grammar and unification grammar. In addition, we allow the integration of statistical/numerical constraints and non-discrete tag and string sets.