Conference PaperPDF Available

Using the Web for Language Independent Spellchecking and Autocorrection.

Authors:

Abstract and Figures

We have designed, implemented and evaluated an end-to-end system spellchecking and autocorrection system that does not require any manually annotated training data. The World Wide Web is used as a large noisy corpus from which we infer knowledge about misspellings and word usage. This is used to build an error model and an n-gram language model. A small secondary set of news texts with artificially inserted misspellings are used to tune confidence classifiers. Because no manual annotation is required, our system can easily be instantiated for new languages. When evaluated on human typed data with real misspellings in English and German, our web-based systems outperform baselines which use candidate corrections based on hand-curated dictionaries. Our system achieves 3.8% total error rate in English. We show similar improvements in preliminary results on artificial data for Russian and Arabic.
Content may be subject to copyright.
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 890–899,
Singapore, 6-7 August 2009. c
2009 ACL and AFNLP
Using the Web for Language Independent Spellchecking and
Autocorrection
Casey Whitelaw and Ben Hutchinson and Grace Y Chung and Gerard Ellis
Google Inc.
Level 5, 48 Pirrama Rd, Pyrmont NSW 2009, Australia
{whitelaw,benhutch,gracec,ged}@google.com
Abstract
We have designed, implemented and eval-
uated an end-to-end system spellcheck-
ing and autocorrection system that does
not require any manually annotated train-
ing data. The World Wide Web is used
as a large noisy corpus from which we
infer knowledge about misspellings and
word usage. This is used to build an er-
ror model and an n-gram language model.
A small secondary set of news texts with
artificially inserted misspellings are used
to tune confidence classifiers. Because
no manual annotation is required, our sys-
tem can easily be instantiated for new lan-
guages. When evaluated on human typed
data with real misspellings in English and
German, our web-based systems outper-
form baselines which use candidate cor-
rections based on hand-curated dictionar-
ies. Our system achieves 3.8% total error
rate in English. We show similar improve-
ments in preliminary results on artificial
data for Russian and Arabic.
1 Introduction
Spellchecking is the task of predicting which
words in a document are misspelled. These pre-
dictions might be presented to a user by under-
lining the misspelled words. Correction is the
task of substituting the well-spelled hypotheses
for misspellings. Spellchecking and autocorrec-
tion are widely applicable for tasks such as word-
processing and postprocessing Optical Character
Recognition. We have designed, implemented
and evaluated an end-to-end system that performs
spellchecking and autocorrection.
The key novelty of our work is that the sys-
tem was developed entirely without the use of
manually annotated resources or any explicitly
compiled dictionaries of well-spelled words. Our
multi-stage system integrates knowledge from sta-
tistical error models and language models (LMs)
with a statistical machine learning classifier. At
each stage, data are required for training models
and determining weights on the classifiers. The
models and classifiers are all automatically trained
from frequency counts derived from the Web and
from news data. System performance has been
validated on a set of human typed data. We have
also shown that the system can be rapidly ported
across languages with very little manual effort.
Most spelling systems today require some hand-
crafted language-specific resources, such as lex-
ica, lists of misspellings, or rule bases. Sys-
tems using statistical models require large anno-
tated corpora of spelling errors for training. Our
statistical models require no annotated data. In-
stead, we rely on the Web as a large noisy corpus
in the following ways. 1) We infer information
about misspellings from term usage observed on
the Web, and use this to build an error model. 2)
The most frequently observed terms are taken as
a noisy list of potential candidate corrections. 3)
Token n-grams are used to build an LM, which
we use to make context-appropriate corrections.
Because our error model is based on scoring sub-
strings, there is no fixed lexicon of well-spelled
words to determine misspellings. Hence, both
novel misspelled or well-spelled words are allow-
able. Moreover, in combination with an n-gram
LM component, our system can detect and correct
real-word substitutions, ie, word usage and gram-
matical errors.
Confidence classifiers determine the thresholds
for spelling error detection and autocorrection,
given error and LM scores. In order to train these
classifiers, we require some textual content with
some misspellings and corresponding well-spelled
words. A small subset of the Web data from news
pages are used because we assume they contain
890
relatively few misspellings. We show that con-
fidence classifiers can be adequately trained and
tuned without real-world spelling errors, but rather
with clean news data injected with artificial mis-
spellings.
This paper will proceed as follows. In Section 2,
we survey related prior research. Section 3 de-
scribes our approach, and how we use data at each
stage of the spelling system. In experiments (Sec-
tion 4), we first verify our system on data with ar-
tificial misspellings. Then we report performance
on data with real typing errors in English and Ger-
man. We also show preliminary results from port-
ing our system to Russian and Arabic.
2 Related Work
Spellchecking and correction are among the oldest
text processing problems, and many different so-
lutions have been proposed (Kukich, 1992). Most
approaches are based upon the use of one or more
manually compiled resources. Like most areas
of natural language processing, spelling systems
have been increasingly empirical, a trend that our
system continues.
The most direct approach is to model the
causes of spelling errors directly, and encode them
in an algorithm or an error model. Damerau-
Levenshtein edit distance was introduced as a
way to detect spelling errors (Damerau, 1964).
Phonetic indexing algorithms such as Metaphone,
used by GNU Aspell (Atkinson, 2009), repesent
words by their approximate ‘soundslike’ pronun-
ciation, and allow correction of words that ap-
pear orthographically dissimilar. Metaphone relies
upon data files containing phonetic information.
Linguistic intuition about the different causes of
spelling errors can also be represented explicitly in
the spelling system (Deorowicz and Ciura, 2005).
Almost every spelling system to date makes use
of a lexicon: a list of terms which are treated as
‘well-spelled’. Lexicons are used as a source of
corrections, and also to filter words that should
be ignored by the system. Using lexicons in-
troduces the distinction between ‘non-word’ and
‘real-word’ errors, where the misspelled word is
another word in the lexicon. This has led to
the two sub-tasks being approached separately
(Golding and Schabes, 1996). Lexicon-based ap-
proaches have trouble handling terms that do not
appear in the lexicon, such as proper nouns, for-
eign terms, and neologisms, which can account for
a large proportion of ‘non-dictionary’ terms (Ah-
mad and Kondrak, 2005).
A word’s context provides useful evidence as
to its correctness. Contextual information can be
represented by rules (Mangu and Brill, 1997) or
more commonly in an n-gram LM. Mays et al
(1991) used a trigram LM and a lexicon, which
was shown to be competitive despite only allow-
ing for a single correction per sentence (Wilcox-
O’Hearn et al., 2008). Cucerzan and Brill (2004)
claim that an LM is much more important than
the channel model when correcting Web search
queries. In place of an error-free corpus, the Web
has been successfully used to correct real-word
errors using bigram features (Lapata and Keller,
2004). This work uses pre-defined confusion sets.
The largest step towards an automatically train-
able spelling system was the statistical model for
spelling errors (Brill and Moore, 2000). This re-
places intuition or linguistic knowledge with a
training corpus of misspelling errors, which was
compiled by hand. This approach has also been
extended to incorporate a pronunciation model
(Toutanova and Moore, 2002).
There has been recent attention on using Web
search query data as a source of training data, and
as a target for spelling correction (Yang Zhang and
Li, 2007; Cucerzan and Brill, 2004). While query
data is a rich source of misspelling information in
the form of query-revision pairs, it is not available
for general use, and is not used in our approach.
The dependence upon manual resources has
created a bottleneck in the development of
spelling systems. There have been few language-
independent, multi-lingual systems, or even sys-
tems for languages other than English. Language-
independent systems have been evaluated on Per-
sian (Barari and QasemiZadeh, 2005) and on Ara-
bic and English (Hassan et al., 2008). To our
knowledge, there are no previous evaluations of
a language-independent system across many lan-
guages, for the full spelling correction task, and
indeed, there are no pre-existing standard test sets
for typed data with real errors and language con-
text.
3 Approach
Our spelling system follows a noisy channel
model of spelling errors (Kernighan et al., 1990).
For an observed word wand a candidate correc-
tion s, we compute P(s|w)as P(w|s)×P(s).
891
confidence
classifiers
Web
News data
with artificial
misspellings
corrected textinput text
language model
error model
term list
scored
suggestions
Figure 1: Spelling process, and knowledge sources used.
The text processing workflow and the data used
in building the system are outlined in Figure 1 and
detailed in this section. For each token in the in-
put text, candidate suggestions are drawn from the
term list (Section 3.1), and scored using an error
model (Section 3.2). These candidates are eval-
uated in context using an LM (Section 3.3) and
re-ranked. For each token, we use classifiers (Sec-
tion 3.4) to determine our confidence in whether
a word has been misspelled and if so, whether it
should be autocorrected to the best-scoring sug-
gestion available.
3.1 Term List
We require a list of terms to use as candidate cor-
rections. Rather than attempt to build a lexicon
of words that are well-spelled, we instead take the
most frequent tokens observed on the Web. We
used a large (>1 billion) sample of Web pages,
tokenized them, and took the most frequently oc-
curring ten million tokens, with very simple filters
for non-words (too much punctuation, too short or
long). This term list is so large that it should con-
tain most well-spelled words, but also a large num-
ber of non-words or misspellings.
3.2 Error Model
We use a substring error model to estimate
P(w|s). To derive the error model, let Rbe
a partitioning of sinto adjacent substrings, and
similarly let Tbe a partitioning of w, such that
|T|=|R|. The partitions are thus in one-to-one
alignment, and by allowing partitions to be empty,
the alignment models insertions and deletions of
substrings. Brill and Moore estimate P(w|s)as
follows:
P(w|s)max
R, T s.t. |T|=|R|
|R|
Y
i=1
P(Ti|Ri)(1)
Our system restricts partitionings that have sub-
strings of length at most 2.
To train the error model, we require triples of
(intended word, observed word, count), which are
described below. We use maximum likelihood es-
timates of P(Ti|Ri).
3.2.1 Using the Web to Infer Misspellings
To build the error model, we require as train-
ing data a set of (intended word, observed word,
count) triples, which is compiled from the World
Wide Web. Essentially the triples are built by start-
ing with the term list, and a process that auto-
matically discovers, from that list, putative pairs
of spelled and misspelled words, along with their
counts.
We believe the Web is ideal for compiling this
set of triples because with a vast amount of user-
generated content, we believe that the Web con-
tains a representative sample of both well-spelled
and misspelled text. The triples are not used di-
rectly for proposing corrections, and since we have
a substring model, they do not need to be an ex-
haustive list of spelling mistakes.
The procedure for finding and updating counts
for these triples also assumes that 1) misspellings
tend to be orthographically similar to the intended
word; Mays et al (1991) observed that 80% of
892
misspellings derived from single instances of in-
sertion, deletion, or substitution; and 2) words are
usually spelled as intended.
For the error model, we use a large corpus (up to
3.7×108pages) of crawled public Web pages. An
automatic language-identification system is used
to identify and filter pages for the desired lan-
guage. As we only require a small window of con-
text, it would also be possible to use an n-gram
collection such as the Google Web 1T dataset.
Finding Close Words. For each term in the
term list (defined in Section 3.1), we find all
other terms in the list that are “close” to it. We
define closeness using Levenshtein-Damerau edit
distance, with a conservative upper bound that in-
creases with word length (one edit for words of
up to four characters, two edits for up to twelve
characters, and three for longer words). We com-
pile the term list into a trie-based data structure
which allows for efficient searching for all terms
within a maximum edit distance. The computa-
tion is ‘embarassingly parallel’ and hence easily
distributable. In practice, we find that this stage
takes tens to hundreds of CP U-hours.
Filtering Triples. At this stage, for each
term we have a cluster of orthographically similar
terms, which we posit are potential misspellings.
The set of pairs is reflexive and symmetric, e.g. it
contains both (recieve,receive) and (receive,re-
cieve). The pairs will also include e.g. (deceive,
receive). On the assumption that words are spelled
correctly more often than they are misspelled, we
next filter the set such that the first term’s fre-
quency is at least 10 times that of the second term.
This ratio was chosen as a conservative heuristic
filter.
Using Language Context. Finally, we use the
contexts in which a term occurs to gather direc-
tional weightings for misspellings. Consider a
term w; from our source corpus, we collect the
set of contexts {ci}in which woccurs. The defi-
nition of a context is relatively arbitrary; we chose
to use a single word on each side, discarding con-
texts with fewer than a total of ten observed occur-
rences. For each context ci, candidate “intended”
terms are wand w’s close terms (which are at least
10 times as frequent as w). The candidate which
appears in context cithe most number of times is
deemed to be the term intended by the user in that
context.
The resulting dataset consists of triples of the
original observed term, one of the “intended”
terms as determined by the above algorithm, and
the number of times this term was intended. For
a single term, it is possible (and common) to have
multiple possible triples, due to the context-based
assignment.
Inspecting the output of this training process
shows some interesting patterns. Overall, the
dataset is still noisy; there are many instances
where an obviously misspelled word is not as-
signed a correction, or only some of its instances
are. The dataset contains around 100 million
triples, orders of magnitude larger than any man-
ually compiled list of misspellings . The kinds of
errors captured in the dataset include stereotypi-
cal spelling errors, such as acomodation, but also
OCR-style errors. computationaUy was detected
as a misspelling of computationally where the ‘U’
is an OCR error for ‘ll’; similarly, Postmodem was
detected as a misspelling of Postmodern (an exam-
ple of ‘keming’).
The data also includes examples of ‘real-word’
errors. For example, 13% of occurrences of
occidental are considered misspellings of acci-
dental; contrasting with 89% of occurrences of
the non-word accidential. There are many ex-
amples of terms that would not be in a normal
lexicon, including neologisms (mulitplayer for
multiplayer), companies and products (Playsta-
ton for Playstation), proper nouns (Schwarznegger
for Schwarzenegger) and internet domain names
(mysapce.com for myspace.com).
3.3 Language Model
We estimate P(s)using n-gram LMs trained on
data from the Web, using Stupid Backoff (Brants
et al., 2007). We use both forward and back-
ward context, when available. Contrary to Brill
and Moore (2000), we observe that user edits of-
ten have both left and right context, when editing
a document.
When combining the error model scores with
the LM scores, we weight the latter by taking their
λ’th power, that is
P(w|s)P(s)λ(2)
The parameter λreflects the relative degrees to
which the LM and the error model should be
trusted. The parameter λalso plays the additional
role of correcting our error model’s misestimation
of the rate at which people make errors. For exam-
ple, if errors are common then by increasing λwe
893
can reduce the value of P(w|w)P(w)λrelative
to Ps6=wP(s|w)P(s)λ.
We train λby optimizing the average inverse
rank of the correct word on our training corpus,
where the rank is calculated over all suggestions
that we have for each token.
During initial experimentation, it was noticed
that our system predicted many spurious autocor-
rections at the beginnings and ends of sentences
(or in the case of sentence fragments, the end of
the fragment). We hypothesized that we were
weighting the LM scores too highly in such cases.
We therefore conditioned λon how much context
was available, obtaining values λi,j where i,jrep-
resent the amount of context available to the LM
to the left and right of the current word. iand jare
capped at n, the order of the LM.
While conditioning λin this way might at first
appear ad hoc, it has a natural interpretation in
terms of our confidence in the LM. When there is
no context to either side of a word, the LM simply
uses unigram probabilities, and this is a less trust-
worthy signal than when more context is available.
To train λi,j we partition our data into bins cor-
responding to pairs i,jand optimize each λi,j in-
dependently.
Training a constant λ, a value of 5.77 was ob-
tained. The conditioned weights λi,j increased
with the values of iand j, ranging from λ0,0=
0.82 to λ4,4= 6.89. This confirmed our hypoth-
esis that the greater the available context the more
confident our system should be in using the LM
scores.
3.4 Confidence Classifiers for Checking and
Correction
Spellchecking and autocorrection were imple-
mented as a three stage process. These em-
ploy confidence classifiers whereby precision-
recall tradeoffs could be tuned to desirable levels
for both spellchecking and autocorrection.
First, all suggestions sfor a word ware ranked
according to their P(s|w)scores. Second, a
spellchecking classifier is used to predict whether
wis misspelled. Third, if wis both predicted to be
misspelled and sis non-empty, an autocorrection
classifier is used to predict whether the top-ranked
suggestion is correct.
The spellchecking classifier is implemented us-
ing two embedded classifiers, one of which is used
when sis empty, and the other when it is non-
empty. This design was chosen because the use-
ful signals for predicting whether a word is mis-
spelled might be quite different when there are no
suggestions available, and because certain features
are only applicable when there are suggestions.
Our experiments will compare two classifier
types. Both rely on training data to determine
threshold values and training weights.
A “simple” classifier which compares the value
of log(P(s|w)) log(P(w|w)), for the original
word wand the top-ranked suggestion s, with a
threshold value. If there are no suggestions other
than w, then the log(P(s|w)) term is ignored.
A logistic regression classifier that uses five
feature sets. The first set is a scores feature
that combines the following scoring information
(i) log(P(s|w)) log(P(w|w)) for top-ranked
suggestion s. (ii) LM score difference between
the original word wand the top suggestion s.
(iii) log(P(s|w)) log(P(w|w)) for second top-
ranked suggestion s. (iv) LM score difference be-
tween wand second top-ranked s. The other four
feature sets encode information about case signa-
tures, number of suggestions available, the token
length, and the amount of left and right context.
Certain categories of tokens are blacklisted, and
so never predicted to be misspelled. These are
numbers, punctuation and symbols, and single-
character tokens.
The training process has three stages. (1) The
context score weighting is trained, as described
in Section 3.3. (2) The spellchecking classifier is
trained, and tuned on held-out development data.
(3) The autocorrection classifier is trained on the
instances with suggestions that the spellchecking
classifier predicts to be misspelled, and it too is
tuned on held-out development data.
In the experiments reported in this paper, we
trained classifiers so as to maximize the F1-score
on the development data. We note that the desired
behaviour of the spellchecking and autocorrection
classifiers will differ depending upon the applica-
tion, and that it is a strength of our system that
these can be tuned independently.
3.4.1 Training Using Artificial Data
Training and tuning the confidence classifiers re-
quire supervised data, in the form of pairs of mis-
spelled and well-spelled documents. And indeed
we posit that relatively noiseless data are needed
to train robust classifiers. Since these data are
894
Language Sentences
Train Test
English 116k 58k
German 87k 44k
Arabic 8k 4k
Russian 8k 4k
Table 1: Artificial data set sizes. The development
set is approximately the same size as the training
set.
not generally available, we instead use a clean
corpus into which we artificially introduce mis-
spellings. While this data is not ideal, we show
that in practice it is sufficient, and removes the
need for manually-annotated gold-standard data.
We chose data from news pages crawled from
the Web as the original, well-spelled documents.
We chose news pages as an easily identifiable
source of text which we assume is almost entirely
well-spelled. Any source of clean text could be
used. For each language the news data were di-
vided into three non-overlapping data sets: the
training and development sets were used for train-
ing and tuning the confidence classifiers, and a test
set was used to report evaluation results. The data
set sizes, for the languages used in this paper, are
summarized in Table 1.
Misspelled documents were created by artifi-
cially introducing misspelling errors into the well-
spelled text. For all data sets, spelling errors
were randomly inserted at an average rate of 2 per
hundred characters, resulting in an average word
misspelling rate of 9.2%. With equal likelihood,
errors were either character deletions, transposi-
tions, or insertions of randomly selected charac-
ters from within the same document.
4 Experiments
4.1 Typed Data with Real Errors
In the absence of user data from a real application,
we attempted our initial evaluation with typed data
via a data collection process. Typed data with real
errors produced by humans were collected. We
recruited subjects from our coworkers, and asked
them to use an online tool customized for data
collection. Subjects were asked to randomly se-
lect a Wikipedia article, copy and paste several
text-only paragraphs into a form, and retype those
paragraphs into a subsequent form field. The sub-
jects were asked to pick an article about a favorite
city or town. The subjects were asked to type
at a normal pace avoiding the use of backspace
or delete buttons. The data were tokenized, au-
tomatically segmented into sentences, and manu-
ally preprocessed to remove certain gross typing
errors. For instance, if the typist omitted entire
phrases/sentences by mistake, the sentence was re-
moved. We collected data for English from 25
subjects, resulting in a test set of 11.6k tokens, and
495 sentences. There were 1251 misspelled tokens
(10.8% misspelling rate.)
Data were collected for German Wikipedia arti-
cles. We asked 5 coworkers who were German na-
tive speakers to each select a German article about
a favorite city or town, and use the same online
tool to input their typing. For some typists who
used English keyboards, they typed ASCII equiva-
lents to non-ASCII characters in the articles. This
was accounted for in the preprocessing of the ar-
ticles to prevent misalignment. Our German test
set contains 118 sentences, 2306 tokens with 288
misspelled tokens (12.5% misspelling rate.)
4.2 System Configurations
We compare several system configurations to in-
vestigate each component’s contribution.
4.2.1 Baseline Systems Using Aspell
Systems 1 to 4 have been implemented as base-
lines. These use GN U Aspell, an open source spell
checker (Atkinson, 2009), as a suggester compo-
nent plugged into our system instead of our own
Web-based suggester. Thus, with Aspell, the sug-
gestions and error scores proposed by the system
would all derive from Aspell’s handcrafted custom
dictionary and error model. (We report results us-
ing the best combination of Aspell’s parameters
that we found.)
System 1 uses Aspell tuned with the logistic
regression classifier. System 2 adds a context-
weighted LM, as per Section 3.3, and uses the
“simple” classifier described in Section 3.4. Sys-
tem 3 replaces the simple classifier with the logis-
tic regression classifier. System 4 is the same but
does not perform blacklisting.
4.2.2 Systems Using Web-based Suggestions
The Web-based suggester proposes suggestions
and error scores from among the ten million most
frequent terms on the Web. It suggests the 20
terms with the highest values of P(w|s)×f(s)
using the Web-derived error model.
895
Systems 5 to 8 correspond with Systems 1 to
4, but use the Web-based suggestions instead of
Aspell.
4.3 Evaluation Metrics
In our evaluation, we aimed to select metrics that
we hypothesize would correlate well with real per-
formance in a word-processing application. In
our intended system, misspelled words are auto-
corrected when confidence is high and misspelled
words are flagged when a highly confident sug-
gestion is absent. This could be cast as a simple
classification or retrieval task (Reynaert, 2008),
where traditional measures of precision, recall and
Fmetrics are used. However we wanted to fo-
cus on metrics that reflect the quality of end-to-
end behavior, that account for the combined ef-
fects of flagging and automatic correction. Es-
sentially, there are three states: a word could be
unchanged, flagged or corrected to a suggested
word. Hence, we report on error rates that mea-
sure the errors that a user would encounter if the
spellchecking/autocorrection were deployed in a
word-processor. We have identified 5 types of er-
rors that a system could produce:
1. E1: A misspelled word is wrongly corrected.
2. E2: A misspelled word is not corrected but is
flagged.
3. E3: A misspelled word is not corrected or
flagged.
4. E4: A well spelled word is wrongly cor-
rected.
5. E5: A well spelled word is wrongly flagged.
It can be argued that these errors have varying
impact on user experience. For instance, a well
spelled word that is wrongly corrected is more
frustrating than a misspelled word that is not cor-
rected but is flagged. However, in this paper, we
treat each error equally.
E1,E2,E3and E4pertain to the correction
task. Hence we can define Correction Error Rate
(CE R):
CE R =E1+E2+E3+E4
T
where Tis the total number of tokens. E3and E5
pertain to the nature of flagging. We define Flag-
ging Error Rate (FER) and Total Error Rate (TER ):
FE R =E3+E5
T
TE R =E1+E2+E3+E4+E5
T
For each system, we computed a No Good Sugges-
tion Rate (NGS) which represents the proportion
of misspelled words for which the suggestions list
did not contain the correct word.
5 Results and Discussion
5.1 Experiments with Artificial Errors
System TE R CER FER N GS
1. Aspell, no LM,LR 17.65 6.38 12.35 18.3
2. Aspell, LM, Sim 4.82 2.98 2.86 18.3
3. Aspell, LM,LR 4.83 2.87 2.84 18.3
4. Aspell, LM,LR 22.23 2.79 19.89 16.3
(no blacklist)
5. WS, no LM,L R 9.06 7.64 6.09 10.1
6. WS,LM, Sim 2.62 2.26 1.43 10.1
7. WS,LM,LR 2.55 2.21 1.29 10.1
8. WS,LM,LR 21.48 2.21 19.75 8.9
(no blacklist)
Table 2: Results for English news data on an in-
dependent test set with artificial spelling errors.
Numbers are given in percentages. L M: Language
Model, Sim: Simple, L R: Logistic Regression,
WS: Web-based suggestions. NGS: No good sug-
gestion rate.
Results on English news data with artificial
spelling errors are displayed in Table 2. The sys-
tems which do not employ the LM scores per-
form substantially poorer that the ones with LM
scores. The Aspell system yields a total error rate
of 17.65% and our system with Web-based sug-
gestions yields TER of 9.06%.
When comparing the simple scorer with the lo-
gistic regression classifier, the Aspell Systems 2
and 3 generate similar performances while the
confidence classifier afforded some gains in our
Web-based suggestions system, with total error re-
duced from 2.62% to 2.55%. The ability to tune
each phase during development has so far proven
more useful than the specific features or classifier
used. Blacklisting is crucial as seen by our results
for Systems 4 and 8. When the blacklisting mech-
anism is not used, performance steeply declines.
When comparing overall performance for the
data between the Aspell systems and the Web-
based suggestions systems, our Web-based sug-
gestions fare better across the board for the news
data with artificial misspellings. Performance
896
gains are evident for each error metric that was ex-
amined. Total error rate for our best system (Sys-
tem 7) reduces the error of the best Aspell sys-
tem (System 3) by 45.7% (from 4.83% to 2.62%).
In addition, our no good suggestion rate is only
10% compared to 18% in the Aspell system. Even
where no LM scores are used, our Web-based sug-
gestions system outperforms the Aspell system.
The above results suggest that the Web-based
suggestions system performs at least as well as
the Aspell system. However, it must be high-
lighted that results on the test set with artificial
errors does not guarantee similar performance on
real user data. The artificial errors were generated
at a systematically uniform rate, and are not mod-
eled after real human errors made in real word-
processing applications. We attempt to consider
the impact of real human errors on our systems in
the next section.
5.2 Experiments with Human Errors
System TE R CE R FE R NGS
English Aspell 4.58 3.33 2.86 23.0
English WS 3.80 3.41 2.24 17.2
German Aspell 14.09 10.23 5.94 44.4
German WS 9.80 7.89 4.55 32.3
Table 3: Results for Data with Real Errors in En-
glish and German.
Results for our system evaluated on data with
real misspellings in English and in German are
shown in Table 3. We used the systems that per-
formed best on the artificial data (System 3 for As-
pell, and System 7 for Web suggestions). The mis-
spelling error rates of the test sets were 10.8% and
12.5% respectively, higher than those of the arti-
ficial data which were used during development.
For English, the Web-based suggestions resulted
in a 17% improvement (from 4.58% to 3.80%) in
total error rate, but the correction error rate was
slightly (2.4%) higher.
By contrast, in German our system improved to-
tal error by 30%, from 14.09% to 9.80%. Correc-
tion error rate was also much lower in our Ger-
man system, comparing 7.89% with 10.23% for
the Aspell system. The no good suggestion rates
for the real misspelling data are also higher than
that of the news data. Our suggestions are lim-
ited to an edit distance of 2 with the original, and
it was found that in real human errors, the aver-
age edit distance of misspelled words is 1.38 but
for our small data, the maximum edit distance is
4 in English and 7 in German. Nonetheless, our
no good suggestion rates (17.2% and 32.3%) are
much lower than those of the Aspell system (23%
and 44%), highlighting the advantage of not using
a hand-crafted lexicon.
Our results on real typed data were slightly
worse than those for the news data. Several fac-
tors may account for this. (1) While the news data
test set does not overlap with the classifier train-
ing set, the nature of the content is similar to the
train and dev sets in that they are all news articles
from a one week period. This differs substantially
from Wikipedia article topics that were generally
about the history and sights a city. (2) Second,
the method for inserting character errors (random
generation) was the same for the news data sets
while the real typed test set differed from the ar-
tificial errors in the training set. Typed errors are
less consistent and error rates differed across sub-
jects. More in depth study is needed to understand
the nature of real typed errors.
1
2
3
4
5
6
100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09
Error rate
Corpus size
Typed TER
Typed CER
Typed FER
Artificial TER
Artificial CER
Artificial FER
Figure 2: Effect of corpus size used to train the
error model.
5.3 Effect of Web Corpus Size
To determine the effects of the corpus size on our
automated training, we evaluated System 7 using
error models trained on different corpus sizes. We
used corpora containing 103,104,...,109Web
pages. We evaluated on the data set with real er-
rors. On average, about 37% of the pages in our
corpus were in English. So the number of pages
we used ranged from about 370 to about 3.7×108.
As shown in Figure 2, the gains are small after
about 106documents.
897
5.4 Correlation across data sets
We wanted to establish that performance improve-
ment on the news data with artificial errors are
likely to lead to improvement on typed data with
real errors. The seventeen English systems re-
ported in Table 3, Table 2 and Figure 2 were each
evaluated on both English test sets. The rank cor-
relation coefficient between total error rates on the
two data sets was high (τ= 0.92; p < 5×106).
That is, if one system performs better than another
on our artificial spelling errors, then the first sys-
tem is very likely to also perform better on real
typing errors.
5.5 Experiments with More Languages
System TE R CER FER N GS
German Aspell 8.64 4.28 5.25 29.4
German WS 4.62 3.35 2.27 16.5
Arabic Aspell 11.67 4.66 8.51 25.3
Arabic WS 4.64 3.97 2.30 15.9
Russian Aspell 16.75 4.40 13.11 40.5
Russian WS 3.53 2.45 1.93 15.2
Table 4: Results for German, Russian, Arabic
news data.
Our system can be trained on many languages
with almost no manual effort. Results for German,
Arabic and Russian news data are shown in Ta-
ble 4. Performance improvements by the Web sug-
gester over Aspell are greater for these languages
than for English. Relative performance improve-
ments in total error rates are 47% in German, 60%
in Arabic and 79% in Russian. Differences in no
good suggestion rates are also very pronounced
between Aspell and the Web suggester.
It cannot be assumed that the Arabic and Rus-
sian systems would perform as well on real data.
However the correlation between data sets re-
ported in Section 5.4 lead us to hypothesize that
a comparison between the Web suggester and As-
pell on real data would be favourable.
6 Conclusions
We have implemented a spellchecking and au-
tocorrection system and evaluated it on typed
data. The main contribution of our work is that
while this system incorporates several knowledge
sources, an error model, LM and confidence clas-
sifiers, it does not require any manually annotated
resources, and infers its linguistic knowledge en-
tirely from the Web. Our approach begins with a
very large term list that is noisy, containing both
spelled and misspelled words, and derived auto-
matically with no human checking for whether
words are valid or not.
We believe this is the first published system
to obviate the need for any hand labeled data.
We have shown that system performance improves
from a system that embeds handcrafted knowl-
edge, yielding a 3.8% total error rate on human
typed data that originally had a 10.8% error rate.
News data with artificially inserted spellings were
sufficient to train confidence classifiers to a sat-
isfactory level. This was shown for both Ger-
man and English. These innovations enable the
rapid development of a spellchecking and correc-
tion system for any language for which tokeniz-
ers exist and string edit distances make sense. We
have done so for Arabic and Russian.
In this paper, our results were obtained without
any optimization of the parameters used in the pro-
cess of gathering data from the Web. We wanted to
minimize manual tweaking particularly if it were
necessary for every language. Thus heuristics such
as the number of terms in the term list, the criteria
for filtering triples, and the edit distance for defin-
ing close words were crude, and could easily be
improved upon. It may be beneficial to perform
more tuning in future. Furthermore, future work
will involve evaluating the performance of the sys-
tem for these language on real typed data.
7 Acknowledgment
We would like to thank the anonymous reviewers
for their useful feedback and suggestions. We also
thank our colleagues who participated in the data
collection.
References
Farooq Ahmad and Grzegorz Kondrak. 2005. Learn-
ing a spelling error model from search query logs.
In HLT ’05: Proceedings of the conference on Hu-
man Language Technology and Empirical Methods
in Natural Language Processing, pages 955–962,
Morristown, NJ, USA. Association for Computa-
tional Linguistics.
K. Atkinson. 2009. Gnu aspell. In Available at
http://aspell.net.
Loghman Barari and Behrang QasemiZadeh. 2005.
Clonizer spell checker adaptive, language indepen-
dent spell checker. In Ashraf Aboshosha et al., ed-
itor, Proc. of the first ICGST International Confer-
898
ence on Artificial Intelligence and Machine Learn-
ing AIML 05, volume 05, pages 65–71, Cairo, Egypt,
Dec. ICGST, ICGST.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J.
Och, and Jeffrey Dean. 2007. Large language
models in machine translation. In Proceedings
of the 2007 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Com-
putational Natural Language Learning (EMNLP-
CoNLL), pages 858–867.
Eric Brill and Robert C. Moore. 2000. An improved
error model for noisy channel spelling correction.
In ACL ’00: Proceedings of the 38th Annual Meet-
ing on Association for Computational Linguistics,
pages 286–293. Association for Computational Lin-
guistics.
S. Cucerzan and E. Brill. 2004. Spelling correction
as an iterative process that exploits the collective
knowledge of web users. In Proceedings of EMNLP
2004, pages 293–300.
F.J. Damerau. 1964. A technique for computer detec-
tion and correction of spelling errors. Communica-
tions of the ACM 7, pages 171–176.
S. Deorowicz and M.G. Ciura. 2005. Correcting
spelling errors by modelling their causes. Interna-
tional Journal of Applied Mathematics and Com-
puter Science, 15(2):275–285.
Andrew R. Golding and Yves Schabes. 1996. Com-
bining trigram-based and feature-based methods for
context-sensitive spelling correction. In In Proceed-
ings of the 34th Annual Meeting of the Association
for Computational Linguistics, pages 71–78.
Ahmed Hassan, Sara Noeman, and Hany Hassan.
2008. Language independent text correction using
finite state automata. In Proceedings of the 2008 In-
ternational Joint Conference on Natural Language
Processing (IJCNLP, 2008).
Mark D. Kernighan, Kenneth W. Church, and
William A. Gale. 1990. A spelling correction pro-
gram based on a noisy channel model. In Proceed-
ings of the 13th conference on Computational lin-
guistics, pages 205–210. Association for Computa-
tional Linguistics.
K. Kukich. 1992. Techniques for automatically cor-
recting words in texts. ACM Computing Surveys 24,
pages 377–439.
Mirella Lapata and Frank Keller. 2004. The web as
a baseline: Evaluating the performance of unsuper-
vised web-based models for a range of nlp tasks. In
Daniel Marcu Susan Dumais and Salim Roukos, ed-
itors, HLT-NAACL 2004: Main Proceedings, pages
121–128, Boston, Massachusetts, USA, May 2 -
May 7. Association for Computational Linguistics.
Lidia Mangu and Eric Brill. 1997. Automatic rule
acquisition for spelling correction. In Douglas H.
Fisher, editor, ICML, pages 187–194. Morgan Kauf-
mann.
Eric Mays, Fred J. Damerau, and Robert L. Mercer.
1991. Context based spelling correction. Informa-
tion Processing and Management, 27(5):517.
M.W.C. Reynaert. 2008. All, and only, the errors:
More complete and consistent spelling and ocr-error
correction evaluation. In Proceedings of the sixth
international language resources and evaluation.
Kristina Toutanova and Robert Moore. 2002. Pronun-
ciation modeling for improved spelling correction.
In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL),
pages 144–151.
L. Amber Wilcox-O’Hearn, Graeme Hirst, and Alexan-
der Budanitsky. 2008. Real-word spelling cor-
rection with trigrams: A reconsideration of the
mays, damerau, and mercer model. In Alexan-
der F. Gelbukh, editor, CICLing, volume 4919 of
Lecture Notes in Computer Science, pages 605–616.
Springer.
Wei Xiang Yang Zhang, Pilian He and Mu Li. 2007.
Discriminative reranking for spelling correction. In
The 20th Pacific Asia Conference on Language, In-
formation and Computation.
899
... Error models that have been used include equal distribution over candidates [2], the inverse of edit-distance [9] and partition based approach developed in [14]. Similarly, language models used typically involve probabilistic n-grams [9], [15]. Though the revolution has been seen in language models trained using neural networks, direct probability estimates from such models haven't been used together with error models to calculate weightings for the candidates. ...
... To estimate, P (T i |R i ) simple countingbased probability can be calculated for all such substring alignments using triples of the correct word, misspelled word, and frequency extracted from the corpus in an unsupervised way. The creation of these triples, as demonstrated in [15], involves the extraction of terms with close edit distances. The underlying assumption is that correct words manifest approximately 10 times more frequently than their closely misspelled counterparts. ...
Preprint
The task of Spell Correction(SC) in low-resource languages presents a significant challenge due to the availability of only a limited corpus of data and no annotated spelling correction datasets. To tackle these challenges a small-scale word-based transformer LM is trained to provide the SC model with contextual understanding. Further, the probabilistic error rules are extracted from the corpus in an unsupervised way to model the tendency of error happening(error model). Then the combination of LM and error model is used to develop the SC model through the well-known noisy channel framework. The effectiveness of this approach is demonstrated through experiments on the Nepali language where there is access to just an unprocessed corpus of textual data.
... In SB, the user completes or corrects the word by selecting an option from the Suggestion Bar which is located above the keyboard. ITE methods are built on language models (LMs) such as N-gram models [1] or complex deep neural network models trained with a large amount of text data [2][3][4]. LMs can be used to calculate the most likely words in a certain context making them ideal for error correction and prediction tasks. ...
Preprint
Full-text available
A large proportion of text is produced using mobile devices. However, very little research looks at the special characteristics of how this happens and, importantly, how it is affected by the design of the language model (LM). The operating systems of modern devices offer a number of LM-based intelligent text entry methods (ITEs) such as Autocorrection (AC) and Suggestion Bar (SB) to aid typing.It is not known how the keyboard and performance of the ITEs influences the typing strategies in the wild.LMs are operating system and language-specific, therefore, in this paper, we release and analyse a large-scale dataset of mobile typing in two languages: English (46755 participants) and Finnish (8661 participants). Typing data was collected with the participants' own iPhone and Android devices resulting a diverse data on the ITE method performance.By analysing the typing speed and the information on which letters and operations happen in each keystroke, we found that iPhone and Android devices encourage the use of two different typing styles. IPhone users used utilise mainly AC and are able to achieve the highest typing speed among the participants when the balance between AC accuracy and threshold to correct user error are adequately balanced. Android users prefer SB to avoid and correct typing errors instead of utilising AC. Especially Finnish participants, who had low ITE accuracy, used SB often to correct typing errors. To develop and evaluate LMs for typing applications, it is essential to study which factors affect the user. The typing dataset that we have prepared and opened for the public, allows for analysing the factors thus aiding the development of the most useful LMs.
... As a point of reference, the algorithm n-gram language model is considered the standard [28]. This system of spellchecking and auto-correction uses the Web to infer misspellings, a term list, an error model, a language model (LM), and a confidence classifier algorithm. ...
Article
Full-text available
Currently, approaches to correcting misspelled words have problems when the words are complex or massive. This is even more serious in the case of Spanish, where there are very few studies in this regard. So, proposing new approaches to word recognition and correction remains a research topic of interest. In particular, an interesting approach is to computationally simulate the brain process for recognizing misspelled words and their automatic correction. Thus, this article presents an automatic recognition and correction system of misspelled words in Spanish texts, for the detection of misspelled words, and their automatic amendments, based on the systematic theory of pattern recognition of the mind (PRTM). The main innovation of the research is the use of the PRTM theory in this context. Particularly, a corrective system of misspelled words in Spanish based on this theory, called Ar2p-Text, was designed and built. Ar2p-Text carries out a recursive process of analysis of words by a disaggregation/integration mechanism, using specialized hierarchical recognition modules that define formal strategies to determine if a word is well or poorly written. A comparative evaluation shows that the precision and coverage of our Ar2p-Text model are competitive with other spell-checkers. In the experiments, the system achieves better performance than the three other systems. In general, Ar2p-Text obtains an F-measure of 83%, above the 73% achieved by the other spell-checkers. Our hierarchical approach reuses a lot of information, allowing for the improvement of the text analysis processes in both quality and efficiency. Preliminary results show that the above will allow for future developments of technologies for the correction of words inspired by this hierarchical approach.
... To correctly use Transformers models it is necessary to finetune them on the specific task one wants to perform, so bart-it was finetuned on a sequence to sequence (seq2seq) task on three different datasets, generated by a developed function that inject different kind of errors (chosen analyzing which are the most common error committed by people) in correct medical text (obtaining a final training loss of 0.017400 and validation loss of 0.018288). It is necessary to apply this kind of function since it's difficult to retrieve dataset containing texts with errors and their corresponding corrector, also, to learn performing correctly a highly complex task as spelling correction, a large and various dataset is needed, so injecting errors is one of the used procedures [13]. Instead, bert-base-italian-xxl-uncased was finetuned on a classification task (training loss = 0.035432, validation loss = 0.021790). ...
Chapter
Full-text available
The application of Natural Language Processing (NLP) to medical data has revolutionized different aspects of health care. The benefits obtained from the implementation of this technique spill over into several areas, including in the implementation of chatbots, which can provide medical assistance remotely. Every possible application of NLP depends on one first main step: the pre-processing of the corpus retrieved. The raw data must be prepared with the aim to be used efficiently for further analysis. Considerable progress has been made in this direction for the English language but for other languages, such as Italian, the state of the art is not equivalently advanced, especially for texts containing technical medical terms. The aim of this work is to identify and develop a preprocessing pipeline suitable for medical data written in Italian. The pipeline has been developed in Python environment, employing Enchant, ntlk modules and Hugging Face’s BERT and BART-based models. Then, it has been tested on real conversations typed between patients and physicians regarding medical questions. The algorithm has been developed within the MULTI-SITA project of the Italian Society of Anti-Infective Therapy (SITA), but shows a flexible structure that can adapt to a large variety of data.
... our spelling corrector makes use of web. We have used the same evaluation procedure as used by Whitelaw et al. [55]. It focuses on metrics that reflect the quality of endto-end behavior, and account for the combined effects of flagging and automatic correction. ...
... our spelling corrector makes use of web. We have used the same evaluation procedure as used by Whitelaw et al. [30]. It focuses on metrics that reflect the quality of end-to-end behavior, and account for the combined effects of flagging and automatic correction. ...
Conference Paper
Full-text available
Summarization has emerged as an increasingly useful approach to tackle the problem of information overload. Extracting information from online conversations can be of very good commercial and educational value. But majority of this information is present as noisy unstructured text making traditional document summarization techniques difficult to apply. In this paper, we propose a novel approach to address the problem of conversation summarization. We develop an automatic text summarizer which extracts sentences from the conversation to form a summary. Our approach consists of three phases. In the first phase, we prepare the dataset for usage by correcting spellings and segmenting the text. In the second phase, we represent each sentence by a set of predefined features. These features capture the statistical, linguistic and sentimental aspects along with the dialogue structure of the conversation. Finally, in the third phase we use a machine learning algorithm to train the summarizer on the set of feature vectors. Experiments performed on conversations taken from the technical domain show that our system significantly outperforms the baselines on ROUGE F-scores.
... Furthermore, errors can accurately be classified into just a small number of independent categories [15] and several efforts have been made towards generating datasets based on artificial grammatical mistakes [16,17,18,19]. However, real-world typos do not necessarily follow those grammatical constructs [20,21] and there arises a need to generate synthetic training datasets based on historical typo statistics from open source datasets [22,23]. This will be the topic of sections 3 and 4 of this paper. ...
Preprint
Full-text available
Typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell cheking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network to learn context-limited domain-specific embeddings. Those embeddings are deployed in a real-time inferencing API for the Microsoft AppSource marketplace to find the closest match between a misspelled user query and the available product names. Our data efficient solution shows that controlled high quality synthetic data may be a powerful tool especially considering the current climate of large language models which rely on prohibitively huge and often uncontrolled datasets.
Article
Full-text available
CloniZER spell checker is an adaptive, language independent and 'built-in error pattern free' spell checker tool which is based on 'Ternary Search Tree' data structure. It suggests the proper form of the misspelled words using nondeterministic traverse. In other words the problem of spell checking is addressed by traverse a tree with variable weighted edges. The proposed method learns media error pattern and improves its suggestions as time goes by. Instead of using expert knowledge for error pattern modelling, the proposed algorithm learns error pattern by interaction with user. 1
Conference Paper
Full-text available
The trigram-based noisy-channel model of real-word spelling-error correction that was presented by Mays, Damerau, and Mercer in 1991 has never been adequately evaluated or compared with other methods. We analyze the advantages and limitations of the method, and present a new evaluation that enables a meaningful comparison with the WordNet-based method of Hirst and Budanitsky. The trigram method is found to be superior, even on content words. We then show that optimizing over sentences gives better results than variants of the algorithm that optimize over fixed-length windows.
Conference Paper
Full-text available
This paper describes a new program, correct, which takes words rejected by the Unix® spell program, proposes a list of candidate corrections, and sorts them by probability. The probability scores are the novel contribution of this work. Probabilities are based on a noisy channel model. It is assumed that the typist knows what words he or she wants to type but some noise is added on the way to the keyboard (in the form of typos and spelling errors). Using a classic Bayesian argument of the kind that is popular in the speech recognition literature (Jelinek, 1985), one can often recover the intended correction, c, from a typo, t, by finding the correction c that maximizes Pr(c) Pr(t/c). The first factor, Pr(c), is a prior model of word probabilities; the second factor, Pr(t/c), is a model of the noisy channel that accounts for spelling transformations on letter sequences (e.g., insertions, delections, substitutions and reversals). Both sets of probabilities were trained on data collected from the Associated Press (AP) newswire. This text is ideally suited for this purpose since it contains a large number of typos (about two thousand per month).
Conference Paper
Full-text available
Logs of user queries to an internet search engine p ro- vide a large amount of implicit and explicit inform a- tion about language. In this paper, we investigate their use in spelling correction of search queries, a task which poses many additional challenges beyond the traditional spelling correction problem. We pre - sent an approach that uses an iterative transformat ion of the input query strings into other strings that corre- spond to more and more likely queries according to statistics extracted from internet search query log s.
Article
Some mistakes in spelling and typing produce correct words, such as typing “fig” when “fog” was intended. These errors are undetectable by traditional spelling correction techniques. In this paper we present a statistical technique capable of detecting and correcting some of these errors when they occur in sentences. Experimental results show that this technique is capable of detecting 76% of simple spelling errors and correcting 73%.
Article
Many natural language applications, like machine translation and information extrac-tion, are required to operate on text with spelling errors. Those spelling mistakes have to be corrected automatically to avoid deteriorating the performance of such ap-plications. In this work, we introduce a novel approach for automatic correction of spelling mistakes by deploying finite state automata to propose candidates corrections within a specified edit distance from the mis-spelled word. After choosing candidate cor-rections, a language model is used to assign scores the candidate corrections and choose best correction in the given context. The proposed approach is language independent and requires only a dictionary and text data for building a language model. The ap-proach have been tested on both Arabic and English text and achieved accuracy of 89%.
Article
This paper accounts for a new technique of correcting isolated words in typed texts. A language-dependent set of string substitutions reflects the surface form of errors that result from vocabulary incompetence, misspellings, or mistypings. Can-didate corrections are formed by applying the substitutions to text words absent from the computer lexicon. A minimal acyclic deterministic finite automaton storing the lexicon allows quick rejection of nonsense corrections, while costs asso-ciated with the substitutions serve to rank the remaining ones. A comparison of the correction lists generated by several spellcheckers for two corpora of English spelling errors shows that our technique suggests the right words more accurately than the others.
Conference Paper
This paper addresses the problem of correctingspelling errors that result in valid,though unintended words (such as peaceand piece, or quiet and quite) and alsothe problem of correcting particular wordusage errors (such as amount and number,or among and between). Such cor-rections require contextual information andare not handled by conventional spellingprograms such as Unix spell. First, weintroduce a method called Trigrams thatuses part-of-speech trigrams to encode the...
Conference Paper
Previous work demonstrated that web counts can be used to approximate bigram frequen- cies, and thus should be useful for a wide va- riety of NLP tasks. So far, only two gener- ation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results gener- alize to tasks covering both syntax and seman- tics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-the- art models trained on small corpora. We ar- gue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models.