ArticlePDF Available

Categorizing spelling errors to assess L2 writing

Authors:

Abstract and Figures

Based on a corpus of 223 argumentative essays written by English as a foreign language learners, this study shows that spelling errors, whether detected manually or automatically, are a reliable predictor of the quality of L2 texts and that reliability is further improved by sub-categorising errors. However, the benefit derived from sub-categorisation is much lower in the case of errors automatically detected by means of the Microsoft Word 2007 spell checker, a situation which results from Word's limited success in detecting and correcting some specific categories of L2 learner errors.
Content may be subject to copyright.
Int. J. Continuing Engineering Education and Life-Long Learning, Vol. X, No. Y, xxxx 1
Copyright © 200x Inderscience Enterprises Ltd.
Categorising spelling errors to assess L2 writing
Yves Bestgen
Centre for English Corpus Linguistics,
Université catholique de Louvain,
Place du Cardinal Mercier 10, B-1348 Louvain-la-Neuve, Belgium
E-mail: yves.bestgen@uclouvain.be
Sylviane Granger*
Centre for English Corpus Linguistics,
Université catholique de Louvain,
Place Blaise Pascal 1, B-1348 Louvain-la-Neuve, Belgium
E-mail: sylviane.granger@uclouvain.be
*Corresponding author
Abstract: Based on a corpus of 223 argumentative essays written by English as
a foreign language learners, this study shows that spelling errors, whether
detected manually or automatically, are a reliable predictor of the quality of L2
texts and that reliability is further improved by sub-categorising errors.
However, the benefit derived from sub-categorisation is much lower in the case
of errors automatically detected by means of the Microsoft Word 2007 spell
checker, a situation which results from Word’s limited success in detecting and
correcting some specific categories of L2 learner errors.
Keywords: English as a foreign language; learner corpus; automatic scoring;
automatic error detection; misspellings; International Corpus of Learner
English; ICLE; Common European Framework of Reference for Languages;
CEF; spell checker; spelling error sub-categorisation; L1 transfer; letter
doubling; word segmentation errors.
Reference to this paper should be made as follows: Bestgen, Y. and
Granger, S. (xxxx) ‘Categorising spelling errors to assess L2 writing’, Int. J.
Continuing Engineering Education and Life-Long Learning, Vol. X, No. Y,
pp.000–000.
Biographical notes: Yves Bestgen is a Research Associate of the Belgian
National Fund for Scientific Research (F.R.S.-FNRS) and a Part-time
Professor at University of Louvain, Belgium, where he teaches courses in
Psycholinguistics and Statistics. He is a member of the Centre for English
Corpus Linguistics. His main research interests focus on text production and
comprehension by native and L2 learners and on the development of techniques
for automatic text analysis.
Sylviane Granger is a Professor of English Language and Linguistics at the
University of Louvain, Belgium. She is the Director of the Centre for English
Corpus Linguistics where research activity is focused on the compilation and
exploitation of learner corpora and multilingual corpora. In 1990, she launched
the International Corpus of Learner English project, which has grown to
contain learner writing by learners of English from 16 different mother tongue
backgrounds. Her current research interests focus on the integration of learner
corpus data into a range of pedagogical tools (electronic dictionaries, writing
aids, spell checkers and essay scoring tools).
2 Y. Bestgen and S. Granger
1 Introduction
Research into spelling correction has long focused on the needs of native speakers. As a
result, “spelling correction tools designed for language learners are rare” [Leacock et al.,
(2010), p.79] and foreign language learners (henceforth L2 learners) have no other choice
but to use existing spell checkers which have been trained on native (L1) data and are
therefore much less successful in dealing with learner errors. Rimrott and Heift (2005)
evaluated the Microsoft Word 2003 spelling corrector on learner German texts and found
that only 52% of the errors were detected, a substantially lower figure than that for texts
written by German native speakers. This study and a few other recent ones (Al-Jarf,
2009a; Botley and Dillah, 2007; Hovermale, 2008; Hovermale and Martin, 2008; Mitton
and Okada, 2007; Okada, 2004; Rimrott and Heift, 2008) bear witness to a surge of
interest in L2 spelling errors. The main objective of these studies is to assess the
reliability of spell checkers for L2 users and suggest avenues for improvement (Heift and
Rimrott, 2008). The research focuses mainly on the specificity of L2 vs. L1 errors and the
impact of the foreign language learner’s L1 on the types of errors produced. Results
highlight the following characteristics of L2 misspellings:
1 there is a wide diversity of error types
2 there is a larger number of errors than in L1 users
3 a large proportion of errors result from a lack of knowledge of the target language.
In the field of automatic text assessment, the presence of a large number of spelling errors
is often viewed as a factor that hinders effective assessment. This is typically the case for
systems that score open-ended questions by comparing students’ answers to some
gold standard (Landauer et al., 2003; Pérez et al., 2004). However, errors – whether
orthographic or grammatical – have also been proved to make a positive contribution to
the grading of texts (Chodorow and Burstein, 2004; Lonsdale and Strong-Krause, 2003),
especially within the framework of L2 assessment. Judging from the literature, this
research trend has been relatively marginal and results are mixed. Some automatic
scoring systems, primarily aimed at native users, take spelling errors into account and
most of the systems provide feedback on these errors (see reviews in Chung and O’Neil,
1997; Dikli, 2006; Warschauer and Ware, 2006). However, this information does not
play a key role in text assessment. In addition, spelling errors are treated as one
undifferentiated category rather than being broken down into subtypes (Burstein et al.,
2004). The impact of spelling errors is also usually neglected in recent L2 research. Many
studies rely on lexical variation to predict text quality (Bestgen et al., 2010; Crossley
et al., 2008; Yu, 2010) and disregard the possible influence of misspellings, which
Granger and Wynne (1999) have shown to be very strong.
This study has two main objectives: first, to assess whether it is possible to predict
learners’ proficiency automatically on the basis of spelling errors and second, to examine
the possibility that sub-categorising these errors might improve the prediction of both
manual and automatic detection. The data used are authentic texts produced by L2
learners from three different mother tongue backgrounds. The analysis, which involves a
comparison between a gold standard manual (i.e., a set of manually detected errors) and
automatic flags by Microsoft Word 2007, provides a valuable platform from which to
assess Word’s efficiency in detecting and correcting errors in authentic L2 texts. This
information could be used to improve not only automatic spell checkers, but also free-text
Categorising spelling errors to assess L2 writing 3
assessment tools by allowing them to remove spelling errors more efficiently before the
assessment step is undertaken.
The article is structured as follows: In Section 2, we describe the corpus data, the
rating procedure and the methodology used to annotate the spelling errors both manually
and automatically. Section 3 investigates the efficiency of manually tagged errors in
predicting the quality of an essay while Section 4 assesses the success rate of MS Word
2007 in terms of error detection and correction as well as prediction of essay quality. In
Section 5 we tackle the possible impact of L1 differences on automatic error detection
and prediction efficiency. The last section discusses some limitations of the study and
suggests avenues for future research.
2 Data and methodology
2.1 Learner corpus data
Our study is based on data extracted from a large computer learner corpus, i.e., an
electronic collection of texts written by foreign language learners. The corpus, the
International Corpus of Learner English (ICLE), is a 3.7 million word corpus of essays
written by intermediate to advanced learners of English as a foreign language from 16
mother tongue backgrounds (Granger et al., 2009). Two hundred and twenty three
argumentative essays were extracted from three ICLE sub-corpora, i.e., 74 essays were
taken from the French (FR) component, 71 from the German (GE) component and 78
from the Spanish (SP) component of the learner corpus. These texts vary between 500
and 900 words in length. The detailed breakdown of the learner corpus sample used is
presented in Table 1.
Table 1 ICLE corpus sample
L1 background Number of learner essays Total tokens
FR 74 50,195
GE 71 49,856
SP 78 51,397
Total 223 151,448
2.2 Rating procedure
Although the essays in the ICLE can be broadly described as intermediate to advanced,
this still leaves room for considerable differences in proficiency between the texts. With a
view to assigning a more precise proficiency rating to each text, the 223 essays were
assessed by two professional raters, who assigned a score based on the descriptors of the
Common European Framework of Reference for Languages (CEF) (Council of Europe,
2001) (for a detailed description of the rating procedure, see Bestgen et al., 2010).
The CEF includes six proficiency levels which can be broken down into three groups
of two, viz. A1 and A2 ( = basic users; elementary proficiency learners), B1 and B2
( = independent users; intermediate proficiency learners), C1 and C2 ( = proficient users;
advanced proficiency learners). Raters were further able to use + or – signs to further
specify quality within each proficiency level (e.g., B2– or C1+). We started the rating
4 Y. Bestgen and S. Granger
procedure at level B1 because the CEF specifies that learners need a B1 proficiency level
to attempt essay writing. In order to attribute one final CEF score to each text, we
computed the mean of the holistic scores given by the raters (each holistic score had been
given a numerical value, as shown in Table 2).
Table 2 11-point numerical scale
Holistic CEF score B1– B1 B1+ B2– B2 B2+ C1– C1 C1+ C2– C2
Numerical value 0.67 1 1.33 1.67 2 2.33 2.67 3 3.33 3.67 4
The mean CEF score is 2.35 with a standard deviation of 0.99. However, as appears from
Table 3, there are marked differences between the three sub-corpora. A one-way
between-groups analysis of variance (ANOVA) indicated that there are highly significant
differences in mean CEF scores between the groups (F(2, 220) = 96.14; p < 0.0001). The
Student-Newman-Keuls post-hoc test revealed that all three group means actually differ
significantly from each other. As can be noted, however, the difference between the SP
group and the FR/GE groups is much more marked than that between the FR and GE
groups themselves.
Table 3 Mean CEF scores in the three L1 groups
N Mean Std. dev.
SP 78 1.44 0.64
FR 74 2.64 0.67
GE 71 3.03 0.87
2.3 Error detection and correction
Manual detection and correction of the errors was carried out by a native speaker of
English with considerable expertise in English language and linguistics. The same analyst
subsequently tagged each error in accordance with the Louvain error tagging system
(Dagneaux et al., 1998)1 and inserted the tags into the text files with the help of the
Université catholique de Louvain Error Editor (UCLEE), a menu-driven editor by means
of which error tags and corrections are inserted into the text files with the appropriate
markup. The Louvain error tagging system includes eight major error domains (formal,
grammatical, lexical, lexico-grammatical, punctuation, word redundant/missing/order,
style and infelicity), which are further broken down into 56 error categories. For the
present study, only the errors pertaining to the form of words have been selected. The
majority of these errors are tagged F and further broken down into pure spelling errors
(FS) and morphological errors (FM). While the majority of FS errors result in
non-existing word forms, the category also includes words with similar forms which are
easily confused, i.e., pairs such as their-there or to-too. We also included erroneous forms
such as the plural forms of uncountable nouns (advices) or adjectives (responsibles)
which are categorised as grammatical or lexico-grammatical errors according to the
Louvain error tagging system. Examples 1 to 4 illustrate the different categories with the
appropriate markup, viz. the error tag written between brackets in front of the erroneous
word and the correction presented between dollar signs after the error.
1 the fast spread of television can transform it into a double-edged (FS) wheapon
$weapon$
Categorising spelling errors to assess L2 writing 5
2 today (FS) its $it’s$ the Germans who have turned to killing foreigners living in
Germany
3 there are just as many weak, emotional, (FM) unpractical $impractical$ men as
there are strong and practical women
4 the experiments of breeding hundreds of (XNUC) cattles $cattle$ out of one cell
have raised a storm of protest.
Alongside the manual error detection process, a process of automatic error detection was
carried out on the texts, using Microsoft Word 2007 with only the spell checker turned
on. We chose to ignore words in upper case and words with numbers to avoid Word
flagging abbreviations and words like B170 or 54 m2. The spellchecking was carried out
twice: first with the UK English dictionary and then with the US English dictionary. Only
the words that were flagged by the two analyses were considered as errors for Word. The
first correction suggested by Word was consistently accepted. When no suggestion was
made, a # was inserted instead.
2.4 Spelling error sub-categorisation
As stated above, one of the objectives of this study was to investigate the potential of
breaking the errors down into sub-categories on the basis of the differences between the
forms produced by the learners and the correct forms proposed by the analyst or the spell
checker. The variables underlying the categorisation are: the element that carries the error
(letter, word boundary, apostrophe) and the error type (single letter addition, omission,
substitution, or transposition and multiple letter errors). Capitalisation errors have been
disregarded. Table 4 lists the nine categories together with their codes and illustrates
them with examples extracted from the corpus.
Table 4 List of sub-categories of spelling errors
Code Description Example
completly completely
concious conscious
distinc distinct
eople people
mecanisms mechanisms
0X Omission of a letter
throghout throughout
develope develop
youngs young
alledged alleged
eightheen eighteen
envolves evolves
X0 Addition of a letter
ridicoulous ridiculous
especialy especially Doub12 Single letter instead of double letter robed robbed
Notes: The underscore represents a space character. Doub12 is a special case of 0X and
Doub21 of X0. As shown below, the letters involved in both categories are
predominantly consonants.
6 Y. Bestgen and S. Granger
Table 4 List of sub-categories of spelling errors (continued)
Code Description Example
adicts addicts
carots carrots
ocurred occurred
Doub12 Single letter instead of double letter
occuring occurring
appartments apartments
allmighty almighty
detailled detailed
loosing losing
Doub21 Double letter instead of single letter
proffessors’ professors’
lifes lives
dependend dependent
consecuently consequently
confortable comfortable
engeneering engineering
XY Substitution of one letter
uncredible incredible
concieved conceived
birht birth
lfie life
peopels peoples
Swap Interchange of two adjacent letters
entreprises enterprises
its it’s
womans woman’s
Apost Error involving an apostrophe
childrens’ children’s
business_man businessman
every_one everyone
free-time free time
everyday every_day
airpollution air_pollution
SplitW Erroneous splitting or joining of words
(word segmentation error)
eventhough even_though
unbalance imbalance
politic political
payed paid
weter whether
dustbinman dustman
theirselves themselves
beggining beginning
configurating configuring
divorcion divorce
Many Two or more errors of the same
type or of different types
hitted hit
Notes: The underscore represents a space character. Doub12 is a special case of 0X and
Doub21 of X0. As shown below, the letters involved in both categories are
predominantly consonants.
Categorising spelling errors to assess L2 writing 7
Besides traditional categories like omission, addition, substitution, swap and Many
(Pollock and Zamora, 1984), our categorisation also contains categories that are less often
used in the literature such as the erroneous doubling (or lack of doubling) of letters or
erroneous splitting (or joining) of words. These categories have been added because they
seemed to be particularly recurrent features of learner spelling (cf. Al-Jarf, 2009b;
Bebout, 1985; Cook, 1997).
2.5 Statistical technique
The first question we were addressing in this study was the extent to which the total
number of spelling errors and the different categories of misspellings can be used to
predict the CEF scores attributed to each learner text. As is customary in the literature on
automated writing evaluation (see for review, Warschauer and Ware, 2006; Yang et al.,
2002), we used the statistical technique of linear regression to assess the predictive power
of misspellings. In these analyses, the variable to be predicted is the CEF score of a text
and the predictors are the relative error frequencies in that same text. We used relative
frequencies to take into account the variable length of the texts. The frequencies in each
different category were divided by the total number of words in the text and multiplied by
1,000. For the prediction based on categories of misspellings, the best predictors were
selected by means of the Stepwise procedure (with a significance level to enter and to
stay of 0.05). The quality of the prediction was measured by the R2 coefficient which
corresponds to the part of the variance that can be explained by the predictor(s). Its value
ranges from 0 to 1, where 1 represents perfect prediction. As the present study compares
models based on different numbers of predictors, it was preferable to use the adjusted R2
which takes this factor into account.
Admittedly, spelling errors are but one variable among many that can be used to
predict the quality of a text and considered in isolation, their predictive power is
necessarily limited. However, our analyses provide an initial indication of the role played
by spelling errors in general and make it possible to assess whether sub-categorisation
improves the prediction. The comparison between manual and automatic error detection
and correction also gives us an idea of the challenges faced in attempting to fully
automate the process of text evaluation.
3 CEF score prediction: manual approach
3.1 Total number of errors
A total of 1,614 errors were initially identified manually (10.7 errors per 1,000 words).
The analysis of the distribution of the number of errors per text showed that one of the
Spanish L2 texts contained 65 spelling errors, meaning that this text alone contained 4%
of the total number of errors in the whole corpus. As the error frequency was markedly
higher in this text than in all the other error-dense texts (which contained in the region of
30 errors each), the decision was made to discard this ‘rogue’ text from subsequent
analyses, which reduced the number of errors to 1,549 for 222 texts.
Results of the regression analysis demonstrate that the relative frequency of spelling
errors in a text can significantly predict (p < 0.0001) the CEF score of that text with an
8 Y. Bestgen and S. Granger
adjusted R2 of 0.34. According to Cohen’s (1988) classical guidelines for effect size, this
value can be considered as large.
3.2 Error categories
The second stage in the analysis involved comparing the results obtained for the total
number of spelling errors with those obtained when sub-categorisation is introduced. The
frequency of the different error categories in the learner corpus is shown in Figure 1
(N = 1,549).
Figure 1 Frequency of spelling error categories (see online version for colours)
13%
9%
7%
5%
21%
2%
1%
18%
24%
0X
X0
Doub12
Doub21
XY
Swap
Apo s t
SplitW
Many
The figure shows that Many represents more than 23% of the errors and SplitW more
than 18%. These two types of errors tend to be neglected by classic automatic detection
systems which mainly focus on single letter errors which are predominant in L1 writing
(Pollock and Zamora, 1983). The frequency of multi-letter errors has also been
underlined by Rimrott and Heift (2008) for learners of German. Figure 1 also brings out a
sizeable proportion of doubling errors (Doub12 and Doub21). Although this is a classic
spelling difficulty in English which affects both native and L2 writers (cf. Cook, 1997), a
recent study based on the LOCNESS corpus, a corpus of essays written by US
undergraduates (cf. Granger and Wynne, 1999), gives frequencies below 0.1 per 1,000
words for Doub12 and Doub21 errors, while the frequencies in our L2 corpus are above
0.7 and 0.5 for Doub12 and Doub 21 respectively.
Multiple regression analysis based on the different categories of errors yields a better
prediction than that achieved for the total number of undifferentiated errors. The adjusted
R2 based on the six predictors selected by the stepwise procedure was equal to 0.43,
compared to 0.34 for the previous analysis. The six selected predictors are given in
Table 5. They are ordered according to their importance in predicting the CEF score, the
importance being measured by the squared semi-partial correlation between the predictor
Categorising spelling errors to assess L2 writing 9
and the CEF score. It corresponds to the decrement in R2 that would result from the
elimination of this predictor from the model [Howell, (2007), p.526].
In order to facilitate comparison between the different analyses, we present in Table 5
an overview of all the results. The discussion of some of these results is deferred to the
next section, however.
Table 5 CEF score prediction: manual vs. automatic detection
Manual Word
Analyses Adj. R2 Predictors SSPC Adj. R2 Predictors SSPC
Total number of errors 0.34 0.27
Error categories 0.43 X0 0.054 0.32 Doub12 0.070
XY 0.036 0X 0.038
Doub12 0.036 X0 0.033
Many 0.030 XY 0.023
0X 0.014
SplitW 0.014
Notes: SSPC stands for squared semi-partial correlation. Where there is only one
predictor, SSPC is equal to the model R2.
4 CEF score prediction: automatic approach
For the automatic detection stage of our analysis we used Word 2007, a commercial,
widely-distributed software tool frequently used by L2 writers to correct their spelling.
Investigating Word’s success at predicting CEF scores also gave us the opportunity to
assess its accuracy in L2 error detection and correction. Rimrott and Heift (2008), the
most in-depth study that has been carried out of a very limited number of studies of L2
spelling errors, indicates that Word 2003 is not highly successful. When interpreting their
results, it is important to bear in mind that their study only assesses whether Word detects
errors that are present in the texts. It does not investigate whether Word flags correct
words as errors. In a study like ours which aims to predict text quality, incorrect flags
need to be included.
4.1 Total number of errors
4.1.1 Precision and recall rates of Word 2007
Word’s spell checker flagged 1513 words in the corpus, which seems at first sight to be
close to the number of manually detected errors (1,549). However, subsequent analysis
revealed that these two figures did not cover the same sets of errors. Of the 1,513 words
flagged by Word, 267 were overflags (OFs), i.e., words that were not detected as errors in
the manual analysis. Only 1246 words could therefore be considered as correct flags
(CFs). In addition, Word missed 303 manually detected spelling errors, henceforth
referred to as underflags (UFs).
To assess Word’s success at error detection we used the measures of precision and
recall recommended by Leacock et al. (2010, p.38) to evaluate error detection systems in
10 Y. Bestgen and S. Granger
the language learner’s context. Salton (1989, p.248) defines these measures as follows:
“Two main parameters of retrieval effectiveness have been used over the years, defined
as the proportion of relevant materials retrieved, or recall (R), and the proportion of
retrieved materials that are relevant, or precision (P)”. Applied to our study, this produces
the following results:
Number of relevant spelling errors flagged by Word07
Recall rate Total number of manually detected spelling errors
1, 246 80.43%
1, 54 9
=
==
Number of relevant spelling errors flagged by Word07
Precision rate Total number of automatic spelling errors flagged by Word07
1, 246 82.35%
1, 51 3
=
==
An analysis of the OFs shows that the vast majority (84%) is made up of foreign words
(liégeoises, Grüne), abbreviations in lower case or including lower case (a.s.o., CFC’s),
proper names (Garbo, Testarossa) or stylistic effects (GE46: “when she suddenly lisped,
‘I’m terribly thorry, but the litht ith full now. Can you try it again nektht themethter?’”).
While Word can hardly be blamed for flagging these words, the situation is different with
the remaining 16%, which are also more serious in an L2 perspective as they may
mislead L2 learners. Here we find clear detection errors such as “The parents too have a
role to play in the education of their children –> The parents to have...” and English
words that are absent from Word’s dictionary (categorial, hoovering). While the second
category of OFs is particularly confusing for L2 learners, both categories are just as
problematic when it comes to predicting the quality of a text.
4.1.2 Prediction of CEF scores
As shown in Table 5, the relative frequency of errors detected by Word in an L2 text
makes it possible to predict the CEF score significantly with an adjusted R2 of 0.27. This
value is significantly lower than that obtained on the basis of manual detection, which
stands at 0.34 [Z = –2.20; p < 0.05; comparison of correlated correlation coefficients:
Meng et al. (1992)].
4.2 Error categories
4.2.1 Efficiency in error detection: breakdown per category
When Word’s efficiency in error detection is assessed in terms of error categories, it soon
appears that the precision and recall rates given in the preceding section conceal marked
discrepancies between categories. For example, of the 282 errors which fall into the
category of SplitW according to manual annotation, only 88 (31%) are flagged as an error
by Word. All the other instances are UFs. SplitW errors actually account for 64% of the
spell checker’s 303 UFs. Particularly striking is the fact that Word accepts as correct 108
of the 109 hyphenated words which should have been written unhyphenated or solid (e.g.,
honey-moon –> honeymoon). It also misses 80 of the 101 unhyphenated words which
Categorising spelling errors to assess L2 writing 11
should either have been written hyphenated or solid (e.g., sheep_dog –> sheepdog).
However, it proves more effective at detecting solid words which should have been
written hyphenated or unhyphenated as two words, as only six of the 72 instances are
missing (e.g., goodhearted –> good-hearted, everyday –> every_day). Another
noteworthy category is the Many category. Only 60% of the words flagged by Word were
annotated manually as errors. The other 40% are OFs (cf. Section 4.1.1).
4.2.2 Error correction success rate: breakdown per category
So far we have only investigated Word’s success at detecting spelling errors. Another
particularly important issue for L2 writers is whether it is capable of suggesting the
correctly spelt word as first correction. In this section, we only take into account the cases
where both Word and manual annotation have detected an error and provided a
correction. Our aim is to determine whether automatic and manual analyses suggest the
same word or not. If Word presents the same word as in the manual analysis as first
suggestion, the occurrence is counted as CF. The results are presented in Table 6. On the
left-hand side of the table, the percentage of CFs is calculated according to the error
categories based on manual detection and correction, while on the right-hand side, the
percentage is calculated on the basis of the error categories based on Word 2007. The
difference between the two sides of the table is highlighted clearly by the Swap category.
When there is a Swap error according to manual analysis, Word provides the right
correction in 97% of cases, but when Word says it is a Swap error, the percentage of CFs
is much lower (72%). This difference probably reflects a tendency for Word’s algorithms
to prioritise Swap errors over other error types. This prioritisation might well turn out to
be more appropriate for native texts than L2 texts.
Table 6 Word 2007’s error correction success rate: breakdown per category
Categories according to manual annotation Categories according to Word 2007
Category Total CF (%)
Total CF (%)
0X 187 86.10 215 77.67
X0 140 80.71 152 74.34
Doub12 107 97.20 116 91.38
Doub21 76 98.68 79 97.47
XY 301 77.08 322 72.98
Swap 34 97.06 46 71.74
Apost 14 85.71 49 24.49
SplitW 88 88.64 102 76.47
Many 299 26.09 165 39.39
Note: CF = correct flag (Word 2007 provides the correct spelling as first suggestion)
The table shows marked differences between categories, some (e.g., Doub21) achieving
very high CF rates while others (e.g., XY or X0) fare much less well and would merit
closer investigation of the underlying algorithms. Two categories deserve particular
attention in view of their very low CF rates. The most problematic category is Many, a
result which confirms Rimrott and Heift’s (2008) study of L2 German errors. This result
is hardly surprising as Word is likely to experience more difficulty in finding the correct
12 Y. Bestgen and S. Granger
word when the original word contains many errors. For example, for sacrifying Word
uses Swap and suggests scarifying when in fact the target word was sacrificing, which
belongs to the Many category (XY + 0X). Another problematic category is Apost (right-
hand side of the table). Word makes more use of this category than it should, mainly
because it frequently introduces an apostrophe when the incorrect word contains ‘s’ as its
last letter. This is the case, for example, for the word lifes, which is corrected as life’s
instead of lives (16 occurrences).
4.2.3 Prediction of CEF scores
The analysis revealed that multiple regression achieves a better prediction rate when it is
based on the different error categories than when it is based on global scores. The
adjusted R2 based on the four predictors selected by the stepwise procedure is equal to
0.32 (vs. 0.27 for the undifferentiated error totals). However, the gain is much lower than
that based on categories of manually detected errors (0.43). The selected predictors and
their squared semi-partial correlations are presented in Table 5. It appears clearly from
the table that Many and SplitW are not selected as predictors by Word. This was to be
expected as these are the categories of errors that Word was found to be least good at
detecting (SplitW, see Section 4.2.1) and correcting (Many, see Section 4.2.2).
5 The impact of learners’ L1
Up to now, this article has presented analyses of the learner data as a whole, with no
distinctions made relating to the three different learner populations – FR, GE and SP – it
represents. However, studies such as Bebout (1985), James et al. (1993), Al-Jarf (2009a)
and Mitton and Okada (2007) conclude that the learners’ L1 affects the type of spelling
errors they produce. It therefore seems likely that an automatic detection system which is
better at detecting and correcting some error categories than others – and our study shows
that this is the case for Word – runs the risk of favouring some learner populations and
penalising others and this risk exists whether the analysis is based on undifferentiated or
sub-categorised spelling errors. In this section we investigate the potential impact of the
learners’ mother tongue by providing the L1 breakdown of the error categories and some
examples of L1-related categories of errors.
5.1 Error frequency: L1 breakdown of manually detected errors
The potential impact of the learners’ L1 on the frequency of error types was assessed by
means of ANOVAs, taking the relative frequency of each error category as dependent
variable. In cases where the difference is significant, the means of the three L1 groups
were compared by means of the Student-Newman-Keuls procedure.
As appears from Table 7, there is a statistically significant difference between the SP
group and the other two groups for nearly all the error categories. The higher error rate in
the SP texts ties in with the difference in the quality of the essays described above
(Section 2.2), which showed that the difference between the SP group and the FR/GE
groups is much more marked than that between the FR and GE groups.
Categorising spelling errors to assess L2 writing 13
Table 7 Relative frequency of error categories in FR, GE and SP: results of the ANOVAs
FR GE SP
Category Mean SD
Mean SD
Mean SD R2 p Diff. sign.
0X 0.6 1.1 0.9 1.6 2.5 3.1 .14 *** SP > (FR = GE)
Doub12 0.2 0.6 0.2 0.6 1.8 2.1 .25 *** SP > (FR = GE)
X0 0.6 1.1 0.7 1.4 1.6 2.0 .09 *** SP > (FR = GE)
Doub21 0.4 0.8 0.5 1.1 0.6 1.1 .01 NS
XY 1.1 1.8 1.5 3.0 3.7 3.4 .14 *** SP > (FR = GE)
Swap 0.2 0.6 0.0 0.2 0.4 1.0 .05 ** SP > GE, FR = GE, FR = SP
Apost 0.0 0.0 0.3 0.6 0.0 0.2 .08 *** GE > (FR = SP)
SplitW 1.3 1.9 3.0 3.0 1.4 1.7 .11 *** GE > (FR = SP)
Many 1.8 2.1 1.4 1.9 3.9 3.0 .17 *** SP > (FR = GE)
Notes: SD = standard deviation, NS = not significant, **p < 0.01 and ***p < 0.0001.
The diff. sign. column presents the statistically significant differences between
the means according to the Student-Newman-Keuls procedure. For example,
SP > (FR = GE) indicates that the mean for the FR group does not differ
significantly from that of the GE group, while both of these means are
significantly lower than the mean for SP.
5.2 Illustrations of L1-related error categories
The SplitW category deserves special attention as it is the only one which is more
prominent in the GE group than in the other two2. The doubling categories are also
worthy of closer investigation. They display opposite tendencies: Doub12 is significant
and the effect size is large (R2 = 0.25), while Doub21 is non-significant and the effect
size is close to zero. A more fine-grained qualitative analysis will enable us to assess
whether, as suggested in the literature, these differences originate in transfer from the
learners’ L1.
5.2.1 Word segmentation
SplitW errors can be further sub-categorised into three categories:
unhyphenated words that should be hyphenated or solid
(e.g., business man –> businessman)
hyphenated words that should be unhyphenated or solid (e.g., free-time –> free time)
solid words that should be hyphenated or unhyphenated
(e.g., airpollution –> air pollution).
The L1 breakdown of the three categories is given in Table 8.
14 Y. Bestgen and S. Granger
Table 8 Relative frequency of the SplitW sub-categories in FR, GE and SP: results of the
ANOVAs
FR GE SP
Category Mean SD
Mean SD
Mean SD R2 p Diff. sign.
Unhyphenated 0.5 0.9 0.9 1.7 0.7 1.1 0.02 NS
Hyphenated 0.5 1.4 1.4 1.9 0.3 0.9 0.09 *** GE > (FR = SP)
Solid 0.3 0.9 0.8 1.8 0.4 0.7 0.03 * GE > (FR = SP)
Note: SD = standard deviation, NS = not significant, *p < 0.05 and ***p < 0.0001
While there is no difference between groups for the unhyphenated category, the other two
categories – hyphenated and solid – are significantly more frequent in GE than in FR or
SP. Examples include honey-moon > honeymoon, lorry-drivers > lorry_drivers,
alarm-clock > alarm_clock, selfesteem > self-esteem, familylife > family_life, eachother
> each_other. This difference is most probably due to transfer from German which is
characterised by a high number of compounds, which are usually written as one word,
i.e., solid (Schulkind, Strassenverkehr, Haustür) but may in some cases be hyphenated
(CD-Laufwerk, Lufthansa-Pressesprecher). Unhyphenated compounds do not exist in
German (e.g., rush hour = Verkehrsspitze; lorry driver = Lastwagenfahrer).3
5.2.2 Letter doubling
The results for letter omission (Doub12) and letter addition (Doub21) are strikingly
different. As appears from Table 9, Doub12 errors are significantly more frequent in SP
than in FR and GE. Over 80% of the Doub12 errors are found in the SP sub-corpus.
Table 9 Means and standard deviations (relative frequency) of Doub12 errors in FR, GE and
SP
FR GE SP
Mean SD
Mean SD
Mean SD
0.2 0.6 0.2 0.6 1.8 2.1
Note: SD = standard deviation
Here too the influence of the learners’ L1 appears as a likely factor, as shown by the list
in Table 10 that compares the erroneous forms used by the SP learners to the
corresponding words in Spanish. As the table shows, the letters involved in Doub12
errors are predominantly consonants.4 As pointed out by Bebout (1985, p.583), although
consonant doubling is a difficulty for any learner or writer of English, it is particularly
treacherous for native speakers of Spanish who “are less used to paying attention to the
presence or absence of doubled consonants or to making decision about doubling when
writing”.
Conversely, the infrequency of this type of error in FR and GE can be explained by
the lack of difference between German/French and English for this category. To take but
one example, the English word communication corresponds to communication in French
and Kommunikation in German.
Categorising spelling errors to assess L2 writing 15
Table 10 Examples of Doub12 errors in SP and corresponding Spanish words
comunication; comunicación imposible; imposible pesimism; pesimismo
comunity; comunidad metalic; metálico posibility; posibilidad
diference; diferencia ocasionally; ocasionalmente preocupation; preocupación
dificult; dificil oficially; oficialmente profesional; profesional
eficient; eficiente oportunity; oportunidad sufrage; sufragio
exagerate; exagerar oposition; oposición suposed; suponer
excelent; excelente opression; opresion supressing; supresión
As shown in Table 11, no such difference between the three learner groups can be
observed as regards Doub21 errors.
Table 11 Means and standard deviations (relative frequency) of the Doub21 error in FR, GE
and SP
FR GE SP
Mean SD
Mean SD
Mean SD
0.4 0.8 0.5 1.1 0.6 1.1
Note: SD = standard deviation
The influence of the L1 is mainly felt in the FR group where most of the errors can be
related to the existence of an equivalent word with two letters in French (e.g.,
educationnal – éducationnel; mentionned – mentionné).5 The GE sub-corpus contains a
mixture of potentially L1-related errors (detailled – detailliert) and errors where L1
influence is absent (beeing, neccessary). The SP sub-corpus contains three times fewer
occurrences of Doub12 errors than Doub21 and in this case, the errors are clearly not
L1-related (carefull; compells). This confirms Bebout’s (1985, p.579) study of Spanish
learners which revealed significantly more errors involving the failure to double a
consonant than the unnecessary doubling of one.
Cook (1997) gives further illustrations of L1 influence. He shows, for example, that
Japanese learners are the only ones to confuse < l > and < r > as in familiality or grobal.
However, his general conclusion is that “transfer from the L1 is less important than had
been believed”. Our own results are more in keeping with an earlier study by James et al.
(1993) which assigned a large proportion of errors (38.5%) to the learners’ L1. However,
the issue is clearly far from settled and further research is needed to clarify the role of L1.
6 Conclusions
Our study has generated a number of interesting findings relating to L2 spelling errors
and the role they can play in automatic text evaluation.
First, our study shows that spelling errors, whether detected manually or
automatically, are good predictors of the quality of L2 texts and that prediction scores can
be further improved by sub-categorising errors. The benefit derived from
sub-categorisation is high in the case of manually detected but much lower in the case of
automatically detected errors. The analysis shows that this is mainly due to the
difficulties encountered by Word in detecting and correcting errors of the types Many and
16 Y. Bestgen and S. Granger
SplitW. 38% of the errors detected by Word which fall into the category of Many are OFs
resulting from its unawareness of some proper names, abbreviations and foreign words. A
better handling of some of these errors would make it possible to improve prediction
accuracy. Our study also shows that Word encounters difficulties when it comes to
identifying SplitW errors, only 31% of which are detected. The question that remains
unanswered in this connection is whether such errors are more typical of L2 writers than
native writers.
Our analyses also highlight the impact of the learners’ L1 on the success rate of spell
checkers. This has implications for both the automatic grading of L2 texts and the
customisation of spell checkers for different categories of learners. As suggested by
Haggan (1991, p.61), progress will only be made in the field if further studies are
conducted “to add to our stock of cross-language error profiles”.
Overall our study strongly underlines the necessity to adapt spell checkers to L2
learners, a need that has been voiced by other researchers and led to some preliminary
implementation. Mitton and Okada (2007), for example, have adapted a spell checker to
some distinctively Japanese error patterns and report some promising results. Our study
also underlines the need already identified by Granger and Wynne (1999) to take spelling
errors into account when computing other indices of text quality such as lexical variation.
On this point it is interesting to note that spelling errors seem to be a considerably higher
predictor of text quality than lexical diversity (Yu, 2010). Admittedly, this comparison
can only be tentative in view of differences in corpus and methodology. A more reliable
comparison can however be made with Bestgen et al. (2010), which uses the same corpus
and the same evaluation method. In that study, the indices for lexical variation account at
best for 24% of the variance of the CEF score, which is markedly less than in the
analyses presented here.
One major limitation of our study is that we only used one spell checker, viz. Word
2007. Further research is necessary to ascertain whether other spell checkers encounter
the same difficulties as those highlighted in our study. Other error classifications such as
those based on competence and performance put forward by Rimrott and Heift (2008) or
those based on the sources of error (Cook, 1997; Haggan, 1991) also need to be assessed.
There is no doubt that those classifications are linguistically more meaningful than those
used in our study but it remains to be seen to what extent they can be automated.
References
Al-Jarf, R. (2009a) ‘Spelling error corpora in EFL’, Proceedings of the International Conference
on Multi Development and Application of Language and Linguistics, 15–16 May 2009,
National Cheng Kung University, Tainan City, Taiwan.
Al-Jarf, R. (2009b) ‘Phonological and orthographic problems in EFL college spellers’, TELLIS
Conference Proceedings, Azad Islamic University Roudehen, Iran, available at
http://repository.ksu.edu.sa/jspui/handle/123456789/5618 (accessed on 15 May 2010).
Bebout, L. (1985) ‘An error analysis of misspellings made by learners of English as a first and as a
second language’, Journal of Psycholinguistic Research, Vol. 14, No. 6, pp.569–593.
Bestgen, Y., Lories, G. and Thewissen, J. (2010) ‘Using latent semantic analysis to measure
coherence in essays by foreign language learners?’, in Bolasco, S., Chiari, I. and Giuliano, L.
(Eds.): Proceedings of 10th International Conferences Journées d’Analyse statistique des
Données Textuelles, pp.385–395, LED, Rome.
Categorising spelling errors to assess L2 writing 17
Botley, S. and Dillah, D. (2007) ‘Investigating spelling errors in a Malaysian learner corpus’,
Malaysian Journal of ELT Research, Vol. 3, pp.74–93.
Burstein, J., Chodorow, M. and Leacock, C. (2004) ‘Automated essay evaluation: the criterion
online writing service’, AI Magazine, Vol. 25, No. 3, pp.27–36.
Chodorow, M. and Burstein, J. (2004) ‘Beyond essay length: evaluating e-rater’s performance on
TOEFL essays’, Research Reports, Report 73, ETS RR-04-04, ETS, Princeton.
Chung, G.K.W.K. and O’Neil, H.F. (1997) ‘Methodological approaches to online scoring of
essays’, CSE Technical Report 461, University of California, Los Angeles.
Cohen, J. (1988) Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Erlbaum,
Hillsdale.
Cook, V. (1997) ‘L2 users and English spelling’, Journal of Multilingual and Multicultural
Development, Vol. 18, No. 6, pp.474–488.
Council of Europe (2001) Common European Framework of Reference for Languages: Learning,
Teaching, Assessment, Cambridge University Press, Cambridge.
Crossley, S.A., Salsbury, T., McCarthy, P.M. and McNamara, D.S. (2008) ‘Using latent semantic
analysis to explore second language lexical development’, in Wilson, D. and Chad Lane, H.
(Eds.): Proceedings of the 21st International Florida Artificial Intelligence Research Society
Conference, pp.136–141, 15–17 May 2008, Coconut Grove, Florida.
Dagneaux, E., Denness, S. and Granger, S. (1998) ‘Computer-aided error analysis’, System,
Vol. 26, No. 2, pp.163–174.
Dikli, S. (2006) ‘An overview of automated scoring of essays’, Journal of Technology, Learning,
and Assessment, Vol. 5, No. 1, available at http://www.jtla.org (accessed on 10 May 2010).
Granger, S. and Wynne, M. (1999) ‘Optimising measures of lexical variation in EFL learner
corpora’, in Kirk, J. (Ed.): Corpora Galore, pp.249–257, Rodopi, Amsterdam.
Granger, S., Dagneaux, E., Meunier, F. and Paquot, M. (Eds.) (2009) The International Corpus of
Learner English: Handbook and CD-ROM (Version 2), Presses universitaires de Louvain,
Louvain-la-Neuve.
Haggan, M. (1991) ‘Spelling errors in native Arabic speaking English majors: a comparison
between remedial students and fourth year students’, System, Vol. 19, Nos. 1–2, pp.45–61.
Heift, T. and Rimrott, A. (2008) ‘Learner responses to corrective feedback for spelling errors in
CALL’, System, Vol. 36, No. 2, pp.196–213.
Hovermale, D.J. (2008) ‘SCALE: spelling correction adapted for learners of English’, Paper
presented at CALICO 2008 ICALL SIG (Pre-conference Workshop), 18–19 March 2008,
San Francisco, USA.
Hovermale, D.J. and Martin, S. (2008) ‘Developing an annotation scheme for ELL spelling errors’,
Proceedings of MCLC-5 (Midwest Computational Linguistics Colloquium) East Lansing,
10–11 May 2008, Michigan, USA.
Howell, D.C. (2007) Statistical Methods for Psychology, 6th ed., Thomson, Belmont.
James, C., Scholfield, P., Garrett, P. and Griffiths, Y. (1993) ‘Welsh bilinguals’ spelling: an
error analysis’, Journal of Multilingual and Multicultural Development, Vol. 14, No. 4,
pp.287–306.
Landauer, T.K., Laham, D. and Foltz, P. (2003) ‘Automatic essay assessment’, Assessment in
Education, Vol. 10, No. 3, pp.295–308.
Leacock, C., Chodorow, M., Gamon, M. and Tetreault, J. (2010) ‘Automated grammatical error
detection for language learners’, Synthesis Lectures on Human Language Technologies, No. 9,
Morgan & Claypool, Princeton.
Lonsdale, D. and Strong-Krause, D. (2003) ‘Automated rating of ESL essays’, Proceedings of the
NAACL 2003 Workshop, pp.61–67, Association for Computational Linguistics, Morristown.
Meng, X., Rosenthal, R. and Rubin, D.B. (1992) ‘Comparing correlated correlation coefficients’,
Psychological Bulletin, Vol. 111, No. 1, pp.172–175.
18 Y. Bestgen and S. Granger
Mitton, R. and Okada, T. (2007) The Adaptation of an English Spellchecker for Japanese Writers,
Birkbeck ePrints, London, available at http://eprints.bbk.ac.uk/592/3/592.pdf.
Okada, T. (2004) ‘A corpus analysis of spelling errors made by Japanese EFL writers’, Yamagata
English Studies, Vol. 9, pp.17–36.
Pérez, D., Alfonseca, E. and Rodríguez, P. (2004) ‘Application of the Bleu method for evaluating
free-text answers in an e-learning environment’, Proceedings of the Language Resources and
Evaluation Conference (LREC-2004), Lisbon.
Pollock, J.J. and Zamora, A. (1984) ‘Automatic spelling correction in scientific and scholarly text’,
Communications of the ACM, Vol. 27, No. 4, pp.358–368.
Rimrott, A. and Heift, T. (2005) ‘Language learners and generic spell checkers in CALL’, CALICO
Journal, Vol. 23, No. 1, pp.17–48.
Rimrott, A. and Heift, T. (2008) ‘Evaluating automatic detection of misspellings in German’,
Language Learning & Technology, Vol. 12, No. 3, pp.73–92.
Salton, G. (1989) Automatic Text Processing, Addison-Wesley, Reading.
Warschauer, M. and Ware, P. (2006) ‘Automated writing evaluation: defining the classroom
research agenda’, Language Teaching Research, Vol. 10, No. 2, pp.157–180.
Yang, Y., Buckendahl, C.W., Juszkiewicz, P.J. and Bhola, D.S. (2002) ‘A review of strategies for
validating computer-automated scoring?’, Applied Measurement in Education, Vol. 15, No. 4,
pp.391–412.
Yu, G. (2010) ‘Lexical diversity in writing and speaking task performances’, Applied Linguistics,
Vol. 31, No. 2, pp.236–259.
Notes
1 The error tagging procedure was carried out within the framework of a PhD that is under
completion at the Centre for English Corpus Linguistics (Thewissen, J. ‘Accuracy across
proficiency levels: Insights from an error-tagged EFL learner corpus’, Université catholique de
Louvain: Centre for English Corpus Linguistics).
2 The Apost category has a similar profile but the limited number of errors (15 in the manual
annotation) does not allow for a more fine-grained analysis.
3 We are grateful to Jennifer Thewissen for providing these examples.
4 98% of the Doub12 errors in the whole learner corpus involve consonants. In Doub21 errors,
consonants also predominate but to a lesser extent (78%).
5 Had the proficiency level of the learners been lower, one could have expected many more
instances of this type of error, as is the case for the Doub21 errors in the SP corpus.
... The built-in spell checker is probably one of the most common text editing tools featured in today's word processors. It has transformed and considerably increased the efficiency of spelling error detection and correction (Bestgen & Granger, 2011), constituting a 'proofing tool' that users frequently rely upon (Pan et al., 2021). Generic spell checkers -those designed for native (L1) writers, such as the spell checkers built into Microsoft Word and Google Docs -have become increasingly sophisticated, and are now widely distributed and used across different school subjects. ...
... When the spell check function is turned on, the software automatically detects and displays various spelling errors and, accordingly, provides immediate feedback in the form of corrections and alternative spelling suggestions. However, generic spell checkers are not 'fool proof ', and their use may result in different types of errors, both human initiated and computer initiated (Bestgen & Granger, 2011;Musk, 2016Musk, , 2021. This suggests that the use of spell check software can also sometimes constrain students' writing. ...
... In examining the quality of spelling errors, research points that there are seemingly varying correction rates between different spell checkers (e.g., commercial and non-commercial). Generic spell checkers are better adapted to detect and correct single letter mistyping than lexical misspellings, due to greater target deviation in the latter (Bestgen & Granger, 2011). Generic spell checkers are further limited and less adapted to meet the needs of nonnative writers, as shown for instance among German (Rimrott & Heift, 2008) and Arabic-speaking (Saigh & Schmitt, 2012) L2 (second language) students. ...
Article
Full-text available
This study focuses on the distribution of agency in software-based spell checking in L1 (Language and Literature) teaching. Drawing on video-ethnographic data from a Swedish-medium school in Finland, the research shows that built-in spell checkers can both afford and constrain studentsʼ digital writing. Through examining the micro-dynamics between human and material agency in use of spell checking, the analysis illustrates that the software does not always work as expected from the userʼs perspective, and hence becomes framed as a ‘trouble sourceʼ, assigned ‘linguistic authorityʼ, and held accountable for not meeting human intentionality. We argue that technologyʼs inherent functions and properties play a central role in the co-constitution of agency in digital writing practices, and call for a greater awareness of generic spell checkersʼ opportunities and limitations in teaching and learning.
... Research in the field of second-language acquisition has found evidence of phoneme-shift based misspellings stemming from L1 influence in L2 text for specific language pairs (Ibrahim, 1978;Cook, 1997;Bestgen and Granger, 2011;Sari, 2014;Ogneva, 2018;Motohashi-Saigo and Ishizawa, 2020). Studies in Natural Language Understanding (NLU) have been limited to spelling correction Nagata et al. (2017); Flor et al. (2019) and native language identification Chen et al. (2017); Nicolai et al. (2013) in English learners. ...
... There has also been a fair amount of interest in the second-language acquisition field on the influence of L1 on L2 spelling. Ibrahim (1978); Cook (1997); Bestgen and Granger (2011); Sari (2014); Ogneva (2018); Motohashi-Saigo and Ishizawa (2020) all find evidence of such influence in specific language pairs. These often stem from the lack of certain sounds in L1 leading to difficulty in distinguishing similar sounds in L2. ...
... Research in the field of second-language acquisition has found evidence of phoneme-shift based misspellings stemming from L1 influence in L2 text for specific language pairs (Ibrahim, 1978;Cook, 1997;Bestgen and Granger, 2011;Sari, 2014;Ogneva, 2018;Motohashi-Saigo and Ishizawa, 2020). Studies in Natural Language Understanding (NLU) have been limited to spelling correction Nagata et al. (2017); Flor et al. (2019) and native language identification Chen et al. (2017); Nicolai et al. (2013) in English learners. ...
... There has also been a fair amount of interest in the second-language acquisition field on the influence of L1 on L2 spelling. Ibrahim (1978); Cook (1997); Bestgen and Granger (2011); Sari (2014); Ogneva (2018); Motohashi-Saigo and Ishizawa (2020) all find evidence of such influence in specific language pairs. These often stem from the lack of certain sounds in L1 leading to difficulty in distinguishing similar sounds in L2. ...
Preprint
Full-text available
A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.
... The use of some kind of learner corpus is definitely a must in order to identify typical learner errors that can be used and explained in didactic language tools. These corpora have been used in one way or another to develop numerous writing tools, as discussed by Bestgen andGranger (2011), Paquot (2012), Wanner, Verlinde and Alonso-Ramos (2013), Alonso-Ramos and García-Salido (2019), Frankenberg-García, Lew, Roberts, Rees and Sharma (2019), and Granger and Paquot (2022). The best type of corpus for this purpose is undoubtedly a tagged corpus with parallel correction of the errors detected, such as the Spanish one described by Davidson, Yamada, Fernández-Mira, Carando, Sánchez-Gutiérrez and Sagae (2020). ...
... Vajjala (2018) used two datasets of non-native English essays written in test-taking scenarios and reported a prediction accuracy of 73% in distinguishing between three proficiency levels (low, medium and high). One promising but as yet largely unexplored research direction is that of tailoring the automatic/automated systems to the native language of the writers (Leacock et al., 2015;Bestgen & Granger, 2011). ...
Article
Full-text available
The aim of this article is to survey the field of learner corpus research from its origins to the present day and to provide some future perspectives. Key aspects of the field — learner corpus design and collection, learner corpus methodology, statistical analysis, research focus and links with related fields, in particular SLA, FLT and NLP — are compared in first-generation LCR, which extends from the late 1980s to 2000, and second-generation LCR, which covers the period from the early 2000s until today. The survey shows that the field has undergone major theoretical and methodological changes and considerably extended its range of applications. Future developments that are likely to gain ground are grouped into three categories: increased diversity, increased interdisciplinarity and increased automation.
... Thorough error analyses provide valuable insights that can predict the quality of learner texts (e.g. Bestgen & Granger, 2011) and aid in the development of NLP algorithms designed to detect common learner errors (Higgins et al., 2015: 590). ...
... Words absent from the corpus are highlighted, followed by either a list of alternatives or a correction executed by the software (Mitton, 2010). While this method's detection rate is higher than 80% (Bestgen & Granger, 2011;Blázquez-Carretero & Fan, 2019;, it only pertains to single-word errors, preventing GSCs from identifying context-specific mistakes should the misspelling correspond to an existent word (Blázquez-Carretero & Fan, 2019). As GSCs are L1-oriented, their built-in autocorrect and feedback mechanism is grounded on the notion that spelling errors are performance-based and typically involve single-letter violations. ...
Article
Full-text available
In 2016, Lawley proposed an easy-to-build spellchecker specifically designed to help second language (L2) learners in their writing process by facilitating self-correction. The aim was to overcome the disadvantages to L2 learners posed by generic spellcheckers (GSC), such as that embedded in Microsoft Word. Drawbacks include autocorrection, misdiagnoses, and overlooked errors. With the aim of imparting explicit L2 spelling knowledge, this correcting tool does not merely suggest possible alternatives to the detected error but also provides explanations of any relevant spelling patterns. Following Lawley’s (2016) recommendations, the present study developed a prototype computer-based pedagogic spellchecker (PSC) to aid L2 learners in self-correcting their written production in Spanish. First, a corpus was used to identify frequent spelling errors of Spanish as a foreign language (SFL) learners. Handcrafted feedback was then designed to tackle the commonest misspellings. To subsequently evaluate this PSC’s efficacy in error detection and correction, another learner Spanish corpus was used. Sixty compositions were analysed to determine the PSC’s capacity for error recognition and feedback provision in comparison with that of a GSC. Results indicate that the PSC detected over 90% of the misspellings, significantly outperforming the GSC in error detection. Both provided adequate feedback on two out of three detected errors, but the pedagogic nature of the former has the added advantage of facilitating self-learning (Blázquez-Carretero & Woore, 2021). These findings suggest that it is feasible to develop spellcheckers that provide synchronous feedback, allowing SFL learners to confidently self-correct their writing while saving time and effort on the teacher’s part.
Article
This paper reports on an ongoing research project aimed at developing a new type of Spanish learner’s grammar, different from those found in textbooks, grammar books and dictionaries. The new grammar, designed to be displayed in digital writing assistants, will explain problems that occur in written learner texts. The paper first describes the main features and functionalities of this grammar and how it will be presented to Spanish learners. It then discusses the development of a methodology for categorising relevant error types, using a unique combination of existing grammars, dictionaries and ChatGPT, all of it supervised by lexicographers with experience in language teaching. Based on this categorisation, the paper explains how the chatbot is prompted to write explanations of the different error types, which it does very well in fruitful interaction with the human lexicographers. The methodology is described in detail with several examples. Finally, the paper explains how the original Spanish explanations are machine translated into English and Chinese, and provides examples of the final result in each language. Throughout the paper, the complex relationship between generative AI and humans is discussed, and it is concluded that a successful result like the one achieved requires both the ability to handle the chatbot properly and the knowledge of the topic being dealt with.
Chapter
The volume espouses an ecosystemic standpoint on multilingual acquisition and learning, viewing language development and use as both ontogenesis and phylogenesis. Multilingualism is inclusively used to refer to sociolinguistic diversity and pluralism. Whether speech, writing, gesture, or body movement, language is a conduit that carries meaning within a complex, fluid, and context-dependent framework that engages different aspects of the individual, the communicative interaction, communicative acts, and social parameters. Continually modified over the years to better represent its multidisciplinary scope, the sociobiological notion of language has found steady and productive ground within major theoretical frameworks, which, individually or holistically, contribute to a rounded understanding of language acquisition, learning, and use by exploring both system-internal and system-external factors and their interaction. Summoning the work of leading academics, the volume outlines the changing dynamics of multilingualism in children and adults internationally with the latest advances and under-represented coverage that highlight the ecosystemic nature of multilingual acquisition, learning, and use.
Article
Full-text available
Spelling error corpora can be collected from student's written essays, homework, dictations, translations, tests and lecture notes. Spelling errors can be classified into whole word errors, faulty graphemes and faulty phonemes in which graphemes are deleted, added, reversed or substituted. They can be used for identifying phonological and orthographic problems, spelling strategies that EFL students use in spelling English; spelling error causes or sources and relationship between spelling and decoding weaknesses. The study gives examples of spelling errors and shows how spelling errors are quantified. Recommendations for remediation are also given.
Article
Full-text available
Computational techniques for scoring essays have recently come into use. Their bases and development methods raise both old and new measurement issues. However, coming principally from computer and cognitive sciences, they have received little attention from the educational measurement community. We briefly survey the state of the technology, then describe one such system, the Intelligent Essay Assessor (IEA). IEA is based largely on Latent Semantic Analysis (LSA), a machine-learning model that induces the semantic similarity of words and passages by analysis of large bodies of domain-relevant text. IEA's dominant variables are computed from comparisons with pre-scored essays of highly similar content as measured by LSA. Over many validation studies with a wide variety of topics and test-takers, IEA correlated with human graders as well as they correlated with each other. The technique also supports other educational applications. Critical measurement questions are posed and discussed.
Article
This study examines the relation between essay length and holistic scores assigned to Test of English as a Foreign Language™ (TOEFL®) essays by e-rater®, the automated essay scoring system developed by ETS. Results show that an early version of the system, e-rater99, accounted for little variance in human reader scores beyond that which could be predicted by essay length. A later version of the system, e-rater01, performs significantly better than its predecessor and is less dependent on length due to its greater reliance on measures of topical content and of complexity and diversity of vocabulary. Essay length was also examined as a possible explanation for differences in scores among examinees with native languages of Spanish, Arabic, and Japanese. Human readers and e-rater01 show the same pattern of differences for these groups, even when effects of length are controlled.
Article
This corpus based analysis discusses Japanese phonological and orthographic interference to English spelling. After establishing that there are idiosyncratic properties in spelling errors generated by Japanese people who use English as a foreign language, I explore the reason why those errors occur. The investigation compares two corpora of English spelling errors, one from native speakers and the other from Japanese writers. A number of Japanese-specific errors observed in the corpora are discussed with special reference to substitution errors. The quantitative evidence indicating that Japanese spellers substitute l with r (as easily anticipated), and vice versa, i.e. r with l, suggests that when Japanese people do not have a reliable phonological (and perhaps visual) clue to select a desirable letter, they are apt to get greatly confused. I claim that 'romazi' plays a deleterious role in Japanese writers' spelling, and that Japanese spellers of English are hampered both by phonological differences between the two languages and by subsequent discrepancies in the way of using Roman letters (not kana).
Article
This paper describes an empirical study of spelling errors using a learner corpus of university-level English. The corpus, known as CALES (Corpus Archive of Learner English in Sabah/Sarawak), consists of argumentative essays collected from university students in three public universities in Sarawak and Sabah. After describing the methodology of the CALES project, the paper outlines how spelling errors can be classified, using a combination of pre-existing categories from the literature, and categories observed in the data. The data demonstrates clearly that spelling is still a major issue both for teachers and learners, and that many students make spelling errors that fit into known categories, despite the fact that they have been studying English for at least 10 years in the Malaysian education system. As well as making observations, this paper offers a number of speculations about why students make spelling errors, and proposes some recommendations of how to prevent these errors from appearing in student writing.