ArticlePDF Available

Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database

Article

Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database

Abstract and Figures

This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prompts, which represents a unique research opportunity when it comes to analyzing task effects and conducting replication studies. Overall, the resulting corpus contains ~406,000 texts in the first sample and ~317,000 texts in the second sample, written by learners representing diverse L1s and a large range of L2 proficiency levels.
Content may be subject to copyright.
 
Rening and modifying the EFCAMDAT
Lessons from creating a new corpus from an
existing large-scale English learner language
database
Itamar Shatz
University of Cambridge
This report outlines the development of a new corpus, which was created by
rening and modifying the largest open-access L English learner data-
base – the EFCAMDAT. The extensive data-curation process, which can
inform the development and use of other corpora, included procedures
such as converting the database from XML to a tabular format, and remov-
ing problematic markup tags and non-English texts. The nal dataset con-
tains two corresponding samples, written by similar learners in response to
dierent prompts, which represents a unique research opportunity when it
comes to analyzing task eects and conducting replication studies. Overall,
the resulting corpus contains ~, texts in the rst sample and
~, texts in the second sample, written by learners representing
diverse Ls and a large range of L prociency levels.
Keywords: data curation, corpus cleaning, English as a second language,
EFCAMDAT
. Introduction
Learner corpora are increasingly being developed from data that originates in
large-scale online platforms. This is benecial, since the growing size of such cor-
pora enables the analysis of large amounts of learner data, in ways that were not
possible before (Callies, ; McEnery, Brezina, Gablasova, & Banerjee, ).
However, with these new data sources come new challenges, which require new
developments in terms of how researchers curate and analyze learner corpora.
One notable challenge is the need to develop approaches to working with data
that was originally collected with educational or social goals in mind, rather than
research, since such data is oen messy and requires substantial processing before
https://doi.org/10.1075/ijlcr.20009.sha
International Journal of Learner Corpus Research 6:2 (2020), pp. 220–236. issn 2215-1478 |eissn 2215-1486
© John Benjamins Publishing Company
it can be properly analyzed. In addition, because of the increasing size of these
new corpora, new approaches to data curation and analysis must be scalable, so
they can be applied eectively on large-scale datasets, which puts an emphasis on
the use of quantitative and NLP-based approaches.
The present report addresses this topic by discussing the development of
a new derivative corpus from an existing learner language database. The goals
of this report are both to introduce the new derivative source, and to explain
how it was developed, in order to inform future data curation and analysis by
researchers working with other learner corpora, and potentially with other cor-
pora in general.
In particular, the original database used in this report is the EF Cambridge
Open Language Database (EFCAMDAT), which is the largest open-access L
English learner database, with ,, texts written by , learners from var-
ious nationalities (Geertzen, Alexopoulou, Baker, Hendriks, Jiang, & Korhonen,
; Huang, Geertzen, Baker, Korhonen, & Alexopoulou, ; Huang,
Murakami, Alexopoulou, & Korhonen, ). The texts in the EFCAMDAT were
submitted by learners to EF’s online English school, which spans  English pro-
ciency levels aligned with common prociency standards such as the CEFR. Each
level consists of  units, and upon completing a unit, learners are tasked with
writing a text, which is then graded. If the learner receives a passing grade, they
advance to the next unit; otherwise, they repeat the unit. The texts cover a variety
of topics, such as reviewing a song for a website or describing one’s favorite day.
The EFCAMDAT is pseudo-longitudinal overall, as learners generally com-
plete only parts of the learning program. However, it contains substantial longi-
tudinal data, since many learners complete sequences of tasks across increasing
levels of prociency, and researchers can track individual learners using the
learner ID variable. In terms of metadata, the EFCAMDAT lists learners’ English
prociency and their nationality, and learners were only added to the database
if their nationality matched their country of residence (Alexopoulou, Michel,
Murakami, & Meurers, ). Prior research on the EFCAMDAT used learners’
nationality to estimate their L, an approach that has been validated empirically
(Alexopoulou et al., ; Huang et al., ; Murakami, ).
I developed the derivative version of the EFCAMDAT because I wanted to
conduct a large-scale quantitative study of L lexical development, and found that
I rst needed to make several substantial modications to the EFCAMDAT. As
such, some of the decisions made in the course of creating the derivative corpus
may not work well for other types of research. For example, the removal of dupli-
cate texts described below may interfere with analyses that focus on formulaic
language. However, researchers can choose to implement only some of the proce-
dures that I outline in this report; to facilitate this, I make the relevant program-
matic scripts available, together with partially cleaned versions of the new corpus.
Rening and modifying the EFCAMDAT 
Overall, the outcome of this data-curation process, in terms of the new corpus, led
to signicant modications in three key areas:
. Format. The new corpus is in a tabular format, rather than the EFCAMDAT’s
original XML format.
. Content. The new corpus has been cleaned to remove texts containing issues
that are likely to interfere with analyses relating to lexical development.
. Structure. The new corpus is split into two samples, to account for some tasks
containing groups of texts written in response to dierent prompts.
. Preparing the new corpus
. Selecting the sample
Because the new corpus was created with large-scale quantitative analyses of L
lexical development in mind, the rst step was to ensure that there were su-
cient texts available for each combination of nationality and L prociency level.
Accordingly, I selected only those nationalities and prociency levels that had
enough texts for my analyses. This is in line with many prior studies that used
the EFCAMDAT. For example, Murakami () and Shatz () examined only
the top  nationalities with most texts in the EFCAMDAT, while Alexopoulou,
Geertzen, Korhonen, and Meurers () and Geertzen, Alexopoulou, and
Korhonen () examined only the top ve.
In terms of prociency level, texts from levels – were kept, while those from
level  were omitted. There were relatively few texts at the omitted level (,,
only . of the total), which were spread across multiple nationalities and tasks.
In addition, levels – were grouped in bands of  based on EF’s guidelines, while
level  was on its own (Geertzen et al., ). Furthermore, level  was the only
level listed as being above the maximum prociency level set by several pro-
ciency standards, such as the TOEFL.
In terms of nationality, texts from the  nationalities with most texts were
kept: Brazilian, Chinese, Taiwanese, Russian, Saudi Arabian, Mexican, German,
Italian, French, Japanese, and Turkish. These nationalities accounted for the vast
majority of texts in the corpus (~), and there were relatively few texts spread
out across the other  nationalities. Overall, ,, texts t these criteria
(. of the texts in the EFCAMDAT).
. Format: Converting from XML to a tabular format
The EFCAMDAT was originally made available in XML format; a sample text
with the original XML formatting appears in Figure .
 Itamar Shatz
<writing id="" level="" unit="">
<learner id="" nationality="br"/>
<topic id="">Making notes for a visitor</topic>
<date>-- ::.</date>
<grade></grade>
<text>
Welcome to my house. Near the my house there is recreation center. Opposite to the
recreation center there is a soccer stadium. Between the recreation center and the soc-
cer stadium there is many restaurants. You guys enjoy!
</text>
</writing>
Figure . Sample text from the EFCAMDAT, with original XML formatting
The EFCAMDAT was imported from XML format using R, together with the
XML package and a custom function (Lang, ). This converted the texts and
all their metadata into tabular xlsx format, where each row represents a single
text. In addition, the following markup tags were modied, to clean the texts for
analysis:
,<br/> and  <br> tags were replaced with a space.
, &amp;quot; tags were replaced with a single set of quotation marks.
, &amp; tags were replaced with the word and.
</code> tag was replaced with a space.
. Content: Analyzing and removing texts
.. Texts with problematic markup tags
A small number of texts contained the &lt; and &gt; markup tags, which stand
for ‘<’ and ‘>’ respectively. Texts containing these tags were removed, because
they were generally accompanied by problematic data, such as improperly for-
matted error tags provided by teachers, together with suggested corrections.
This included, for example:
&lt;&lt;&lt;&lt;IS&lt;correct&gt;/correct&gt;
&lt;&lt;C, PU&lt;
MY&lt;&lt;x&gt;y&lt;My).
These tags were not supposed to be in the version of the EFCAMDAT used here,
which is meant to be free of annotations, and they would have interfered with
future analyses, for example by inserting words into the text that the learner did
Rening and modifying the EFCAMDAT 
not write. The reason why the full texts were removed is that the tags were incon-
sistent in terms of structure, so there was no simple scalable way to remove them
while preserving the original texts they were in.
There were , &lt; tags and , &gt; tags in the sample, spread across
only , texts (. of texts at this stage). To remove them, two R packages were
used: stringr to detect the relevant strings in texts (Wickham & RStudio, ),
and dplyr (Wickham, François, Henry, Müller, & RStudio, ) to lter texts
based on the detected strings. Aer this removal, ,, texts remained.
.. Ultra-short texts
Ultra-short texts were dened as texts with fewer than  words, since such texts
were below the minimal wordcount that learners were instructed to write, even
at the lowest prociency level. These texts oen contained various issues. For
example, many contained just random symbols (e.g. “???,??!??????,????????!” in
text ) or only a few words (e.g. just “Hi,” in text ). Similarly, there
were over , such texts that were close variants of the same sentence (“Good
evening. How are you. I’m ne, thanks. We’re busy. Good night.”).
Wordcounts were calculated using the stringr package (Wickham & RStudio,
) and a custom search pattern. , ultra-short texts were removed from
the sample (. of texts at this stage). Their average length was . words
(median= , standard deviation= .). Most of these texts (,, .) came
from the rst three tasks, with the majority (,) coming from the rst one.
Aer this removal, , texts remained.
.. Non-English texts
Texts that were not written in English were removed. This included texts that con-
tained gibberish of various forms, texts that were written entirely in a foreign lan-
guage, and texts that contained substantial portions written in a foreign language.
This problematic material oen appeared due to technical issues, such as when
the L instructions were copied into the text.
These texts were identied using the cld library in R, which relies on a
Bayesian classier that identies the language of texts (Ooms, ). The thresh-
old for removal was the maximal proportion of English in the text (.), to
ensure that the texts did not contain substantial portions of foreign-language
material. Overall, only , texts containing signicant levels of non-English
text were removed (. of texts at this stage). Aer this removal, , texts
remained.
.. Duplicate texts
Duplicate texts were texts that were almost identical to each other in substantial
portions. This generally occurred as a result of reusing source material from the
 Itamar Shatz
task almost verbatim. For example, the texts in task  oen had the exact same
opening in response to the prompt “Claiming back your security deposit”: “Dear
Sir, I am writing to ask your advice about a problem I have with my landlord and
the real estate agent….
As with the other steps in the cleanup process, there are advantages and dis-
advantages to this removal. Specically, the main advantage of removing these
texts is that the direct reuse of source material could obscure L eects and other
linguistic patterns in unpredictable ways. Conversely, the main disadvantage of
removing these texts is that this could lead to the removal of some meaningful lin-
guistic patterns, such as the use of formulaic language. However, this concern was
mitigated, as this issue appeared to be relatively task dependent, rather than pro-
ciency dependent. For example, the task with the highest proportion of duplicate
texts (.) was task , which is relatively advanced. This suggests that the
issue of duplicate texts occurred, to a substantial degree, as a result of task eects
and idiosyncrasies in the learning situation. In addition, the potential issues with
this removal were further mitigated, as texts were removed only when they con-
tained a substantial portion of identical, overlapping phrasing, down to letters
and punctuation marks.
To calculate similarity between texts in the database, the stringdist package
in Rwas used (Van der Loo, ). Specically, the analysis used the hamming
method, an edit-based algorithm that calculates the number of substitutions
required to get from one string to another. To use it, trimmed versions of each
text were created, which contained only the rst  characters, since this method
requires that the compared texts be of identical length. Texts were trimmed specif-
ically to  characters, as the shortest text was  characters, and  rep-
resented a close and round number. This is benecial when determining the
similarity threshold later, and provides a proportion that is simple to replicate.
Then, to determine the threshold of similarity at which texts would be con-
sidered duplicates, an initial analysis was conducted on a sample of texts from
Brazilian and Japanese learners in tasks  and . This sample was chosen as it
represents two distinctly dierent nationalities and tasks, which contain dierent
numbers of texts (, and , for Brazilian,  and  for Japanese, in tasks
 and  respectively).
A similarity matrix was calculated for the texts in this sample, and duplicate
texts based on a similarity threshold of ‘’ were extracted. This means that in cases
where a trimmed text required fewer than ve substitutions to be transformed
into a dierent text in the sample, the two texts were designated as duplicates.
Then, duplicate texts based on a similarity threshold of ‘’ were also extracted,
and the results between the two thresholds were compared manually by examin-
ing the list of new texts that were identied as duplicates, and checking whether
they appeared to include true duplicates or false positives. This process was
Rening and modifying the EFCAMDAT 
repeated, each time increasing the threshold by increments of  (  …).
Eventually,  was identied as the optimal threshold, since it appeared to lead
to the identication of new duplicates compared to a lower threshold of , and
because increasing the threshold to  appeared to lead to a substantial increase
in false positives.
Finally, a similarity matrix was calculated on the main sample, using ‘’ as
the threshold. Because each text must be compared against all other texts, this
calculation involves potentially prohibitive computational complexity when run
on large-scale datasets such as the EFCAMDAT. To resolve this, the analysis was
run separately for each combination of nationality and task (for example, texts in
task  among Japanese speakers, texts in task  among German speakers, etc.).
This reduces the complexity of the calculation and is unlikely to have a substantial
impact on its outcome, since within-nationality duplicates are more likely than
between-nationality duplicates, and since between-task duplicates are unlikely.
Based on this, , texts were removed (. of the sample at this stage).
Certain tasks were more likely to contain duplicate texts; for example, . of
texts in task  were removed as duplicates, compared to only . of texts in
task . Higher prociency tasks were less likely to have texts marked as dupli-
cates, but there were many cases where higher-level tasks had a higher proportion
of duplicates than lower-level tasks (the correlation between proportion of dupli-
cates per task and task number was Spearman’s rho=−., p< .). Aer this
removal, , texts remained.
.. Outlier texts based on wordcount
This step targeted texts that were anomalously short or long. Such texts oen suf-
fered from various issues, such as the inclusion of large amounts of irrelevant
material, for reasons that are unclear. For example, text  was anomalously
long, with  words at task  where the average wordcount was , and con-
tained a letter about a company’s logo in response to the prompt “Introducing
yourself by email”.
Outlier texts in terms of wordcount were identied using Tukey’s method.
This means that, for each task, outlier texts were those that had a wordcount .
interquartile ranges (IQR) below the st quartile or above the rd quartile of word-
counts for texts from the same task (Kaliyaperumal, Kuppusamy, Arumugam,
Kannan, Manoj, & Arumugam, ). Accordingly, a dierent set of problematic
texts were identied using this method compared to the one for removing ultra-
short texts, since this method accounts for dierences in wordcounts between
tasks. For example, this means that text  was removed in this step, since
it had a wordcount of  at task , where the average wordcount was . Note
that it would have been insucient to use only this method without rst remov-
 Itamar Shatz
ing ultra-short texts, because of the low average wordcount in many of the low-
prociency tasks, especially when ultra-short texts are included.
Based on this analysis, , texts were removed (. of the sample at this
stage). Of these, , (.) were short outlier texts and , (.) were
long outlier texts. Aer this removal, , texts remained.
. Structure: Classifying texts based on prompt
To explain this process, it helps to rst dene three terms:
Task: this is the specic lesson that learners’ texts are categorized under (e.g.
“task ”). Task numbers are listed sequentially in the EFCAMDAT, and
range from –, with  tasks per prociency level.
Prompt: this is the prompt that texts are written in response to (e.g. “Writing
a weather guide for your city” ). Each task has a corresponding prompt listed
in the EFCAMDAT.
Topic: this is the topic that a text revolves around (e.g. “weather”), as deter-
mined by classication soware that will be described in this section.
Many texts in the EFCAMDAT did not correspond to their listed task prompt.
The reason for this issue was as follows:
Initially, each task was associated with a certain prompt. For example, task 
had the prompt titled “Writing a weather guide for your city”.
At some point, the prompts for some tasks were replaced with new ones. For
example, the prompt for task  was changed to something such as “Describe
people’s favorite sport in your country.
This change in prompt was not reected in the database. Accordingly, all texts
belonging to the same task number were listed together, regardless of which
prompt they were written in response to. For example, task  contained texts
written in response to the prompt on writing a weather guide, together with
texts written in response to the new prompt on describing people’s favorite
sport.
Accordingly, it was necessary to do the following:
Determine which tasks contained groups of texts corresponding to multiple
prompts.
Determine how many prompts were used in such tasks.
Categorize the texts in such tasks based on the prompt that they corre-
sponded to.
Since no information regarding the dierent prompts was available in the data-
base, it was necessary to nd a scalable way to analyze the topics that texts
Rening and modifying the EFCAMDAT 
revolved around. To do this, I rst grouped texts from each task (e.g. task ,
task …), and used the tm package in R (Feinerer & Hornik, ) to create a
document-term matrix, with the term frequencies for each text. Then I used the
topicmodels package (Grün & Hornik, ) to estimate a latent Dirichlet allo-
cation (LDA) model using Gibbs sampling. For a visual representation of this
process, see Figure .
 Original texts:
• “My city has good weather”
• “It often rains in my city”
• “My favorite sport is basketball”
• “My city is hot in the summer”
• “People’s favorite sport is football”

 Identifying key terms
for each text
• “My city has good weather”- city, good, weather
• “It often rains in my city”- rain, city
• “My favorite sport is basketball”- favorite, sport, basketball
• “My city is hot in the summer”- city, hot, summer
• “People’s main sport is football”- people, main, sport, football
 Identifying main topics,
based on key terms
Topic :Topic :
citysport
 Categorizing texts,
based on topic
The city topic group:The sport topic group:
“My city has good weather”“My favorite sport is basketball”
“It often rains in this city”“People’s main sport here is football”
“My city is hot in the summer”
Figure . Rough illustration of the process used to classify texts based on topic
This process requires that the number of topics per task be specied in
advance. Accordingly, to determine the appropriate number of topics, I started by
 Itamar Shatz
running the process with two topics, and then tried increasing that number to
three, while manually inspecting the texts. This revealed that the maximum num-
ber of prompts was ‘, as dividing texts into more than two topics led to group-
ings that were not based on a dierence in prompt. For example, if texts written
in response to the prompts “a weather guide for your city” and “people’s favorite
sport in your country ” were divided into more than two topics, then texts written
in response to the same prompt would be separated; e.g. texts revolving around a
weather guide might be split into those that primarily use keywords such as [win-
ter/cold/rain] and those that use keywords such as [summer/hot/sun]. A single
exception was task , where the classication soware used the same keyword
(‘there’) to classify texts from both topics. Accordingly, I re-ran the analysis for
this task with three topics in the LDA model. I then examined the texts and com-
bined two of the topics (under the keywords ‘there’ and ‘house’), while the third
topic (under ‘neighborhood’) was marked as corresponding to a dierent prompt.
Next, it was necessary to determine which tasks contained groups of texts cor-
responding to two prompts, and then classify texts accordingly. An examination
showed that, in tasks with texts corresponding to two prompts, texts were initially
written in response to the rst prompt, until a certain date when the new prompt
replaced the rst. Accordingly, a sub-sample of the corpus was created, contain-
ing only texts submitted before --, which was established as the earliest
approximate point when the second prompt was introduced. Then, the topics of
the texts in the sub-sample were analyzed separately for each task:
In cases where most texts (+) before the cuto date belonged to a single
topic, the task was categorized as having two prompts. Essentially, if most
texts before the cuto revolved around a single topic, this indicated that the
topic corresponded to an initial prompt, while the less frequent topic corre-
sponded to a second prompt that was introduced only aer the cuto. For
example, if almost all of the texts before the cuto revolved around the topic
city, and almost none revolved around the topic sport, then it was likely that
texts written about sport were based on a second prompt, which was intro-
duced later.
In cases where fewer than  of texts before the cuto belonged to a single
topic, the task was treated as having a single prompt. Essentially, if the texts
before the cuto date revolved around two topics in relatively similar propor-
tions, then there was likely only one prompt for the task, since the similarity
in proportion indicated that the division into topics was not based on a dif-
ference in prompt. For example, if texts before the cuto revolved around the
topics restaurant and food in relatively similar proportions, then it was likely
that the texts were written in response to the same prompt, and that they sim-
ply used slightly dierent keywords.
Rening and modifying the EFCAMDAT 
One concern was that there might be tasks where one topic was much more com-
mon in the full sample. However, this was ruled out, given that the most extreme
ratio between topics overall was .: (between the second and the rst topics in
task ), and the overall mean ratio between the rst and second topics was .
(median= ., SD= .). Conversely, the cuto point used to determine whether
two prompts were used was the much higher ratio of : (i.e. ). Overall, the
procedure used to classify texts is outlined in Figure .
 Group texts based on the task that they are listed under (e.g. all texts
written under task ).

 Identify the two main topics that texts in each task revolve around
(e.g. “sport” and “restaurant”), and categorize each text based on its
topic.

 Create a sub-sample of the categorized texts, consisting only of texts
written before the cutoff date. This cutoff is the earliest date at which
a second prompt was generally introduced into tasks where it was
used.

 If more than  of the texts in the sub-sample revolve around a
single topic, match that topic with the initial prompt listed in the
corpus, and match the other topic with a second prompt. Otherwise,
match both topics to the same initial prompt.

Figure . Outline of the process that was used to identify and classify texts written in
response to dierent prompts
In cases where all of the texts from a given task were established as having
been written under the same prompt, they were all kept in the sample ( tasks,
. of total). Conversely, in cases where texts from a given task were established
 Itamar Shatz
as having been written under two prompts, only texts written using the initial
prompt were kept in the main sample ( tasks, .). Accordingly, , texts
(. of texts) were designated as having been written in response to a second
prompt, and were consequently separated into a second sample.
Finally, the texts in the second sample were further cleaned. This involved
removing texts that were categorized as having been written in response to the
second prompt despite being written before the cuto date, which was the earliest
point when the new prompt was generally introduced. The cuto date used at
this stage was --, which was later than the cuto used previously. This is
because the second prompt was oen introduced around this later date, so using it
allowed for the removal of more irrelevant texts. This led to the removal of ,
texts (.), leaving , texts in the rst sample and , texts in the sec-
ond sample.
An important limitation of the second sample is that it does not list the
prompts that learners responded to in their texts, since such data is not available
in the EFCAMDAT. However, the original prompts from the rst sample are still
listed in the second sample, to maintain continuity between the samples; this
ensures that the two samples share the same data structure, which means that
researchers can easily concatenate them into a single sample if they wish. Nev-
ertheless, it is possible to estimate the prompts manually, by reading the texts.
Alternatively, it is possible to identify the key topics that the texts revolve around,
by using the same keyword-extraction method that was implemented earlier; one
such keyword is already listed for each text in the new version of the corpus, based
on the earlier extraction process.
. Discussion and conclusion
Overall, this report outlined a comprehensive process used to modify and rene
a large-scale English learner database – the EFCAMDAT – in terms of its format,
content, and structure. The process used to create the derivative corpus is out-
lined in Figure .
Based on this, from an initial database containing ,, texts, a corpus
was created with , texts (~,, word tokens) in the rst sample and
, texts (~,, word tokens) in the second sample. These samples
cover  and  topics respectively, and contain texts written by learners from 
nationalities and with a large range of English prociency levels (CEFR A-C).
The numbers of texts per nationality and CEFR level in the new samples are listed
in Table .
Rening and modifying the EFCAMDAT 
 Select the initial sample
(,, texts fit the relevant criteria; . of ,, texts)

 Transform texts from XML into a tabular format, and convert
markup tags

 Remove texts with problematic markup tags
(, texts; . of texts at this stage)

 Remove ultra-short texts
(, texts; . of texts at this stage)

 Remove non-English texts
(, texts; . of texts at this stage)

 Remove duplicate texts
(, texts; . of texts at this stage)

 Remove outlier texts, based on wordcount
(, texts; . of texts at this stage)

 Identify tasks containing groups of texts corresponding to
different prompts, and classify texts accordingly

Figure . Summary of the preparation process of the corpus
 Itamar Shatz
Table . Number of texts in the derivative corpus, per nationality and CEFR level. Nationalities are listed by total number of texts in the rst
sample, in decreasing order
Nationality
Number of texts (first sample) Number of texts (second sample)
Total A A B B C Total A A B B C
Brazilian , , , , , , , , , , , ,
Chinese , , , , ,   , , , ,    
Mexican , , , , ,   , , , , ,  
Russian , , , , ,   , , , , ,  
German , , , , , , , , , , , 
French , , , , ,   , , , , ,  
Italian , , , , ,   , , , , ,  
Saudi Arabian , , , ,    , , , ,   
Taiwanese , , , ,     , , , ,    
Japanese , , , ,    , , , ,    
Turkish , , ,      , , ,     
Total , , , , , , , , , , , ,
Rening and modifying the EFCAMDAT 
As noted earlier, the new corpus was created to facilitate large-scale quantita-
tive analyses of L lexical development, using the data available in the EFCAM-
DAT. Accordingly, some of the procedures in the data-curation process may not
be appropriate for other types of analyses; a notable example of this is the removal
of duplicate texts, which could be an issue for analyses that focus on formulaic
language. As such, to facilitate the use of the EFCAMDAT for other purposes,
in addition to making the nal version of the new corpus available, I have also
made available additional versions of the corpus from dierent steps of the data-
curation process, together with the key R scripts that I used. All these materials,
together with other relevant ones, such as a glossary of variables, are available on
the ocial EFCAMDAT site (https://corpus.mml.cam.ac.uk/). They are currently
listed there under the “Resources” page, as the “EFCAMDAT Cleaned Subcorpus”.
In addition to introducing the new derivative corpus, this report can also
inform future work on other learner corpora, by identifying issues that
researchers may encounter during data curation and analysis, and by proposing
scalable solutions that they may use. This is something that is becoming increas-
ingly necessary, given the growing use of large-scale learner corpora that are based
on educational and social platforms, and that were therefore not originally col-
lected with research in mind.
Acknowledgements
I wish to thank: my PhD supervisor, Dora Alexopoulou, for her continuous advice and support;
Carlos Balhana and everyone else involved with the EFCAMDAT’s technical aspects, for their
work and assistance; EF Education First, for making their data openly available; Hughes Hall
for their nancial support; everyone who contributed to the soware used in this research, for
their important work; the editor and the anonymous reviewers, for taking the time to provide
excellent comments.
References
Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (). Exploring big educational
learner corpora for SLA research: Perspectives on relative clauses. International Journal
of Learner Corpus Research, (), –. https://doi.org/10.1075/ijlcr.1.1.04ale
Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (). Task eects on linguistic
complexity and accuracy: A large-scale learner corpus analysis employing natural
language processing techniques. Language Learning, (S), –.
https://doi.org/10.1111/lang.12232
 Itamar Shatz
Callies, M. (). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier
(Eds.), The Cambridge handbook of learner corpus research (pp. –). Cambridge:
Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.003
Feinerer, I., & Hornik, K . (). tm: Text Mining Package. Retrieved from https://cran.r-
project.org/package=tm
Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S., & Korhonen, A. (). The EF
Cambridge Open Language Database (EFCAMDAT). User Manual Part I: Written
Production. Retrieved from https://corpus.mml.cam.ac.uk/
Geertzen, J., Alexopoulou, T., & Korhonen, A. (). Automatic linguistic annotation of large
scale L databases: The EF-Cambridge Open Language Database (EFCamDat). In
R. T. Millar, K. I. Martin, C.M. Eddington, A. Henery, N. M. Miguel, & A. Tseng (Eds.),
Selected proceedings of the  Second Language Research Forum (pp. –).
Somerville, MA: Cascadilla Proceedings Project.
Grün, B., & Hornik, K. (). topicmodels: An R package for tting topic models. Journal of
Statistical Soware, (), –. https://doi.org/10.18637/jss.v040.i13
Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (). The EF Cambridge
Open Language Database (EFC AMDAT): Information for users (pp. –). Retrieved from
https://corpus.mml.cam.ac.uk/
Huang, Y., Murakami, A ., Alexopoulou, T., & Korhonen, A. (). Dependency parsing of
learner English. International Journal of Corpus Linguistics, (), –.
https://doi.org/10.1075/ijcl.16080.hua
Kaliyaperumal, S.K., Kuppusamy, M., Arumugam, S., Kannan, K .S., Manoj, K., &
Arumugam, S. (). Labeling methods for identifying outliers. International Journal of
Statistics and Systems, (), –.
Lang, D. T.. (). XML: Tools for parsing and generating XML within R and S-Plus.
Retrieved from https://cran.r-project.org/package=XML
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (). Corpus linguistics, learner
corpora, and SLA: Employing technology to analyze language use. Annual Review of
Applied Linguistics, , –. https://doi.org/10.1017/S0267190519000096
Murakami, A. (). Individual variation and the role of L in the L development of English
grammatical morphemes: Insights from learner corpora (Unpublished doctoral
dissertation). Cambridge University.
Murakami, A. (). Modeling systematicity and individuality in nonlinear second language
development: The case of English grammatical morphemes. Language Learning, (),
–. https://doi.org/10.1111/lang.12166
Ooms, J. (). cld: Google’s compact language detector  (Version .). . Retrieved from
https://cran.r-project.org/package=cld
Shatz, I. (). How native language and L prociency aect EFL learners’ capitalisation
abilities: A large-scale corpus study. Corpora, (), –.
https://doi.org/10.3366/cor.2019.0168
Van der Loo, M.P.J. (). The stringdist package for approximate string matching. The R
Journal, (), –. Retrieved from https://cran.r-project.org/package=stringdist
Wickham, H., François, R., Henry, L., Müller, K., & RStudio. (). dplyr: A grammar of data
manipulation. Retrieved from https://cran.r-project.org/web/packages/dplyr/index.html
Wickham, H., & RStudio. (). stringr: Simple, consistent wrappers for common string
operations. Retrieved from https://cran.r-project.org/web/packages/stringr/index.html
Rening and modifying the EFCAMDAT 
Address for correspondence
Itamar Shatz
University of Cambridge
Department of Theoretical and Applied Linguistics
Faculty of Modern and Medieval Languages
Raised Faculty Building, Sidgwick Avenue
Cambridge, CB DA
United Kingdom
is@cam.ac.uk
 Itamar Shatz
... Assuming each error stand by itself and is independent from other errors, each error could be given a class. Following this assumption manual annotations of typologies arrived with every dataset (Dahlmeier et al., 2013;Shatz, 2020) differing between them and between languages (Rozovskaya and Roth, 2019;Lee et al., 2021). ...
Preprint
Full-text available
In Grammatical Error Correction, systems are evaluated by the number of errors they correct. However, no one has assessed whether all error types are equally important. We provide and apply a method to quantify the importance of different grammatical error types to humans. We show that some rare errors are considered disturbing while other common ones are not. This affects possible directions to improve both systems and their evaluation.
... Firstly, we use the EFCAMDAT corpus (Geertzen et al., 2014) that comprises L2 learners' scripts annotated with their respective score on a scale from 0 to 100, their proficiency level from 1 to 16 (mapped to CEFR levels from A1 to C2) and partially error-tagged by human experts. As our work investigates the efficacy of errors as features, we only use the error-tagged section of the EFCAM-DAT Cleaned Subcorpus (Shatz, 2020), consisting of 498,208 scripts ranging from proficiency level 1 to 15 (i.e. from A1 to C1), which we divided into training and test set. The error tagset of the corpus consists of 24 types of errors, of which we discarded 7 related to spelling, punctuation and orthographic errors, as they would be of no use for assessing speech (see Table 1). ...
Conference Paper
Full-text available
The growing demand for learning English as a second language has led to an increasing interest in automatic approaches for assessing spoken language proficiency. One of the most significant challenges in this field is the lack of publicly available annotated spoken data. Another common issue is the lack of consistency and coherence in human assessment. To tackle both problems, in this paper we address the task of automatically predicting the scores of spoken test responses of English-as-a-second-language learners by training neural models on written data and using the presence of grammatical errors as a feature, as they can be considered consistent indicators of proficiency through their distribution and frequency. Specifically, we train a feature extractor on EF-CAMDAT, a large written corpus containing error annotations and proficiency levels assigned by human experts, in order to extract information related to grammatical errors and, in turn, we use the resulting model for inference on the CLC-FCE corpus, on the ICNALE corpus, and on the spoken section of the TLT-school corpus, a collection of proficiency tests taken by Italian students. The work investigates the impact of the feature extractor on spoken proficiency assessment as well as the written-to-spoken approach. We find that our error-based approach can be beneficial for assessing spoken proficiency. The results obtained on the considered datasets are discussed and evaluated with appropriate metrics.
... Moreover, there was a significant difference between the A1 level and C level (Estimate = 0.94, SE = 0.44, t = 2.1, p < 0.05), suggesting that the expected noun 1 It should be noted that the topic ID used in the present study may not be as credible as it could have been as the original EFCAMDAT data happen to contain texts which did not coincide with the listed task prompt. The EFCAMDAT Cleaned Subcorpus (Shatz, 2020), a derivative corpus of EFCAMDAT, has achieved higher reliability in the annotation of task prompts (topic ID) and can be an ideal option in future analysis. length is longer at the C level than at the A1 level by 0.94 words. ...
Article
Full-text available
The investigation of learners’ interlanguage could greatly contribute to the teaching of English as a foreign language and the development of teaching materials. The present study investigates the collocational profiles of large-scale written production by English learners with varied L1 backgrounds and different proficiency levels. Using the British National Corpus as reference corpus, learners’ collocation use was extracted by corpus query language and further identified by t-score via Python programming language. The collocation list consists of 2,501 make/take + noun (the direct object) collocations. Findings show that proficient learners tend to use collocations containing more semantically complicated and abstract noun elements for varied communication tasks. Moreover, advanced learners are inclined to use collocations comprised of more difficult and longer noun elements.
Chapter
This paper addresses the problem of classifying the proficiency of second language learners using multilingual models. Such models can be extremely useful in applications supporting the learning of multiple, even rare languages. Experiments based on Czech, German and Italian languages have been reported in the literature. This dataset was extended with texts in English. SVM, random forest, and logistic regression methods were used to train the model with different sets of language features. For the monolingual models – which served as benchmarks – the best results were observed for the random forest and SVM methods. For multilingual models, in contrast to other studies, the best results were obtained using the SVM algorithm. Models trained on a feature set containing n-grams of POS, n-grams of dependencies, and POS distribution performed better than models trained only on n-grams of POS, used in other works on multilingual models. The experiments confirmed the feasibility of using multilingual models in place of monolingual ones. Multilingual models were also able to classify texts in a language that was not involved in model learning. KeywordsClassificationLanguage proficiencyCEFR
Article
We investigated whether the accuracy of grammatical morphemes in second language (L2) learners’ writing is associated with usage‐based distributional factors. Specifically, we examined whether the accuracy of L2 English inflectional morphemes is associated with the availability (i.e., token frequency) and contingency (i.e., token frequency relative to other forms with the same lemma) of the inflected word form as well as the formulaicity of the context in which it occurs (i.e., predictability of the form given the surrounding words). Data drawn from a large‐scale learner corpus indicated that contingency is a robust predictor of morpheme accuracy, thereby supporting the usage‐based view that language learners are sensitive to distributional properties in their input. Furthermore, the relationship of contingency with accuracy does not necessarily lessen when learners’ proficiency rises. Contrary to previous research investigating online processing, we did not identify in our study availability and formulaicity as predictors of accuracy of morpheme production in writing.
Article
Full-text available
Capitalisation is a salient orthographic feature, which plays an important role in linguistic processing during reading, and in writing assessment. Learners’ second language (L2) capitalisation skills are influenced by their native language (L1), but earlier studies of L1 influence did not focus on learners’ capitalisation, and examined primarily ‘narrow’ samples. This study examines capitalisation error patterns in a large-scale corpus of over 133,000 texts, composed by nearly 38,000 EFL learners, who represent seven different L1s and a wide range of English proficiency levels. The findings show that speakers of all L1s made a large number of capitalisation errors, in terms of errors per word and error proportion (out of all errors), especially at lower L2 proficiency levels. Under-capitalisation was more common than over-capitalisation, though this gap narrowed over time. Interestingly, L1s which share English's Latin script had higher error rates, suggesting that (assumed) perceived similarity between the L1 and the L2 increases interference, though this interference could not be explained only through direct negative transfer. There was also an interaction between L1 influence and L2 proficiency, so that differences between speakers of different L1s became smaller as their L2 proficiency improved.
Article
Full-text available
In this article we explore the relationship between learner corpus and second language acquisition research. We begin by considering the origins of learner corpus research, noting its roots in smaller scale studies of learner language. This development of learner corpus studies is considered in the broader context of the development of corpus linguistics. We then consider the aspirations that learner corpus researchers have had to engage with second language acquisition research and explore why, to date, the interaction between the two fields has been minimal. By exploring some of the corpus building practices of learner corpus research, and the theoretical goals of second language acquisition studies, we identify reasons for this lack of interaction and make proposals for how this situation could be fruitfully addressed.
Article
Current syntactic annotation of large-scale learner corpora mainly resorts to “standard parsers” trained on native language data. Understanding how these parsers perform on learner data is important for downstream research and application related to learner language. This study evaluates the performance of multiple standard probabilistic parsers on learner English. Our contributions are three-fold. Firstly, we demonstrate that the common practice of constructing a gold standard – by manually correcting the pre-annotation of a single parser – can introduce bias to parser evaluation. We propose an alternative annotation method which can control for the annotation bias. Secondly, we quantify the influence of learner errors on parsing errors, and identify the learner errors that impact on parsing most. Finally, we compare the performance of the parsers on learner English and native English. Our results have useful implications on how to select a standard parser for learner English.
Article
Large-scale learner corpora collected from online language learning platforms, such as the EF-Cambridge Open Language Database (EFCAMDAT), provide opportunities to analyze learner data at an unprecedented scale. However, interpreting the learner language in such corpora requires a precise understanding of tasks: How does the prompt and input of a task and its functional requirements influence task-based linguistic performance? This question is vital for making large-scale task-based corpora fruitful for second language acquisition research. We explore the issue through an analysis of selected tasks in EFCAMDAT and the complexity and accuracy of the language they elicit.
Article
Large-scale learner corpora collected from online language learning platforms, such as EFCambridge Open Language Database (EFCAMDAT), provide opportunities to analyze learner data at an unprecedented scale. However, interpreting the learner language in such corpora requires a precise understanding of tasks: How does the prompt and input of a task and its functional requirements influence task-based linguistic performance? This question is vital for making large-scale task-based corpora fruitful for SLA research. We explore the issue through an analysis of selected tasks in EFCAMDAT and the complexity and accuracy of the language they elicit.
Thesis
The overarching goal of the dissertation is to illustrate the relevance of learner corpus research to the field of second language acquisition (SLA). The possibility that learner corpora can be useful in mainstream SLA research has a significant implication given that they have not been systematically explored in relation to SLA theories. The thesis contributes to building a methodological framework to utilize learner corpora beneficially to SLA and argues that learner corpus research contributes to other disciplines. This is achieved by a series of case studies that quantitatively analyze individual variation and the role of native language (L1) in second language (L2) development of English grammatical morphemes and explain the findings with existing SLA theories. The dissertation investigates the L2 development of morphemes based on two largescale learner corpora. It first reviews the literature and points out that the L2 acquisition order of English grammatical morphemes that has been believed universal in SLA research may, in fact, vary across the learners with different L1 backgrounds and that individual differences in morpheme studies have been relatively neglected in previous literature. The present research, thus, provides empirical evidence testing the universality of the order and the extent of individual differences. In the first study, the thesis investigates L1 influence on the L2 acquisition order of six English grammatical morphemes across seven L1 groups and five proficiency levels. Data drawn from approximately 12,000 essays from the Cambridge Learner Corpus establish clear L1 influence on this issue. The study also reveals that learners without the equivalent morpheme in L1 tend to achieve an accuracy level of below 90% with respect to the morpheme even at the highest proficiency level, and that morphemes requiring learners to learn to pay attention to the relevant distinctions in their acquisition show a stronger effect of L1 than those which only require new form-meaning mappings. The findings are interpreted under the framework of thinking-for-speaking proposed by Dan Slobin. Following the first study, the dissertation exploits EF-Cambridge Open Language Database (EFCamDat) and analyzes the developmental patterns of morphemes, L1 influence on the patterns, and the extent to which individual variation is observed in the development. Based on approximately 140,000 essays written by 46,700 learners of 10 L1 groups across a wide range of proficiency levels, the study found that (i) certain developmental patterns of accuracy are observed irrespective of target morphemes, (ii) inverted U-shaped development is rare irrespective of morphemes, (iii) proficiency influences the within-learner developmental patterns of morphemes, (iv) the developmental patterns at least slightly vary depending on morphemes, and (v) significant individual variation is observed in absolute accuracy, the accuracy difference between morphemes, and the rate of development. The findings are interpreted with dynamic systems theory (DST), a theory of development that has recently been applied to SLA research. The thesis further examines whether any systematic relationship is observed between the developmental patterns of morphemes. Although DST expects that their development is interlinked, the study did not find any strong relationships between the developmental patterns. However, it revealed a weak supportive relationship in the developmental pattern between articles and plural -s. That is, within individual learners, when the accuracy of articles increases, the accuracy of plural -s tends to increase as well, and vice versa.
Article
This article introduces two sophisticated statistical modeling techniques that allow researchers to analyze systematicity, individual variation, and nonlinearity in second language (L2) development. Generalized linear mixed-effects models can be used to quantify individual variation and examine systematic effects simultaneously, and generalized additive mixed models allow for the examination of systematicity, individuality, and nonlinearity within a single model. Based on a longitudinal learner corpus, this article illustrates the usefulness of these models in the context of L2 accuracy development of English grammatical morphemes. I discuss the strengths of each technique and the ways in which these techniques can benefit L2 acquisition research, further highlighting the importance of accounting for individual variation in modeling L2 development.
Chapter
In contrast to other types of data that have traditionally been used in second language acquisition (SLA) research, learner corpora provide systematic collections of authentic, continuous and contextualised language use by foreign/second language (L2) learners stored in electronic format. They enable the systematic and (semi-)automatic extraction, visualisation and analysis of large amounts of learner data in a way that was not possible before. Access to and analysis of learner corpus data is greatly facilitated by the digital medium, and their sheer quantity can give SLA theories a more solid empirical foundation alongside experimental data. As is the case with other instruments and techniques of data collection or pre-compiled databases, the choice of method(s) depends on the object(s) of study and the research question(s) being asked, and in turn, findings and results are highly dependent on the method(s) or database(s) chosen. Carrying out research by means of a learner corpus may be conceptualised as a process involving various steps that range from the choice of research approach and the selection of the appropriate corpus to the annotation, extraction, analysis and interpretation of the data (see, e.g., Granger 2012a). This chapter will provide an overview of current practices, developments, challenges and future perspectives in learner corpus methodology. It first addresses several principal ways in which learner corpora can be used, and then describes the two most commonly practised types of analysis. It also highlights the possibilities and advantages of combining learner corpus data with (quasi-)experimental methods and presents a critical assessment of current practices in learner corpus analysis and an outlook on methodological developments in the field. 2 Core issues There are several principal ways in which learner corpora can be used methodologically. As in corpus linguistics in general, one can draw a tripartite distinction between corpus-informed, corpus-based and corpus-driven approaches, depending on the kind of evidence the corpus data is needed for, and the degree of involvement of the researcher with respect to data retrieval, analysis and interpretation. It is important to stress that these are not strict distinctions but that the three types partially overlap and merge into one another.
Article
Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R's native exact matching functions match and %in%.