PreprintPDF Available

Using machine learning in learner corpus research: error annotation and (extra-)linguistic metadata as proxies for approaching the interlanguage

Authors:

Abstract and Figures

This article investigates the potential of using error-annotated L2 utterances and their associated language profile to more accurately assess the language proficiency of an L2 learner. The main goal of this paper is to demonstrate the use of machine learning methods for uncovering features of the learner's interlanguage and their proficiency level, as well as to answer three questions: a) is it feasible to use error-annotated data along with linguistic profile metadata to trace patterns related to aspects of learners’ interlanguage? b) How important may error annotation be in estimating the learner’s proficiency level? c) Which of the (extra-)linguistic profile metadata stand out in contributing to the correct classification of the proficiency level? GLCII, the largest online freely available L2-Greek corpus, is utilized as an example corpus to illustrate the approach proposed in this paper with which for any L2, central questions can potentially be addressed by the relevant machine learning method so that, in a second stage, useful conclusion can be drawn. As a final point, we urge the implementation of ML techniques as a way of expediting research in a variety of related areas in applied linguistics.
Content may be subject to copyright.
Using machine learning in learner corpus research:
error annotation and (extra-)linguistic metadata as
proxies for approaching the interlanguage
Alexandros Tantos ( alextantos@lit.auth.gr )
Aristotle University of Thessaloniki
Kosmas Kosmidis
Aristotle University of Thessaloniki
Research Article
Keywords: Machine Learning Methods, Learner Corpora, PCA, Correspondence Analysis, Logistic
Regression, Random Forests
Posted Date: January 31st, 2023
DOI: https://doi.org/10.21203/rs.3.rs-2478290/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Additional Declarations: No competing interests reported.
Using machine learning in learner corpus research: error annotation and (extra-
)linguistic metadata as proxies for approaching the interlanguage
First author: Alexandros Tantos
Department of Linguistics
Aristotle University of Thessaloniki, Thessaloniki, Greece
alextantos@lit.auth.gr
ORCID id: 0000-0001-9917-1878
Second author: Kosmas Kosmidis
Department of Physics
Aristotle University of Thessaloniki, Thessaloniki, Greece
kosmask@auth.gr
ORCID id: 0000-0002-9150-5746
Abstract
This article investigates the potential of using error-annotated L2 utterances and their associated language profile to
more accurately assess the language proficiency of an L2 learner. The main goal of this paper is to demonstrate the
use of machine learning methods for uncovering features of the learner's interlanguage and their proficiency level, as
well as to answer three questions: a) is it feasible to use error-annotated data along with linguistic profile metadata to
trace patterns related to aspects of learners’ interlanguage? b) How important may error annotation be in estimating
the learner’s proficiency level? c) Which of the (extra-)linguistic profile metadata stand out in contributing to the
correct classification of the proficiency level? GLCII, the largest online freely available L2-Greek corpus, is utilized
as an example corpus to illustrate the approach proposed in this paper with which for any L2, central questions can
potentially be addressed by the relevant machine learning method so that, in a second stage, useful conclusion can be
drawn. As a final point, we urge the implementation of ML techniques as a way of expediting research in a variety of
related areas in applied linguistics.
Keywords
Machine Learning Methods, Learner Corpora, PCA, Correspondence Analysis, Logistic Regression, Random Forests.
Statements and Declarations
Competing Interests
This study was funded by the Hellenic Foundation of Research & Innovation (Grant number: 3161)
Data availability statement
The datasets and code generated during and/or analysed during the current study are available from the corresponding
author on reasonable request.
1. Introduction
For the past fifteen years, Learner Corpus Research (LCR) has been gradually transitioning towards a research
paradigm that more frequently utilizes the rigorous methods used in Natural Language Processing (NLP) for the
analysis of written and spoken learners’ productions (e.g., Ballier et al 2020; Callies & Paquot, 2015; Jarvis & Paquot,
2015; Vajjala & Lõo, 2014, Leacock et al., 2010). This shift is driven by the usefulness of a subset of Machine Learning
(ML) algorithms in understanding the second language (L2) learning process. One of the most pressing related
questions is whether ML algorithms can be utilized to gain insight into the proficiency level of L2 learners. This paper
attempts to explore a number of ML algorithms which may be able to identify the influential factors in classifying
learners’ productions into different proficiency levels.
Tracking down the factors that determine the global proficiency level of L2 learners is a complex undertaking. For
this purpose, efforts have been made by various institutions to describe the main characteristics of each language
proficiency level in a standardized format. The most renowned of these efforts in Europe is the Common European
Framework of Reference (CEFR) that provides a standardized scale for assessing the proficiency level of European
languages.
The vast majority of L2 corpora are rather small due to the difficulty of locating L2 learners, particularly for languages
other than English, French or German, as well as the complexity of constructing an L2 corpus.1 Even fewer of the
existing L2 corpora have been annotated, either manually or automatically; furthermore, part-of-speech (POS)
annotation is the most common type of annotation for such corpora followed by error-annotation. As a result, only to
a handful L2 corpora could potentially be applied supervised ML algorithms. Thus, although supervised ML methods
on annotated L2 production data can lead to new interesting research projects, only a few studies have the luxury of
employing ML methods for approaching the proficiency level of L2 learners.
This paper aims to apply supervised and unsupervised ML methods to the Greek Learner Corpus II (GLCII), the
largest freely available online error-annotated learner corpus of L2 Greek, in order to uncover aspects of L2 Greek
proficiency level. The primary focus is not on finding the most performant machine learning method, but on
discovering which predictors play a significant role and how their contribution to the correct classification compares
to each other.
Our model development and analysis stages employ two types of information recorded in the GLCII compilation: a)
the manually-annotated errors of L2 Greek learners in five key grammatical areas for L2 Greek learning and b) the
(extra-)linguistic profiles of the learners whose written and oral productions were collected and included in GLCII.
Our models are distinguished in that they are built on manually-crafted labels related to (a) and (b), rather than
automatically-extracted text features, such as lexical complexity or readability metrics, as in the competition described
by Ballier et al. (2020), or automatically-assigned POS and morphological tagging as in Vajjala & Lõo (2014); (for
more information on both papers, see section 2).
There are three questions addressed in this paper:
- Is it feasible to use the existing error-annotated L2 productions along with the recorded (extra-)linguistic
profile metadata to trace patterns related to aspects of learners’ interlanguage?
- How important are five key grammatical error types in estimating the proficiency level?
- Which of the (extra-)linguistic profile variables stand out in contributing to the correct classification of the
proficiency level?
1 There is a multitude of factors that play a role in the compilation of a L2 corpus that is representative and balanced with
respect to the variety that they are aiming to describe (Granger et al 2015). However, we avoid to spell them out due to a
different focus of the current paper.
The structure of the paper is as follows: The next section offers an overview of the relevant literature, which will serve
as the basis for presenting our line of research. Section 3 describes the dataset used for developing and testing our
models, GLCII, while section 4 outlines the pre-processing steps taken before developing ML models. Section 5
presents the model development stage for each of the unsupervised and supervised ML methods used in this paper for
either tracing patterns in the data or for predicting the L2-Greek learners’ proficiency level. Section 6 elaborates on
the relationship between Error_Type and Proficiency_Level and section 7 summarizes the results and discusses the
findings in this paper.
2. Relevant Work
Our paper resonates with Vajjala & Lõo (2014) and Ballier et al (2020). The common denominator is that we all use
various ML models that share the same response variable, the language proficiency level. The first difference lies on
the type and number of predictor variables as well as on the types of ML algorithms used for predicting the response
variable. Moreover, our motivation for using ML algorithms for classifying learners’ productions into the different
levels is to address the questions raised in the previous section which can be summarized as follows: to what extent
do implicit linguistic and extralinguistic features, such as the grammatical errors and the (extra-)linguistic profile of
the learners, play a role in the predictability of a ML algorithm? This leads to the inquiry of whether we can rank the
manually-crafted annotated data according to their relevance for accurately predicting the proficiency level. Before
delving into the details of the dataset and the ML algorithms used, the following two sections provide a summary of
Vajjala & Lõo (2014) and Ballier et al (2020).
2.1 Vajjala & Lõo (2014)
Vajjala & Lõo (2014) explores the use of machine learning algorithms to automatically evaluate the language skills
of Estonian learners and predict the Common European Framework of Reference (CEFR) levels of their learner texts
based on the Estonian Interlanguage Corpus (EIC) (Eslon & Metslang 2007) consisted of around 12,000 documents
in total at the time of the study spanning four levels. The authors investigate the use of supervised machine learning
techniques to predict the CEFR level of Estonian learner texts. They compare a classification-based and a regression-
based ML model to determine which one is best suited to the task concluding that classification-based methods lead
to a slightly better accuracy; the two models were successful in predicting the CEFR level of the created material in
79% and 76% of the cases, respectively.
Their main motivation is to examine whether and to which extent it is possible to use automated ways for classifying
learners into the different CEFR levels for a morphologically rich language, such as Esthonian. Although the authors
used a variety of lexical and morphological features to train their ML models, they also stress the fact that they ignored
syntactic and discourse-related features and learner errors to evaluate the texts.
The authors concluded that, while there is still a need to improve the accuracy of the predictions, their study
demonstrated the potential of machine learning algorithms to accurately predict the CEFR levels of learner texts and
suggested that more grammatical features may boost the performance of these algorithms and the associated evaluation
metrics.
What is interesting to note is that even though Vajjala & Lõo (2014) aim at demonstrating the usefulness of ML models
for aiding the assessment of learner productions and their classification to CEFR proficiency levels, they also report
a ranking of the importance of their features used during the training of their models indirectly suggesting that rankings
such as the one they report may be useful for shaping new SLA hypotheses or helping confirm existing ones.
2.2 Ballier et al (2020)
Ballier et al (2020) reports on the ML models used for the prediction of CEFR levels in a learner corpus within the
CAp 2018 ML competition, a classification task of the six CEFR levels, which maps linguistic competence in a foreign
language onto six reference levels. The dataset obtained from ECFAMDAT consisted of 40,966 texts written solely
by French L1 learners and included 59 features, such as lexical complexity metrics, readability metrics, and the CEFR-
assigned proficiency level of the learner, which was the predictor variable.
One of the distinguishing features of the EFCAMDAT corpus is that its written productions come from a variety of
task prompts; as expected, the performance of the ML models that were trained on the data might have been affected
when a different set of task prompts might be used.
What determined the first place in the competition was the greater sensitivity shown to more linguistically-oriented
syntactic and stylistic features. Specifically, as it has been shown in Ballier et al (2020), a good part of the winning
team’s performance superiority was due to use of linguistically-informed variables, such as spelling mistakes and
syntactic patterns found in the essays along with the POS tagging.
3. The Dataset: The Greek Learner Corpus (GLC) II
GLCII is the largest online and freely available L2-Greek corpus and a follow-up version of the earlier and much
smaller in scale GLCI.2 It contains approximately 500,000 words and consists of ~1,200 written essays and 328 spoken
productions and features productions from adult language learners of various (socio)linguistic and cultural
backgrounds. In order to facilitate research on second language acquisition (SLA) and second language teaching,
GLCII records a number of SLA-related variables (L1, Proficiency Level, Age, Gender, Education Level) as proposed
in the relevant literature (Granger, 2002; Lozano & Mendikoetxea 2013, Mendikoetxea, 2014; Myles, 2005, Gilquin,
2015, Paquot & Plonsky, 2017, Tracy-Ventura & Paquot 2021, Tracy-Ventura et al 2021).
In comparison to GLCI, which contained only written data from adolescent students, GLCII also includes 242 written
texts and 66,645 word tokens from L1-Greek native speakers, forming the control sub-corpus. It is a cross-sectional
corpus of adult learners of different proficiency levels, ranging from beginner to advanced, and includes a wide range
of metadata with a three-tiered error-annotation system. The majority of the corpus was collected from a classroom
setting, with a minority sourced from official L2-Greek exams.3
The GLCII dataset accommodates an error-annotation scheme developed and used for annotating errors in nodal
grammatical categories in L2 development across written and spoken productions in five grammatical domains:
Agreement, Case, Gender Assignment, Aspect and Voice.4 The whole dataset including the written error-annotated
text spans along with the error tagset on the grammatical level for most of the written productions can be accessed
2 GLCI is a L2 Greek LC which consists of approximately 450 written productions produced by adolescent students (around 33000
words). GLCI’s error annotation scheme was the basis for the GLCII’s error annotation scheme. Moreover, GLCI is freely available
within the CLARIN-EL repository, available at https://inventory.clarin.gr/corpus/520
. GLCI is limited by its size and range of covered genres, topics, and recorded metadata.
3 A limited sample of data was taken from the official language proficiency examinations in Greek as a second language, which
are created and conducted annually by the Center for Greek Language (CGL). Note that CGL does not administer or organize any
language courses.
4 Although the annotation process is still in progress, since new productions are being added on a regular basis, the
error annotation dataset, the error tagset and the guidelines for error annotation across several grammatical
categories are accessible through the online platform that hosts GLCII at https://glc.lit.auth.gr/app/GLC_Gateway)
through the web application platform that hosts GLCII (https://glc.lit.auth.gr/).5
Each grammatical error domain is further broken down into distinct error categories with each category having a
cluster of possible values. Table I displays the three annotation layers for one of the error domains, namely Agreement.
Each error domain is further divided into distinct categories, and each error category is assigned a cluster of possible
values. In this paper, the outer error annotation layer, namely that of the error domain, has been considered as one of
the predictor variables for developing the ML models. The following section provides a detailed overview of the pre-
processing steps implemented prior to creating our unsupervised models and training our supervised ones.
Table I. Hierarchical representation of a part of the GLCII error annotation scheme
Error Domain
Error Category
Error Cluster
Tag
Short
Description
Tag
Short
Description
Short
Description
AGR
Agreement
GEN
Gender
Masculine
Feminine
Neuter
NUM
Number
Singular
Plural
CASE
Case
Nominative
Accusative
Genitive
Vocative
GLCII contains descriptive metadata which accurately reflect a learner’s profile (see table II). The thorough
documentation of GLCII's metadata has two results: it enables the user to determine whether GLCII can serve as a
source for their research questions (see McEnery et al., 2006), and it increases the validity of corpus-based outcomes
and facilitates their findings to new L2 Greek learner data sets beyond the corpus (Mackey and Gass, 2015). Although
it is expensive in terms of resources to record both error annotations and metadata, it provides us with a valuable
opportunity to use these data as predictor variables in uncovering characteristics of what constitutes a proficiency
level. The following section is devoted to the choices we made while training and preparing our supervised and
unsupervised models, correspondingly.
5 In addition to the error-annotated data and their accompanied (extra-)linguistic metadata, the GLC Gateway also provides all
written GLCII productions that can be downloaded freely.
Table II. Metadata variables recorded during the GLCII compilation
Learner-related Variables
Linguistic Profile
Variable name
Values
Learner’s L1
The language(s) the students report as
their L1
Proficiency level
A1, A2, B1, B2, C1, C26
Knowledge of other languages
The language(s) reported by the students
as L3, Lx
Total length of instruction in Greek
0 - 9 months, 9 - 12 months, 1 - 2 years, 2
- 3 years, more than 3 years, Other
Greek language learning setting
Self-paced learning, Language course (in
a class or privately), Attendance of Greek
School Abroad, Other
Relationship with a Greek partner and
use of Greek in communication
Yes/No
Greek relatives and use of Greek in
communication with them
Yes/No
Greek friends and use of Greek in
communication with them
Yes/No
Demographic Context
Age
A drop-down menu with a span of age
choices from 17 to 80
Sex
Female, Male, Other
Country of origin
The country the learner reports as the
place of origin
Educational level
Text- and task-related variables
6 The actual proficiency level of the learners is accurately reflected in the recorded proficiency level, as it was assigned to each
student based on a standardised and state-recognised proficiency test that they took part in.
Text type
Description, Narration, Argumentation
Time needed
An upper limit of 60 minutes was set
4. Data Pre-processing and Feature Engineering
In this section, we discuss the utilized pre-processing and the feature engineering steps to ready the dataset for the ML
techniques demonstrated in the subsequent sections. Our dataset consists of 4600 entries (rows) featuring two error-
related variables derived from the error annotation scheme partly described in the previous section, alongside the
recorded metadata related to the (extra-)linguistic profile of the learners in table 2. Each row of the dataset corresponds
to a distinct error -manually annotated by our team members- that belongs to the annotation scheme of GLCII.
The detailed list of columns is as follows:
Corpus_Filename, Error_Type, Error_Subtype, Target_Form, Sex, Age, Country_Origin, L1, Other_Languages,
Proficiency_Level, Education_Level, Learning_Duration, Learning_Process, GR_Duration_Stay, GR_Partner,
GR_Partner_Communication, GR_Relatives, GR_Relatives_Communication, GR_Friends,
GR_Friends_Communication, Production_Duration_Minutes, Expected_Genre, Actual_Genre, Relation,
Relation_Type.
A column labelled Actual_Expected_Mismatch was created as part of the pre-processing of the dataset, which indicates
whether or not there is agreement between the Expected_Genre and Actual_Genre entries of the dataset.7
Columns from the dataset that are solely for informational or supportive reasons, such as Corpus_Filename,
Error_Text, Text, Grade, Offset, L1_Secondary, Essay_Choice, Sources and Help, were omitted.
Εach row of the pre-processed dataset is a vector of 17 coordinates for each of the 4600 observations in the dataset.
The dataset includes entries of all data types: numerical (e.g., Age, Production_Duration_Minutes), ordinal (e.g.,
Proficiency_Level or GR_Duration_Stay), and categorical (e.g., Sex, L1, Other_Languages). analyzing the data and
applying any type of machine learning method on such datasets can be difficult and requires prior pre-processing of
the variable values. For our project, numerical data were left as is, and ordinal data were changed to integer values.
For example, Proficiency_Level values (A1, A2, B1, B2, C1, C2) were assigned integer values from 0 to 5. As for
categorical values, a technique called one-hot-encoding was used instead of integer encoding since machine learning
models may provide inaccurate performance or unexpected results if they are allowed to assume a natural order
between categories. As a final pre-processing step all values were scaled to the range of 0 to 1 via a method called as
Min-Max Scaling using the following equation:
 
 
An alternative and often used approach would be to normalize the data using zero mean, unit variance scaling. These
two are the most commonly used normalization methods, since most of the variables that are measured at different
scales do not contribute equally to the model fitting and model learned function and might end up creating a bias.
7 The "Expected_Genre" and "Actual_Genre" columns of the dataset contain the judgements of our annotators regarding the
expected genre of the written piece based on the essay topic and the genre that was eventually produced by the learners.
5. Model Development
In order to answer the three questions posed in section 2, both supervised and unsupervised Machine Learning (ML)
methods were utilized. The first question, pertaining to tracing patterns in the dataset, suggests the use of unsupervised
learning techniques while the second question concerning the importance of the five error types in estimating the
proficiency level can be addressed through both unsupervised and supervised methods. To address these two
questions, Principal Component Analysis (PCA) and Correspondence Analysis (CA), two of the most renowned
unsupervised learning methods, were employed. The second and third questions were reformulated as classification
problems, which were handled with supervised Machine Learning (ML) methods; specifically, Logistic Regression
and Random Forests were chosen and their extracted weights were used to draw meaningful conclusions regarding
the importance of each variable.
5.1 Unsupervised Learning methods
In the following two sections, Principal Components Analysis (PCA) and Correspondence Analysis (CA), two
different but related methods of unsupervised learning have been utilized in order to identify two distinct types of
patterns. The first type of pattern is linked directly to the raw dataset observations described in section 4 which
correspond to the annotated erroneous learner productions and the groupings that may be identified with reference to
the recorded (extra-)linguistic metadata. The second type of pattern is related to the potential associations between the
categorical variables that operationalize the (extra-)linguistic metadata documented in GLCII.
5.1.1 Principal Components Analysis
PCA is a unsupervised learning technique that uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (Geron,
2022). PCA is by far the most popular feature extraction and dimensionality reduction method. It is also used for
exploratory data analysis, such as in the field of bioinformatics for the analysis of genome data and gene expression
levels. Through PCA, patterns in data are identified based on the correlation among its observed features. The aim of
PCA aims is to identify the directions of maximum variance in usually high dimensional data and to project it onto a
new subspace with fewer or the same number of dimensions as the original one. The axes of the new subspace, i.e.
the principal components, can be interpreted as the directions of maximum variance with the added benefit that the
new feature axes are orthogonal to each other. Usually, when the first two principal components account for a high
proportion of the variability in the dataset, then plotting them together can be used as a form of unsupervised
classification method, to detect any underlying clusters of data.
Figure 1: Plot of the first 2 principle componets of the dataset. The colors in the plots map to the four features’ values:
GR_Partner(top left), GR_Partner_Communication(top right), GR_Relatives(bottom left) and
GR_Relatives_Communication (bottom right). Green points indicate a value of “No” while blue points indicate a
value of “Yes” for the respected feature.
Figure 1 presents useful graphical representations, called biplots that depict the first two principal components, PC1
and PC2, extracted through PCA applied on our dataset. Ignoring the colouring of the data points in figure 1 for the
moment and given the multitude of information that is compressed from the recorded metadata variables and the shape
of the data cloud on the plots of figure 1, one might not be particularly encouraged to extract any particular value from
these two newly-established dimensions, PC1 and PC2. However, if a post-processing step is followed, whereby a
coloring scheme is applied to each of the observations based on the value in various variables, one would hope that
new patterns become visible. Recall that each point on the plots in figure 1 represents one entry of the dataset, i.e. one
error out of the 4600 annotated errors. The coloring of the data points on the top left plot map to the two different
possible values of the GR_Partner feature, which corresponds to the learners’ answer to the question of whether they
have a Greek partner with whom they communicate in Greek; those of the top right plot map to the
GR_Partner_Communication feature values, while the bottom left and bottom right plots map to the GR_Relatives
and GR_Relatives_Communication features correspondingly. Green-colored data points are used to indicate a value
of “No”, while blue ones indicate a value of “Yes” for the respected feature.8 In all cases, we observe that values
greater than 1 on the first principle component (PC1>1.0) are consistently assosiated with a value of ‘No’ in each of
the four features. This is not a fact of absolute certainty, but as one can readily observe in the graphs it occurs with
high probability. The effect is amplified if we consider the values of PC2 meaning that an entry of PC1>1 and PC2<0
will almost certainly be characterised by a “No” value of the above mentioned features. On the other hand, for values
of PCA1<1 and PCA2>0 we obtain the exact opposite effect, specifically with L2 Greek learners who regularly
communicate in Greek in contexts that are not instructional. We have throughly examined whether such clustering is
evident for the other features as well. We have, however, not been able to observe any additional visible patterns.
Thus, figure 1 implies that with the knowledge of the first two principal components related to an entry in the GLCII
dataset, one can predict the value of the feature GR_Partner with satisfactory accuracy and this highlights the
importance of communication in L2 learning outside of an instructional setting.
5.1.2 Correspondence Analysis
CA is a similar technique to PCA, but is used to analyze the relationships between two or more categorical variables
as these are expressed through a contingency table (Beh & Lombardo 2021, Greenacre 2021). In our case, CA
contributes to answering the second answer of section 1, namely to rank the relative importance of the five error types
with respect to students’ proficiency level.
For two by two contingency tables that express the relationship between two categorical variables, variance is
expressed through the so-called table’s total inertia. The total inertia of a contingency table can be interpreted as the
amount of information that it contains. Specifically, it is the total variance, which is calculated by dividing the χ2
statistic associated with the two-way contingency table by the sample size. This total inertia is further decomposed
into row (and column) inertia, which represent the weighted average of the distances between the row (collumn)
profiles and the average row (column) profile. Essentially, CA decomposes the two-way contingency table into
orthogonal factors that maximize the separation between rows and columns. These factors capture the relative
distances between row and column points, which can be used to measure the correspondence between the categories
of the table. This is to say, the more a row point is close to a column point, the greater the proportion of that column
category is on the row profile. Let us focus on the contingency table in table III with the Proficiency_Level values on
the rows and the Error_Type values on the columns.
Table III: Contingency table of Proficiency_Level and Error_Type.
CA, as PCA, provides biplots that can also be thought of as perceptual maps of the categorical variables and their
associated values that ease the interpretation of their associations. Similarly to the PCA plots above, CA biplots are
then used to identify a) associations between the values of the categorical variables included in a contingency table
and b) trends and patterns in the data that can be proven very useful in our case; for instance, CA may serve as a proxy
to relate the different values of the error-annotated categories and the students’ proficiency level recorded in GLCII.
8 The "Yes" and "No" values on the dataset were encoded as 0 and 1 throughout our analyses, respectively.
What this means is that the resulting biplot can facilitate the detection of underlying patterns among the variables and
their values.
Figure 2: Biplot of the relationship between Proficiency_Level and Error_Type in the GLCII corpus.
The biplot depicting the relationship between Error_Type and Proficiency_Level values in figure 2 reveals a pattern
in which the blue points, denoting Error_Type values, are spread across the first axis from left to right. This suggests
that the first axis differentiates well between the different types of errors. Additionally, the position of the red points,
which signify Proficiency_Level values, shows that higher values on Dim1 indicate a lower proficiency level, with
A1 on the far right end and C1 on the farthest left of Dim1. However, the highest proficiency level, C2, is on the left
upper corner of the biplot and not on the farthest left point as one would anticipate if only Dim1 was considered. This
visual representation also emphasizes the interconnection between the values of the Error_Type and the
Proficiency_Level variables. One possible interpretation of the biplot in figure 2 is that L2 Greek learners at the A2
proficiency level begin to encounter the intricacy of grammatical agreement in Greek, which is why the red dotted A2
is relatively close to the blue data point, Agreement. The biplot reveals that B1 learners present a higher proportion of
both Agreement and Gender errors, whereas L2 Greek learners at the B2 proficiency level have moved away from
Agreement errors and display a higher frequency of errors related to Aspect, Gender and Case. Moreover, C1 learners
are more likely to produce Aspect and Voice errors, whereas C2 learners display a pattern of errors distinct from the
other proficiency levels, indirectly suggesting that they have already acquired the five key grammatical domains in
L2 Greek. Furthermore, low values -close to 0- on Dim2, i.e., on the vertical axis, may be indicative of the core of the
grammar, whereas higher values may be associated with the periphery, more likely related to discourse-related
grammatical areas such as cohesion and coherence relevant to more advanced L2 Greek learners.9
5.2 Supervised Learning: Classification
9 Note, however, that GLCII has not been annotated for cohesion and/or coherence infelicitousness.
The second and the third question concerning the importance of the recorded (extra -)linguistic profile variables, i.e.,
GLCII’s metadata, for estimating learners’ proficiency level may be formulated as classification problems. This means
that the proficiency level becomes the dependent/response variable, while the error annotations and the metadata
variables are considered our independent variables. Our aim in using classification algorithms is twofold: a) to
determine how much of the differences in Proficiency_Level can be attributed to the relevant linguistic profile
variables believed to be significant in second language acquisition and b) to also rank these variables in terms of their
impact on estimating the proficiency level.
The most important factor to consider when selecting a classification algorithm for our needs is its ability to provide
us with an understanding of the reasons why it selected one class over the others. However, the majority of supervised
learning techniques focus on building accurate and effective predictive models, and do not offer a transparent and
comprehensible explanation of how influential the independent variables of the model were. For instance, neural
networks are notorious in not allowing the estimation of feature importance.
The following two sections discuss the fundamentals of the two techniques, Logistic Regression and Random Forests,
which are useful for addressing the second and third questions posed in this paper.
5.2.1 Logistic Regression
The logistic function dates to the 1830s, when the P.F. Verhulst used it to describe population dynamics. Logistic
regression, which despite its name suggests is a classification method, fits the logistic function to a dataset to predict
the probability, given an event, that a particular outcome will occur. The logistic function which is sometimes simply
referred to as the sigmoid function due to its characteristic S-shape is given by
󰇛󰇜

In the above equation, z is the net input i.e., the linear combination of weights and the inputs, i.e., the independent
variables of the training examples. Thus,
where m is the total number of features.
Logistic Regression can be used for multiclass classification using the One vs Rest method (Hosmer 2013, Hastie et
al 2009). It has the advantage to allow for estimation of feature importance since weights wi with large absolute values
are considered more important due to their contribution to the net input z. Recall that in our case Proficiency_Level is
the target variable i.e., the computed values of φ(z) are used as an estimator of the probability of an individual with a
particular profile (given by the model’s independent variables) to have a certain proficiency level, whereas the error
annotations and the metadata variables described in section 3 constitute the model’s independent variables.
5.2.2 Random Forests
The popularity of Random Forests has increased in recent years (Geron, 2022). Random Forests are collections of
Decision Trees introduced by Breiman (2001) to reduce overfitting, a common problem in decision tree learning.
Decision tree classifiers are supervised learning models that are very useful and successful during training. They have
a distinct approach in comparison to logistic regression. Rather than combining input features linearly to calculate the
output, a decision tree works by dividing the data by answering a set of questions at each stage.
By implementing the decision algorithm, we start at the tree root and split the data on the feature that results in the
highest information gain, explained in more detail below. In an iterative process, we can then repeat this splitting
process at each child node up until the leaves are pure. This means that the training examples at each node all belong
to the same class.
Figure 3: Example of a decision tree used for the prediction of Proficiency_Level as the target variable.
Figure 3 shows an example of a decision tree applied to our data set trying to predict Proficiency_Level using the
features of our data set. One can readily see the ‘questions’ asked by the decision algorithm and the separation of the
training dataset to subsets depending on whether the answer is True or False. The central root for example ‘asks’ of
Learning_Duration is less or equal to 0.5. One can also see the value of the gini impurity (an information gain
measure). It ranges between 0 and 1 and a gini impurity equal to zero means that the set is completely pure i.e., all its
members belong to the same class.
The Random Forest algorithm employs an ensemble of decision trees to reach a final decision. Basically, taking the
view that if one tree is good, then many trees (a.k.a. a forest) should be even better, if there is enough diversity between
them. One of the most captivating aspects of Random Forests is the manner in which it creates randomness from an
ordinary dataset.
The first of the methods that it uses is known as bagging. This technique involves training the trees on slightly altered
data, obtained from bootstrap samples. Moreover, it is possible to add randomness by limiting the choices that a
decision tree can make. At each node, a random subset of the features is given to the tree, and it can only pick from
that subset rather than from the whole set. As well as increasing the randomness in the training of each tree, it also
speeds up the training, since there are fewer features to search over at each stage. Of course, it does introduce a new
parameter, i.e., the number of features to consider, but the random forest does not seem to be very sensitive to this
parameter; in practice, a subset size that is the square root of the number of features seems to be common. The effect
of these two forms of randomness is to reduce the variance without effecting the bias. Another benefit of this is that
there is no need to prune the trees. There is another parameter that we don’t know how to choose yet, which is the
number of trees to put into the forest. However, this is easy to pick if we aim at optimal result s: we can keep on
building trees until the error stops decreasing. Once the set of trees are trained, the output of the forest is the majority
vote for classification or the mean response for regression. Random forests are ideal tools for exploring and analyzing
learner corpora because they can determine the significance of each feature through the Information Gain at each stage
of the decision algorithm.
5.2.3 Feature importance
Our goal in this section is to implement both logistic regression and random forests and estimate the importance of
the various features of our dataset. Such an approach has been successfully used in the past in various scientific
contexts (e.g., Kosmidis et al 2020). As mentioned above, the main reason behind the choice of these two supervised
learning algorithms was precisely that they allow feature importance estimation. Since the algorithmic processes of
the two methods are vastly different, we are confident that features that appear high in the importance ranking of both
methods have high predictive power and merit a closer consideration by both SLA and other related applied linguistic
fields.
Figure 4: The estimation of the proficiency level is treated as a supervised learning problem. Two different
classification methods were used and the ten most important features for each method were calculated. Feature
importance applying Logistic Regression (on the left) and Random Forests (on the right).
Figure 4 shows results of our treatment of the estimation of Proficiency_Level as a supervised learning problem. Our
application of machine learning methods does not intend to develop the optimal classifier. Of course, this is a
legitimate practical task, and there have been held several contests in the past concentrating on precisely that (see for
example Balier et al, 2020). However, our aim is to engineer suitable classifiers, i.e., classifiers with a satisfactory
level of performance that estimate the feature importance to different features in the classification process. For this
purpose, we utilized the Logistic Regression and the Random Forest classifiers. Unlike other classification algorithms,
these two fit well our purpose of assigning importance to the predictor variables, albeit in a different manner.
We choose to also omit the Age feature since its predictive value, although high, may be dependent on our particular
dataset. Note that the inclusion of the Age feature improves the performance of our classifiers on the test set even
more. However, the dataset contains several individuals of age around 18 that are at B1 Proficiency Level. Thus, any
algorithm may memorize this trait and be successful. But such a success is not what we aim for; therefore, we decided
to exclude Age from our feature list.
We split our data set into two subsets: the first 75% was devoted for training and the remaining 25% for testing. After
training our classifiers on the training subset, we evaluated them on the testing subset. The performance of both
classifiers is quite remarkable; Logistic Regression gained an accuracy score of 0.778 while Random Forests scored
0.938. These scores are much higher than the benchmark dummy classifier; a classifier that predicts the most frequent
class for every instance and essentially has no predictive power, which in this case achieved an accuracy score of
0.430.
Figure 4 displays a bar plot of the 10 most important features for each classification method. As the two classifiers are
completely different in both theoretical and algorithmic grounds, we can be confident that the features that appear to
be of high value in both cases are worthy of further examination. For instance, L1, Country_Origin and
Other_Languages appear high in both methods. It is possible that a causal relationship exists between these features
and the target variable, i.e., Proficiency_Level.10 Noteworthy linguistic features, such as Target_Form, Error_Subtype,
Actual_Genre and Expected_Genre, are utilized successfully by both algorithms in predicting Proficiency_Level, since
they are both ranked highly. These findings highlight the usefulness of our strategy to use two interpretable algorithmic
processes for pointing out the important features for proficiency level estimation. Pacticularly, they suggest that the
text type, the target form and the error type are important key grammatical factors for estimating the proficiency level
of an L2 learner.
5.2.4 The confusion matrix of the Random Forest Classifier
Figure 5 presents the confusion matrix of the Random Forest Classifier for the prediction of learners’ proficiency
level. The largest confusion rates, as expected, are observed in consecutive classes meaning that misclassification may
occur to students classified as belonging to a higher or a lower proficiency level close to their actual one, e.g., instead
of being classified to their actual B1 they may have been classified to B1. There are, however, some interesting cases,
e.g., nine cases with true label B2 that were erroneously predicted as A2 and seven cases predicted as B2 instead of
the actual level of A2, with a distance of two levels. The reasons for the discrepancy in these cases calls for further
investigation and suggests that the confusion matrix may function as a proxy for tracing irregularities and potentially
interesting cases. Particularly, studying the confusion matrix may lead to useful conclusions as to the reasons why the
proficiency level of a learner may be misclassified. In general, we have very few costly errors of mislabeled samples
whose true label is several classes apart. Lastly, the limited number of mismatches on the confusion matrix
demonstrates the usefulness of the error-annotated dataset and the (extra-)linguistic profile variables for estimating
the proficiency level of L2 Greek learners.
10 However, this needs to be investigated further to ensure that any possible correlations between L1 and Proficiency_Level may
be specific to the dataset
Figure 5: Confusion matrix of the Random Forest classifier for the prediction of Proficiency_Level. Darker colors
indicate lower values.
6. Focusing on the relationship between Error_Type and Proficiency_Level
Motivated by the preceding outcomes that address the second and third questions of the paper, we were drawn to use
traditional statistical tools to investigate the correlation between Proficiency_Level and Error_Type. Two interesting
quantities that can be further explored are the unconditional probability P(Error_Type) and the conditional probability
P(Error_Type| Proficiency_Level) (Bertsekas & Tsitsiklis, 2008, Freedman et al 2007). Particularly, the ratio
󰇛 󰇜
󰇛󰇜
included for all pairs of values between the two variables in figure 6 may help us gain useful insights into the
relationship between pairs of certain values of these two variables. If these two variables, Proficiency_Level and
Error_Type, were statistically independent, the ratio would be equal to one. Focusing on figure 6 that depicts the ratios
of the different pairings of values, we observe some quite intriguing findings. Given that an error has occurred by
someone of level C2 it is much more likely that this error is of voice type, 2.5 times more than expected, while case
or gender errors are unlikely for C2 proficiency level. Part of the reason why there is such a high likelihood of C2
learners producing an error in the grammatical domain of voice is that -at this stage- learners start to experiment with
Greek passive forms in various contexts. This is in accordance with the intuition of applied linguists and teachers of
L2 Greek, since learners of lower proficiency levels rarely utilize the passive voice. Moreover, creating and applying
a series of regular expressions for capturing the overall passive forms on GLCII strongly confirms the point that L2
Greek learners in C2 use passive forms, regularly. Furthermore, given that an error has occurred by someone at level
A1 this is much more likely to be an agreement error. While, in principle, the ratio can be calculated to address the
relative likelihood of other variable combinations, e.g., between the Error_Subtype and L1, the resulting figures are
practically hard to visualize and, consequently, to interpret due to the existence of too many categories; for example,
there are 34 error subtype values.
Figure 6: Ratio of the conditional probability P(Error_Type|Proficiency_Level) to the unconditional probability
P(Error Type).
Thus, apart from the machine learning approaches utilized in section 5 that revealed the link between Error_Type and
Proficiency_Level (see Figure 4), the more traditional approach of using a relative likelihood ratio, illustrated in figure
6, could be employed complementarily to highlight more aspects of the above-mentioned connection. Lastly, such a
ratio may function complementarily to the CA biplot displayed in figure 2 that demonstrates the association between
two categorical variables in terms of their distances in the biplot.
7. Results and Discussion
There are several conclusions that can be drawn from the application of ML methods to well-documented and error-
annotated learner corpora. By including information on a) the (extra-)linguistic profile of the learners and b) learners’
grammatical error profile for key grammatical domains, such as aspect, voice, agreement, case and gender, in the
training and test datasets, we were able to better reflect and/or reveal their value for learning, teaching and researching
Greek as a second language. Although the richness and value of the aforementioned information has been indicated
by both theoretical hypotheses and experimental results put forward in SLA and by empirical observations made by
the teachers in class, it is the first time that ML methods have been asked to exploit a manually error-annotated learner
corpus so that: a) the above-mentioned prior hypotheses were either confirmed or falsified, b) useful hidden aspects
of the L2-Greek learning process related to learners’ linguistic profile were revealed in a way that was not possible
earlier and c) the relative importance of these five grammatical error domains and of the (extra-)linguistic profile was
traced across different proficiency levels based on the real evidence of learners’ error-annotated productions recorded
in a learner corpus. The current study strived to show the relevance and usefulness of machine learning methods in
these three areas that helped us gain insights into learners’ interlanguage.
From a bird's-eye view, this paper offers two main contributions: the first is of methodological and the second of
theoretical nature and relevant to the three questions posed in Section 1. Methodologically, both supervised and
unsupervised learning methods have been used in the following way: the first two questions have been transformed
into exploratory questions that could be answered with the use of two unsupervised learning techniques, while the
third question was translated into a classification problem, for which two ML classifiers, logistic regression and
random forests, were used to evaluate the importance of the independent variables in estimating the proficiency level
of L2 Greek learners. Regarding the use of supervised learning methods, our strategy was to initially build a
randomized model and compare it with our error-informed model to track down the effectiveness of including these
specific grammatical error domains on a learner corpus in predicting the proficiency level of learners. It is our
conviction that similar questions of theoretical nature pertaining to SLA and learner corpus research may be translated
to either clustering or classification problems and addressed using ML techniques.
Summarizing the theoretical findings of this paper,
It was determined that it is feasible to use the error-annotated data in combination with the (extra-)linguistic
profile metadata to detect tendencies associated with aspects of the learner's interlanguage. Specifically, our
paper suggests that L2 Greek usage in communication stands out as a hidden factor of the GLCII dataset.
Figure 1 shows that learners who were assigned a value greater than 1 on PC1 seem to fit very well with a
“No” response on all four communication-related questions, while the combination of values PC1>1 and
PC2<0 amplifies the effect, namely that the L2 Greek learner does not use Greek regularly in any of the
controlled communication circumstances. On the contrary, a combination of values PCA1<1 and PCA2>0
leads to exactly the opposite effect, namely that of L2 Greek learners who communicate in Greek regularly
in contexts outside of an instruction setting. This finding emphasizes the strength of the effect of
communication in L2 during second language acquisition.
Using CA, a revealing perceptual map was created that uncovered hidden correlations between values of the
two crucial categorical variables, Proficiency_Level and Error_Type. In particular, the first dimension, Dim1,
is aligned with learner’s proficiency level, as higher values on Dim1 point to a lower proficiency level and
lower values the opposite, implying that the majority of the variability in the data is described by the
proficiency level. Moreover, by visualizing the values of Error_Type on the biplot and its position in relation
to different proficiency levels, the relationship between the two variables is further accentuated.
The two classifiers, Logistic Regression and Random Forests, that have been used provide the power to rank
the independent variables in terms of feature importance for estimating the proficiency level. Notably, the
two algorithms are in agreement for the majority of the most highly ranked variables, namely Target_Form,
Error_Subtype, Actual_Genre and Expected_Genre.
Studying the confusion matrix of the Random Forest classifier is an indirect but strong indicator of the
accuracy of the error-annotated dataset along with the recorded (extra-)linguistic profile variables for
predicting correctly learners’ proficiency level. Misclassifications may be employed for tracing irregularities
and potentially interesting cases.
Lastly, the relative likelihood between the different error types and proficiency levels identifies the
probabilities that specific error types may occur given the learner’s proficiency level and may lead second
language teachers to decide for new targeted teaching interventions and applied linguists, such as
psycholinguists, to exploit them for further experimentation.
One of the tenets of any machine learning algorithm is that the quality of results is directly proportional to the quality
of the dataset that we feed it. All the above findings may be reliable to the degree that GLCII is compliant to high
standards of corpus compilation. It is our strong conviction that domain experts in SLA and other applied linguistic
fields who aim at revealing deep properties of the L2 learning/teaching process, take seriously into consideration the
possibility of adding annotation layers on qualitative learner corpora and tailor their analyses with creative ways of
using ML methods in a similar spirit to the one described in our paper.
Literature
1. Ballier, N., Canu, S., Petitjean, C., Gasso, G. & Balhana, C. (2019). Machine learning for learner English: A
plea for creating learner data challenges. International Journal of Learner Corpus Research, 6(1), 72-103,
from ff10.1075/ijlcr.18012.balff. ffhal-02496670f
2. Beh, E. J. & Lombardo R. (2021). An Introduction to Correspondence Analysis. Hoboken, NJ: Wiley.
3. Bertsekas, D. & Tsitsiklis, J. N. (2008). Introduction to probability (Vol. 1). Athena Scientific.
4. Callies, M., & Paquot, M. (2015). Learner Corpus Research: An interdisciplinary field on the move.
International Journal of Learner Corpus Research 1(1), 16.
5. Eslon P. & Metslang H. (2007). Learner language and Estonian interlanguage corpus. Eesti
Rakenduslingvistika Ühingu aastaraamat. Estonian Papers in Applied Linguistics.
6. Jarvis, S., & Paquot, M. (2015). Learner corpora and native language identification. In S. Granger, G. Gilquin,
& F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (Cambridge Handbooks in
Language and Linguistics, pp. 605-628). Cambridge: Cambridge University Press.
doi:10.1017/CBO9781139649414.027.
7. Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O'Reilly Media,
Inc.
8. Greenacre, M. (2021). Correspondence Analysis in Practice (third ed.). London: CRC PRESS.
9. Breiman, L. (2001). Random forests. Machine learning, 45(1) 5-32.
10. Freedman, D.R., Pisani, R. & Purves, R. (2007). Statistics (fourth ed.). W. W. Norton & Company.
11. Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier
(Eds.), The Cambridge Handbook of Learner Corpus Research (Cambridge Handbooks in Language and
Linguistics, pp. 9-34). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139649414.002.
12. Granger, S. (2009). The contribution of learner corpora to second language acquisition and foreign language
teaching: a critical evaluation. In Aijmer, K. (Ed.), Corpora and language teaching (pp. 13-32). Amsterdam:
John Benjamins.
13. Granger, S., Gilquin, G., & Meunier, F. (Eds.). (2015). The Cambridge Handbook of Learner Corpus
Research (Cambridge Handbooks in Language and Linguistics). Cambridge: Cambridge University Press.
doi:10.1017/CBO9781139649414
14. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (secnd ed.). New York:
Springer.
15. Hosmer, D. (2013). Applied logistic regression. Hoboken, New Jersey: Wiley.
16. Kosmidis, K., Jablonski, K. P., Muskhelishvili, G., & Hütt, M. T. (2020). Chromosomal origin of replication
coordinates logically distinct types of bacterial genetic regulation. NPJ systems biology and applications,
6(1), 1-9.
17. Leacock, C., Chodorow, M., Gamon, M. & Tetreault J. (2014). Automated Grammatical Error Detection for
Language Learners (second ed.). Morgan & Claypool Publishers.
18. Lozano, C. & Mendikoetxea, A. (2013). Learner corpora and second language acquisition: The design and
collection of CEDEL2. In A. Díaz-Negrillo, N. Ballier & P. Thompson (Eds.), Automatic treatment and
analysis of learner corpus data (pp. 65-100). Amsterdam: John Benjamins.
19. Mackey, A., & Gass, S.M. (2021). Second Language Research: Methodology and Design (third ed.).
Routledge. https://doi.org/10.4324/9781315750606
20. McEnery, T., Xiao, R. & Tono, Y. (2006). Corpus-based Language Studies: An Advanced Resource Book.
London: Taylor & Francis.
21. Mendikoetxea., A. (2014). Corpus-based research in second language Spanish. In Geeslin KL (Ed.) The
handbook of Spanish second language acquisition. Oxford: Wiley-Blackwell, pp. 1129.
22. Myles, F. (2015). Second language acquisition theory and learner corpus research. In Granger, S., Gilquin,
G., & Meunier, F. (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 309332). Cambridge:
Cambridge University Press.
23. Paquot, M. & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research.
International Journal of Learner Corpus Research, 3 (1), 6194.
24. Pedregosa, F., Gaël V., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss
R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay, E. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
25. Tracy-Ventura, N & Paquot, M. (2021). SLA and Corpora: An Overview. In N. Tracy-Ventura & M. Paquot
(Eds.), The Routledge handbook of second language acquisition and corpora (pp. 408-422). Abingdon:
Routledge.
26. Tracy-Ventura, N., Paquot, M., & Myles, F. (2021). The future of corpora in SLA. In N. Tracy -Ventura &
M. Paquot (Eds.), The Routledge handbook of second language acquisition and corpora (pp. 1-7). Routledge.
27. Vajjala, S. & Loo, K. (2014). Automatic CEFR level prediction for Estonian learner text. NEALT Proceedings
Series (22:9, pp. 113128). Linköping Electronic Conference Proceedings.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
For a long time it has been hypothesized that bacterial gene regulation involves an intricate interplay of the transcriptional regulatory network (TRN) and the spatial organization of genes in the chromosome. Here we explore this hypothesis both on a structural and on a functional level. On the structural level, we study the TRN as a spatially embedded network. On the functional level, we analyze gene expression patterns from a network perspective ("digital control"), as well as from the perspective of the spatial organization of the chromosome ("analog control"). Our structural analysis reveals the outstanding relevance of the symmetry axis defined by the origin (Ori) and terminus (Ter) of replication for the network embedding and, thus, suggests the co-evolution of two regulatory infrastructures, namely the transcriptional regulatory network and the spatial arrangement of genes on the chromosome, to optimize the cross-talk between two fundamental biological processes: genomic expression and replication. This observation is confirmed by the functional analysis based on the differential gene expression patterns of more than 4000 pairs of microarray and RNA-Seq datasets for E. coli from the Colombos Database using complex network and machine learning methods. This large-scale analysis supports the notion that two logically distinct types of genetic control are cooperating to regulate gene expression in a complementary manner. Moreover, we find that the position of the gene relative to the Ori is a feature of very high predictive value for gene expression, indicating that the Ori-Ter symmetry axis coordinates the action of distinct genetic control mechanisms. npj Systems Biology and Applications (2020) 6:5 ; https://doi.
Article
Full-text available
This study aims to provide the first empirical assessment of quantitative research methods and study quality in learner corpus research. We systematically review quantitative primary studies referenced in the Learner Corpus Bibliography (LCB), a representative bibliography of learner corpus research maintained by the Learner Corpus Association which contained 1,276 references when the project started. Each primary study in the LCB was coded for over 50 categories representing six dimensions: (a) publication type (i.e. conference paper, book chapter, journal article), (b) research focus (e.g. lexis, grammar), (c) methodological features (e.g. keyword analysis, error analysis, use of reference corpus), (d) statistical analyses (e.g. X², t-test, regression analysis), and (e) reporting practices (e.g. reliability coefficients, means). Results point to several systematic strengths as well as many flaws, such as the absence of research questions, incomplete and inconsistent reporting practices (e.g. means without standard deviations), and lack of statistical literacy (i.e. LCR studies generally overrely on tests of statistical significance, do not report effect sizes, rarely check or report whether statistical assumptions have been met, rarely use multivariate analyses). Improvements over time are however clearly noted and there are signs that, like other related disciplines, learner corpus research is slowing undergoing a methodological reform.
Chapter
Full-text available
Second language acquisition (SLA) research has traditionally relied on elicited experimental data, and it has disfavoured natural language use data. Learner corpus research has the potential to change this but, to date, the research has contributed little to the interpretation of L2 acquisition, and some of the corpora are flawed in design. We analyse the reasons why many SLA researchers are still reticent about using corpora, and how good corpus design and adequate tools to annotate and search corpora can help overcome some of the problems observed. We do so by describing how the ten standard principles used in corpus design (Sinclair 2005) were applied to the design of CEDEL2, a large learner corpus of L1 English – L2 Spanish (Lozano 2009a).
Chapter
Full-text available
Since the development of the field of second language acquisition (SLA), which Gass et al. (1998: 409) situate in the 1960s or 1970s, use has been made of authentic data representing learners’ interlanguage. However, what has characterised many of these SLA studies is the small number of subjects investigated and the limited size of the data collected. This can be illustrated by the case studies selected by Ellis (2008: 9–17) as an ‘introduction to second language acquisition research’: Wong Fillmore's (1976, 1979) study of five Mexican children, Schumann's (1978) study of Alberto, Schmidt's (1983) study of Wes, Ellis's (1984, 1992) study of three classroom learners and Lardiere's (2007) study of Patty. While such studies have allowed for a very thorough and detailed analysis of the data under scrutiny (including individual variation and developmental stages), their degree of generalisation can be questioned (Ellis 2008: 8). In this respect, the expansion of corpus linguistics to the study of interlanguage phenomena has opened up new possibilities, materialised in the form of learner corpora. Like any corpus, the learner corpus is a ‘collection of machine-readable authentic texts (including transcripts of spoken data) which is sampled to be representative of a particular language or language variety’ (McEnery et al. 2006: 5). What makes the learner corpus special is that it represents language as produced by foreign or second language (L2) learners. What makes it different from the data used in earlier SLA studies is that it seeks to be representative of this language variety. This element is emphasised by some of the definitions of learner corpora found in the literature, e.g. Nesselhauf's (2004: 125) definition as ‘ systematic computerized collections of texts produced by language learners’ (emphasis added), where ‘systematic’ means that ‘the texts included in the corpus were selected on the basis of a number of – mostly external – criteria (e.g. learner level(s), the learners' L1(s) [mother tongue(s)]) and that the selection is representative and balanced’ (Nesselhauf 2004: 127). Design criteria are essential when collecting a learner corpus and will therefore be dealt with as one of the core issues (Section 2.2).
Chapter
This chapter examines the relationship between second language acquisition (SLA) research and learner corpus research (LCR). It has the dual aim of examining the usefulness of learner corpora for second language acquisition research and of outlining the importance of SLA theory for the design and analysis of learner corpora. In the early stages of LCR, the focus was on description rather than interpretation. This focus has gradually shifted, however, and efforts have been made in the LCR community towards a better grounding in SLA theory (Granger 2012a). Despite such evolution, second language researchers have been rather slow in taking advantage of learner corpora and their associated computerised methodologies (Myles 2005), and LCR is not always fully informed by SLA research, making collaboration between the two fields sometimes more of a wish than a reality (Hasselgård 1999). The chapter will take stock of bidirectional moves (more LCR in SLA and more SLA theory in LCR) by providing a survey of some SLA studies informed by learner corpus data, and it will argue the theoretical and empirical case for the need for SLA research methodologies to move into the digital age and for LCR to take full account of developments in SLA theorising. Of central concern to both fields is the need for good learner data, as argued by SLA theorists and LC researchers alike: ‘It seems self-evident that one of the most precious resources in SLA research, alongside a clear conceptual framework, is a good quality dataset to work on’ (Myles and Mitchell 2004: 173). ‘There is nothing new in the idea of collecting learner data. Both FLT and SLA researchers have been collecting learner output for descriptive and/or theory building purposes since the disciplines emerged’ (Granger 2004: 123–4). Learner corpora seem to provide the ideal meeting ground to discuss the data needs of both fields. This chapter is written from the perspective of SLA theory. On the one hand, it considers why learner corpora are essential to advance and enhance our endeavours towards a better understanding of the nature of second language (L2) learner development and, on the other hand, it specifies what kind of corpora are needed if we are to bring answers to some of the current questions SLA theory is investigating, before providing illustrations from recent studies.