Content uploaded by Fanny Meunier
Author content
All content in this area was uploaded by Fanny Meunier on Dec 21, 2020
Content may be subject to copyright.
23
Fanny Meunier Introduction to Learner Corpus Research
Introduction
Work in Learner Corpus Research (LCR) started around the 1980s as “an oshoot of corpus linguis-
tics” (Granger et al., 2015, p. 1). Corpus linguistics and LCR share a set of common features, among
which is the use of corpora and corpus tools to analyze language. A corpus is dened by McEnery et
al. (2006, p. 5) as a “collection of machine-readable authentic texts (including transcripts of spoken
data) which is sampled to be representative of a particular language or language variety”. A learner
corpus is thus a specic type of corpus which, to follow up on McEnery et al.’s denition, can
broadly be dened as a collection of machine-readable texts consisting in representative samples
of the language written and/or spoken by learners of an additional language (viz. not their mother
tongue, but a foreign/second/nth target language). LCR uses learner corpus data as its main data
source. As for the results of learner corpus studies, they typically serve two main purposes:
• inform SLA research,
• provide useful input for applied projects (including the creation or improvement of teaching
materials/approaches, or the training/development of Natural Language Processing tools).
A comparison of the oft-cited denitions of LCR (see Gilquin, 2015 for more details) reveals
that one of the key features of learner corpora is that the language they contain is meant to be
as authentic as possible and is often dened as (near-) natural. As explained by Granger (2008,
p. 337), “the term near-natural is used to highlight the ‘need for data that reects as closely as
possible ‘natural’ language use (i.e. language that is situationally and interactionally authentic)
while recognizing that the limitations facing the collection of such data often obligate researchers
to resort to clinically elicited data (for example, by using pedagogic tasks (Ellis & Barkhuizen,
2005, p. 7))”. As cases of purely spontaneous oral or written learners’ productions are rare1 – or,
when they take place, cannot easily be ‘spontaneously collected’ – pedagogic tasks serve as the
main prompts to (near)-natural learner language productions.
Another key feature is that the texts2 included in learner corpora have been selected on the
basis of a number of criteria or variables related to, among others:
• the learners themselves (e.g. target language, mother tongue, prociency level),
• the type of communicative task (e.g. written/oral communication, descr iptiv e/per suasi ve/na
rrati ve/ex posit ory writing/narrative, informal/formal level),
3
Introduction to
Learner Corpus Research
Fanny Meunier
24
Fanny Meunier
• the contextual conditions of language production or task setting (e.g. interactive tasks, com-
puter–mediated communication, use of reference tools or not).
The criteria/variables listed above are typically used as metadata to organize the electronic stor-
age of the data in large databases that can later be queried. Researchers can, for instance, select
sub-sections of the data collected (e.g. only texts spoken by female learners of German as a
foreign language/from a lower beginner level/collected during an informal discussion). The vari-
ables can later serve as dependent/independent/predictive variables in the linguistic analyses car-
ried out (see section 3 for more details).
The learners’ initial productions (often called ‘raw’ texts) are often further annotated to enable
researchers to access richly annotated data. The texts can be annotated automatically with the
help of natural language processing tools, edited with the help of semi-automated tools, or anno-
tated fully manually. Some examples of typical linguistic annotations include:
• automatic part-of-speech tagging: each word in the corpus is attributed a part-of-speech
(noun, verb, adjective, etc.) thanks to the help of fully automatic part-of-speech tagging
software (see Chapters 5 and 6, this volume, for more information);
• computer-aided error annotation (CEA): as learner corpora are produced by learners, some
researchers may be interested in spotting areas of diculty that learners have in producing
an L2. Annotating these aspects makes it possible to subsequently focus on them to help
foster learners’ prociency. Errors/infelicities in the corpus are rst spotted by researchers
who then use an editor to insert codes in the corpus (e.g. a plural determiner followed by a
singular common noun can receive a ‘noun number’ error code). More details on CEA can
be found in Chapter 7.
As the two examples provided above show, annotations may include fully automatic tools
(such as part-of-speech taggers, semantic taggers, or syntactic parsers), but also semi-auto-
matic annotation tools requiring human intervention before the analysis can be done. Some
annotations can also be done fully manually by researchers (by inserting codes in the text
using text processing tools, for example) when the analysis cannot be automated, as would
be the case for the inclusion of non-verbal comments in transcripts of videoed interactions
([laughs], [unlled pauses], [gestures], [contextual comments], etc.). This type of tagging is
often referred to as problem-oriented annotation/tagging, viz. the manual annotation by the
researcher of any feature of interest.
A last key feature is that learner corpora, like any other type of corpus, can be queried using
corpus tools such as concordancers (see Chapter 6, this volume), which can be used to:
• extract word lists, word combinations, tags, keywords or annotations,
• display occurrences of words/phrases/tags in the selected corpus,
• compare dierent subcorpora in terms of keywords, frequency distribution of items, etc.
Given the space limitations of an introductory chapter, it is not really possible to describe in
detail all the tools that can be used to annotate or query learner corpora. I thus warmly recom-
mend the Tools for Corpus Linguistics webpage to readers (see https://corpus-analysis.com/) as
it oers an impressive list of 228 corpus tools, each described in terms of the following aspects:
name, description, categories, platform, and pricing, plus a link to the tool. I also refer readers
to the software index of the Handbook of Learner Corpus Research (Granger et al., 2015) as it
contains a list of over 80 tools (annotation tools, DDL tools, statistical packages, text retrieval
tools, iCALL and CALL packages, etc.) whose concrete use, description and illustration can be
found in the handbook.
25
Introduction to Learner Corpus Research
As can be seen from the list of key features above, technology is clearly part and parcel of
LCR. Thanks to giant strides in computer technology in the last quarter of the 20th century, it
became possible to collect data from much larger cohorts of learners and to use computer soft-
ware to assist researchers in the annotation and analysis of the data. The aordances of technol-
ogy also made it possible to perform data analyses that were either previously not conceivable or,
at least, not feasible in a reasonable timeframe and at a reasonable cost. As the second section of
this chapter will show, LCR has evolved remarkably through its three decades of existence. Just
as learners typically go through stages of development in their learning of an additional language,
LCR also evolved from a novice eld (lled with the excitement that usually goes with novelty)
towards more competent and reective practices. This evolution has impacted most of the core
features and issues in LCR, as will be shown in the next section.
Core Issues
Size, Collection, Variables, and Analysis: Limits and Strengths
Access to large (for the time) electronic learner corpora in the 1990s led to a revolution in the
way learner language was analyzed and described. The rst learner corpora that exceeded one
million word tokens in size were collected, and the new options oered by automatic corpus
analysis tools (word lists ordered in decreasing order of frequency, retrieval of words in con-
texts through concordancers, automatic part-of-speech tagging, etc.) oered unprecedented
insights into learner language. The urge to get access to previously inaccessible frequency
information led to a ‘descriptive fever’ (analysis of productions by numerous learners; lists of
the top n words in a corpus; frequency of errors; most frequently used verbs; overused and/
or underused linguistic items, often in relation to an L1 corpus etc.). The term ‘fever’ is not
used here in any derogatory way but simply points to the focus of interest at the time, even if
cautionary tales were already given. Granger (1994, p. 27) warned readers that “quantitative
data should not be regarded as an end in itself” but rather “as a springboard for a qualitative
investigation of the data” and of its patterns of use. Such cautionary tales notwithstanding, it
must be acknowledged that numerous publications back then were essentially descriptive with
frequency lists being provided and compared, with – in many cases – no clear reference to SLA
theories, except for the sometimes simplistic reference to transfer. This led some research-
ers to consider learner corpus linguistics as synonymous to distributional number crunching,
which – despite the limitations mentioned above – also constituted an unfair shortcut. Granger
(2009) responded to criticism levelled against LCR and the lack of collaboration between LCR
and SLA by pointing that one of the main assets of the former is that it brings to the SLA eld
a much wider empirical basis than previously available. She also explained that learner corpora
which have been collected on the basis of strict, well-described criteria and which have been
stored in easily queryable databases contain data from hundreds and sometimes thousands of
learners, which greatly enhance representativeness of data. It also makes controlling the many
variables that aect learner production possible.
Over the years, practices in LCR have also evolved signicantly, moving from a focus on one
main variable (mother tongue background) to studies analyzing the eects of and/or relationships
between a much wider range of variables. Examples include planning time (Ädel, 2008), time of
exposure/learning (Meunier & Littré, 2013), genre (Gentil & Meunier, 2018), or a combination
of variables such as learning context and emotional aspects (De Smet et al., 2018).
Overall, the initial criticism levelled against LCR – be it fair or not – proved very fruitful
as it prompted learner corpus researchers to explicitly verbalize the numerous advantages of
LCR and move the eld further. Gries (2009, p. 2), for instance, argued that corpus linguistic
methods are “a method just as acceptability judgments, experimental data, etc.” are and that
26
Fanny Meunier
“linguists of every theoretical persuasion can use corpus data”. He also explained that usage-
based cognitive-linguistic theories are particularly compatible with corpus linguistics methods,
thereby throwing the spotlight on some of the specic strengths of LCR. The constant ques-
tioning and reassessment of LCR led to a more reective and competent practice in LCR, also
prompting the collection of a much larger variety of learner corpus types, which subsequently
opened up new avenues for analysis. Whilst the rst learner corpora were mainly targeting
written L2 English by relatively advanced learners, typically university students, a much larger
range of target languages and text types has since been collected. The ‘Learner corpora around
the world’ webpage3 maintained by the Centre for English Corpus Linguistics at the Université
catholique de Louvain pays tribute to this variety of target languages (Arabic, French, German,
Korean, Spanish, etc.), text types and production conditions (exam essays, argumentative and
literary essays, letters, diaries, picture descriptions, book reviews, monologues, dialogues,
computer-mediated communication, mails, translations, etc.). Other welcome advances have
been made in terms of:
• prociency levels (covering the whole range of prociency levels, from beginners to
advanced) and types of learners (children, teenagers, adults, non-native ‘learners’ but also
non-native ‘users’ including teachers, heritage speakers, translators, etc.);
• variety of research designs (cross-sectional, quasi-longitudinal, longitudinal),
The publication of the rst handbook of Learner Corpus Research (Granger et al., 2015) and the
launch of the rst journal entirely devoted to LCR, the International Journal of Learner Corpus
Research (IJLCR) pay tribute to the variety of current LCR studies addressing areas as diverse
as interdisciplinarity (Callies & Paquot, 2015), linguistic innovations and creativity in non-native
Englishes (Deshors et al., 2018), and study quality (Paquot & Plonsky, 2017).
Other developments include the use of more complex statistical techniques to interpret quan-
titative data (see e.g. Gries, 2013) and the popularization of mixed-methods designs to comple-
ment LCR methods and studies (see Gilquin & Gries, 2009; Meunier & Littré, 2013). One of the
limitations of LCR is that some of the language features studied by researchers may not naturally
occur frequently enough in unconstrained, open-ended (semi-) authentic production. The col-
lection and analysis of other data types to triangulate research results and oer converging or
diverging evidence is then particularly useful. Such data types may include experimental data,
questionnaires, semi-guided interviews, think-aloud protocols or ethnographic approaches (also
see Chapter 10, this volume).
The (Native Speaker) Norm/Myth?
Native corpus studies have demonstrated their added value in making it possible to compare dif-
ferent varieties of the same language, both synchronically and diachronically, and in providing
a more balanced/rened description of languages. For instance, books like Brief Grammar for
English (attributed to William Bullokar in 1586 and which aimed to show that English was as
rule-bound as Latin) were replaced by thick and detailed accounts like the Longman Grammar of
Spoken and Written English (Biber et al., 1999) where the grammatical specicities of various
text types/registers were minutely described.
The power of corpus data for comparing dierent language varieties is also a central asset
of LCR. As Granger (2015, p. 8) explains, two types of comparison appeared to be particularly
worthwhile in LCR:
• a comparison with native language (NL), seen as the ultimate attainment of learning a for-
eign/second language;
27
Introduction to Learner Corpus Research
• a comparison of one sample of learner language (IL, for interlanguage) with other samples of
learner language, particularly from learners with dierent mother tongue backgrounds, for
example, E2F (the English produced by learners with French as an L1) vs E2G (the English
produced by learners with German as an L1) in Figure 3.1.
This double entry approach to LCR, conceptualized by Granger in 1996, was labelled Contrastive
Interlanguage Analysis, or CIA.
Whilst the IL vs IL approach has always been promoted and accepted, “CIA has been sub-
jected to a range of criticism, most targeted at the L1/L2 branch” (Granger, 2015, p. 13), which
prompted a new version of the methodology, abbreviated as CIA² (see Figure 3.2 for a visual
representation). Put briey, the reference to native speaker language was interpreted as the recog-
nition of one idealized native speaker norm and even labelled as “imperialistic assumptions about
the ownership of English” (Tan, 2005, p.: 128). This was an unfair criticism, according to Granger
(2015, p. 15), as plenty of L1 standards (such as British, American, Australian, Canadian, Hong
Kong, India, Singapore, Sri Lanka, etc.) have been used as reference corpora for CIA studies.
In CIA², new terms have been proposed to avoid misunderstandings: RLV (for Reference
Language Varieties) and ILV (for Interlanguage Varieties). The use of RLV points to the large
number of dierent reference points against which learner data can be set (inner circle varieties
such as British or American English, outer circle varieties such as Indian or Singapore English),
as well as corpora of competent L2 user data, English as a Lingua Franca. As for ILVs, they refer
to learner language varieties, given the “highly variable nature of interlanguage” (Granger 2015,
p. 18). Comparing an ILV with an RLV makes it possible to better understand the processes at
CIA
NL vs IL IL vs IL
E2JE2S
E2F
E2GE1
E2
Figure 3.1 Contrastive Interlanguage Analysis (Granger 1996)
Figure 3.2 CIA² (Granger 2015, p. 17)
28
Fanny Meunier
play in the acquisition of that specic ILV. A comparison of various ILVs (e.g. learners of L2
English whose mother tongues are French, Dutch, Italian, Greek, or Finnish) can help detect
potential universal paths of acquisition versus L1 induced phenomena. A comparison of the lan-
guage produced in various modes, genres or registers produced by the same learners (e.g. written
and oral productions by the same learners) can help researchers discover mode/genre/register
specic features (also see Chapter 8, this volume, for more details on comparing learner corpora).
Applied Perspectives in LCR: The Continuous/Contextualized
Text Paradox in LCR
As explained in the introductory section, the applied perspectives of LCR are numerous.
Mukherjee (2009, p. 212) states that ‘[l]earner corpus analyses always, at least implicitly, raise
the question of what the language-pedagogical implications and applications might be” and
numerous publications have addressed the links between (learner) corpora and pedagogy (see
for instance Burnard & McEnery, 2000; Granger et al., 2002b; Granger, 2008; Aijmer, 2009;
Meunier, 2010). Learner corpora have been used to inform lexicography, syllabus design, materi-
als design, computer-aided language learning and pedagogical approaches such as data-driven
learning. Some learners’ dictionaries (e.g. the Longman Dictionary of Contemporary English
(2009) or the Cambridge Advanced Learner’s Dictionary (2008)) contain error notes intended
to help learners avoid common mistakes. Learner corpora have also been used by textbook writ-
ers to inform the design of tasks addressing typical problems that learners face (see the error
correction/rewriting exercises in the Grammar and Beyond textbook series4). Some large-scale
initiatives like the English Prole Project (see http://www.englishprole.org/ for more details
and related publications) rely on learner corpus data to help teachers and educators understand
what aspects of English are typically learned/acquired at each level of the Common European
Framework of Reference for Languages (Council of Europe, 2001). Learner corpora can also
be used to create data-driven learning activities, i.e. the use of corpora and concordances (typi-
cally keywords presented in their context of production) so that learners can work as language
researchers in awareness-raising activities. Learners can check specic patterns of the use of
keywords in native speaker texts and then compare that use to learners’ productions.
Despite numerous publications on the pedagogical value of learner corpora, a lack of uptake of
corpus-informed pedagogy has been noted (Granger, 2009; McCarthy, 2008; Shirato & Stapleton,
2007; Römer, 2009; Wilson, 2013; Meunier, 2018). Besides the technical problems that are often
put forward, another reason that may explain the lack of uptake of corpus-informed pedagogy can
be found in – what I would label – the contextual/continuous text paradox in LCR. Proponents
of (learner) corpus studies lay strong emphasis on the fact that corpus data is unique in that it
contains continuous stretches of discourse (not single words, phrases or sentences) and consist
in contextualized data (i.e. data not produced in isolation but in the context of a meaningful, set
task). And yet, when it comes to pedagogical applications, the use of learner corpus data rarely
goes beyond the sentence level. This paradox would surely need to be addressed in the future to
pay better tribute to the uniqueness of corpus data and maybe also ensure a clearer understand-
ing on the part of learners and/or teachers of the usefulness of corpus-informed pedagogy (see
Section 6 for some suggestions).
Main Research Methods
As some aspects related to research methods also constitute core issues in LCR, they have been
addressed in the previous section (this includes contrastive interlanguage analysis (and its evolu-
tion over time) as well as a number of issues related to annotation and corpus analysis tools. This
29
Introduction to Learner Corpus Research
third section will thus focus on only two aspects: the basic types of corpus approaches and the
three main study designs typically used in LCR.
Corpus-based and Corpus-driven Approaches
Two basic approaches can be used to analyze a (learner) corpus. The corpus-based approach
uses corpora as a source of information to explore a theory or hypothesis, aiming to validate it,
refute it, or rene it. One concrete example would be the study of grammatical variation in terms
of dative alternation, as speakers have a choice between the prepositional dative construction
(e.g. give something to someone) and the double object construction (e.g. give someone some-
thing). SLA studies put forward dierent hypotheses when it comes to the dative alternation in
L2 English (result of lexicalized verbal preferences, order of acquisition of the two constructions,
etc.). As explained by Jäschke (2016, p. 19) “very few studies explored whether the learners’ use
and judgments of the two variants are governed by the same linguistic factors which have been
found to be predictive for English native speakers”. A corpus-based approach can be used in such
cases to explore learners’ actual use of dative constructions (as was done by Deshors, 2014).
In corpus-driven approaches, the corpus is viewed as a source of inspiration to formulate
hypotheses about language (Tognini-Bonelli, 2001, p. 84-5). “The role of the researcher is to for-
mulate questions and to draw conclusions derived from what corpus data reveal when subjected
to statistical analysis rather than using the data to test a research hypothesis by approaching a
corpus with a number of preconceived ideas” (Callies, 2015, p. 36). One concrete example of
such an approach can be found in Belz and Vyatkina (2008) who investigated the pedagogical
application of a learner corpus study in language teaching and in the developmental analysis of
language learning in an instructed setting. The authors used L1 German data as a baseline against
which learner German data was compared. Using a corpus-driven approach (thanks to a care-
ful qualitative – usage in context – follow-up analysis of frequency lists), they spotted learners’
emerging use of some focal features. These included the use of xed and creative constructions of
the German modal particles ja, denn, doch, and mal. Such studies contribute to second language
acquisition research via dense documentation of micro-changes in learners’ language use over
time and to the formation of new hypotheses for future research.
Cross-sectional, Quasi-longitudinal, and Longitudinal Research Designs
Earlier studies in LCR were mostly cross-sectional, which means that they examined the lan-
guage behavior of a group or groups of language learners at a single point in their development.
Those studies usually compared one ILV with one or more other ILVs or with an RLV.
With a view to addressing developmental paths in SLA, researchers decided to carry out a
comparison of cross-sectional studies of dierent groups of learners at dierent developmental
stages, thereby adopting what Huat (2012, p. 197) calls a pseudo-longitudinal approach. The
learners’ productions are not from the same learners, hence the use of the ‘pseudo’ prex, and
the ‘time’ variable is somehow measured by proxies such as age or prociency level. In such
pseudo-longitudinal designs, researchers compare several groups of learners at dierent levels
of prociency.
Cross-sectional and pseudo-longitudinal designs do not allow for the analysis of individual
development. Individual variation within each group or sub-group can however be analyzed – as
can group development in pseudo-longitudinal designs.
Longitudinal study designs, in contrast to the two previous types, follow the same individual(s)
over time. Longitudinal research is dened as ‘emphasizing the study of change and containing
at minimum three repeated observations on at least one of the substantive constructs of interest’
30
Fanny Meunier
(Ployhart & Vandenberg, 2010, p. 97). As explained in Meunier (2015), the collection of longi-
tudinal learner corpus data is time- and cost-consuming, and the analysis can only start when the
entire data collection is over. Other issues include attrition (i.e. the sometimes signicant number
of participants dropping out before data-collection points). Such obstacles probably account for
the scarcity of longitudinal studies in the early days of LCR.
In longitudinal study designs, group progress, individual variation within groups and indi-
vidual trajectories can be analyzed. This requires the use of, for instance, multi-level model-
ling – also referred to as hierarchical linear modelling or mixed-eects models (see Raudenbush
& Bryk, 2002; Baayen et al., 2008; Cunnings, 2012; Gries, 2015). “Multi-level modelling allows
a variety of predictors to be analyzed, with ‘time’ being a key predictor in longitudinal studies: do
participants become more procient as time goes by and, if so, how strong is the eect of time?
Such statistical modelling can be applied to individuals within groups as well as to individuals as
individuals, by analyzing both endpoints and trajectories” (Meunier, 2015, p. 382).
With the benet of hindsight, it can be argued that there has been a true qualitative evolution
over time in the research methods used in LCR. The eld broadly evolved from the descriptive
analysis of aggregate data in cross-sectional designs to the use of inferential statistics and a focus
on intra- and inter-learner variability in more complex types of designs (including longitudinal
studies and mixed-methods approaches).
Representative Corpora and Research
The International Corpus of Learner English (ICLE) is probably the exemplar of rst-genera-
tion learner corpora. It has been and is still being used massively by learner corpus researchers.
Its rst edition (Granger et al., 2002a) – resulting from ten years of international collaboration
between numerous universities – contained 2.5 million words of English (mostly argumenta-
tive essays by university students of English) written by learners from eleven dierent mother
tongue backgrounds and was released in CD-ROM format including an interface to compile
tailor-made subcorpora on the basis of a set of predened learner or task variables. The second
version (Granger et al., 2009) diered from the rst one in scope (larger amount and greater
diversity of the learner data included) and in functionalities. It included a built-in concordancer
and direct links between learner prole information and search results. A third extended and
web-based version of ICLE (Granger et al., 2020) will soon be available. Ädel’s (2008) study,
presented below, is based on ICLE v1.
Ädel (2008) examines variables related to task eects on language use. The research question
addresses how the variable of ‘task setting’ (time and reference sources available) aects the
learners’ writing styles on the written/spoken continuum. She uses the concepts of ‘involvement’
and ‘detachment’ typically used in variationist corpus-based approaches to language, with infor-
mal speech typically characterized as involved (rst-person reference, emphatic particles, etc.)
and formal writing as detached (passive constructions, inanimate subjects, etc.). The learners
were university students of English with L1 Swedish who wrote argumentative and expository
essays for the Swedish subcorpus of the International Corpus of Learner English (SWICLE:
Granger et al., 2009) and the Uppsala Student English Corpus (USE; Axelsson, 2000). The over-
all results of the study show that learners exhibit more involvement features in timed than in
untimed essays but less if they have access to source texts. In addition to a possible lack of regis-
ter awareness, the study reveals that the extreme rate of involvement found in SWICLE is rather
linked to both the lack of time that writers have to make the text more written-like and the lack
of model texts to rely on.
The Longitudinal Database of Learner English (LONGDALE) (Meunier, 2016) is one rep-
resentative example of truly longitudinal learner corpora. It currently contains data collected by
31
Introduction to Learner Corpus Research
ve teams (Radboud University Nijmegen (the Netherlands), University of Hannover (Germany),
University of Louvain (Belgium), University of Padua (Italy), and University Paris-Diderot
(France). The same students are followed over a period of at least three years and data collec-
tions are organized at least once per year, with some teams organizing several data collections
per year. The term ‘database’ (rather than ‘corpus’) has been used from the onset of the project
as LONGDALE includes a wide range of data types including argumentative essays, narratives,
and informal interviews, but also more guided types of productions (such as picture descriptions).
Experimental data is also included for some of the subcorpora. The database also includes com-
prehensive learner prole information which is gathered during each data collection. The study
presented below focuses on the acquisition of phonology/pronunciation.
Méli (2013) analyzes the segmental realizations of French learners of English with a view of
checking whether ‘perceived dissimilarity’ is a hindrance or an advantage for the L2 acquisition
of sounds that do not exist in the learner’s L1. He focusses on the realizations of the interdental
fricative, as well as some of the phonemic vowel length asymmetries for vowels, {/i/ in French,
/i-/i:/ in English} and {/u/ in French, /u/-/u:/ in English}. Eighteen students were recorded lon-
gitudinally over three years (the date of year two and four are used in the present study). The
acoustic characteristics of some features of native speech were compared – using the PRAAT
software (Broesma, 2001) – to learners’ realizations of the same sounds using the Bark Dierence
Metric method5. The analysis of the interdental fricatives stresses the importance of phonotactics
(i.e. the syntax of phonemes) and of lexical frequency. It also mentions possible ‘islands of reli-
ability’ (for expressions such as I think) which might help oral production in that learners may
use these as ‘buying time’ devices or structuring features. The analysis also indicates dierent
learning patterns for some sets of phonemes (with /u/~/u:/ being acquired later than /i/~/i:/). An
analysis of the perception and the categorization of the phonemic realizations by learners them-
selves was also carried out. The paper tests how accurately the data found is predicted by known
Second Language Acquisition (SLA) theoretical frameworks such as Flege’s Speech Learning
Model (1995). The results of the study show that the assumptions fail to predict dierences in
learning patterns.
The last corpus presented in this section is rather innovative. The Multilingual, Traditional,
Immersion, and Native Corpus (MulTINCo: Meunier, Hendrikx, Bulon, Van Goethem, & Naets,
accepted) includes both learner and native data types. It contains:
• learner data for two target languages: Dutch and English
• learners’ spoken and written, longitudinal data collected in two dierent educational set-
tings: Content and Language Integrated Learning – CLIL – and traditional foreign language
classes;
• data produced by the same learners in their L1 (on similar task types);
• comparable native data from native speakers of the learners’ L2 of about the same age;
• a variety of background variables (age, gender, home language, amount of L2 curricular and
extracurricular input, etc.).
Van Mensel et al. (accepted) explores the impact of formal and informal input on learners’ vari-
ability in writing. It compares two target language conditions (Dutch and English) in two dif-
ferent instructed settings, namely Content and Language Integrated Language Learning (CLIL)
and traditional foreign language learning classes (non-CLIL) in French-speaking Belgium. The
study is part of a large project whose main objective is to investigate the inuence of CLIL – and
other educational, motivational, and cognitive factors – on the acquisition of a foreign language.
Over 900 French-speaking primary and secondary school pupils learning English or Dutch in
CLIL and NON-CLIL settings were followed longitudinally for two consecutive school years
32
Fanny Meunier
(2015-2016 and 2016-2017) and various data types were collected. Using regression models to
check whether the type and amount of input that learners are exposed to6 correlate with pro-
ciency levels, the study shows that CLIL is a signicant predictor of L2 outcomes for both target
languages, but that the relative impact of formal and informal input diers depending on the
target language. The results also highlight the importance of the L2 status in research on CLIL,
because dierent L2s can yield dierent results.
Future Directions
As illustrated in section 2, LCR has constantly questioned its role, methods, and goals, and has, as a
result, evolved remarkably over the last 30 years. It is almost impossible to accurately predict what
will lay ahead of us, say, in the next 30 years to come, but I have identied two promising areas.
The rst one is related to the very status of LCR, which has always been considered as a prod-
uct-oriented approach and which actually has the potential to combine both process and product
orientations in the future. Mäntylä et al. (2018), for instance, show how the use of keystroke log-
ging software (Strömqvist et al., 2006) cannot only help researchers better understand the writing
process but also – and perhaps even more importantly here – lead to a reconsideration of what is
actually perceived or stored in the learner’s mind as a formulaic sequence (Wray, 2002). Previous
LCR research on formulaic language focused on ‘learner-external’ sequences (viz. the linguistic
patterns produced). The recording of keyboard activities during the writing process on computers
gives researchers unprecedented access to ‘learner-internal’ patterns (Myles & Cordier, 2017). A
careful analysis of the pauses between words, for instance, can reveal diculties in accessing a
formulaic sequence but also the fact that the suite of words considered as a formulaic sequence on
the basis of learner-external patterns may not have psycholinguistic reality in the learner’s internally
stored holistic lexicon (Durrant, 2013). Studies like this one are only the rst steps towards studies
that integrate both product- and process-oriented approaches. New technologies and digital tools
make it possible to record processing ‘moves’ which will, hopefully, be integrated in LCR in the
future and hence help researchers revisit some of the LCR ndings in a new perspective.
The second promising area is the interest in – and need for – collecting more interactive data
types. Whilst some learner corpora already include samples of language in interaction (such as the
Telekorp: Belz & Vyatkina, 2008), they still constitute a minority of the data types collected. The
interactive nature of communication is being increasingly stressed in SLA circles, with a focus
on ecological approaches (Kramsch & Vork Steensen, 2008; Thorne, 2013) and multilingual-
ism (May, 2013; Ortega, 2013). As explained by Van Lier (2010, p. 2) “Ecological approaches
focus primarily on the quality of learning opportunities, of classroom interaction and of educa-
tional experience in general. Important pedagogical principles in an ecological approach are the
creation of ecologically valid contexts, relationships, agency, motivation and identity”. Such an
approach calls for more attention to be paid to the ecological value of tasks given to learners. As
for the multilingual turn, it puts multilingualism at the forefront, thereby opening up new avenues
for intrinsically multilingual corpora where, for instance, learners from dierent mother tongues
can interact and translanguage. Instead of the rather homogeneous corpora of L2‘x’ with only
one ‘x’, one could collect learner corpora of L2‘xs’. This focus on interaction is also found in
pedagogical circles where ocial curricular documents clearly distinguish the features of spoken
and written competences with or without interaction and also insist on the non-verbal strategies
that are key to interactive competence. Learner corpora like the Giessen-Long Beach Chaplin
Corpus (GLBCC: Jucker, Müller, & Smith, 2006) or the Multimedia Adult ESL Learner Corpus
(MAELC: Reder, Harris, & Setzer, 2003) should inspire future learner corpus collections. The
GLBCC consists of transcribed interactions between native English, ESL, and EFL speakers.
As for MAELC, it contains videotaped classroom interactions associated with written materials
33
Introduction to Learner Corpus Research
(copies of classroom written materials, student work, teacher logs, and teacher reections). The
corpus includes materials from four years of adult ESL classes ranging from beginning to upper-
intermediate prociency with over 3600 hours of classroom interaction recorded by six cameras
and multiple microphones. The corpus has been partly coded for participation pattern and activ-
ity, and portions of these classes have been transcribed, targeting student language during pair
work. As explained on the MAELC website “examinations of dyadic interaction can focus on
interactions between students from dierent rst language backgrounds as well as on develop-
mental studies of individual students who are recorded throughout several terms of study”. It is
also of primary importance to reach out and collect data from less favored groups of learners
(such as migrants) in order for LCR to be representative of all types of learners both in formal
(instructed) and informal (non-instructed) contexts.
Further Reading
Granger, S., Gilquin, G., & Meunier, F. (Eds.) (2015). The Cambridge handbook of learner corpus research.
Cambridge: Cambridge University Press.
This volume is the rst handbook entirely devoted to LCR. It oers a detailed overview of the eld and
of the aordances of learner corpora.
Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied
Linguistics, 32, 130–149.
This article focusses on the formulaic/phraseological nature of language, one of the key aspects of lan-
guage that corpus-linguistic methodology has helped reveal.
Fuchs, R., & Werner, V. (Eds.) (2018). Tense and aspect in second language acquisition and learner corpus
research [Special Issue]. International Journal of Learner Corpus Research, 4(2), 143–163
This edited volume presents ve studies addressing a topic that has received much attention in SLA, viz.
tense and aspect. It provides a fresh LCR perspective on tense and aspect issues.
Related Topics
Chapters 2, 6, 7, 8, 9, and 11.
Notes
1 Whilst instances of informal interactions may be more likely, few learners spontaneously decide to write
an argumentative or literary essay.
2 Corpus data are ideally continuous (i.e. consisting of longer stretches of discourse, not single words,
phrases or sentences) and contextualized (i.e. not produced in isolation but in the context of a meaning-
ful, set task).
3 Centre for English Corpus Linguistics (date of access 24th September 2018): Learner Corpora around the
World. Louvain-la-Neuve: Université Catholique de Louvain. https ://uc louva in.be /en/r esear ch-in stitu
tes/i lc/ce cl/le arner -corp ora-a round -the- world .html
4 See https ://ww w.cam bridg e.org /us/c ambri dgeen glish /cata log/e nglis h-aca demic -purp oses/ gramm ar-an
d-bey ond/
5 Put simply, the Bark Dierence Metric Method improves acoustic measurements by making it possible
to lter out physiological dierences in pronunciation while retaining sociolinguistic dierences.
6 Computed thanks to a proxy gathering various types of information collected through questionnaires on
input type and frequency.
References
Ädel, A. (2008). Involvement features in writing: Do time and interaction trump register awareness? In G.
Gilquin, S. Papp, & M. B. Díez-Bedmar (Eds.), Linking up contrastive and learner corpus research (pp.
35–53). Amsterdam: Rodopi.
34
Fanny Meunier
Aijmer, K. (2009). Corpora and language teaching. Amsterdam: John Benjamins.
Axelsson, M. W. (2000). USE – The Uppsala student English corpus: An instrument for needs analysis.
ICAME Journal, 24, 155–157.
Baayen, H., Davidson, D., & Bates, D. (2008). Mixed-eects modeling with crossed random eects for
subjects and items. Journal of Memory and Language, 59(4), 390–412.
Belz, J., & Vyatkina, N. (2008). The pedagogical mediation of a developmental learner corpus for classroom-
based language instruction. Language Learning and Technology, 12(3), 33–52.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and
written English. Harlow: Pearson Education Limited.
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
Burnard, L., & McEnery, T. (Eds.) (2000). Rethinking language pedagogy from a corpus perspective:
Papers from the third international conference on teaching and language corpora. Frankfurt: Peter Lang
Publishing.
Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus research (pp. 35–56). Cambridge: Cambridge University Press.
doi:10.1017/CBO9781139649414.003.
Callies, M., & Paquot, M. (2015). Learner corpus research: An interdisciplinary eld on the move.
International Journal of Learner Corpus Research, 1(1), 1–6. doi: 10.1075/ijlcr.1.1.00edi
McIntosh, C. Cambridge Advanced Learner's Dictionary. (2008). Cambridge: Cambridge University Press.
Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching,
assessment. Cambridge: Cambridge University Press.
Cunnings, I. (2012). An overview of mixed-eects statistical models for second language researchers.
Second Language Research, 28(3), 369–382.
Deshors, S. C. (2014). A case for a unied treatment of EFL and ESL: A multifactorial approach. English
World-Wide, 35(3), 277–305.
Deshors, S. C., Götz, S., & Laporte, S. (Eds.) (2018). Rethinking linguistic creativity in non-native Englishes
(Volume 98). John Benjamins Publishing Company.
De Smet, A., Mettewie, L., Galand, B., Hiligsmann, P., & Van Mensel, L. (2018). Classroom anxiety and
enjoyment in CLIL and non-CLIL: Does the target language matter? Studies in Second Language
Learning and Teaching, 8(1), 47–71. doi:10.14746/ssllt.2018.8.1.3
Durrant, P. (2013). Formulaicity in an agglutinating language: The case of Turkish. Corpus Linguistics and
Linguistic Theory, 9(1), 1–38.
Ellis, R., & Barkhuizen, G. (2005). Analysing learner language. Oxford: Oxford University Press.
Flege, J. E. (1995). Second-language speech learning: Theory, ndings and problems. In W. Strange (Ed.),
Speech perception and linguistic experience: Theoretical and methodological issues (pp. 229–273).
Timonium: York Press.
Fuchs, R., & Werner, V. (Eds.) (2018). Tense and aspect in second language acquisition and learner corpus
research [Special Issue]. International Journal of Learner Corpus Research, 4(2), 143–163
Gentil, G., & Meunier, F. (2018). A systemic functional linguistic approach to usage-based research and
instruction: The case of nominalization in L2 academic writing. In A. E. Tyler, L. Ortega, M. Uno, & H.
I. Park (Eds.), Usage-inspired L2 instruction. Researched pedagogy (pp. 267–289). Amsterdam: John
Benjamins.
Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier
(Eds.), The Cambridge handbook of learner corpus research (pp. 9–34). Cambridge: Cambridge
University Press.
Gilquin, G., & Gries, S. (2009). Corpora and experimental methods: A state-of-the-art review. In G. Gilquin
(Ed.), Corpora and experimental methods [Special Issue]. Corpus Linguistics and Linguistic Theory,
5(1), 1–26.
Granger, S. (1994). The learner corpus: A revolution in applied linguistics. English Today, 10(3), 25–33.
doi:10.1017/S0266078400007665
Granger, S. (1996). From CA to CIA and back: An integrated approach to computerized bilingual and
learner corpora. In K. Aijmer, B. Altenberg, & M. Johansson (Eds.), Languages in contrast: Text-
based cross-linguistic studies. Lund studies in English (vol. 88, pp. 37–51). Lund: Lund University
Press.
Granger, S. (2008). Learner corpora in foreign language education. In N. Van Deusen-Scholl & N. H.
Hornberger (Eds.), Encyclopedia of language and education (vol. 4, pp. 337–351). Boston: Springer.
Granger, S. (2009). The contribution of learner corpora to second language acquisition and foreign language
teaching: A critical evaluation. In K. Aijmer (Ed.), Corpora and language teaching (pp. 13–32).
Amsterdam: John Benjamins.
35
Introduction to Learner Corpus Research
Granger, S. (2015). Contrastive interlanguage analysis: A reappraisal. International Journal of Learner
Corpus Research, 1(1), 7–24. doi: 10.1075/ijlcr.1.1.01gra.
Granger, S., Dagneaux, E., & Meunier, F. (2002a). International corpus of learner English. Handbook and
CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain.
Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (2009). International corpus of learner English.
Version 2 (Handbook + CD-ROM). Louvain-la-Neuve: Presses universitaires de Louvain.
Granger, S., Dupont, M., Meunier, F., Naets, H., & Paquot, M. (2020). International corpus of learner
English. Version 3 (Handbook + web interface). Louvain-la-Neuve: Presses Universitaires de Louvain.
Granger, S., Gilquin, G., & Meunier, F. (Eds.) (2015). The Cambridge handbook of learner corpus research.
Cambridge: Cambridge University Press.
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.) (2002b). Computer learner corpora, second language
acquisition and foreign language teaching. Amsterdam: John Benjamins.
Gries, S. (2009). What is corpus linguistics? Language and Linguistics Compass, 3(5), 1225–1241.
doi:10.1111/j.1749-818X.2009.00149.x
Gries, S. (2013). Statistical tests for the analysis of learner corpus data. In A. Diaz-Negrillo, N. Ballier, & P.
Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 287–310). Amsterdam:
John Benjamins.
Gries, S. (2015). Statistics for learner corpus research. In S. Granger, G. Gilquin, & F. Meunier (Eds.),
The Cambridge handbook of learner corpus research (pp. 159–182). Cambridge: Cambridge University
Press. doi:10.1017/CBO9781139649414.008
Huat, C. M. (2012). Learner corpora and second language acquisition. In K. Hyland, C. M. Huat, & M.
Handford (Eds.), Corpus applications in applied linguistics (pp. 191–207). London: Continuum.
Jäschke, K. (2016). The dative alternation in English as a second language. PhD dissertation. Düsseldorf:
Heinrich-Heine-Universität. Retrieved from https://d-nb.info/1135382433/34
Jucker, A., Müller, S., & Smith, S. (2006). GLBCC (Giessen - Long Beach Chaplin Corpus). Oxford text
archive. Retrieved from http://hdl.handle.net/20.500.12024/2506. See also http://ota.ox.ac.uk/desc/2506.
Kramsch, C., & Vork Steensen, S. (2008). Ecological perspectives on second language acquisition and
socialization. In N. H. Hornberger (Ed.), Encyclopedia of language and education (pp. 2595–2606).
Boston: Springer.
Longman Dictionary of Contemporary English (Fifth edition). (2009). Harlow: Pearson Education Limited.
Mäntylä, K., Lahtinen, S., Vaakanainen, V., & Mäkilä, M. (2018). Using keystroke logging to analyse the
writing process – tools for teaching writing. EuroCALL 2018 (book of abstracts, p. 27), Jyväskylä,
August 22.
May, S. (Ed.) (2013). The multilingual turn: Implications for SLA, TESOL, and bilingual education. London:
Routledge.
McCarthy, M. (2008). Accessing and interpreting corpus information in the teacher education context.
Language Teaching, 41(4), 563–574.
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book.
London: Routledge.
Méli, A. (2013). Phonological acquisition in the French–English interlanguage: Rising above the phoneme.
In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner
corpus data (pp. 207–226). Amsterdam: Benjamins.
Meunier, F. (2010). Learner corpora and English language teaching: Checkup time. Anglistik: International
Journal of English Studies, 21(1), 209–220.
Meunier, F. (2015). Developmental patterns in learner corpora. In S. Granger, G. Gilquin, & F. Meunier
(Eds.), The Cambridge handbook of learner corpus research (pp. 379–400). Cambridge: Cambridge
University Press.
Meunier, F. (2016). Introduction to the LONGDALE project. In E. Castello K. Ackerley, & F. Coccetta
(Eds.), Studies in learner corpus linguistics: Research and applications for foreign language teaching
and assessment (pp. 123–126). Berlin: Peter Lang Publishing. Retrieved from https ://uc louva in.be /en/r
esear ch-in stitu tes/i lc/ce cl/lo ngdal e.htm l
Meunier, F. (2018). Promoting TPACK and professional learning communities: Focus on teaching and
learning multiword units. EuroCALL, Jyväskylä, August 23. doi:10.13140/RG.2.2.26823.14244.
Retrieved from https ://ww w.res earch gate. net/p ublic ation /3272 37628 _Prom oting _TPAC K_and _prof
essio nal_l earni ng_co mmuni ties_ focus _on_t eachi ng_an d_lea rning _mult iword _unit s_Eur oCALL _conf
erenc e_pap er_Jy vasky la_Fi nland _23_A ugust _2018
Meunier, F., Hendrikx, I., Bulon, A., Van Goethem, K., & Naets, H. (2020). MulTINCo: Multilingual traditional
immersion and native corpus. Better-documented multi-literacy practices for more rened SLA studies.
International Journal of Bilingual Education and Bilingualism, DOI: 10.1080/13670050.2020.1786494
36
Fanny Meunier
Meunier, F., & Littré, D. (2013). Tracking learners’ progress: Adopting a dual ‘corpus cum experimental
data’ approach. The Modern Language Journal, 97(1), 61–76.
Mukherjee, J. (2009). The grammar of conversation in advanced spoken learner English: Learner corpus
data and language-pedagogical implications. In K. Aijmer (Ed.), Corpora and language teaching (pp.
203–230). Amsterdam: John Benjamins.
Myles, F., & Cordier, C. (2017). Formulaic sequences (FS) cannot be an umbrella term in SLA: Focusing
on psycholinguistic FSs and their identication. Studies in Second Language Acquisition, 39(1), 3–28.
doi:10.1017/S027226311600036X
Ortega, L. (2013). SLA for the 21st century: Disciplinary progress, transdisciplinary relevance, and the bi/
multilingual turn. Language Learning, 63(1), 1–24.
Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied
Linguistics, 32, 130–149.
Paquot, M., & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus
research. International Journal of Learner Corpus Research, 3(1), 61–94.
Ployhart, R. E., & Vandenberg, R. J. (2010). Longitudinal research: The theory, design, and analysis of
change. Journal of Management, 36(1), 94–120.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis
methods (2nd ed.). Thousand Oaks: Sage Publications, Inc.
Reder, S., Harris, K., & Setzler, K. (2003). A multimedia adult learner corpus. TESOL Quarterly, 37(3),
65–78. Retrieved from http: //www .labs chool .pdx. edu/m aelc_ acces s.htm l.
Römer, U. (2009). Corpus research and practice: What help do teachers need and what can we oer? In K.
Aijmer (Ed.), Corpora and language teaching (pp. 83–98). Amsterdam: John Benjamins.
Shirato, J., & Stapleton, P. (2007). Comparing English vocabulary in a spoken learner corpus with a native
speaker corpus: Pedagogical implications arising from an empirical study in Japan. Language Teaching
Research, 1(4), 393–412.
Strömqvist, S., Holmqvist, K., Johansson, V., Karlsson, H., & Wengelin, Å. (2006). What keystroke logging
can reveal about writing. In K. P. H. Sullivan & E. Lindgren (Eds.), Computer keystroke logging and
writing: Methods and applications (pp. 45–71). Amsterdam: Elsevier.
Tan, M. (2005). Authentic language or language errors? Lessons from a learner corpus. ELT Journal, 59(2),
126–134.
Thorne, S. (2013). Language learning, ecological validity, and innovation under conditions of superdiversity.
Bellaterra Journal of Teaching and Learning Language and Literature, 6(2), 1–27.
Tognini-Bonelli, E. (2001). Corpus linguistics at work. Amsterdam: John Benjamins.
Van Lier, T. (2010). The ecology of language learning: Practice to theory, theory to practice. Social and
Behavioral Sciences, 3, 2–6.
Van Mensel, L., Bulon, A., Hendrikx, I., Meunier, F., & Van Goethem, K. (2020). Eects of input on L2
writing skills in English and Dutch: CLIL and non-CLIL learners in French-speaking Belgium. Journal
of Immersion and Content-Based Language Education.
Wilson, J. (2013). Technology, pedagogy and promotion. How can we make the most of corpora and data-
driven learning (DDL) in language learning and teaching? The Higher Education Academy . Retrieved
from https ://ww w.hea cadem y.ac. uk/sy stem/ les /corp us_te chnol ogy_p edago gy_pr omoti on2.p df
Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.