ChapterPDF Available

Introduction to Learner Corpus Research

Authors:
23
Fanny Meunier Introduction to Learner Corpus Research
Introduction
Work in Learner Corpus Research (LCR) started around the 1980s as “an oshoot of corpus linguis-
tics” (Granger et al., 2015, p. 1). Corpus linguistics and LCR share a set of common features, among
which is the use of corpora and corpus tools to analyze language. A corpus is dened by McEnery et
al. (2006, p. 5) as a “collection of machine-readable authentic texts (including transcripts of spoken
data) which is sampled to be representative of a particular language or language variety”. A learner
corpus is thus a specic type of corpus which, to follow up on McEnery et al.’s denition, can
broadly be dened as a collection of machine-readable texts consisting in representative samples
of the language written and/or spoken by learners of an additional language (viz. not their mother
tongue, but a foreign/second/nth target language). LCR uses learner corpus data as its main data
source. As for the results of learner corpus studies, they typically serve two main purposes:
inform SLA research,
provide useful input for applied projects (including the creation or improvement of teaching
materials/approaches, or the training/development of Natural Language Processing tools).
A comparison of the oft-cited denitions of LCR (see Gilquin, 2015 for more details) reveals
that one of the key features of learner corpora is that the language they contain is meant to be
as authentic as possible and is often dened as (near-) natural. As explained by Granger (2008,
p. 337), “the term near-natural is used to highlight the ‘need for data that reects as closely as
possible ‘natural’ language use (i.e. language that is situationally and interactionally authentic)
while recognizing that the limitations facing the collection of such data often obligate researchers
to resort to clinically elicited data (for example, by using pedagogic tasks (Ellis & Barkhuizen,
2005, p. 7))”. As cases of purely spontaneous oral or written learners’ productions are rare1 – or,
when they take place, cannot easily be ‘spontaneously collected’ – pedagogic tasks serve as the
main prompts to (near)-natural learner language productions.
Another key feature is that the texts2 included in learner corpora have been selected on the
basis of a number of criteria or variables related to, among others:
the learners themselves (e.g. target language, mother tongue, prociency level),
the type of communicative task (e.g. written/oral communication, descr iptiv e/per suasi ve/na
rrati ve/ex posit ory writing/narrative, informal/formal level),
3
Introduction to
Learner Corpus Research
Fanny Meunier
24
Fanny Meunier
the contextual conditions of language production or task setting (e.g. interactive tasks, com-
puter–mediated communication, use of reference tools or not).
The criteria/variables listed above are typically used as metadata to organize the electronic stor-
age of the data in large databases that can later be queried. Researchers can, for instance, select
sub-sections of the data collected (e.g. only texts spoken by female learners of German as a
foreign language/from a lower beginner level/collected during an informal discussion). The vari-
ables can later serve as dependent/independent/predictive variables in the linguistic analyses car-
ried out (see section 3 for more details).
The learners’ initial productions (often called ‘raw’ texts) are often further annotated to enable
researchers to access richly annotated data. The texts can be annotated automatically with the
help of natural language processing tools, edited with the help of semi-automated tools, or anno-
tated fully manually. Some examples of typical linguistic annotations include:
automatic part-of-speech tagging: each word in the corpus is attributed a part-of-speech
(noun, verb, adjective, etc.) thanks to the help of fully automatic part-of-speech tagging
software (see Chapters 5 and 6, this volume, for more information);
computer-aided error annotation (CEA): as learner corpora are produced by learners, some
researchers may be interested in spotting areas of diculty that learners have in producing
an L2. Annotating these aspects makes it possible to subsequently focus on them to help
foster learners’ prociency. Errors/infelicities in the corpus are rst spotted by researchers
who then use an editor to insert codes in the corpus (e.g. a plural determiner followed by a
singular common noun can receive a ‘noun number’ error code). More details on CEA can
be found in Chapter 7.
As the two examples provided above show, annotations may include fully automatic tools
(such as part-of-speech taggers, semantic taggers, or syntactic parsers), but also semi-auto-
matic annotation tools requiring human intervention before the analysis can be done. Some
annotations can also be done fully manually by researchers (by inserting codes in the text
using text processing tools, for example) when the analysis cannot be automated, as would
be the case for the inclusion of non-verbal comments in transcripts of videoed interactions
([laughs], [unlled pauses], [gestures], [contextual comments], etc.). This type of tagging is
often referred to as problem-oriented annotation/tagging, viz. the manual annotation by the
researcher of any feature of interest.
A last key feature is that learner corpora, like any other type of corpus, can be queried using
corpus tools such as concordancers (see Chapter 6, this volume), which can be used to:
extract word lists, word combinations, tags, keywords or annotations,
display occurrences of words/phrases/tags in the selected corpus,
compare dierent subcorpora in terms of keywords, frequency distribution of items, etc.
Given the space limitations of an introductory chapter, it is not really possible to describe in
detail all the tools that can be used to annotate or query learner corpora. I thus warmly recom-
mend the Tools for Corpus Linguistics webpage to readers (see https://corpus-analysis.com/) as
it oers an impressive list of 228 corpus tools, each described in terms of the following aspects:
name, description, categories, platform, and pricing, plus a link to the tool. I also refer readers
to the software index of the Handbook of Learner Corpus Research (Granger et al., 2015) as it
contains a list of over 80 tools (annotation tools, DDL tools, statistical packages, text retrieval
tools, iCALL and CALL packages, etc.) whose concrete use, description and illustration can be
found in the handbook.
25
Introduction to Learner Corpus Research
As can be seen from the list of key features above, technology is clearly part and parcel of
LCR. Thanks to giant strides in computer technology in the last quarter of the 20th century, it
became possible to collect data from much larger cohorts of learners and to use computer soft-
ware to assist researchers in the annotation and analysis of the data. The aordances of technol-
ogy also made it possible to perform data analyses that were either previously not conceivable or,
at least, not feasible in a reasonable timeframe and at a reasonable cost. As the second section of
this chapter will show, LCR has evolved remarkably through its three decades of existence. Just
as learners typically go through stages of development in their learning of an additional language,
LCR also evolved from a novice eld (lled with the excitement that usually goes with novelty)
towards more competent and reective practices. This evolution has impacted most of the core
features and issues in LCR, as will be shown in the next section.
Core Issues
Size, Collection, Variables, and Analysis: Limits and Strengths
Access to large (for the time) electronic learner corpora in the 1990s led to a revolution in the
way learner language was analyzed and described. The rst learner corpora that exceeded one
million word tokens in size were collected, and the new options oered by automatic corpus
analysis tools (word lists ordered in decreasing order of frequency, retrieval of words in con-
texts through concordancers, automatic part-of-speech tagging, etc.) oered unprecedented
insights into learner language. The urge to get access to previously inaccessible frequency
information led to a ‘descriptive fever’ (analysis of productions by numerous learners; lists of
the top n words in a corpus; frequency of errors; most frequently used verbs; overused and/
or underused linguistic items, often in relation to an L1 corpus etc.). The term ‘fever’ is not
used here in any derogatory way but simply points to the focus of interest at the time, even if
cautionary tales were already given. Granger (1994, p. 27) warned readers that “quantitative
data should not be regarded as an end in itself” but rather “as a springboard for a qualitative
investigation of the data” and of its patterns of use. Such cautionary tales notwithstanding, it
must be acknowledged that numerous publications back then were essentially descriptive with
frequency lists being provided and compared, with – in many cases – no clear reference to SLA
theories, except for the sometimes simplistic reference to transfer. This led some research-
ers to consider learner corpus linguistics as synonymous to distributional number crunching,
which – despite the limitations mentioned above – also constituted an unfair shortcut. Granger
(2009) responded to criticism levelled against LCR and the lack of collaboration between LCR
and SLA by pointing that one of the main assets of the former is that it brings to the SLA eld
a much wider empirical basis than previously available. She also explained that learner corpora
which have been collected on the basis of strict, well-described criteria and which have been
stored in easily queryable databases contain data from hundreds and sometimes thousands of
learners, which greatly enhance representativeness of data. It also makes controlling the many
variables that aect learner production possible.
Over the years, practices in LCR have also evolved signicantly, moving from a focus on one
main variable (mother tongue background) to studies analyzing the eects of and/or relationships
between a much wider range of variables. Examples include planning time (Ädel, 2008), time of
exposure/learning (Meunier & Littré, 2013), genre (Gentil & Meunier, 2018), or a combination
of variables such as learning context and emotional aspects (De Smet et al., 2018).
Overall, the initial criticism levelled against LCR be it fair or not – proved very fruitful
as it prompted learner corpus researchers to explicitly verbalize the numerous advantages of
LCR and move the eld further. Gries (2009, p. 2), for instance, argued that corpus linguistic
methods are “a method just as acceptability judgments, experimental data, etc.” are and that
26
Fanny Meunier
“linguists of every theoretical persuasion can use corpus data”. He also explained that usage-
based cognitive-linguistic theories are particularly compatible with corpus linguistics methods,
thereby throwing the spotlight on some of the specic strengths of LCR. The constant ques-
tioning and reassessment of LCR led to a more reective and competent practice in LCR, also
prompting the collection of a much larger variety of learner corpus types, which subsequently
opened up new avenues for analysis. Whilst the rst learner corpora were mainly targeting
written L2 English by relatively advanced learners, typically university students, a much larger
range of target languages and text types has since been collected. The ‘Learner corpora around
the world’ webpage3 maintained by the Centre for English Corpus Linguistics at the Université
catholique de Louvain pays tribute to this variety of target languages (Arabic, French, German,
Korean, Spanish, etc.), text types and production conditions (exam essays, argumentative and
literary essays, letters, diaries, picture descriptions, book reviews, monologues, dialogues,
computer-mediated communication, mails, translations, etc.). Other welcome advances have
been made in terms of:
prociency levels (covering the whole range of prociency levels, from beginners to
advanced) and types of learners (children, teenagers, adults, non-native ‘learners’ but also
non-native ‘users’ including teachers, heritage speakers, translators, etc.);
variety of research designs (cross-sectional, quasi-longitudinal, longitudinal),
The publication of the rst handbook of Learner Corpus Research (Granger et al., 2015) and the
launch of the rst journal entirely devoted to LCR, the International Journal of Learner Corpus
Research (IJLCR) pay tribute to the variety of current LCR studies addressing areas as diverse
as interdisciplinarity (Callies & Paquot, 2015), linguistic innovations and creativity in non-native
Englishes (Deshors et al., 2018), and study quality (Paquot & Plonsky, 2017).
Other developments include the use of more complex statistical techniques to interpret quan-
titative data (see e.g. Gries, 2013) and the popularization of mixed-methods designs to comple-
ment LCR methods and studies (see Gilquin & Gries, 2009; Meunier & Littré, 2013). One of the
limitations of LCR is that some of the language features studied by researchers may not naturally
occur frequently enough in unconstrained, open-ended (semi-) authentic production. The col-
lection and analysis of other data types to triangulate research results and oer converging or
diverging evidence is then particularly useful. Such data types may include experimental data,
questionnaires, semi-guided interviews, think-aloud protocols or ethnographic approaches (also
see Chapter 10, this volume).
The (Native Speaker) Norm/Myth?
Native corpus studies have demonstrated their added value in making it possible to compare dif-
ferent varieties of the same language, both synchronically and diachronically, and in providing
a more balanced/rened description of languages. For instance, books like Brief Grammar for
English (attributed to William Bullokar in 1586 and which aimed to show that English was as
rule-bound as Latin) were replaced by thick and detailed accounts like the Longman Grammar of
Spoken and Written English (Biber et al., 1999) where the grammatical specicities of various
text types/registers were minutely described.
The power of corpus data for comparing dierent language varieties is also a central asset
of LCR. As Granger (2015, p. 8) explains, two types of comparison appeared to be particularly
worthwhile in LCR:
a comparison with native language (NL), seen as the ultimate attainment of learning a for-
eign/second language;
27
Introduction to Learner Corpus Research
a comparison of one sample of learner language (IL, for interlanguage) with other samples of
learner language, particularly from learners with dierent mother tongue backgrounds, for
example, E2F (the English produced by learners with French as an L1) vs E2G (the English
produced by learners with German as an L1) in Figure 3.1.
This double entry approach to LCR, conceptualized by Granger in 1996, was labelled Contrastive
Interlanguage Analysis, or CIA.
Whilst the IL vs IL approach has always been promoted and accepted, “CIA has been sub-
jected to a range of criticism, most targeted at the L1/L2 branch” (Granger, 2015, p. 13), which
prompted a new version of the methodology, abbreviated as CIA² (see Figure 3.2 for a visual
representation). Put briey, the reference to native speaker language was interpreted as the recog-
nition of one idealized native speaker norm and even labelled as “imperialistic assumptions about
the ownership of English” (Tan, 2005, p.: 128). This was an unfair criticism, according to Granger
(2015, p. 15), as plenty of L1 standards (such as British, American, Australian, Canadian, Hong
Kong, India, Singapore, Sri Lanka, etc.) have been used as reference corpora for CIA studies.
In CIA², new terms have been proposed to avoid misunderstandings: RLV (for Reference
Language Varieties) and ILV (for Interlanguage Varieties). The use of RLV points to the large
number of dierent reference points against which learner data can be set (inner circle varieties
such as British or American English, outer circle varieties such as Indian or Singapore English),
as well as corpora of competent L2 user data, English as a Lingua Franca. As for ILVs, they refer
to learner language varieties, given the “highly variable nature of interlanguage” (Granger 2015,
p. 18). Comparing an ILV with an RLV makes it possible to better understand the processes at
CIA
NL vs IL IL vs IL
E2JE2S
E2F
E2GE1
E2
Figure 3.1 Contrastive Interlanguage Analysis (Granger 1996)
Figure 3.2 CIA² (Granger 2015, p. 17)
28
Fanny Meunier
play in the acquisition of that specic ILV. A comparison of various ILVs (e.g. learners of L2
English whose mother tongues are French, Dutch, Italian, Greek, or Finnish) can help detect
potential universal paths of acquisition versus L1 induced phenomena. A comparison of the lan-
guage produced in various modes, genres or registers produced by the same learners (e.g. written
and oral productions by the same learners) can help researchers discover mode/genre/register
specic features (also see Chapter 8, this volume, for more details on comparing learner corpora).
Applied Perspectives in LCR: The Continuous/Contextualized
Text Paradox in LCR
As explained in the introductory section, the applied perspectives of LCR are numerous.
Mukherjee (2009, p. 212) states that ‘[l]earner corpus analyses always, at least implicitly, raise
the question of what the language-pedagogical implications and applications might be” and
numerous publications have addressed the links between (learner) corpora and pedagogy (see
for instance Burnard & McEnery, 2000; Granger et al., 2002b; Granger, 2008; Aijmer, 2009;
Meunier, 2010). Learner corpora have been used to inform lexicography, syllabus design, materi-
als design, computer-aided language learning and pedagogical approaches such as data-driven
learning. Some learners’ dictionaries (e.g. the Longman Dictionary of Contemporary English
(2009) or the Cambridge Advanced Learner’s Dictionary (2008)) contain error notes intended
to help learners avoid common mistakes. Learner corpora have also been used by textbook writ-
ers to inform the design of tasks addressing typical problems that learners face (see the error
correction/rewriting exercises in the Grammar and Beyond textbook series4). Some large-scale
initiatives like the English Prole Project (see http://www.englishprole.org/ for more details
and related publications) rely on learner corpus data to help teachers and educators understand
what aspects of English are typically learned/acquired at each level of the Common European
Framework of Reference for Languages (Council of Europe, 2001). Learner corpora can also
be used to create data-driven learning activities, i.e. the use of corpora and concordances (typi-
cally keywords presented in their context of production) so that learners can work as language
researchers in awareness-raising activities. Learners can check specic patterns of the use of
keywords in native speaker texts and then compare that use to learners’ productions.
Despite numerous publications on the pedagogical value of learner corpora, a lack of uptake of
corpus-informed pedagogy has been noted (Granger, 2009; McCarthy, 2008; Shirato & Stapleton,
2007; Römer, 2009; Wilson, 2013; Meunier, 2018). Besides the technical problems that are often
put forward, another reason that may explain the lack of uptake of corpus-informed pedagogy can
be found in – what I would label – the contextual/continuous text paradox in LCR. Proponents
of (learner) corpus studies lay strong emphasis on the fact that corpus data is unique in that it
contains continuous stretches of discourse (not single words, phrases or sentences) and consist
in contextualized data (i.e. data not produced in isolation but in the context of a meaningful, set
task). And yet, when it comes to pedagogical applications, the use of learner corpus data rarely
goes beyond the sentence level. This paradox would surely need to be addressed in the future to
pay better tribute to the uniqueness of corpus data and maybe also ensure a clearer understand-
ing on the part of learners and/or teachers of the usefulness of corpus-informed pedagogy (see
Section 6 for some suggestions).
Main Research Methods
As some aspects related to research methods also constitute core issues in LCR, they have been
addressed in the previous section (this includes contrastive interlanguage analysis (and its evolu-
tion over time) as well as a number of issues related to annotation and corpus analysis tools. This
29
Introduction to Learner Corpus Research
third section will thus focus on only two aspects: the basic types of corpus approaches and the
three main study designs typically used in LCR.
Corpus-based and Corpus-driven Approaches
Two basic approaches can be used to analyze a (learner) corpus. The corpus-based approach
uses corpora as a source of information to explore a theory or hypothesis, aiming to validate it,
refute it, or rene it. One concrete example would be the study of grammatical variation in terms
of dative alternation, as speakers have a choice between the prepositional dative construction
(e.g. give something to someone) and the double object construction (e.g. give someone some-
thing). SLA studies put forward dierent hypotheses when it comes to the dative alternation in
L2 English (result of lexicalized verbal preferences, order of acquisition of the two constructions,
etc.). As explained by Jäschke (2016, p. 19) “very few studies explored whether the learners’ use
and judgments of the two variants are governed by the same linguistic factors which have been
found to be predictive for English native speakers”. A corpus-based approach can be used in such
cases to explore learners’ actual use of dative constructions (as was done by Deshors, 2014).
In corpus-driven approaches, the corpus is viewed as a source of inspiration to formulate
hypotheses about language (Tognini-Bonelli, 2001, p. 84-5). “The role of the researcher is to for-
mulate questions and to draw conclusions derived from what corpus data reveal when subjected
to statistical analysis rather than using the data to test a research hypothesis by approaching a
corpus with a number of preconceived ideas” (Callies, 2015, p. 36). One concrete example of
such an approach can be found in Belz and Vyatkina (2008) who investigated the pedagogical
application of a learner corpus study in language teaching and in the developmental analysis of
language learning in an instructed setting. The authors used L1 German data as a baseline against
which learner German data was compared. Using a corpus-driven approach (thanks to a care-
ful qualitative – usage in context – follow-up analysis of frequency lists), they spotted learners’
emerging use of some focal features. These included the use of xed and creative constructions of
the German modal particles ja, denn, doch, and mal. Such studies contribute to second language
acquisition research via dense documentation of micro-changes in learners’ language use over
time and to the formation of new hypotheses for future research.
Cross-sectional, Quasi-longitudinal, and Longitudinal Research Designs
Earlier studies in LCR were mostly cross-sectional, which means that they examined the lan-
guage behavior of a group or groups of language learners at a single point in their development.
Those studies usually compared one ILV with one or more other ILVs or with an RLV.
With a view to addressing developmental paths in SLA, researchers decided to carry out a
comparison of cross-sectional studies of dierent groups of learners at dierent developmental
stages, thereby adopting what Huat (2012, p. 197) calls a pseudo-longitudinal approach. The
learners’ productions are not from the same learners, hence the use of the ‘pseudo’ prex, and
the ‘time’ variable is somehow measured by proxies such as age or prociency level. In such
pseudo-longitudinal designs, researchers compare several groups of learners at dierent levels
of prociency.
Cross-sectional and pseudo-longitudinal designs do not allow for the analysis of individual
development. Individual variation within each group or sub-group can however be analyzed – as
can group development in pseudo-longitudinal designs.
Longitudinal study designs, in contrast to the two previous types, follow the same individual(s)
over time. Longitudinal research is dened as ‘emphasizing the study of change and containing
at minimum three repeated observations on at least one of the substantive constructs of interest’
30
Fanny Meunier
(Ployhart & Vandenberg, 2010, p. 97). As explained in Meunier (2015), the collection of longi-
tudinal learner corpus data is time- and cost-consuming, and the analysis can only start when the
entire data collection is over. Other issues include attrition (i.e. the sometimes signicant number
of participants dropping out before data-collection points). Such obstacles probably account for
the scarcity of longitudinal studies in the early days of LCR.
In longitudinal study designs, group progress, individual variation within groups and indi-
vidual trajectories can be analyzed. This requires the use of, for instance, multi-level model-
ling – also referred to as hierarchical linear modelling or mixed-eects models (see Raudenbush
& Bryk, 2002; Baayen et al., 2008; Cunnings, 2012; Gries, 2015). “Multi-level modelling allows
a variety of predictors to be analyzed, with ‘time’ being a key predictor in longitudinal studies: do
participants become more procient as time goes by and, if so, how strong is the eect of time?
Such statistical modelling can be applied to individuals within groups as well as to individuals as
individuals, by analyzing both endpoints and trajectories” (Meunier, 2015, p. 382).
With the benet of hindsight, it can be argued that there has been a true qualitative evolution
over time in the research methods used in LCR. The eld broadly evolved from the descriptive
analysis of aggregate data in cross-sectional designs to the use of inferential statistics and a focus
on intra- and inter-learner variability in more complex types of designs (including longitudinal
studies and mixed-methods approaches).
Representative Corpora and Research
The International Corpus of Learner English (ICLE) is probably the exemplar of rst-genera-
tion learner corpora. It has been and is still being used massively by learner corpus researchers.
Its rst edition (Granger et al., 2002a) resulting from ten years of international collaboration
between numerous universities contained 2.5 million words of English (mostly argumenta-
tive essays by university students of English) written by learners from eleven dierent mother
tongue backgrounds and was released in CD-ROM format including an interface to compile
tailor-made subcorpora on the basis of a set of predened learner or task variables. The second
version (Granger et al., 2009) diered from the rst one in scope (larger amount and greater
diversity of the learner data included) and in functionalities. It included a built-in concordancer
and direct links between learner prole information and search results. A third extended and
web-based version of ICLE (Granger et al., 2020) will soon be available. Ädel’s (2008) study,
presented below, is based on ICLE v1.
Ädel (2008) examines variables related to task eects on language use. The research question
addresses how the variable of ‘task setting’ (time and reference sources available) aects the
learners’ writing styles on the written/spoken continuum. She uses the concepts of ‘involvement’
and ‘detachment’ typically used in variationist corpus-based approaches to language, with infor-
mal speech typically characterized as involved (rst-person reference, emphatic particles, etc.)
and formal writing as detached (passive constructions, inanimate subjects, etc.). The learners
were university students of English with L1 Swedish who wrote argumentative and expository
essays for the Swedish subcorpus of the International Corpus of Learner English (SWICLE:
Granger et al., 2009) and the Uppsala Student English Corpus (USE; Axelsson, 2000). The over-
all results of the study show that learners exhibit more involvement features in timed than in
untimed essays but less if they have access to source texts. In addition to a possible lack of regis-
ter awareness, the study reveals that the extreme rate of involvement found in SWICLE is rather
linked to both the lack of time that writers have to make the text more written-like and the lack
of model texts to rely on.
The Longitudinal Database of Learner English (LONGDALE) (Meunier, 2016) is one rep-
resentative example of truly longitudinal learner corpora. It currently contains data collected by
31
Introduction to Learner Corpus Research
ve teams (Radboud University Nijmegen (the Netherlands), University of Hannover (Germany),
University of Louvain (Belgium), University of Padua (Italy), and University Paris-Diderot
(France). The same students are followed over a period of at least three years and data collec-
tions are organized at least once per year, with some teams organizing several data collections
per year. The term ‘database’ (rather than ‘corpus’) has been used from the onset of the project
as LONGDALE includes a wide range of data types including argumentative essays, narratives,
and informal interviews, but also more guided types of productions (such as picture descriptions).
Experimental data is also included for some of the subcorpora. The database also includes com-
prehensive learner prole information which is gathered during each data collection. The study
presented below focuses on the acquisition of phonology/pronunciation.
Méli (2013) analyzes the segmental realizations of French learners of English with a view of
checking whether ‘perceived dissimilarity’ is a hindrance or an advantage for the L2 acquisition
of sounds that do not exist in the learner’s L1. He focusses on the realizations of the interdental
fricative, as well as some of the phonemic vowel length asymmetries for vowels, {/i/ in French,
/i-/i:/ in English} and {/u/ in French, /u/-/u:/ in English}. Eighteen students were recorded lon-
gitudinally over three years (the date of year two and four are used in the present study). The
acoustic characteristics of some features of native speech were compared using the PRAAT
software (Broesma, 2001) – to learners’ realizations of the same sounds using the Bark Dierence
Metric method5. The analysis of the interdental fricatives stresses the importance of phonotactics
(i.e. the syntax of phonemes) and of lexical frequency. It also mentions possible ‘islands of reli-
ability’ (for expressions such as I think) which might help oral production in that learners may
use these as ‘buying time’ devices or structuring features. The analysis also indicates dierent
learning patterns for some sets of phonemes (with /u/~/u:/ being acquired later than /i/~/i:/). An
analysis of the perception and the categorization of the phonemic realizations by learners them-
selves was also carried out. The paper tests how accurately the data found is predicted by known
Second Language Acquisition (SLA) theoretical frameworks such as Flege’s Speech Learning
Model (1995). The results of the study show that the assumptions fail to predict dierences in
learning patterns.
The last corpus presented in this section is rather innovative. The Multilingual, Traditional,
Immersion, and Native Corpus (MulTINCo: Meunier, Hendrikx, Bulon, Van Goethem, & Naets,
accepted) includes both learner and native data types. It contains:
learner data for two target languages: Dutch and English
learners’ spoken and written, longitudinal data collected in two dierent educational set-
tings: Content and Language Integrated Learning – CLIL – and traditional foreign language
classes;
data produced by the same learners in their L1 (on similar task types);
comparable native data from native speakers of the learners’ L2 of about the same age;
a variety of background variables (age, gender, home language, amount of L2 curricular and
extracurricular input, etc.).
Van Mensel et al. (accepted) explores the impact of formal and informal input on learners’ vari-
ability in writing. It compares two target language conditions (Dutch and English) in two dif-
ferent instructed settings, namely Content and Language Integrated Language Learning (CLIL)
and traditional foreign language learning classes (non-CLIL) in French-speaking Belgium. The
study is part of a large project whose main objective is to investigate the inuence of CLIL – and
other educational, motivational, and cognitive factors – on the acquisition of a foreign language.
Over 900 French-speaking primary and secondary school pupils learning English or Dutch in
CLIL and NON-CLIL settings were followed longitudinally for two consecutive school years
32
Fanny Meunier
(2015-2016 and 2016-2017) and various data types were collected. Using regression models to
check whether the type and amount of input that learners are exposed to6 correlate with pro-
ciency levels, the study shows that CLIL is a signicant predictor of L2 outcomes for both target
languages, but that the relative impact of formal and informal input diers depending on the
target language. The results also highlight the importance of the L2 status in research on CLIL,
because dierent L2s can yield dierent results.
Future Directions
As illustrated in section 2, LCR has constantly questioned its role, methods, and goals, and has, as a
result, evolved remarkably over the last 30 years. It is almost impossible to accurately predict what
will lay ahead of us, say, in the next 30 years to come, but I have identied two promising areas.
The rst one is related to the very status of LCR, which has always been considered as a prod-
uct-oriented approach and which actually has the potential to combine both process and product
orientations in the future. Mäntylä et al. (2018), for instance, show how the use of keystroke log-
ging software (Strömqvist et al., 2006) cannot only help researchers better understand the writing
process but also – and perhaps even more importantly here – lead to a reconsideration of what is
actually perceived or stored in the learner’s mind as a formulaic sequence (Wray, 2002). Previous
LCR research on formulaic language focused on ‘learner-external’ sequences (viz. the linguistic
patterns produced). The recording of keyboard activities during the writing process on computers
gives researchers unprecedented access to ‘learner-internal’ patterns (Myles & Cordier, 2017). A
careful analysis of the pauses between words, for instance, can reveal diculties in accessing a
formulaic sequence but also the fact that the suite of words considered as a formulaic sequence on
the basis of learner-external patterns may not have psycholinguistic reality in the learner’s internally
stored holistic lexicon (Durrant, 2013). Studies like this one are only the rst steps towards studies
that integrate both product- and process-oriented approaches. New technologies and digital tools
make it possible to record processing ‘moves’ which will, hopefully, be integrated in LCR in the
future and hence help researchers revisit some of the LCR ndings in a new perspective.
The second promising area is the interest in – and need for – collecting more interactive data
types. Whilst some learner corpora already include samples of language in interaction (such as the
Telekorp: Belz & Vyatkina, 2008), they still constitute a minority of the data types collected. The
interactive nature of communication is being increasingly stressed in SLA circles, with a focus
on ecological approaches (Kramsch & Vork Steensen, 2008; Thorne, 2013) and multilingual-
ism (May, 2013; Ortega, 2013). As explained by Van Lier (2010, p. 2) “Ecological approaches
focus primarily on the quality of learning opportunities, of classroom interaction and of educa-
tional experience in general. Important pedagogical principles in an ecological approach are the
creation of ecologically valid contexts, relationships, agency, motivation and identity”. Such an
approach calls for more attention to be paid to the ecological value of tasks given to learners. As
for the multilingual turn, it puts multilingualism at the forefront, thereby opening up new avenues
for intrinsically multilingual corpora where, for instance, learners from dierent mother tongues
can interact and translanguage. Instead of the rather homogeneous corpora of L2‘x’ with only
one ‘x’, one could collect learner corpora of L2‘xs’. This focus on interaction is also found in
pedagogical circles where ocial curricular documents clearly distinguish the features of spoken
and written competences with or without interaction and also insist on the non-verbal strategies
that are key to interactive competence. Learner corpora like the Giessen-Long Beach Chaplin
Corpus (GLBCC: Jucker, Müller, & Smith, 2006) or the Multimedia Adult ESL Learner Corpus
(MAELC: Reder, Harris, & Setzer, 2003) should inspire future learner corpus collections. The
GLBCC consists of transcribed interactions between native English, ESL, and EFL speakers.
As for MAELC, it contains videotaped classroom interactions associated with written materials
33
Introduction to Learner Corpus Research
(copies of classroom written materials, student work, teacher logs, and teacher reections). The
corpus includes materials from four years of adult ESL classes ranging from beginning to upper-
intermediate prociency with over 3600 hours of classroom interaction recorded by six cameras
and multiple microphones. The corpus has been partly coded for participation pattern and activ-
ity, and portions of these classes have been transcribed, targeting student language during pair
work. As explained on the MAELC website “examinations of dyadic interaction can focus on
interactions between students from dierent rst language backgrounds as well as on develop-
mental studies of individual students who are recorded throughout several terms of study”. It is
also of primary importance to reach out and collect data from less favored groups of learners
(such as migrants) in order for LCR to be representative of all types of learners both in formal
(instructed) and informal (non-instructed) contexts.
Further Reading
Granger, S., Gilquin, G., & Meunier, F. (Eds.) (2015). The Cambridge handbook of learner corpus research.
Cambridge: Cambridge University Press.
This volume is the rst handbook entirely devoted to LCR. It oers a detailed overview of the eld and
of the aordances of learner corpora.
Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied
Linguistics, 32, 130–149.
This article focusses on the formulaic/phraseological nature of language, one of the key aspects of lan-
guage that corpus-linguistic methodology has helped reveal.
Fuchs, R., & Werner, V. (Eds.) (2018). Tense and aspect in second language acquisition and learner corpus
research [Special Issue]. International Journal of Learner Corpus Research, 4(2), 143–163
This edited volume presents ve studies addressing a topic that has received much attention in SLA, viz.
tense and aspect. It provides a fresh LCR perspective on tense and aspect issues.
Related Topics
Chapters 2, 6, 7, 8, 9, and 11.
Notes
1 Whilst instances of informal interactions may be more likely, few learners spontaneously decide to write
an argumentative or literary essay.
2 Corpus data are ideally continuous (i.e. consisting of longer stretches of discourse, not single words,
phrases or sentences) and contextualized (i.e. not produced in isolation but in the context of a meaning-
ful, set task).
3 Centre for English Corpus Linguistics (date of access 24th September 2018): Learner Corpora around the
World. Louvain-la-Neuve: Université Catholique de Louvain. https ://uc louva in.be /en/r esear ch-in stitu
tes/i lc/ce cl/le arner -corp ora-a round -the- world .html
4 See https ://ww w.cam bridg e.org /us/c ambri dgeen glish /cata log/e nglis h-aca demic -purp oses/ gramm ar-an
d-bey ond/
5 Put simply, the Bark Dierence Metric Method improves acoustic measurements by making it possible
to lter out physiological dierences in pronunciation while retaining sociolinguistic dierences.
6 Computed thanks to a proxy gathering various types of information collected through questionnaires on
input type and frequency.
References
Ädel, A. (2008). Involvement features in writing: Do time and interaction trump register awareness? In G.
Gilquin, S. Papp, & M. B. Díez-Bedmar (Eds.), Linking up contrastive and learner corpus research (pp.
35–53). Amsterdam: Rodopi.
34
Fanny Meunier
Aijmer, K. (2009). Corpora and language teaching. Amsterdam: John Benjamins.
Axelsson, M. W. (2000). USE The Uppsala student English corpus: An instrument for needs analysis.
ICAME Journal, 24, 155–157.
Baayen, H., Davidson, D., & Bates, D. (2008). Mixed-eects modeling with crossed random eects for
subjects and items. Journal of Memory and Language, 59(4), 390–412.
Belz, J., & Vyatkina, N. (2008). The pedagogical mediation of a developmental learner corpus for classroom-
based language instruction. Language Learning and Technology, 12(3), 33–52.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and
written English. Harlow: Pearson Education Limited.
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
Burnard, L., & McEnery, T. (Eds.) (2000). Rethinking language pedagogy from a corpus perspective:
Papers from the third international conference on teaching and language corpora. Frankfurt: Peter Lang
Publishing.
Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus research (pp. 35–56). Cambridge: Cambridge University Press.
doi:10.1017/CBO9781139649414.003.
Callies, M., & Paquot, M. (2015). Learner corpus research: An interdisciplinary eld on the move.
International Journal of Learner Corpus Research, 1(1), 1–6. doi: 10.1075/ijlcr.1.1.00edi
McIntosh, C. Cambridge Advanced Learner's Dictionary. (2008). Cambridge: Cambridge University Press.
Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching,
assessment. Cambridge: Cambridge University Press.
Cunnings, I. (2012). An overview of mixed-eects statistical models for second language researchers.
Second Language Research, 28(3), 369–382.
Deshors, S. C. (2014). A case for a unied treatment of EFL and ESL: A multifactorial approach. English
World-Wide, 35(3), 277–305.
Deshors, S. C., Götz, S., & Laporte, S. (Eds.) (2018). Rethinking linguistic creativity in non-native Englishes
(Volume 98). John Benjamins Publishing Company.
De Smet, A., Mettewie, L., Galand, B., Hiligsmann, P., & Van Mensel, L. (2018). Classroom anxiety and
enjoyment in CLIL and non-CLIL: Does the target language matter? Studies in Second Language
Learning and Teaching, 8(1), 47–71. doi:10.14746/ssllt.2018.8.1.3
Durrant, P. (2013). Formulaicity in an agglutinating language: The case of Turkish. Corpus Linguistics and
Linguistic Theory, 9(1), 1–38.
Ellis, R., & Barkhuizen, G. (2005). Analysing learner language. Oxford: Oxford University Press.
Flege, J. E. (1995). Second-language speech learning: Theory, ndings and problems. In W. Strange (Ed.),
Speech perception and linguistic experience: Theoretical and methodological issues (pp. 229–273).
Timonium: York Press.
Fuchs, R., & Werner, V. (Eds.) (2018). Tense and aspect in second language acquisition and learner corpus
research [Special Issue]. International Journal of Learner Corpus Research, 4(2), 143–163
Gentil, G., & Meunier, F. (2018). A systemic functional linguistic approach to usage-based research and
instruction: The case of nominalization in L2 academic writing. In A. E. Tyler, L. Ortega, M. Uno, & H.
I. Park (Eds.), Usage-inspired L2 instruction. Researched pedagogy (pp. 267–289). Amsterdam: John
Benjamins.
Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier
(Eds.), The Cambridge handbook of learner corpus research (pp. 9–34). Cambridge: Cambridge
University Press.
Gilquin, G., & Gries, S. (2009). Corpora and experimental methods: A state-of-the-art review. In G. Gilquin
(Ed.), Corpora and experimental methods [Special Issue]. Corpus Linguistics and Linguistic Theory,
5(1), 1–26.
Granger, S. (1994). The learner corpus: A revolution in applied linguistics. English Today, 10(3), 25–33.
doi:10.1017/S0266078400007665
Granger, S. (1996). From CA to CIA and back: An integrated approach to computerized bilingual and
learner corpora. In K. Aijmer, B. Altenberg, & M. Johansson (Eds.), Languages in contrast: Text-
based cross-linguistic studies. Lund studies in English (vol. 88, pp. 37–51). Lund: Lund University
Press.
Granger, S. (2008). Learner corpora in foreign language education. In N. Van Deusen-Scholl & N. H.
Hornberger (Eds.), Encyclopedia of language and education (vol. 4, pp. 337–351). Boston: Springer.
Granger, S. (2009). The contribution of learner corpora to second language acquisition and foreign language
teaching: A critical evaluation. In K. Aijmer (Ed.), Corpora and language teaching (pp. 13–32).
Amsterdam: John Benjamins.
35
Introduction to Learner Corpus Research
Granger, S. (2015). Contrastive interlanguage analysis: A reappraisal. International Journal of Learner
Corpus Research, 1(1), 7–24. doi: 10.1075/ijlcr.1.1.01gra.
Granger, S., Dagneaux, E., & Meunier, F. (2002a). International corpus of learner English. Handbook and
CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain.
Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (2009). International corpus of learner English.
Version 2 (Handbook + CD-ROM). Louvain-la-Neuve: Presses universitaires de Louvain.
Granger, S., Dupont, M., Meunier, F., Naets, H., & Paquot, M. (2020). International corpus of learner
English. Version 3 (Handbook + web interface). Louvain-la-Neuve: Presses Universitaires de Louvain.
Granger, S., Gilquin, G., & Meunier, F. (Eds.) (2015). The Cambridge handbook of learner corpus research.
Cambridge: Cambridge University Press.
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.) (2002b). Computer learner corpora, second language
acquisition and foreign language teaching. Amsterdam: John Benjamins.
Gries, S. (2009). What is corpus linguistics? Language and Linguistics Compass, 3(5), 1225–1241.
doi:10.1111/j.1749-818X.2009.00149.x
Gries, S. (2013). Statistical tests for the analysis of learner corpus data. In A. Diaz-Negrillo, N. Ballier, & P.
Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 287–310). Amsterdam:
John Benjamins.
Gries, S. (2015). Statistics for learner corpus research. In S. Granger, G. Gilquin, & F. Meunier (Eds.),
The Cambridge handbook of learner corpus research (pp. 159–182). Cambridge: Cambridge University
Press. doi:10.1017/CBO9781139649414.008
Huat, C. M. (2012). Learner corpora and second language acquisition. In K. Hyland, C. M. Huat, & M.
Handford (Eds.), Corpus applications in applied linguistics (pp. 191–207). London: Continuum.
Jäschke, K. (2016). The dative alternation in English as a second language. PhD dissertation. Düsseldorf:
Heinrich-Heine-Universität. Retrieved from https://d-nb.info/1135382433/34
Jucker, A., Müller, S., & Smith, S. (2006). GLBCC (Giessen - Long Beach Chaplin Corpus). Oxford text
archive. Retrieved from http://hdl.handle.net/20.500.12024/2506. See also http://ota.ox.ac.uk/desc/2506.
Kramsch, C., & Vork Steensen, S. (2008). Ecological perspectives on second language acquisition and
socialization. In N. H. Hornberger (Ed.), Encyclopedia of language and education (pp. 2595–2606).
Boston: Springer.
Longman Dictionary of Contemporary English (Fifth edition). (2009). Harlow: Pearson Education Limited.
Mäntylä, K., Lahtinen, S., Vaakanainen, V., & Mäkilä, M. (2018). Using keystroke logging to analyse the
writing process – tools for teaching writing. EuroCALL 2018 (book of abstracts, p. 27), Jyväskylä,
August 22.
May, S. (Ed.) (2013). The multilingual turn: Implications for SLA, TESOL, and bilingual education. London:
Routledge.
McCarthy, M. (2008). Accessing and interpreting corpus information in the teacher education context.
Language Teaching, 41(4), 563–574.
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book.
London: Routledge.
Méli, A. (2013). Phonological acquisition in the French–English interlanguage: Rising above the phoneme.
In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner
corpus data (pp. 207–226). Amsterdam: Benjamins.
Meunier, F. (2010). Learner corpora and English language teaching: Checkup time. Anglistik: International
Journal of English Studies, 21(1), 209–220.
Meunier, F. (2015). Developmental patterns in learner corpora. In S. Granger, G. Gilquin, & F. Meunier
(Eds.), The Cambridge handbook of learner corpus research (pp. 379–400). Cambridge: Cambridge
University Press.
Meunier, F. (2016). Introduction to the LONGDALE project. In E. Castello K. Ackerley, & F. Coccetta
(Eds.), Studies in learner corpus linguistics: Research and applications for foreign language teaching
and assessment (pp. 123–126). Berlin: Peter Lang Publishing. Retrieved from https ://uc louva in.be /en/r
esear ch-in stitu tes/i lc/ce cl/lo ngdal e.htm l
Meunier, F. (2018). Promoting TPACK and professional learning communities: Focus on teaching and
learning multiword units. EuroCALL, Jyväskylä, August 23. doi:10.13140/RG.2.2.26823.14244.
Retrieved from https ://ww w.res earch gate. net/p ublic ation /3272 37628 _Prom oting _TPAC K_and _prof
essio nal_l earni ng_co mmuni ties_ focus _on_t eachi ng_an d_lea rning _mult iword _unit s_Eur oCALL _conf
erenc e_pap er_Jy vasky la_Fi nland _23_A ugust _2018
Meunier, F., Hendrikx, I., Bulon, A., Van Goethem, K., & Naets, H. (2020). MulTINCo: Multilingual traditional
immersion and native corpus. Better-documented multi-literacy practices for more rened SLA studies.
International Journal of Bilingual Education and Bilingualism, DOI: 10.1080/13670050.2020.1786494
36
Fanny Meunier
Meunier, F., & Littré, D. (2013). Tracking learners’ progress: Adopting a dual ‘corpus cum experimental
data’ approach. The Modern Language Journal, 97(1), 61–76.
Mukherjee, J. (2009). The grammar of conversation in advanced spoken learner English: Learner corpus
data and language-pedagogical implications. In K. Aijmer (Ed.), Corpora and language teaching (pp.
203–230). Amsterdam: John Benjamins.
Myles, F., & Cordier, C. (2017). Formulaic sequences (FS) cannot be an umbrella term in SLA: Focusing
on psycholinguistic FSs and their identication. Studies in Second Language Acquisition, 39(1), 3–28.
doi:10.1017/S027226311600036X
Ortega, L. (2013). SLA for the 21st century: Disciplinary progress, transdisciplinary relevance, and the bi/
multilingual turn. Language Learning, 63(1), 1–24.
Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied
Linguistics, 32, 130–149.
Paquot, M., & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus
research. International Journal of Learner Corpus Research, 3(1), 61–94.
Ployhart, R. E., & Vandenberg, R. J. (2010). Longitudinal research: The theory, design, and analysis of
change. Journal of Management, 36(1), 94–120.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis
methods (2nd ed.). Thousand Oaks: Sage Publications, Inc.
Reder, S., Harris, K., & Setzler, K. (2003). A multimedia adult learner corpus. TESOL Quarterly, 37(3),
65–78. Retrieved from http: //www .labs chool .pdx. edu/m aelc_ acces s.htm l.
Römer, U. (2009). Corpus research and practice: What help do teachers need and what can we oer? In K.
Aijmer (Ed.), Corpora and language teaching (pp. 83–98). Amsterdam: John Benjamins.
Shirato, J., & Stapleton, P. (2007). Comparing English vocabulary in a spoken learner corpus with a native
speaker corpus: Pedagogical implications arising from an empirical study in Japan. Language Teaching
Research, 1(4), 393–412.
Strömqvist, S., Holmqvist, K., Johansson, V., Karlsson, H., & Wengelin, Å. (2006). What keystroke logging
can reveal about writing. In K. P. H. Sullivan & E. Lindgren (Eds.), Computer keystroke logging and
writing: Methods and applications (pp. 45–71). Amsterdam: Elsevier.
Tan, M. (2005). Authentic language or language errors? Lessons from a learner corpus. ELT Journal, 59(2),
126–134.
Thorne, S. (2013). Language learning, ecological validity, and innovation under conditions of superdiversity.
Bellaterra Journal of Teaching and Learning Language and Literature, 6(2), 1–27.
Tognini-Bonelli, E. (2001). Corpus linguistics at work. Amsterdam: John Benjamins.
Van Lier, T. (2010). The ecology of language learning: Practice to theory, theory to practice. Social and
Behavioral Sciences, 3, 2–6.
Van Mensel, L., Bulon, A., Hendrikx, I., Meunier, F., & Van Goethem, K. (2020). Eects of input on L2
writing skills in English and Dutch: CLIL and non-CLIL learners in French-speaking Belgium. Journal
of Immersion and Content-Based Language Education.
Wilson, J. (2013). Technology, pedagogy and promotion. How can we make the most of corpora and data-
driven learning (DDL) in language learning and teaching? The Higher Education Academy . Retrieved
from https ://ww w.hea cadem y.ac. uk/sy stem/ les /corp us_te chnol ogy_p edago gy_pr omoti on2.p df
Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.
... LCR applies the paradigm of corpus linguistics to a wide range of (research) purposes in foreign/second language acquisition (Granger 2002;2012). It is for this reason that LCR and corpus linguistics share a set of common features (Meunier 2021). ...
... Despite LCR has been the target of fierce criticism by the field of SLA since the 1990s, over the time LCR has improved consistently, by refining its theories and techniques, and SLA has recognised its developments (Granger 2012;Meunier 2021). In the last few years, LCR and SLA have started to interact, nonetheless Myles (2015; notes that there is not a real systematic collaboration. ...
Article
Full-text available
The article sheds light on Chinese as a Second Language Learner Corpus Research, emphasising advances and lacks in this field. First, the paper describes the potential of learner corpora in the investigation of learner language. Second, it provides an overview of Chinese learner corpus-based research and reviews existing L2 Chinese learner corpora. The paper highlights the lack of L2 Chinese learner corpora collecting data from Italian learners. Therefore, it emphasises the importance of compiling corpora to conduct studies on the acquisition of L2 Chinese by learners whose L1s are other than English or Asian languages.
... Due to the initial excitement over accessing previously inaccessible frequency information, many studies primarily focused on comparisons and productions of frequency lists for specific language features (e.g., verbs) and they made little use of tools other than concordancing software. According to Meunier (2020), most studies at the time could not go beyond the analysis of overused and/or underused linguistic items, generating lists of top n words in a corpus, and the documentation of the most frequently used linguistics phenomena. In Meunier's words, this led to "descriptive fever", a focus of interest or emphasis on the description of learner language rather than explaining what affects its development. ...
Article
Full-text available
Though rather rare and not favoured by corpus linguists due to computationally hard-to-handle problems, learner corpora consisting of spoken and written texts by students from different L1 backgrounds can benefit both researchers in the field of second language acquisition and language teachers. Growing from this need and considering corpora’s potential importance for the language teachers and learners in the Turkish context, our L2 English learner corpus is yet another humble attempt to build an error-tagged learner corpus particularly scrutinizing lexical errors, which play a key role in the language production of second language learners. Building on Hemchua and Schmitt’s lexical error taxonomy and developed following the strict methodological considerations in the literature (e.g., error naming and fixing through several rounds of tagging), the corpus consists of 369 written texts by 231 university students (with 104,864 words, 3000+ tagged and fixed errors). The corpus database is provided with a user-friendly web-interface, which consists of statistical output, modules highlighting lexical errors and correct versions, different search options including error types, and an error-tagging add-in for further development. In addition to being a resourceful website trying to guide language practitioners and second language learners, it can be considered a platform with a capacity to be developed further by applied linguists conducting studies in this line of research. Finally, thanks to its easy-to-use interface and versatile features, it has potential to become a reference learner corpus for English as a foreign/second language with the contribution of other universities in Türkiye.
... sia possibile osservare una timida tendenza a considerare anche altre lingue (Granger 2015). D'altro canto, i primi learner corpora elettronici di dimensioni considerevoli risalgono agli anni '90 del secolo scorso (Meunier 2021). Pertanto, nelle ricerche di tipo corpus-based in cui si voglia confrontare il parlato nativo con quello non nativo non è semplice reperire corpora di parlato confrontabili e diventa più difficile se la LT degli apprendenti non è l'inglese. ...
Article
Full-text available
The study examines dialogic speech in Foreign Language (FL) by comparing the pragmatic strategies adopted by Italian-speaking learners of Spanish and German and native speakers of the same languages. The aim of the work is to investigate whether and to what extent the strategies employed in FL present pragmatic patterns that can be traced back to those proper to the native language (NL, Italian), proper to the target language (LT, Spanish, German) or, rather, characteristics linked to linguistic and strategic competence in LS (thus independent on either L1 or LT). We consider the articulation of the textual structure, the preferences and “dispreferences” accorded to introducing and handling discourse topics, together with a degree of fluency based on some temporal parameters. Our findings indicate that FL speech presents a less elaborate and more fragmented textual structure than native speech: the topic entities explored tend to be arranged linearly, not hierarchically. At the same time, conversational games, although less and less in-depth, are on average concluded by employing a greater number of moves, in the context of a general slowness in processing and difficulty in managing the interaction. This also correlates with a slower speech rate and lower general fluency.
... 'Error' is, however, used here in accordance with other literature on corpus linguistics, e.g. Granger et al (2015). This is because it is necessary in corpus tagging to have a precise notion of what is being selected for analysis. ...
Article
Full-text available
This article presents CORYL (CORpus of Young Learner language), and demonstrates how using the corpus can help to reveal or shed further light on many phenomena which are manifested in the written English language of Norwegian school pupils. The paper begins by introducing CORYL, then focuses on learner language and the role of corpora in the study of this. Separate sections follow, devoted to what we term Computer-aided error analysis and Interlanguage analysis (not involving errors). Within these sections, extracts and other findings from CORYL-searching are presented to illustrate what we believe CORYL is able to indicate about the language of these learners.
Article
Full-text available
This article introduces the process of design and collection of the Mexican Learner Corpus (MexLeC), considering the notions of representativeness, authenticity (naturalness) and balance in building a learner corpus (Egbert, Biber y Gray, 2022; McEnery, Xiao y Tono, 2006; Biber, 2004). At the same time, it presents a literature review of materials and task type in learner spoken corpora and the availability or English learner corpora from Spanish native speakers. The result is an oral and longitudinal corpus unique in Latin America of approximately 200 000 tokens and 150 interviews in transcription. The recorded interviews sample three text-types in Biber (2004), informative, stance and narrative as well as two text genres (Council of Europe, 2020), descriptive monologue and argumentative monologue. The main applications of MexLeC are in the understanding of patterns of language acquisition and the development of teaching materials and program design for English as a Foreign Language.
Article
Full-text available
Este artículo describe tres propuestas pedagógicas orientadas a promover el aprendizaje de lenguas en la educación superior. También pretende discutir la importancia del aprendizaje continua de la primera lengua (L1) y lenguas extranjeras y/o adicionales (L2) para el éxito académico y profesional. Primero se describe un proyecto titulado Language-Integrated Knowledge Education (LIKE), que tiene el objeto de explicitar los recursos lingüísticos necesarios para participar plenamente en contextos académicos, tanto de forma oral como escrita, en el L1 y otras lenguas de los estudiantes de educación superior. En el segundo proyecto se describe los beneficios y las razones para ampliar el uso de intercambio intercultural y virtual dentro del campo de la enseñanza de lenguas extranjeras, así como el uso del diálogo intercultural y virtual en clases de disciplinas específicas (es decir, con enfoque no lingüística). El tercer proyecto, presentado más brevemente como una pieza conceptual, explora el uso del aprendizaje basado en el uso de juegos móviles con GPS para poder llevar los recursos y las actividades de aprendizaje de lenguas fuera de las aulas. Cada una de las tres propuestas es a la vez modesta y ambiciosa. Son modestas en que combinan elementos de enfoques pedagógicos ya existentes en el campo de la enseñanza de lenguas, como son la telecolaboración, el aprendizaje integrado de contenido y lengua extrajera (AICLE), portfolios (carpetas) y la utilización de las tecnologías móviles y experiencias de realidad aumentada para aprender idiomas. Son ambiciosos, ya que cada uno parte desde contextos tradicionales de aprendizaje de lenguas (por ejemplo, los departamentos de lenguas modernas), a la vez que fomenta la amplia y sistemática integración de la enseñanza de idiomas con tipos de actividades no tradicionales, incluyendo otras disciplinas académicas (el proyecto LIKE), comunidades virtuales y entornos fuera de las aulas (juegos de móvil habilitado con GPS).
Article
Full-text available
Whilst the links between learner corpus research (LCR) and Second Language Acquisition (SLA) have long been debated, McEnery et al. (2019. “Corpus Linguistics, Learner Corpora, and SLA: Employing Technology to Analyze Language Use.” Annual Review of Applied Linguistics 39: 74-92. doi:10.1017/S0267190519000096) claim that learner corpus data are not yet sufficiently integrated in SLA research. This article aims to go one way towards bridging the LCR/SLA gap by illustrating the benefits of collecting and analyzing data sets that better document multiliteracy practices. We first contextualize our work within the field of LCR where calls for more multidimensional data sets have been made. We then present a new database called MulTINCo - Multilingual Traditional, Immersion, and Native Corpus - collected in the framework of a project on Content and Language Integrated Learning in French-speaking Belgium. As our data set contains rich metadata and blends corpus data with other data types, we illustrate its potential for SLA research. In Sections 3 and 4, we describe the data collected and the interface. In the last section of the paper, we wrap up with a discussion on the methodological assets of such multidimensional data sets for SLA studies, and present directions for future research.
Book
Full-text available
The International Corpus of Learner English (ICLE) is a corpus of essay writing by upper intermediate and advanced learners. Founded and coordinated by Sylviane Granger at the University of Louvain, the corpus is the results of almost 30 years of collaborative activity between a large number of universities internationally. Since the release of the first and second editions in 2002 and 2009 respectively, the corpus has been used in a wide range of research projects internationally, served as the basis for many MA and PhD theses and generally played a key role in promoting the field of learner corpus research. The first version of the corpus contained 2.5 million words produced by learners from 11 mother tongue backgrounds. The second version was larger in terms of both words (3.7 million) and language backgrounds (16). The current version differs from the previous ones in two major ways. Firstly, it is even larger than the previous versions, as it includes data from 25 mother tongue backgrounds, amounting to 5.7 million words. In addition, unlike the preceding versions-which were both distributed on CD-ROMs-ICLEv3 is hosted on a brand-new web-based interface. This allows not only for easier and more flexible access but also for the regular inclusion of new subcorpora as they are completed, thereby highlighting the fundamentally dynamic nature of the ICLE project. Based on the abundant feedback that we have received over the years from scholars using the corpus, the functionalities that were already available in the previous versions of the ICLE have been enhanced significantly. The enhancement of the concordance tool and the corpus download facilities, for instance, is intended to answer the growing concern for the study of internal variability and individual differences in learner data. Licences (for non-profit educational purposes only) give access to the ICLEv3 web interface and a copy of the ICLE handbook published by Louvain University Press, which contains a detailed description of the corpus, an overview of the ELT situation in the countries of origin of the learners and a user's guide to the web interface. For more information on the corpus, visit the ICLEv3 trial interface: https://corpora.uclouvain.be/cecl/icle/trial
Article
Full-text available
As part of a project on Content and Language Integrated Learning (CLIL) in French-speaking Belgium, this study aims to explore the impact of formal and informal input on learners' variability in writing, and to compare two target-language conditions (Dutch and English) in CLIL and non-CLIL settings in French-speaking Belgium. A regression model shows that CLIL is a significant predictor of L2 outcomes for both target languages, but that the relative impact of formal and informal input differs depending on the target language. In short, the amount of formal language exposure predicts the outcomes of the written productions of the learners of English, and the frequency of informal exposure those of the learners of Dutch. We argue that this observation is likely related to the difference in status that each of these languages holds among the pupils in our sample. The findings thus highlight the importance of the L2 status in research on CLIL, since different L2s can yield different results.
Chapter
Full-text available
The aim of our research is to understand how speech learning changes over the life span and to explain why "earlier is better" as far as learning to pronounce a second language (L2) is concerned. An assumption we make is that the phonetic systems used in the production and perception of vowels and consonants remain adaptiive over the life span, and that phonetic systems reorganize in response to sounds encountered in an L2 through the addition of new phonetic categories, or through the modification of old ones. The chapter is organized in the following way. Several general hypotheses concerning the cause of foreign accent in L2 speech production are summarized in the introductory section. In the next section, a model of L2 speech learning that aims to account for age-related changes in L2 pronunciation is presented. The next three sections present summaries of empirical research dealing with the production and perception of L2 vowels, word-initial consonants, and word-final consonants. The final section discusses questions of general theoretical interest, with special attention to a featural (as opposed to a segmental) level of analysis. Although nonsegmental (i.e., prosodic) dimensions are an important source of foreign accent, the present chapter focuses on phoneme-sized units of speech. Although many different languages are learned as an L2, the focus is on the acquisition of English.
Presentation
Full-text available
This ppt conference presentation is made available as an OER in the form of a pdf. Please check the OER specifications indicated on page 1.
Article
Full-text available
This study investigates pupils’ anxiety and enjoyment in the classroom when learning a second or foreign language. The particularity of this study lies in the comparison of two target languages (English and Dutch) in two educational contexts (CLIL and non-CLIL) at different instruction levels (primary and secondary education). While most research on content and language integrated learning (CLIL) focuses on English as a target language, the Belgian context calls for a comparison with the language of the “other” community, in this case Dutch. Data were collected from 896 pupils in French-speaking Belgium through a selfreport questionnaire measuring pupils’ anxiety and enjoyment in the classroom, along with background characteristics. Results indicate that while CLIL pupils experience significantly less anxiety than their non-CLIL counterparts, English learners report significantly less anxiety and more enjoyment than Dutch learners. This suggests an important role of the target language for emotional engagement in the classroom and calls for further investigation into the role of target language perceptions. Finally, the interactions with instruction level reveal that while primary school pupils report stronger emotions, the effects of CLIL and English are much larger at secondary level.