ArticlePDF Available

CEDEL2: Design, compilation and web interface of an online corpus for L2 Spanish acquisition research


Abstract and Figures

This article presents and reviews a new methodological resource for research in second language acquisition (SLA), CEDEL2 ( Corpus Escrito del Español L2 ‘L2 Spanish Written Corpus’), and its free online search-engine interface ( ). CEDEL2 is a multi-first-language corpus (Spanish, English, German, Dutch, Portuguese, Italian, French, Greek, Russian, Japanese, Chinese, and Arabic) of L2 Spanish learners at all proficiency levels. It additionally contains several native control subcorpora (English, Portuguese, Greek, Japanese, and Arabic). Its latest release (version 2) holds material from around 4,400 speakers, which amounts to over 1,100,000 words. CEDEL2 follows strict corpus-design criteria (Sinclair, 2005) and L2 corpus-design recommendations (Tracy-Ventura and Paquot, 2021), and all subcorpora are equally designed to be fully contrastable, as recommended by Contrastive Interlanguage Analysis (Granger, 2015). Thanks to its design and web interface, CEDEL2 allows for complex searches which can be further narrowed down according to its SLA-motivated variables, e.g. first language (L1), proficiency level, self-reported proficiency level, age of onset to the L2, length of exposure to the L2, length of residence in a Spanish-speaking country, knowledge of other foreign languages, type of task, etc. These CEDEL2 features allow L2 researchers to address SLA questions and hypotheses.
Content may be subject to copyright.
Second Language Research
2022, Vol. 38(4) 965 –983
© The Author(s) 2021
Article reuse guidelines:
DOI: 10.1177/02676583211050522
CEDEL2: Design, compilation
and web interface of an
online corpus for L2 Spanish
acquisition research
Cristóbal Lozano
Universidad de Granada, Spain
This article presents and reviews a new methodological resource for research in second language
acquisition (SLA), CEDEL2 (Corpus Escrito del Español L2 ‘L2 Spanish Written Corpus’), and its
free online search-engine interface ( CEDEL2 is a multi-first-language
corpus (Spanish, English, German, Dutch, Portuguese, Italian, French, Greek, Russian, Japanese,
Chinese, and Arabic) of L2 Spanish learners at all proficiency levels. It additionally contains
several native control subcorpora (English, Portuguese, Greek, Japanese, and Arabic). Its latest
release (version 2) holds material from around 4,400 speakers, which amounts to over 1,100,000
words. CEDEL2 follows strict corpus-design criteria (Sinclair, 2005) and L2 corpus-design
recommendations (Tracy-Ventura and Paquot, 2021), and all subcorpora are equally designed to
be fully contrastable, as recommended by Contrastive Interlanguage Analysis (Granger, 2015).
Thanks to its design and web interface, CEDEL2 allows for complex searches which can be further
narrowed down according to its SLA-motivated variables, e.g. first language (L1), proficiency
level, self-reported proficiency level, age of onset to the L2, length of exposure to the L2, length
of residence in a Spanish-speaking country, knowledge of other foreign languages, type of task,
etc. These CEDEL2 features allow L2 researchers to address SLA questions and hypotheses.
L2 acquisition research, L2 corpora, L2 Spanish corpus, learner corpora, second language
acquisition (SLA)
I Introduction: Learner corpora and SLA
In learner corpus research (LCR), learner corpora are defined as systematic collections
of authentic and contextualized written/spoken language produced by second language
Corresponding author:
Cristóbal Lozano, Universidad de Granada, Facultad de Filosofía y Letras, Campus de Cartuja, Granada,
18071, Spain.
1050522SLR0010.1177/02676583211050522Second Language ResearchLozano
Review Article
966 Second Language Research 38(4)
(L2) learners (Callies and Paquot, 2015b: 1) and assembled according to explicit design
criteria (Granger, 2009: 14). They ‘can contribute to SLA [second language acquisition]
theory by providing a better description of interlanguage . . . and a better understanding
of the factors that influence it’ (Granger, 2008: 259). CEDEL2 contributes to SLA by (1)
incorporating these ‘factors’ into its corpus design via metadata, (2) adhering to specific
design principles (Sinclair, 2005), and (2) following LCR recommendations (Tracy-
Ventura and Paquot, 2021), which state that L2 corpora should:
1. focus on L2s other than English;
2. include learners at all proficiency levels, with varied L1s, from different ages,
and from different learning backgrounds and settings;
3. promote cross-linguistic comparisons;
4. include more learner and task variables (metadata);
5. include varied tasks, some of which promote hypothesis-testing research;
6. consider different perspectives on what a ‘control’ corpus is;
7. be freely available to the research community (Open Science).
SLA researchers have traditionally favoured experimental and controlled data to test SLA
hypotheses (Mackey and Gass, 2016). However, learner corpora are on the increase
(Granger, 2012; Mendikoetxea, 2014; Myles, 2015) and recent corpus studies have started
to test SLA hypotheses too (Lozano, 2021b). Large L2 corpora: (1) can offer a wide empiri-
cal base to test SLA theories (Mendikoetxea, 2014); (2) are ecologically valid since they
sample unconstrained language, i.e. learners can choose their own wording (Granger, 2008);
(3) contain highly contextualized language that allow researchers to go beyond the lexical/
sentential level and explore discursive/pragmatic aspects (Myles, 2015); and (4) can be
interrogated to find patterns leading to the formulation of hypotheses (hypothesis-finding)
and also to test hypotheses against the corpus data (hypothesis-testing) (Mendikoetxea,
2014; Myles, 2005). In this context, we will discuss CEDEL2, which ‘represents a laudable
attempt to bring learner corpus research and SLA closer together’ (Gilquin, 2015: 24).
This article aims to present the CEDEL2 design, compilation and free web interface
in the context of SLA and L2 Spanish acquisition. CEDEL2 takes an SLA-motivated
approach to learner corpus design with a view to offering answers to SLA questions. This
article does neither intend to discuss the role of learner corpora in SLA theory (for dis-
cussions, see Lozano, 2021b; Lozano and Mendikoetxea, 2013; Mendikoetxea, 2014;
Myles, 2015, 2021) nor to present an overview of L2 Spanish corpora (see overviews in
Alonso-Ramos, 2016; Mendikoetxea, 2014; Rojo, 2021, and the L2 Spanish corpus
index atñol).
The article is structured as follows: Section II contextualizes CEDEL2. Section III
presents its design principles, current holdings, data collection, web-based search inter-
face, and some of its limitations. Section IV concludes with thoughts on the way forward
in L2 (Spanish) corpus research.
II Some representative L2 Spanish acquisition corpora
While L2 English corpora predominate in the LCR field, the increasing interest in L2
Spanish has triggered the creation of corpora for L2 Spanish acquisition research. To
Lozano 967
contextualize CEDEL2, we will focus on a few representative corpora used in L2 Spanish
acquisition research: Spanish Learner Language Oral Corpora (SPLLOC: Mitchell et
al., 2008); Language and Social Networks Abroad Project (LANGSNAP: Tracy-Ventura
et al., 2016); and Corpus de Aprendices de Español ‘Spanish Learner Corpus’ (CAES:
Rojo and Palacios Martínez, 2016). We will see how CEDEL2 v.2 ultimately comple-
ments these corpora in terms of written-spoken data (next paragraphs) and adds function-
alities in terms of SLA-informed design features (Section III).
Regarding medium (Figure 1), the four corpora range from the entirely spoken
SPLLOC (100% of spoken files) to the entirely written CAES (100% written), with the
mixed LANGSNAP (65% spoken, 35% written) and the mostly written CEDEL2 (97.5%
written, 2.5% spoken) in between. Crucially, spoken data collection (and their corre-
sponding transcriptions) are more labour intensive and costly than written data collection
(Callies, 2015; Tracy-Ventura and Paquot, 2021). This results in spoken corpora like
LANGSNAP (331,554 words) and SPLLOC (unreported number of words) being smaller
in word size than written corpora like CAES (573,718) and CEDEL2 (1,106,013). For
SLA, however, representativeness is more relevant than corpus size since the corpus is
designed to represent the language type it intends to sample (see 3.1 for details on
Concerning the spoken vs. written dichotomy, LCR researchers have often assumed
that spoken data reflect learners’ competence better than written data (e.g. Myles, 2015).
However, many written learner corpora have been used to investigate competence and to
test SLA hypotheses (Granger et al., 2015) and most of the learner corpus studies in the
seminal book by Le Bruyn and Paquot (2021) use written corpora to ‘provide powerful
evidence that written data can make a significant contribution to some major SLA issues’
(Granger, 2021: 248). A written corpus is thus a valid and valuable source of evidence to
tap into learners’ competence and to test SLA hypotheses (Lozano, 2021b; Mendikoetxea,
2014), despite ‘SLA’s insistence on the supremacy of spoken data’ (Granger, 2021: 248).
LCR researchers could move beyond this dichotomy by exploring new avenues like the
triangulation of spoken and written data from different corpora. For example, Vázquez
Veiga (2016) triangulated spoken SPLLOC data and written CEDEL2 data to investigate
discourse markers at the lexicon-pragmatics interface. This approach ultimately yields a
more fully-rounded picture of the linguistic phenomenon under investigation.
Triangulation is even more promising when the spoken and written data come from the
same learners, same task and same corpus (Granger, 2021), as is the case in CEDEL2
(see Section III.1).
Regarding their speakers, SPLLOC (n = 150 speakers) samples British secondary-
school and university learners of L2 Spanish. LANGSNAP (n = 37) samples a cohort of
British university learners of L2 Spanish longitudinally over two academic courses at
SPOKEN ●●●●●●●●●
Figure 1. Four representative L2 Spanish corpora plotted along the spoken–written
968 Second Language Research 38(4)
three points in time (before, during and after their year abroad in Spain/Mexico). CAES
(n = 1,423) samples instructed learners of Spanish at the Instituto Cervantes from six
different L1 backgrounds. CEDEL2 (n = 4,334) samples more heterogeneous learners,
with eleven L1s and diverse backgrounds in terms of countries, proficiency levels,
chronological ages, ages of exposure to L2 Spanish, lengths of instruction in Spanish and
learning environments. L2 Spanish researchers have thus at their disposal cross-sectional
data with a variety of instructed and naturalistic backgrounds, plus longitudinal data
As for the number of variables sampled, SPLLOC and LANGSNAP register 5 varia-
bles each (4 learner variables and 1 task variable), CAES 10 (9 learner, 1 task) and
CEDEL2 25 (20 learner, 5 task). In terms of the number of subcorpora, SPLLOC
and LANGSNAP contain 2 subcorpora each (1 learner, 1 native), CAES 6 (learner only)
and CEDEL2 16 (11 learner, 6 native). Further details can be found in Appendix 1 in
supplemental material.
III The CEDEL2 (version 2) corpus
We describe CEDEL2 next: its general corpus design principles and SLA-motivated fea-
tures (3.1), its holdings (3.2), the data collection (3.3), its online web search interface
(3.4), the process and product (3.5) and its limitations (3.6).
1 CEDEL2 corpus design
While many learner corpora are not built according to specific design principles (Gilquin,
2015; Tono, 2016), Gilquin (2015) proposes CEDEL2 as a good-practice case in learner
corpus design since it rests on 10 corpus-design principles (Sinclair, 2005) that were
adapted for SLA purposes (Lozano and Mendikoetxea, 2013). Sinclair’s (2005) most
relevant principles are:
1. content selection: select corpus contents based on external (communicative func-
tion of the texts) and not internal (the language of the texts) criteria;
2. representativeness: select contents to be as representative as possible of the lan-
guage it samples;
3. topic: select the subject matter of the corpus based on external criteria;
4. contrast: compare only those subcorpora that have been equally designed;
5. documentation: fully document the contents of the corpus (i.e. its linguistically-
motivated variables or metadata).
Based on Sinclair’s design principles and the seven recommendations by Tracy-Ventura
and Paquot’s (2021) seen above, CEDEL2 showcases nine features that make it a suita-
ble and valuable tool for L2 Spanish acquisition research.
Feature 1: Same design across subcorpora to ensure maximum comparability. The principle
of contrast requires all subcorpora to be equally designed and balanced. In line with
Lozano 969
Tracy-Ventura and Paquot’s (2021) second recommendation and Granger’s (2015) Con-
trastive Interlanguage Analysis (CIA), all CEDEL2 subcorpora follow the same design
principles to ensure maximum comparability. Both between-subcorpora and within-sub-
corpus contrasts allow to test different effects (e.g. L2 development, L1 transfer, univer-
sal mechanisms, ultimate attainment, exposure, bimodality, bidirectionality, etc.), as will
be discussed in the following subsections.
Feature 2: SLA-motivated variables (learner profile). Though learner variables ‘have not
been controlled consistently across corpora and are seldom incorporated in metadata’
(Díaz-Negrillo and Thompson, 2013: 13), CEDEL2 registers, in line with Tracy-Ventura
and Paquot’s (2021) fourth recommendation, 20 learner variables (metadata) that can be
exploited in the investigation of many SLA phenomena (Table 1).
Proficiency level is a key learner variable that is often lacking in learner corpora
(Gilquin, 2015: 24). Some corpora use ad hoc criteria like course year as a proxy for
proficiency level (Callies, 2021), but ‘[b]eing in the same year group at school is not
always a sufficiently rigorous indication, and it is advisable to carry out independent
measures of proficiency’ (Myles, 2015: 316). CEDEL2 uses four measures: (1) objective
independent measure: learners do a 43-point standardized Spanish placement test
(University of Wisconsin, 1998); (2) subjective measure: learners self-rate each of their
skills (speaking, writing, listening, and reading) according to a 6-point scale (lower/
upper beginner, lower/upper intermediate, lower/upper advanced). Gilquin (2015) argues
that the ‘double proficiency measure in CEDEL2 is therefore a major asset (also because
. . . it makes it possible to compare self-rated and real proficiency)’ (p. 24). Additionally,
CEDEL2 records (3) language certificate (learners state any certificates they hold, if
any); and (4) length of instruction (LoI) in Spanish (learners report how long they have
been learning Spanish for). All these metadata provide a good estimation of the learner’s
proficiency level.
Feature 3: SLA-relevant variables (task profile). Five task metadata were recorded:
1. Task title (14 tasks to choose from; see Appendix 1 in supplemental material);
2. Task text (written text/spoken text transcription with audio file);
3. Approximate time to produce the task (in minutes);
4. Where the task was done (in class/outside class/both);
5. Resources used to produce the task (help from Spanish native/bilingual diction-
ary/monolingual dictionary/spellchecker/grammar book/background readings/
Representativeness relates to both the sampled language (see this subsection) and the
sampled speakers (see Feature 9 below). Lozano and Mendikoetxea (2013) argue that L2
corpus design should adhere to external criteria to ensure that their language is repre-
sentative and authentic. This is achieved via ‘tasks that allow learners to choose their
own wording rather than being requested to produce a particular word or structure’
(Granger, 2008: 261), which ultimately leads to ‘a high degree of inclusiveness and a low
degree of language bias’ (Mendikoetxea, 2014: 14). Following these and Tracy-Ventura
970 Second Language Research 38(4)
Table 1. Learner variables (numbered list) and likely second language acquisition (SLA)
phenomena to investigate (bulleted list) in CEDEL2.
Learner’s L1
L1 of the learner’s father
L1 of the learners’ mother
Languages spoken at home
L1 transfer effects by contrasting the 11 different L1s of the learner subcorpora
L1 transfer effects vs. the effects of general/universal cognitive mechanisms by
comparing L1s that are typologically similar/different
Indirect L2 input effects by analysing the native Spanish control subcorpus
Likely language-dominance patterns via the parents’ L1 and language spoken at home
Standardized placement test score (1–43 points)
Proficiency level based on the placement score (lower/upper beginner, lower/upper
intermediate, lower/upper advanced)
Proficiency-level self-evaluation on each skill in Spanish (speaking, listening, writing, reading)
Proficiency-level self-evaluation on each skill in an additional foreign language (speaking,
listening, writing, reading)
Spanish language certificates held, if any
L2 developmental effects by comparing beginner vs. intermediate vs. advanced learners
I nterlanguage knowledge at a given proficiency level within a learner subcorpus or
across learner subcorpora
Likely influence from the learners’ additional foreign language
Correlation between the learners’ objective placement test score and their self-
evaluation score
Sex-related linguistic differences (males/females)
Age (chronological)
Maturational effects due to age (e.g. linguistic abilities in young/adult/senior learners)
Linguistic phenomena across the lifespan
Age of Exposure (AoE) to L2 Spanish
Likely AoE effects (critical periods in L2)
Years studying Spanish (length of instruction, LoI)
LoI effects (or lack thereof)
Stay(s) in Spanish-speaking countries for longer than 1 month? (yes/no)
Stay(s): Where?
Stay(s): When? (periods of residence)
Stay(s): How long? (length of residence)
Effects of Length of Residence (LoR) and the recency of the stays in Spanish-
speaking country/countries
Effects of exposure to naturalistic input in a naturalistic setting
Ultimate attainment effects in near natives (e.g. learners with high proficiency levels
and long stays in a Spanish speaking country) vs. Spanish natives
School/University/Educational institution
Major degree (if any)
Year at university (if any)
Likely effects of general educational background (school/university)
Likely effects of educational background in Spanish (for those majoring in Spanish/
Hispanic Studies vs. those who are not)
Lozano 971
and Paquot’s (2021) fourth and fifth recommendations, CEDEL2 tasks meet five criteria
so as to potentially elicit authentic and varied linguistic phenomena within the text types
it samples (see additional details in Appendix 1 in supplemental material):
1. Range of text types: descriptive (task 1 and 2 on the description of your region
and a famous person), narrative (3–7 on the narration of a film, your last holi-
days, your future plans, a recent trip and a life experience; 13–14 picture- and
video-based narratives), and argumentative (8–12 on terrorism, anti-tobacco law,
gay marriages, marijuana legalization, and immigration). Some tasks may trigger
a blending of descriptive-narrative styles (e.g. tasks 3–7).
2. Range of control: Most tasks are relatively open-ended (tasks 1–12), but tasks 13
and 14 impose a certain degree of control since participants narrate the same
visual prompts.
3. Range of difficulty: Some descriptive tasks are linguistically undemanding and
suitable for beginners (e.g. task 1), while narrative tasks require tense–aspect
contrasts and argumentative tasks (e.g. 8–12) are linguistically demanding.
4. Pedagogical and replication criteria: Tasks 1–12 were selected from essay topics
found in mainstream Spanish language textbooks. Tasks 13 (picture-based narra-
tive Frog, where are you?) and 14 (short video from Charles Chaplin’s The kid)
have been extensively used in SLA research, which allows for replication.
5. Tense–aspect contrasts are triggered by different tasks, e.g. 3 Describe a film you
have recently seen and 6 Describe a trip you have recently made (present perfect)
vs. 4 What did you do last year during your holidays? (preterite) vs. 5 Which are
your plants for the future? (future).
Purpose-designed, theoretically-motivated tasks are also necessary in learner corpora
(Callies and Paquot, 2015a; Myles, 2015; Tracy-Ventura and Myles, 2015; Tracy-Ventura
and Paquot, 2021). Domínguez et al. (2013) and Tracy-Ventura and Myles (2015) showed
that such tasks in SPLLOC revealed tense/aspect contrasts that would have gone unde-
tected if only one generic past-tense narrative task had been used. These contrasts are
also allowed in CEDEL2; see point (5) above.
Theoretically-motivated CEDEL2 tasks (7 Retell a recent film you have seen recently;
13 Retell the frog story and 14 Retell the Chaplin video) trigger different information-status
contexts (topic continuity vs. topic shift with varying number of potential antecedents) that
constrain anaphora resolution in native and L2 Spanish. Such tasks have shed light on SLA
theoretical issues like: (1) the Pronominal Feature Geometry hypothesis (Lozano, 2009b)
since learners’ well-known deficits with anaphora resolution selectively affect only 3rd
person human anaphoric pronouns; (2) the Pragmatic Principles Violation Hypothesis
(Lozano, 2016; Martín-Villena and Lozano, 2020) since learners are more redundant (in
topic-continuity contexts) than ambiguous (in topic-shift contexts); and (3) the Position of
Antecedent Hypothesis and the Accessibility Hierarchy (Georgopoulos, 2017).
Feature 4: Multiple L1 backgrounds. A classic debate in SLA is L1 influence vs. language
universals. Following Tracy-Ventura and Paquot’s (2021) first and second recommenda-
tions, CEDEL2 samples L2 Spanish learners from eleven L1s (English, German, Dutch,
972 Second Language Research 38(4)
Portuguese, Italian, French, Russian, Greek, Japanese, Chinese, and Arabic). Addition-
ally, based on the principle of contrast, CIA and Tracy-Ventura and Paquot’s (2021) third
recommendation, all CEDEL2 subcorpora were equally designed to allow contrasts
between typologically-(un)related L1s (e.g. Germanic: English vs. German vs. Dutch;
Romance: Portuguese vs. Italian vs. French; Germanic vs. Romance vs. Slavic vs.
Semitic vs. Sino-Tibetan vs. Japonic).
Feature 5: Cross-sectional, developmental corpus. Following Tracy-Ventura and Paquot’s
(2021) second recommendation, CEDEL2 samples learners at all proficiency levels
(based on a standardized placement test, see Feature 2 above), so L2 development can be
traced. Unlike CAES, CEDEL2 does not restrict composition topics to different profi-
ciency levels. CEDEL2 is a cross-sectional corpus, like SPLLOC and CAES, but unlike
the longitudinal LANGSNAP.
Feature 6: Bidirectionality. CEDEL2 has an equally-designed, mirror-image L2 English
corpus, COREFL (Corpus of English as a Foreign Language: http://corefl.learnercor- (Lozano et al., 2021), so it adheres to the principle of contrast and to Tracy-
Ventura and Paquot’s (2021) third recommendation. The same phenomenon can be
explored bidirectionally (L1L2), e.g. L1 English–L2 Spanish (CEDEL2) vs. L1
Spanish–L2 English (COREFL), which allows to uncover L1-specific vs. universal
effects that are independent of the L1–L2 combinations. Bidirectionality is an under-
researched area in SLA/LCR.
Feature 7: Bimodal contrasts. As discussed above, spoken data are costly to collect and
process, hence written corpora are the norm in LCR (Tracy-Ventura and Paquot, 2021).
While CEDEL2 is predominantly a written corpus, it incorporates some data from par-
ticipants who did the same task twice (written then spoken) with a two-week gap to avoid
task-habituation effects (n = 104 participants: 26 L1 English–L2 Spanish learners, 59 L1
Spanish native controls, and 19 L1 English native controls; see Appendix 1 in supple-
mental material). The classic argument that spoken data better reflect learners’ compe-
tence than written data (see discussion in Section II) can be reliably tested in CEDEL2
since medium varies (spoken/written) but task and speaker are constant, as recommended
by Granger (2021: 248): ‘bimodal corpora allow interesting comparisons, especially
when the data are collected from the same learners’.
Feature 8: ‘Dual’ native control subcorpora. Some learner corpora include a native control
subcorpus as a benchmark of the (variety of) language learners are exposed to. Control
corpora are justified in LCR –see the ‘comparative fallacy’ vs. ‘comparative hypocrisy’
debate (Granger, 2009; Tracy-Ventura and Paquot, 2021). Following the recommenda-
tion of using several native norms in learner corpus research (Gilquin, 2021b) and Tracy-
Ventura and Paquot’s (2021) sixth recommendation, CEDEL2 samples data from 1,112
natives across Spanish-speaking countries (Peninsular and Latin American varieties),
which turns it into a Spanish native corpus in its own right. Importantly, to determine
whether learners’ L2 knowledge is due to their L1, L2 input, or universal cognitive
mechanisms, two native control subcorpora are required: (1) the learners’ target (L2)
Lozano 973
language to check for potential effects of input on L2 acquisition; (2) the learners’ L1 to
check for possible L1 transfer effects. Using two native control subcorpora provides
more information about the likely sources of knowledge than using only one control
corpus, as demonstrated in L2 Spanish fluency (Huensch and Tracy-Ventura, 2017) and
L2 English reference (Kang, 2004). CEDEL2 incorporates this ‘dual’ native-corpus per-
spective (Table 2).
Feature 9: Heterogeneous sample. LCR researchers complain that many current learner
corpora oversample argumentative essays produced in university settings by young
adults who are advanced learners (Paquot and Plonsky, 2017), so, following Tracy-
Ventura and Paquot’s (2021) second recommendation, CEDEL2 samples heterogeneous
learners: eleven typologically (un)related L1s, as well as different proficiency levels,
chronological ages, AoE to Spanish, LoI and LoR, learning environments (instructed/
uninstructed) and educational backgrounds (universities/secondary schools). Such
CEDEL2 variety is argued to be an asset (Gilquin, 2015). The CEDEL2 web-based
search interface (Section III.4) allows researchers to filter results according to 12 meta-
data (e.g. L1, age, proficiency level, AoE, LoR, LoI), so researchers can select repre-
sentative samples of the population they intend to investigate.
2 CEDEL2 holdings: Subcorpora and statistics
CEDEL2 was designed and compiled by Cristóbal Lozano, who directs the project since
2004 (Lozano, 2009a; Lozano and Mendikoetxea, 2013). Online written data collection
started in 2006. CEDEL2 v.2 currently holds 4,399 written and spoken files coming from
4,334 speakers, amounting to over one million word-tokens (n = 1,105,936) (see statis-
tics in Appendix 1 in supplemental material and on the CEDEL2 website) thanks to the
voluntary data-collection collaboration of both local (Universidad de Granada, UGR)
Table 2. Current native control subcorpora in CEDEL2 v.2.
Native control subcorpus 1
(learners’ mother tongue)
Learner subcorpus Native control subcorpus 2
(learners’ target language)
L1 English L1 English–L2 Spanish L1 Spanish
L1 Portuguese L1 Portuguese–L2 Spanish
L1 Greek L1 Greek–L2 Spanish
L1 Arabic L1 Arabic–L2 Spanish
L1 Japanese L1 Japanese–L2 Spanish
L1 German* L1 German–L2 Spanish
L1 Dutch* L1 Dutch–L2 Spanish
L1 Italian* L1 Italian–L2 Spanish
L1 French* L1 French–L2 Spanish
L1 Russian* L1 Russian–L2 Spanish
L1 Chinese* L1 Chinese–L2 Spanish
Note. * under development.
974 Second Language Research 38(4)
and international collaborators (for details, see Appendix 2 in supplemental material). Its
17 subcorpora (Figure 2) are still growing for the future CEDEL2 version 3.
3 CEDEL2 data collection
CEDEL2 written data were collected via dedicated online forms (v.1) from 2006 to 2016
and via Google Forms (v.2) from late 2016 ( L2 corpus online
data collection has been argued to be ‘exciting for conceptualizing new avenues for data
collection that could reach L2 users, rather than learners, that have largely been ignored’
(Bell and Payant, 2021: 64).
Forms are written in the participant’s native language to ensure full understanding
(Figure 3). Participants completed three sections (instructions and informed consent,
learner profile, and task profile), and learners additionally completed a Spanish place-
ment test. CEDEL2 participation is voluntary. Ethics approval was granted by the Human
Research Ethics Committee at the Universidad de Granada. Calls for participation were
advertised in distribution lists (Linguist List, Infoling, Corpora List, social media, etc.).
In return for their written participation, learners received their placement-test score and,
upon request, a written statement of participation.
An international team (Appendix 2 in supplemental material) collaborated in the data
collection and in the translation of the forms into the participants’ mother tongue.
Approximately 27% of the data (1,185 files out of 4,440 files) were collected by collabo-
rators and 73% (3,215 out of 4,399 files) were collected by C. Lozano. Data collection
was coordinated from the University of Granada (89% or 3,936 out of 4,399 files) and
the remaining 11% from other international universities (464 out of 4,400 files). These
figures represent the coordinating institutions and not the actual data collection
243 216164 101 83 82 74 60 59 22
172 47 16 12 6
Learners Naves
Figure 2. Number of files per subcorpus (CEDEL2 v.2).
Lozano 975
While written data were collected online from participants all over the world, spoken
data were collected in situ at the Universidad de Granada in a quiet room with recording
equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at
1 Pa) to ensure optimal audio quality. Audio files were orthographically transcribed into
text files. Transcripts followed basic transcription conventions (see the User guide in the
CEDEL2 website for details) e.g. xxx for unintelligible words, / for silent pauses, eh for
filled pauses, and = for false starts).
4 CEDEL2 search and download engine
Following the Open Science philosophy and Tracy-Ventura and Paquot’s (2021) seventh
recommendation, the newly developed CEDEL2 v.2 web-based interface (Figure 4) was
freely released in September 2020 under a Creative Commons license (CC BY-NC-ND
3.0 ES) at
The CEDEL2 interface offers multiple and sophisticated search and download options
(for details, see Appendix 1 in supplemental material, the user Guide and the circled
question mark ( ) on the CEDEL2 website). There are several (sub)types of results (i.e.
outputs), which can be refined by the 12 filters Figure 4.
a Output type
1. Texts: The output shows a tabulated list of corpus files with 10 columns repre-
senting variables (Figure 5). Each text can be visualized by clicking on the
Figure 3. CEDEL2 v.2 forms.
976 Second Language Research 38(4)
tabulated list. The list of texts can be additionally filtered (see Section III.4.c
below), sorted according to certain criteria (Figure 6, left image), and down-
loaded in several formats: TXT (actual written text or spoken text transcription),
TXT with metadata (text together with the 20 linguistic-profile variables and the
4 task variables), CSV (Comma Separated Values) for Excel, CSV for other soft-
ware. The MP3 audio files can also be downloaded.
2. Concordances: The searched element is displayed in the centre accompanied by
its surrounding context, i.e. keywords in context (KWIC, Figure 7). Concordances
can be filtered (see Section III.4.c below), sorted (Figure 6, right image) and
3. Simple frequency: The output shows the frequency of the searched element(s)
(e.g. word, lemma, grammatical word) out of the total number of words and doc-
uments in the corpus, e.g. lemma SER ‘to be’ (21,016/851,675 results [24.68
million] in 2,713/3,034 documents) vs. lemma ESTAR ‘to be’ (5,038/851,675
results [5.92/million] in 1,752/3,034 documents).
4. Full frequency: The output shows the frequency of the searched element(s)
according to eleven variables (L1, medium, L1 and medium, proficiency level,
text title, years studying Spanish (LoI), stay abroad (LoR), age of exposure (AoE)
to Spanish, age, sex, and placement test score).
Figure 4. CEDEL2 v.2 web-based interface (
Lozano 977
b Output subtype. For concordances and frequencies, the searchable element(s) can be:
1. Words: Strings (characters, word, word combinations) that can incorporate wild-
cards: * (any number of characters), ? (for any 1 character), | (for either one
string or the other).
2. Grammatical elements: In CEDEL2 v.2, all words in the Spanish and English texts
have been automatically tagged (i.e. part-of-speech annotated) with Freeling
(, which is an automatic annotator (see details on
the CEDEL2 website’s User guide). The searchable grammatical elements can be:
a. Part-of-speech (POS) tags, available from a drop-down menu containing word
categories (e.g. Noun, Adjective, Verb, Adverb, etc.) and subcategories (e.g.
noun.masculine.plural; verb.indicative.imperfect; pronoun.personal.3rd.singular.
Figure 5. Text output.
Figure 6. Sorting options.
978 Second Language Research 38(4)
masculine; etc.). Researchers can search for, e.g. imperfect/perfect past tense
contrasts to test the well-known Aspect Hypothesis; 3rd person singular personal
pronouns to test hypotheses on anaphora resolution (see above); etc.
b. Lemmas (e.g. SER and ESTAR would search for all verbal forms of ser and estar
‘to be’) to test hypotheses that can account for this well-known verbal contrast in
L2 Spanish.
3. Words proxim.: It searches for a first word separated N words from a second
word. The user can define the separation (N), e.g. yo + 2 + estoy would retrieve
cases like yo no|siempre|nunca|ahora estoy.
4. Grammatical elements proxim.: It searches for a first grammatical word sepa-
rated N words from a second grammatical word, e.g. determiner-article-mascu-
line-singular + 1 word + noun-feminine-singular, which would find cases of
masculine articles followed by feminine nouns (el acción, el alcantarilla . . .).
This can help researchers test hypotheses on gender (mis)agreement. Other
sophisticated lemma/tags combinations are possible, e.g. lemma LLEGAR + 2
+ noun, would retrieve instances of postverbal subjects: llegar ‘arrive’ followed
by either a noun or by another word (e.g. article) before the noun, as in Llega un
policía, Llega otro personaje, etc.) to test the well-known unaccusative
c Filtering options. The output (results) can be filtered according to 12 filters (Figure 4):
learners’ L1, task medium (written/spoken/written and spoken by the same person), sex,
proficiency level on a 6-point scale (lower/upper beginner, intermediate, and advanced),
placement test score in Spanish (range: 0%–100%), self-evaluated proficiency level in
Spanish on a 6-point scale, task title (14 tasks to choose from), filename, age, AoE to
Spanish, LoI in Spanish (in years), and LoR in a Spanish-speaking country (in months).
Filters allow to target those elements (learners, concordances, texts, grammatical words,
etc.) that meet the researcher’s criteria.
5 CEDEL2 process and product
Learners/natives participate in the written component via online forms (http://learnercor- Data and metadata are received in spreadsheet format. Based on a protocol,
the corpus managers manually clean, standardize and join the spreadsheet data of the
different subcorpora, which are then exported to CSV files for automatic validation. The
Figure 7. Concordance output.
Lozano 979
task texts are POS-annotated with Freeling and, together with their metadata (headers),
are joined. The resulting tagged documents are loaded onto a database, which can finally
be searched/filtered/downloaded by users via the online interface (see Figure 8).
At the time of printing of this paper, CEDEL2 data are the source of over 50 publica-
tions and dissertations covering SLA phenomena like anaphora, collocations, lexicon,
morphology, orthography, reflexives, unaccusativity, (in)transitivity, determiners, lexi-
con-pragmatics interface, lexicon-discourse interface, error analysis, interference and
transfer, automatic proficiency-level classification, natural language processing, and com-
puter-assisted language learning (see
6 CEDEL2: Current limitations and future improvements
Corpus balance requires having equally-proportioned samples of each language variety
(Mendikoetxea, 2014: 14). Data size in each cell (i.e. cells resulting from crossing the
corpus-design factors) must be balanced to ensure representativeness across subcorpora
(Lozano, 2021a: 143). Future versions of CEDEL2 will therefore need to strike a balance
between: (1) learner subcorpora, since the L1-English–L2-Spanish subcorpus is the larg-
est in word size, speakers and number of tasks when compared to the other L2 subcor-
pora, which are smaller in size and contain either one task (Chaplin) or two tasks (Chaplin
and Frog); (2) control subcorpora, since not all learner subcorpora have their equivalent
native control subcorpus (Table 2); (3) spoken/written subcorpora, since only a small
spoken sample has been included in CEDEL2 v.2. As for tagging, it has been done auto-
matically with the Freeling tagger, which sometimes misclassifies words (particularly
learners’ novel words) into an incorrect word category. This feature is intended to be a
useful search tool for SLA researchers, though it needs to be improved in future versions.
Figure 8. CEDEL2 flowchart.
Source. Copyright of this flowchart by NLPGo (
980 Second Language Research 38(4)
Finally, a promising feature for future versions is the use of automated text metrics (e.g.
lexical sophistication/diversity, text cohesion, grammatical complexity, text readability,
etc.) as an additional measure of proficiency level.
IV Learner corpora: The way forward
LCR has a lot to offer to SLA (Granger, 2021; Tracy-Ventura and Paquot, 2021), particularly
to SLA theoretical models (Lozano, 2021b for an overview). This can be achieved via SLA-
motivated corpus design and fine-grained, theoretically-informed tagsets to test particular
hypotheses, as done with CEDEL2 (e.g. Georgopoulos, 2017; Lozano, 2009b, 2016; Martín-
Villena and Lozano, 2020) and SPLLOC (Domínguez et al., 2013; Tracy-Ventura and
Myles, 2015).
Most L2 Spanish corpora are cross-sectional, so longitudinal corpora like LANGSNAP
are needed (Tracy-Ventura and Paquot, 2021). More spoken and written data coming
from the same learner and task would be ideal to test the effect of medium (Granger,
2021). Multi-task (as opposed to single-task) corpora are also welcome since task vari-
ability is a key factor to understand learners’ interlanguage (Domínguez et al., 2013;
Tracy-Ventura and Myles, 2015).
SLA/LCR researchers argue for ‘complementing the corpus data with experimental
data’ (Tracy-Ventura and Myles, 2015: 89). Such triangulation is gaining momentum
(Gilquin, 2021a) since complementing large quantities of corpus data with controlled
experimental data can offer SLA theories a more solid empirical base (Callies and
Paquot, 2015b). Researchers have already triangulated experimental and spoken
SPLLOC corpus data to advance our understanding of tense–aspect in L2 Spanish
(Domínguez et al., 2013). Triangulation is most fruitful when done in a cyclic fashion to
investigate the same phenomenon: the corpus results can provide insights that can be
later implemented in an experiment, the results of which can additionally shed light on
new aspects that can be later interrogated in the corpus, and so on, as done in the inves-
tigation of word order in L2 English under the unaccusative hypothesis with corpus data
(Lozano and Mendikoetxea, 2010) and with experimental data (Mendikoetxea and
Lozano, 2018). Cyclic triangulation is therefore a welcome step in the contribution of
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.
The author(s) disclosed receipt of the following financial support for the research, authorship, and/
or publication of this article: CEDEL2 has been publicly funded over the past 15 years by several
research project grants from the Spanish Government, which we gratefully acknowledge: FFI2016-
75106-P (Ministerio de Economía y Competitividad); FFI2012-30755 (Ministerio de Economía y
Competitividad); FFI2008-01584/FILO (Ministerio de Ciencia e Innovación); HUM2005-01728/
FILO (Ministerio de Educación y Ciencia).
Lozano 981
Cristóbal Lozano
Supplemental material
Supplemental material for this article is available online.
Alonso-Ramos M (ed.) (2016) Spanish learner corpus research: Current trends and future per-
spectives. Amsterdam: John Benjamins.
Bell P and Payant C (2021) Designing learner corpora: Collection, transcription, and annotation. In
Tracy-Ventura N and Paquot M (eds) The Routledge handbook of second language acquisi-
tion and corpora. Abingdon: Routledge, pp. 53–67.
Callies M (2015) Learner corpus methodology. In Granger S, Gilquin G, and Meunier F (eds) The
Cambridge handbook of learner corpus research. Cambridge: Cambridge University Press,
pp. 35–55.
Callies M (2021) Proficiency. In Tracy-Ventura N and Paquot M (eds) The Routledge handbook of
second language acquisition and corpora. Abingdon: Routledge.
Callies M and Paquot M (2015a) An interview with Yukio Tono. International Journal of Learner
Corpus Research 1: 160–71.
Callies M and Paquot M (2015b) Learner corpus research: An interdisciplinary field on the move.
International Journal of Learner Corpus Research 1: 1–6.
Díaz-Negrillo A and Thompson P (2013) Learner corpora: Looking towards the future. In Díaz-
Negrillo A, Ballier N, and Thompson P (eds) Automatic treatment and analysis of learner
corpus data. Amsterdam: John Benjamins, pp. 9–29.
Domínguez L, Tracy-Ventura N, Arche MJ, Mitchell R, and Myles F (2013) The role of dynamic
contrasts in the L2 acquisition of Spanish past tense morphology. Bilingualism: Language
and Cognition 16: 558–77.
Georgopoulos A (2017) Anaphora resolution in the interlanguage of Greek and English learners of
Spanish: A corpus study. Studies in Greek Linguistics 37: 239–52.
Gilquin G (2015) From design to collection of learner corpora. In Granger S, Gilquin G, and
Meunier F (eds) The Cambridge handbook of learner corpus research. Cambridge: Cambridge
University Press, pp. 9–34.
Gilquin G (2021a) Combining learner corpora and experimental methods. In Tracy-Ventura N
and Paquot M (eds) The Routledge handbook of second language acquisition and corpora.
Abingdon: Routledge.
Gilquin G (2021b) One norm to rule them all? Corpus-derived norms in learner corpus research
and foreign language teaching. Language Teaching. Epub ahead of print 31 March 2021.
DOI: 10.1017/S0261444821000094.
Granger S (2008) Learner corpora. In Lüdeling A and Kytoe M (eds) Corpus linguistics: An inter-
national handbook. Berlin: Mouton de Gruyter, pp. 259–75.
Granger S (2009) The contribution of learner corpora to second language acquisition and foreign
language teaching. In Aijmer K (ed.) Corpora and language teaching. Amsterdam: John
Benjamins, pp. 13–32.
Granger S (2012) How to use foreign and second language learner corpora. In Mackey A and
Gass SM (eds) Research methods in second language acquisition: A practical guide. Oxford:
Wiley-Blackwell, pp. 5–29.
982 Second Language Research 38(4)
Granger S (2015) Contrastive interlanguage analysis: A reappraisal. International Journal of
Learner Corpus Research 1: 7–24.
Granger S (2021) Have learner corpus research and second language acquisition finally met? In
Le Bruyn B and Paquot M (eds) Learner corpus research meets second language acquisition.
Cambridge: Cambridge University Press, pp. 243–57.
Granger S, Gilquin G, and Meunier F (eds) (2015) The Cambridge handbook of learner corpus
research. Cambridge: Cambridge University Press.
Huensch A and Tracy-Ventura N (2017) Understanding second language fluency behavior: The
effects of individual differences in first language fluency, cross-linguistic differences, and
proficiency over time. Applied Psycholinguistics 38: 755–85.
Kang JY (2004) Telling a coherent story in a foreign language: Analysis of Korean EFL learners’
referential strategies in oral narrative discourse. Journal of Pragmatics 36: 1975–90.
Le Bruyn B and Paquot M (eds) (2021) Learner corpus research meets second language acquisi-
tion. Cambridge: Cambridge University Press.
Lozano C (2009a) CEDEL2: Corpus Escrito del Español como L2. In Bretones CM et al. (eds)
Applied linguistics now: Understanding language and mind / La lingüística aplicada actual:
Comprendiendo el lenguaje y la Mente. Almería: Universidad de Almería, pp. 197–212.
Lozano C (2009b) Selective deficits at the syntax–discourse interface: Evidence from the CEDEL2
corpus. In Leung Y-I, Snape N, and Sharwood-Smith M (eds) Representational deficits in
second language acquisition. Amsterdam: John Benjamins, pp. 127–66.
Lozano C (2016) Pragmatic principles in anaphora resolution at the syntax–discourse interface:
Advanced English learners of Spanish in the CEDEL2 corpus. In Alonso Ramos M (ed.)
Spanish learner corpus research: Current trends and future perspectives. Amsterdam: John
Benjamins, pp. 235–65.
Lozano C (2021a) Corpus textuales de aprendices para investigar sobre la adquisición del espa-
ñol LE/L2 [Textual learner corpora for investigating the acquisition of Spanish as a second
language]. In Cruz Piñol M (ed.) E-Research y español LE/L2: Investigar en la era digital.
Abingdon: Routledge, pp. 138–63.
Lozano C (2021b) Generative approaches. In Tracy-Ventura N and Paquot M (eds) The Routledge
handbook of second language acquisition and corpora. Abingdon: Routledge, pp. 213–27.
Lozano C and Mendikoetxea A (2010) Interface conditions on postverbal subjects: A corpus study
of L2 English. Bilingualism: Language and Cognition 13: 475–97.
Lozano C and Mendikoetxea A (2013) Learner corpora and second language acquisition: The
design and collection of CEDEL2. In Díaz-Negrillo A, Ballier N, and Thompson P (eds)
Automatic treatment and analysis of learner corpus data. Amsterdam: John Benjamins, pp.
Lozano C, Díaz-Negrillo A, and Callies M (2021) Designing and compiling a learner corpus of
written and spoken narratives: COREFL. In Bongartz C and Torregrossa J (eds) What’s in
a narrative? Variation in story-telling at the interface between language and literacy. Bern:
Peter Lang, pp. 21–46.
Mackey A and Gass SM (2016) Second language research: Methodology and design. 2nd edition.
Mahwah, NJ: Lawrence Erlbaum.
Martín-Villena F and Lozano C (2020) Anaphora resolution in topic continuity: Evidence from L1
English–L2 Spanish data in the CEDEL2 corpus. In Ryan J and Crosthwaite P (eds) Referring
in a second language: Studies on reference to person in a multilingual world. Abingdon:
Routledge, pp. 119–41.
Mendikoetxea A (2014) Corpus-based research in second language Spanish. In Geeslin KL (ed.)
The handbook of Spanish second language acquisition. Oxford: Wiley-Blackwell, pp. 11–29.
Lozano 983
Mendikoetxea A and Lozano C (2018) From corpora to experiments: Methodological triangulation
in the study of word order at the interfaces in adult late bilinguals (L2 learners). Journal of
Psycholinguistic Research 47: 871–98.
Mitchell R, Domínguez L, Arche M, Myles F, and Marsden E (2008) SPLLOC: A new database
for Spanish second language acquisition research. In Roberts L, Myles F, and David A (eds)
EUROSLA Yearbook 8. Amsterdam: John Benjamins, pp. 287–304.
Myles F (2005) Interlanguage corpora and second language acquisition research. Second Language
Research 21: 373–91.
Myles F (2015) Second language acquisition theory and learner corpus research. In Granger
S, Gilquin G, and Meunier F (eds) The Cambridge handbook of learner corpus research.
Cambridge: Cambridge University Press, pp. 309–32.
Myles F (2021) An SLA perspective on learner corpus research. In Le Bruyn B and Paquot M
(eds) Learner corpus research meets second language acquisition. Cambridge: Cambridge
University Press, pp. 258–73.
Paquot M and Plonsky L (2017) Quantitative research methods and study quality in learner corpus
research. International Journal of Learner Corpus Research 3: 61–94.
Quesada T (2021) Studies on Anaphora Resolution in L1 Spanish – L2 English and L1 English –
L2 Spanish Adult Learners: Combining Corpus and Experimental Methods. PhD dissertation,
Universidad de Granada, Granada.
Rojo G (2021) Introducción a la lingüística de corpus en español [Introduction to corpus linguis-
tics in Spanish]. Abingdon: Routledge.
Rojo G and Palacios Martínez I (2016) Learner Spanish on computer: The CAES ‘Corpus de
Aprendices de Español’ project. In Alonso Ramos M (ed.) Spanish learner corpus research:
Current trends and future perspectives. Amsterdam: John Benjamins, pp. 55–87.
Sinclair J (2005) How to build a corpus. In Wynne M (ed.) Developing linguistic corpora: A guide
to good practice. Oxford: Oxbow Books, pp. 79–83.
Tono Y (2016) What is missing in learner corpus design? In Alonso Ramos M (ed.) Spanish learner
corpus research: Current trends and future perspectives. Amsterdam: John Benjamins, pp.
Tracy-Ventura N and Myles F (2015) The importance of task variability in the design of learner
corpora for SLA research. International Journal of Learner Corpus Research 1: 58–95.
Tracy-Ventura N and Paquot M (2021) The future of corpora in SLA. In Tracy-Ventura N and
Paquot M (eds) The Routledge handbook of second language acquisition and corpora.
Abingdon: Routledge.
Tracy-Ventura N, Mitchell R, and McManus K (2016) The LANGSNAP longitudinal learner cor-
pus: Design and use. In Alonso Ramos M (ed.) Spanish learner corpus research: State of the
art and perspectives. Amsterdam: John Benjamins, pp. 117–42.
University of Wisconsin (1998) The University of Wisconsin College-Level Placement Test:
Spanish (Grammar) Form 96M. Madison, WI: University of Wisconsin Press. Available at: (accessed September 2021).
Vázquez Veiga N (2016) Discourse markers in CEDEL2 and SPLLOC corpora of learner Spanish:
Analysis of some lexical–pragmatic failures. In Alonso Ramos M (ed.) Spanish learner
corpus research: Current trends and future perspectives. Amsterdam: John Benjamins, pp.
Appendix 1: A comparative overview of four large publicly available L2 Spanish corpora
v. 1 and v. 2 (March 2020)
v. 3 (March 2020)
v. 1.2 (Aug 2018)
Mitchell et al. (2008)
Tracy-Ventura et al. (2016)
Rojo & Palacios (2016))
Web-based search interface
Interface features:
Search options: string only.
Search sensitivity: not
Search output format: audio
files, transcripts.
File filtering options: corpus,
task, level of participant, sex, years
learning Spanish, other languages
studied, speakers’ ID.
Sorting options: not available.
File formats: audio files, text
transcriptions (CHAT format).
File download options: not
(Note: SPLLOC is also
downloadable via TALKBANK)
Interface features:
Search options: not available.
Search sensitivity: not available.
Search output format: not available.
File filtering options: task, data collection
time (before, during, after), speaker’s ID number.
Sorting options: no available.
File formats: audio files, text transcriptions
(CHAT format).
File download options: not available.
(Note: LANGSNAP is also downloadable via
Interface features:
Search options: string, POS, lemma.
Search sensitivity: orthographic
accents (on|off), case-sensitive (on|off).
Search output format: concordances,
simple statistics, complex statistics.
File filtering options: proficiency level,
L1, country, sex, age range.
Sorting options: yes.
File formats: not available.
File download options: not available.
Lozano (2021). CEDEL2: Design, compilation and web interface of an online corpus for L2 Spanish
acquisition research. Second Language Research. First Published online October 16, 2021
Spoken and written
Subcorpora statistics:
speakers, files, words
: No. of speakers
🔊: No. of spoken audio files
: No. of written text files
: No. of words
Shaded cells: native control
Not reported
L1Spa (natives)
L1Spa (natives)
Note: Audio files contain their
corresponding transcription.
Note: Audio files contain their corresponding
L1 Arab-
L2 Spa
L1 Chin-
L2 Spa
L1 Fren-
L2 Spa
L1 Eng-
L2 Spa
L1 Portu-
L2 Spa
L1 Russ-
L2 Spa
CEDEL2 v.1
L1Spa (natives)
CEDEL2 v.2
L1 English-L2 Spa
L1 Greek-L2 Spa
L1 Italian-L2 Spa
L1 Portug-L2 Spa
L1 French-L2 Spa
L1 German-L2 Spa
L1 Dutch-L2 Spa
L1 Russian-L2 Spa
L1 Japanese-L2 Spa
L1 Chinese-L2 Spa
L1 Arabic-L2 Spa
L1 Spa (natives)
L1 English (natives)
L1 Greek (natives)
L1 Portug (natives)
L1 Japanese (natives)
L1 Arabic (natives)
Files/Speakers ratio
585 files / 150 speakers = 3.9 files
per speaker
556 files / 37 speakers = 15.02 files per speaker
3,873 files / 1,423 speakers = 2.72 files
per speaker
L1 Spanish native control
Variety: Peninsular Spanish
Variety: Peninsular Spanish, Mexican Spanish
Other control corpora
(learners’ L1)
Age range
13 22 years old
20 27 years old
15 61+ years old
Learners: types
Instructed only (secondary
education, university in the UK).
Instructed only (university students from the UK)
Instructed (students at the Instituto
Cervantes in different countries)
Learners: Independent
proficiency measure
No: Placement according to hours
of instruction
Yes (CEFR-based proficiency from
Instituto Cervantes)
Learners: proficiency levels
All (beginner, intermediate,
Unknown/Not reported
All (beginner, intermediate and advanced:
A1 to C1 according to CEFR)
Learners: profile and
linguistic variables
Educational level (year 9, year
10, year 13, undergraduate)
Years learning Spanish (3, 4, 6
Other foreign languages studied
Other foreign languages studied
Educational level (university)
Work type while abroad (exchange student,
teaching assistant)
Proficiency level (A1-C1)
Age of exposure to Spanish
Months studying Spanish
Length of stay in a Spanish-speaking
country (in months)
Educational level (primary, secondary,
university, other).
Contacts in Spanish-speaking countries
(no, friends, family, friends & family,
Learners: Task variables
Type of task (see cell below)
Type of task (see cell below)
Type of task (see cell below)
Task types
9 tasks in total:
SPLLOC v. 1:
Clitic production task
Loch Ness narrative
Chaplin Modern Times narrative
Pair discussion task
One-to-one picture description &
interview task
SPLLOC v. 2:
Past-tense guided interview task
2 past-tense picture-based
narratives (varying degree of
control: Nati & Pancho; Hermanas)
Present-tense simultaneous
actions task
7 tasks in total:
•Oral task: Personal semi-structured interview.
•Oral task: Past-tense picture-based narrative
(brothers’ story), as in SPLLOC v.2.
•Oral task: Present-tense picture-based narrative
(cat story), as in SPLLOC v.2.
•Oral task: Past-tense picture-based narrative
(sisters’ story), as in SPLLOC v.2.
•Written argumentative task: Do you think
marihuana should be legalised? (as task no. 11 in
CEDEL2 v.1).
•Written argumentative task: Do you think gay
couples have the right to get married and adopt
children? (as task no. 10 in CEDEL2 v.1).
•Written argumentative task: Do you think junk
food should be taxed?
13 tasks in total:
Tasks are graded according to proficiency
level with an increasing number of words.
There is combination of genres and
communicative functions (descriptive,
narrative, argumentative, requests and
A1 level (lower beginner):
1. Write an email to your work/class
mates introducing yourself (75-100
2. Write a note to your flatmates telling
them you will be late (30-40 words).
3. Write an email to a friend talking about
your family (75-100 words)
A2 level (upper beginner), 100-150
4. Write a postcard to your friends about
your last holidays.
5. Write a biography about a person you
6. Make a booking for a hotel room.
B1 level (lower intermediate), 175-200
7. Write a letter to a friend requesting
several favours.
8. Write to a flight company complaining
about your lost luggage.
9. Write a (real or imaginary) funny story.
B2 level (upper intermediate), 275-300
10. Write an admission letter to a
university programme.
11. Write an essay on the importance of
new technologies.
C1 level (lower intermediate), 400-500
12. Write a critical commentary about a
recent film you have seen.
13. Write an email to your electricity
company complaining about power cuts
and high prices.
Appendix 2: Main data collaborators (ranked in order of N of files collected as of 15th July 2020)
# documents
# audios
CEDEL2 v.1
L1 English - L2 Spanish
Cristóbal Lozano (Uni Granada)
CEDEL2 v.1
Control: L1 Spanish
Cristóbal Lozano (Uni Granada)
CEDEL2 v.2
L1 English L2 Spanish
Cristóbal Lozano (Uni Granada)
CEDEL2 v.2
Control: L1 Spanish
Cristóbal Lozano (Uni Granada)
CEDEL2 v.2
L1 Japanese L2 Spanish
Nobuo Ignacio López-Sako (Uni Granada)
CEDEL2 v.1
L1 Greek - L2 Spanish
Athanasios Georgopoulos (Uni Granada)
CEDEL2 v.2
Control: L1 English*
Cristóbal Lozano & Ana Díaz-Negrillo (Uni Granada)
CEDEL2 v.2
L1 Portuguese L2 Spanish
Joana Teixeira & Ana Madeira (Uni Nova Lisboa UNL)
CEDEL2 v.2
L1 Russian - L2 Spanish
Tatiana Portnova & Benamí Barros (Uni Granada)
CEDEL2 v.2
L1 Italian L2 Spanish
Pau Montserrat (Uni Florence)
CEDEL2 v.2
L1 German L2 Spanish
Jacopo Torregrossa (Uni Frankfurt)
CEDEL2 v.2
L1 Arabic L2 Spanish
Amal Haddad (Uni Granada)
CEDEL2 v.2
L1 Dutch L2 Spanish
Kim Collewaert (Vrije Uni Brussel- Uni Granada), An Vande Casteele (Vrije Uni Brusseel), Mª Carmen Parafita (Uni Leiden)
CEDEL2 v.2
L1 French L2 Spanish
Hugues Lacroix (Uni Montréal), Ismael Ramos Ruíz (Uni Paris Diderot)
CEDEL2 v.2
Control: L1 Japanese
Nobuo Ignacio López-Sako (Uni Granada)
CEDEL2 v.2
L1 Greek L2 Spanish
Athanasios Georgopoulos (Uni Athens), Aphrodite Amanatidou (Uni Ioanina - Uni Granada)
CEDEL2 v.2
L1 Chinese L2 Spanish
Juan José Ciruela (Uni Granada), Cristóbal Lozano (Uni Granada) Aida García (Universidad Autónoma de Madrid UAM)
CEDEL2 v.2
Control: L1 Portuguese
Joana Teixeira & Ana Madeira (Uni Nova Lisboa UNL)
CEDEL2 v.2
Control: L1 Greek
Aphrodite Amanatidou (Uni Ioanina - Uni Granada)
CEDEL2 v.2
Control: L1 Arabic
Amal Haddad (Uni Granada)
*The L1 English native data are part of COREFL (Corpus of English as a Foreign Language) (Lozano et al., 2021), available at
... corpus de reducido tamaño compilados para llevar a cabo investigaciones particulares, existen en este momento tres que pueden ser consultados libremente (CEDEL2 (Lozano, 2022); CAES (Palacios et al., 2019); y COWS-L2H (Yamada et al., 2020)), otros tres mediante registro (Aprescrilov (Buyse y González, 2013), CATE (Lu, 2010) y CORESPI (Bailini y Frigerio, 2019)) y uno mediante compra (CORANE (Cestero Mancera et al., 2001)). ...
... CEDEL2 ( (Lozano, 2022) es un corpus de aprendices destinado a la investigación sobre la adquisición de segundas lenguas que recoge datos desde 2006 a través de un formulario en línea, donde cada voluntario escribe un texto sobre un tema elegido libremente entre catorce temas propuestos. En 2020 se incorporó un subcorpus de L1 japonés, como muestra la Tabla 1. ...
Full-text available
En este artículo se presenta el Corpus de ELE en Japón, CELEN (, una colección de textos escritos por hablantes de japonés (L1) con distintos grados de dominio del español como lengua extranjera, desde el nivel A1 hasta el nivel C2 del MCER. Los datos proceden de (1) universidades en Japón, donde el español se estudia como asignatura de lengua extranjera o como carrera, y (2) contextos de interacción real en Internet, como blogs electrónicos y foros. La versión 1.2, de abril de 2023, consta de 6.196 textos escritos por 1.035 aprendices, con un total de 658.467 palabras. En el apartado 1 se resume brevemente la situación del español en Japón y los corpus de aprendices existentes. En el apartado 2 se describen las características principales de CELEN, el proceso de recogida y anotación de los datos y la interfaz de consulta. En el apartado 3 se ilustra su uso con varios tipos de búsquedas (concordancias, colocaciones, listas de palabras y n-gramas), aplicadas a fenómenos lingüísticos relevantes en la docencia o la investigación en ELE: el uso de se, las preposiciones, la concordancia de género, el orden de palabras, las colocaciones verbales, la frecuencia léxica o las secuencias de categorías gramaticales más frecuentes. Se trata de un recurso abierto, que se actualiza periódicamente, y esperamos que otros profesores e investigadores puedan albergar sus textos en él para ofrecer a la comunidad científica una amplia muestra de aprendices japoneses de español. En la página web del proyecto ( se puede consultar la guía de uso detallada y descargar íntegramente algunas partes del corpus bajo una licencia CC BY-NC 4.0.
... In the Italian context, for instance, the growing number of students and the widespread interest in Chinese language teaching (Romagnoli and Conti 2021) have not been matched by an equally flourishing research on corpus compilation to support research on the acquisition of L2 Chinese by Italian learners (Iurato 2022a). The compilation of a learner corpus is a challenging issue due to the strict criteria that need to be observed for corpus design and data collection (Castillo Rodríguez et al. 2020;Dutra and Gomide 2015;Lozano 2021). This paper addresses these issues and presents the methodological steps necessary for the compilation of a written Italian learner corpus of L2 Chinese. ...
... In what follows, I will describe: a) general corpus design principles and SLA-motivated features; b) the corpus typology; c) environment, learner, and task variables; d) the data collection procedure. (Granger 1996) for the study of cross-linguistic influence, the learner corpus is accompanied by a control corpus of 30 Chinese native speakers as a benchmark of the (variety of ) language learners are exposed to (Lozano 2021) 18 . Moreover, following one of the most important corpus design criteria outlined by Sinclair (2005), the two corpora are comparable because the tasks that were administered to the learners and the control group were identical. ...
... In the Italian context, for instance, the growing number of students and the widespread interest in Chinese language teaching (Romagnoli and Conti 2021) have not been matched by an equally flourishing research on corpus compilation to support research on the acquisition of L2 Chinese by Italian learners (Iurato 2022a). The compilation of a learner corpus is a challenging issue due to the strict criteria that need to be observed for corpus design and data collection (Castillo Rodríguez et al. 2020;Dutra and Gomide 2015;Lozano 2021). This paper addresses these issues and presents the methodological steps necessary for the compilation of a written Italian learner corpus of L2 Chinese. ...
... Moreover, following one of the most important corpus design criteria outlined by Sinclair (2005), the two corpora are comparable because the tasks that were administered to the learners and the control group were identical. In other words, the same design across the two corpora ensures comparability, a key issue particularly emphasized by Lozano (2021). 4. It contains a rich set of metadata on learner variables, as it is important to document learner variables accurately both to support data interpretation (Bell and Payant 2021) and to increase reliable comparability across studies (Tracy-Ventura et al. 2021). ...
Full-text available
This article introduces a new methodological resource for research in L2 Chinese acquisition: the written sub-corpus of the Bimodal Italian Learner Corpus of Chinese (BILCC). The corpus, methodologically grounded in the Learner Corpus Research (LCR) framework, has been assembled according to strict design criteria. It is a specific-purpose corpus, as it has been specifically designed to explore the pragmalinguistic knowledge of Chinese shì clefts by L1 Italian learners. The corpus consists of contextualized written data produced by 103 Italian learners at beginner, intermediate, and advanced levels, totaling 53,437 Chinese characters, 38,793 tokens, and 693 word types. Additionally, the corpus design includes an equivalent sub-corpus of native Chinese speakers, consisting of data from 30 L1 Chinese speakers. The paper presents the features of the corpus design, and describes the corpus typology, as well as the environment, learner, and task variables. The data collection procedure and the corpus size are also discussed. Finally, the paper demonstrates the effectiveness of SLA theoretically motivated tasks used for data collection by presenting statistical analyses of the collected data on the production of shì clefts.
... Corpus Escrito del Español L2 (CEDEL2) (Lozano, 2022) is a multi-L1 corpus of L2 Spanish learners coming from 11 different L1 backgrounds, plus a Spanish monolingual control subcorpus. CEDEL2 (version 2) currently holds 1,105,936 words, 4,399 participants, and 14 task topics. ...
Full-text available
This study investigates the acquisition of anaphora resolution (AR) in Spanish as a second language (L2). According to the Position of Antecedent Strategy (PAS), in native Spanish null pronominal subjects are biased toward subject antecedents, whereas overt pronominal subjects show a "flexible" bias (typically toward non-subject but also toward subject antecedents). The PAS has been extensively investigated in experimental studies, though little is known about real production. We show how naturalistic production (corpus methods) can uncover crucial factors in the PAS that have not been explored in the experimental literature. We analyzed written samples from the CEDEL2 corpus: L1 English-L2 Spanish adult late-bilingual learners (intermediate, lower-advanced and upper-advanced proficiency levels) and a control group of adult Spanish monolinguals (N = 75 texts). Anaphors were manually annotated via a fine-grained, linguistically-motivated tagset in UAM Corpus Tool. Against traditional assumptions, our results reveal that (i) the PAS is not a privileged mechanism for resolving anaphora; (ii) it is more complex than assumed (in terms of the division of labor of anaphoric forms, their antecedents and the syntactic configuration in which they appear); (iii) the much-debated "flexible" bias of overt pronouns is apparent since they are hardly produced and are replaced by repeated NPs, which show a clear non-subject antecedent bias; (iv) at the syntax-discourse interface, the PAS is constrained by information structure in more complex ways than assumed: null pronouns mark topic continuity, whereas overtly realized referential expressions (overt REs: overt pronouns and NPs) mark topic shift. Learners show more difficulties with topic continuity (where they redundantly use overt pronouns) than with topic shift (where they normally disambiguate by using overtly realized REs), thus being more redundant than ambiguous, in line with the Pragmatic Principles Violation Hypothesis (PPVH) (Lozano, 2016). We finally argue that the insights from corpora should be implemented into experiments. The triangulation of corpus and experimental methods in bilingualism ultimately provides a clearer understanding of the phenomenon under investigation.
... 3 The examples from natives in this paper are taken from the following corpora: (i) Spanish natives: CEDEL2 corpus (Corpus Escrito del Español L2) ( (Lozano 2022) (cf. section 5.1); (ii) Greek natives: GLC corpus (Greek Learner Corpus) ( ...
Full-text available
Anaphora Resolution (AR) is a pervasive phenomenon in natural languages. AR relates to how referring expressions (REs) (e.g., null/overt subject pronouns, and NPs) corefer with their antecedents in discourse. We use corpus methods to simultaneously compare AR in two null-subject languages (Spanish vs. Greek). We analyse a Spanish-native sample (CEDEL2 corpus, N=341 REs analysed) and an equally-designed Greek-native sample (GLC corpus, N=400 REs analysed), while keeping constant the text type (Chaplin narrative task), the annotation scheme (tagset), the tagging procedure, and the profile of the natives. Our corpus results reveal similarities in the way Spanish and Greek natives construct their narratives regarding the distribution of the information status of the REs (topic continuity/shift) and the distribution of characters (main/secondary) in discourse. Crucially, our two languages differ in relation to topicality (Greek capitalises on discourse topic whereas Spanish relies more on sentential topic), which leads to a different distribution in the realization of REs in discourse. These similarities and differences are accounted for by a new theoretical proposal, the Type of Topic Hypothesis (TTH), which postulates that there is a tension between discourse-topic vs. sentential-topic oriented languages. The TTH captures the idea that, while narratives are constructed in the same way in both languages, RE realization varies as a result of the discourse-topic orientation of Greek vs. the sentential-topic orientation of Spanish.
... For example, SPLLOC has been used to study the preterit/imperfect contrast (e.g., yo comí/yo comía 'I ate'/'I was eating') in Spanish (Domínguez et al., 2013). Comparable written data can be found in the Corpus Escrito del Español L2 (CEDEL2; Lozano, 2021), while the LANGSNAP corpora focus on learners of L2 French and Spanish in a study abroad setting (Tracy-Ventura et al., 2016) and have been used to study lexical sophistication (Tracy-Ventura, 2017). For recent overviews of L2 corpora, see Tracy-Ventura and Paquot (2021), and for more on L2 usage-based approaches, see Geeslin et al. (Chapter 19, in this volume). ...
This chapter discusses the compatibility of the approaches with increasingly-sophisticated corpus research and considers the study of usage-based factors across different corpora, languages, and linguistic structures. It explores the pending issues related to understudied phenomena, languages, regional varieties, and language users, along with the ability of corpora to serve as primary and/or supplementary data. The chapter explains the cognitive-functional origins of usage-based approaches to language development, key constructs within these approaches, and their compatibility with increasingly sophisticated corpus research. It also discusses the scope of usage-based corpus analyses, ways that corpora can inform more controlled measures of data elicitation, and the languages and types of users considered. Usage-based linguists will continue to uncover information about human cognition across different structures, languages, regional varieties, levels of language expertise, and language pairings.
... Greek: Promotes explora<on of the rela<onship between input 13 quan<ty and interlanguage (Muñoz 2011(Muñoz , 2014Granena & Long, 2013). In combina<on with the variable concerning the Greek Language Learning Sehng, this variable records any prior experience in learning Greek and offers insight into interlanguage phenomena such as fossiliza<on Considering the context of our data collection (i.e, during an instructed Greek language course) it was determined 12 that the implementation of an additional tool (e.g., an online test (Lozano 2021)) would have presented an extra burden for the learners. This was deemed unnecessary, as the students had already completed an extensive placement test at the start of the course and were regularly assessed through the terms with the completion of short tests. ...
Full-text available
This paper presents an error-annotated learner corpus, the Greek Learner Corpus II (GLCII). GLCII responds to the need for representa<onal data of less spoken and taught languages that will support research on Greek as a second (L2) or foreign language (FL). Therefore, focus has been given to recording both wriCen and spoken learner' produc<ons from a range of genres accompanied with metadata relevant to both L2/FL teaching and acquisi<on. GLCII has drawn on prac<ces and current trends in corpus construc<on when adop<ng specific design criteria to ensure its originality, suitability, and availability as a language resource (Brezina, 2018; Tracy-Ventura et al., 2021). Currently, the GLCII is the largest online, freely available, Learner Corpus (LC) of L2-Greek, compiled in the framework of the research project Latent Aspects in L2 Acquisi/on (LAL2A). There have been recent aCempts to compile a L2-Greek Learner Corpus that complies with today's modern standards of LC compila<on. In 2010, Tzimokas compiled a LC which consisted of 291 wriCen produc<ons (around 65000 words), gathered from adult learners that studied at the School of Modern Greek, hosted by the University of Athens. This was the first methodical endeavor to collate a representa<ve sample of L2-Greek and demonstrate error-annota<ons. Whilst this compila<on carefully addressed data collec<on and annota<on issues, it can be cri<cised for its sample size as it does not accommodate large scale analy<cal supposi<on. Furthermore, the annotated data are not widely accessible, and the error annota<ons scheme is arduous. Naviga<on through this scheme for the end-user is therefore overly complicated and affects the ease of annota<ons analysis. Tantos and Papadopoulou (2014) addressed these issues with the Greek Learner Corpus I (GLCI), a L2 Greek LC which consists of approximately 450 wriCen produc<ons produced by adolescent students (around 33000 words). GLCI's error annota<on scheme was the basis for the 1 GLCII's error annota<on scheme. Moreover, GLCII is freely available within the CLARIN-EL GLCI was co-funded by the Greek Ministry of Educa<on and the European Union within the research project 1 Educa/on for foreign and repatriated pupils.
This article introduces the CELI corpus, a new learner corpus of written Italian consisting of ca. 600,000 tokens, evenly distributed among CEFR (Common European Framework of Reference for Languages) proficiency levels B1, B2, C1 and C2. The collected texts derive from the language certification exams administered by the University for Foreigners of Perugia all around the world. The corpus contains rich metadata pertaining to text-related and learner-related variables. It expands the domain of learner corpora by being, among other things, both freely available online to the research community, and by focusing on a target language other than English. The article also presents and evaluates the POS-tagging procedure, thus contributing to best practices in learner corpus annotation.
In 2016, Lawley proposed an easy-to-build spellchecker specifically designed to help second language (L2) learners in their writing process by facilitating self-correction. The aim was to overcome the disadvantages to L2 learners posed by generic spellcheckers (GSC), such as that embedded in Microsoft Word. Drawbacks include autocorrection, misdiagnoses, and overlooked errors. With the aim of imparting explicit L2 spelling knowledge, this correcting tool does not merely suggest possible alternatives to the detected error but also provides explanations of any relevant spelling patterns. Following Lawley’s (2016) recommendations, the present study developed a prototype computer-based pedagogic spellchecker (PSC) to aid L2 learners in self-correcting their written production in Spanish. First, a corpus was used to identify frequent spelling errors of Spanish as a foreign language (SFL) learners. Handcrafted feedback was then designed to tackle the commonest misspellings. To subsequently evaluate this PSC’s efficacy in error detection and correction, another learner Spanish corpus was used. Sixty compositions were analysed to determine the PSC’s capacity for error recognition and feedback provision in comparison with that of a GSC. Results indicate that the PSC detected over 90% of the misspellings, significantly outperforming the GSC in error detection. Both provided adequate feedback on two out of three detected errors, but the pedagogic nature of the former has the added advantage of facilitating self-learning (Blázquez-Carretero & Woore, 2021). These findings suggest that it is feasible to develop spellcheckers that provide synchronous feedback, allowing SFL learners to confidently self-correct their writing while saving time and effort on the teacher’s part.
Full-text available
While research on second language (L2) tense-aspect acquisition has flourished, most studies have focused on lexical aspect as an explanatory variable (Bardovi-Harlig and Comajoan-Colomé 2020). However, the role of the features of first language (L1) production in L2 Spanish preterit-imperfect acquisition has never been tested before. Prior research has found that the frequency and distinctiveness of verb forms in corpora of L1 English production predict L2 English learners’ tense-aspect production (Wulff et al. 2009). The present study aims to replicate these findings and test the predictions of hypotheses of L2 tense-aspect acquisition in another group of learners: English-dominant, instructed Spanish learners. Analyses were performed on longitudinal data from the Corpus of Written Spanish of L2 and Heritage Speakers (COWS-L2H; Yamada et al. 2020) and cross-sectional data from the Corpus Escrito del Español L2 (CEDEL2; Lozano 2021). Results indicate that L1 verb frequency and distinctiveness predict learners’ emergent use of the preterit and the imperfect.
Full-text available
This chapter deals with the combined use of learner corpus data and experimental data to gain a better understanding of learner language and how it is acquired. It presents the advantages of such a combination and some of its challenges. It also describes the experimental methods that have most often been combined with learner corpus analyses. Examples of studies that have successfully combined learner corpus data and experimental data are provided. The chapter advocates the use of more – and more diversified – multimethod approaches and suggests that this could contribute to the theoretical rapprochement between learner corpus research and second language acquisition.
Full-text available
This paper considers the issue of the norm in the context of learner corpus research and its implications for foreign language teaching. It seeks to answer three main questions: Does learner corpus research require a native norm? What corpus-derived norms are available and how do we choose? What do we do with these norms in the classroom? The first two questions are more research-oriented, reviewing the types of reference corpora that can be used in the analysis of learner corpora, whereas the third one looks into the pedagogical use of corpus-derived norms. It is shown that, while studies in learner corpus research can dispense with a native norm, they usually rely on one, and that a wide range of native and non-native norms are available, from which choosing the most appropriate one(s) is of crucial importance. This large repertoire of corpus-derived norms is then reconsidered in view of the reality of the foreign language classroom.
Full-text available
Full text: This paper shows the need to triangulate different approaches in Bilingualism and Second Language Acquisition (SLA) research to fully understand late bilinguals’ interlanguage grammars. Methodologically, we show how experimental and corpus data can be (and should be) triangulated by reporting on a corpus study (Lozano and Mendikoetxea in Biling Lang Cognit 13(4):475–497, 2010) and a new follow-up offline experiment investigating Subject–Verb inversion (Subject–Verb/Verb–Subject order) in L1 Spanish–L2 English (n = 417). Theoretically, we follow a recent line in psycholinguistic approaches to Bilingualism and SLA research (Interface Hypothesis, Sorace in Linguist Approaches Biling 1(1):1–33, 2011). It focuses on the interface between syntax and language-external modules of the mind/brain (syntax-discourse [end-focus principle] and syntax-phonology [end-weight principle]) as well as a language-internal interface (lexicon-syntax [unaccusative hypothesis]). We argue that it is precisely this multi-faceted interface approach (corpus and experimental data, core syntax and the interfaces, representational and processing models) that provides a deeper understanding of (i) the factors that favour inversion in L2 acquisition in particular and (ii) interlanguage grammars in general.
Full-text available
Learners of Spanish show persistent deficits with the distribution of overt and null pronouns in subject position. The interface between syntax and discourse has been claimed to account for these deficits (Sorace & Filiaci 2006; Sorace, Serratrice, Filiaci & Baldo 2009). This study uses corpus methodology to explore the anaphoric 3rd person subject usage in the interlanguage of Greek and English learners of Spanish. Learners of two different proficiency levels (elementary and upper-advanced) for each group (English and Greek) were examined and compared to a native Spanish control group. Results indicate that although elementary Greek-speaking learners of Spanish show some tendency to overuse overt subjects, they do so in a significantly lower percentage than their English counterparts. Moreover, at the upper-advanced level, Greek-speaking learners exhibit native-like preferences, in contrast to the English-speaking learners, who show deficits even at the highest levels of proficiency.
El auge de la investigación de la adquisición del español como segunda lengua (L2) ha hecho necesaria la creación de amplias muestras del lenguaje (corpus de aprendices). Dichos corpus permiten investigar qué tipo de conocimiento o competencia (interlengua) adquieren los aprendices de español L2. Se justificará la necesidad del buen diseño de un corpus de aprendices y se ilustra cómo la metodología de corpus permite al investigador entender de manera sistemática los fenómenos propios de la interlengua del español L2. Se presentará asimismo una panorámica de los corpus de español como L2 disponibles gratuitamente en línea, con algunos estudios representativos que investigan varios fenómenos de la interlengua en español L2. Finalmente, se mostrarán casos prácticos del uso de dos paquetes de software gratuito: AntConc, que permite hacer búsquedas de concordancias, y UAM Corpus Tool para etiquetar y analizar estadísticamente el corpus. Ambas herramientas se ilustrarán con datos procedentes del corpus CEDEL2. ----------- The increasing interest in the research on the acquisition of Spanish as a second language (L2) has led to the creation of large language databases (learner corpora). Such corpora allow researchers to investigate the type of knowledge or competence (interlanguage) that learners can acquire in their L2 Spanish. We will justify the need for good learner corpus design and will illustrate how corpus methods help researchers understand typical L2 Spanish interlanguage phenomena in a systematic way. An overview of freely available L2 Spanish corpora will be presented along with some representative L2 Spanish acquisition studies that investigate several interlanguage phenomena. Finally, we will do hands-on practice with two software tools: Antconc, which allows to do concordance searchers on the corpora, and UAM Corpus Tool, which allows to tag and statistically analyse the corpora. Both tools will be illustrated with samples from the CEDEL2 corpus.
Variability in subject expression has been a widely studied phenomenon over the last few decades and is still the focus of a considerable body of research in both native (L1) and second language (L2) grammars. Crucially, the production of L2 Spanish learners, both written and oral, has been investigated in depth with a view to understand how they use referential expressions (REs) like null and overt pronominals (i.e. what has been traditionally called anaphora resolution) and other REs such as lexical noun phrases (NPs), as well as which factors constrain their use in real discourse (e.g. Blackwell & Quesada, 2012; Lozano, 2009b, 2016). Even though L2 learners acquire the morphosyntactic features that license null subjects in L2 Spanish from very early stages (Liceras, 1989; Phinney, 1987), results from both experimental and corpus-based developmental studies (e.g. Lozano, 2009b, 2018; Montrul & Rodríguez-Louro, 2006) have shown that certain features are particularly difficult for non-native speakers even at end-states of acquisition. L2 learners show persistent deficits in selecting felicitous null/overt pronouns when constrained at the interfaces (e.g. syntax–discourse interface), following Sorace’s Interface Hypothesis (2011, 2012), which holds that such features are more difficult to acquire than merely syntactic ones. However, Lozano (2009b, 2016) used a near-native corpus of L2 Spanish learners to show that these deficits are rather selective and do not necessarily affect the whole pronominal paradigm: most of these deficits were (i) attributed to third person human singular subject REs (whereas the rest of the pronominal paradigm was unproblematic), and (ii) were mainly observable in topic continuity scenarios (whereas topic shift and other scenarios were not problematic). These scenarios will be further explored in this chapter using a corpus approach, which will also allow for the investigation of other less-explored factors that constrain the form of subject REs in native and non-native grammars.