ArticlePDF Available

CEDEL2: Design, compilation and web interface of an online corpus for L2 Spanish acquisition research

Authors:

Abstract

This article presents and reviews a new methodological resource for research in second language acquisition (SLA), CEDEL2 ( Corpus Escrito del Español L2 ‘L2 Spanish Written Corpus’), and its free online search-engine interface ( cedel2.learnercorpora.com ). CEDEL2 is a multi-first-language corpus (Spanish, English, German, Dutch, Portuguese, Italian, French, Greek, Russian, Japanese, Chinese, and Arabic) of L2 Spanish learners at all proficiency levels. It additionally contains several native control subcorpora (English, Portuguese, Greek, Japanese, and Arabic). Its latest release (version 2) holds material from around 4,400 speakers, which amounts to over 1,100,000 words. CEDEL2 follows strict corpus-design criteria (Sinclair, 2005) and L2 corpus-design recommendations (Tracy-Ventura and Paquot, 2021), and all subcorpora are equally designed to be fully contrastable, as recommended by Contrastive Interlanguage Analysis (Granger, 2015). Thanks to its design and web interface, CEDEL2 allows for complex searches which can be further narrowed down according to its SLA-motivated variables, e.g. first language (L1), proficiency level, self-reported proficiency level, age of onset to the L2, length of exposure to the L2, length of residence in a Spanish-speaking country, knowledge of other foreign languages, type of task, etc. These CEDEL2 features allow L2 researchers to address SLA questions and hypotheses.
OPEN ACCESS:
Lozano, C. (2022). CEDEL2: Design, compilation and web interface of an online corpus for
L2 Spanish acquisition research. Second Language Research, 38(4), 965-983.
https://doi.org/10.1177/02676583211050522
... 2 Corpus Escrito del Español como L2 (Lozano 2022), see section 4.1. for a description of the corpus. ...
... The data were collected through dedicated online forms. In addition to the task prompting the production, participants also contributed information about their demographic and linguistic background and participated in a Spanish placement test (Lozano 2022). The production data used in the present study are narrative texts. ...
Article
Full-text available
This article deals with the realization of referential subjects in the L2 Spanish of German (adult) native speakers. The acquisition of a null subject grammar by speakers of a non-null subject language has drawn considerable attention in generative approaches to L2 acquisition. This article revisits the issue and compares the predictions made by the Interface Hypothesis (Sorace 2005, 2011, Sorace and Filiaci 2006, Tsimpli and Sorace 2006) to an alternative, the Feature Reassembly Hypothesis (Lardiere 2008, Slabakova 2013 Cho and Slabakova 2014). Relying on corpus data, the study presents a novel empirical approach and applies an innovate statistical analysis procedure from learner corpus research. The results of the study corroborate previous empirical findings, namely that pronouns, yet not null subjects, are problematic, but also brings in new insights, in particular that issues with pronouns are consistent and go beyond the contexts predicted by the Interface Hypothesis. The contrasts between L1 and L2 subject realization found in the data therefore can only in part be explained to result from interface issues. The Feature Reassembly Hypothesis offers a suitable additional explanation relating the issues to the properties of the L1 and L2 learnability.
... James describes error analysis as the study of "the incidence, nature, cause, and consequences of unsuccessful language use" [9]. The methodology involves detecting errors, categorizing them, and diagnosing their underlying causes, which often stem from factors such as L1 interference or the overgeneralization of L2 rules [10]. ...
Article
Full-text available
This study investigates the typical linguistic errors made by Uzbek students learning English as a foreign language through a learner corpus approach. The analysis is based on 40 academic essays written by undergraduates at Urgench State University. Using the Sketch Engine platform, the errors were identified, categorized into 13 distinct types, and examined for their frequency and patterns. The findings highlight that errors in spelling, article usage, punctuation, and word choice are the most common, whereas issues with sentence structure and the use of linking words are less frequent. Additionally, the research explores gender-based differences in error patterns and examines how native language structures influence English acquisition. These insights provide valuable guidance for enhancing teaching strategies and addressing the specific challenges faced by English learners.
... The study of Ardura (2004) introduced a listening comprehension task and examined the effect of the mass/count noun distinction and verb type on intermediate L1-English L2-Spanish learners' interpretation of definite plurals, using both psychological and non-psychological verbs to neutralise syntactic 1 It is imperative to emphasise that studies on article acquisition not based on learner output, presenting descriptive and contrastive approaches to interlanguage in a theoretical manner, are excluded from this study. For those interested in a more descriptive approach, the following sources are recommended: Bop (2023), Garachana (2008), Montero Gálvez (2011Gálvez ( , 2014, Rothman et al. (2018), Santiago Alonso (2010, 2017b, 2022, Yang (2000). Additionally, other works focusing on the general analysis of the entire Spanish interlanguage have also been omitted. ...
Article
Full-text available
This study delves into the use of Spanish articles by Estonian students learning Spanish as their third language (L3). This topic has received little attention to date, despite evidence that it represents one of the greatest challenges Estonian learners of Spanish face, and it appears that these difficulties persist even at higher levels of Spanish proficiency (Kruse 2018: 126). This article aims to provide a comprehensive overview of how Estonian students use Spanish articles. 345 written texts from a learner corpus have been analysed. Results show that the most common error among learners is omission of the article, in line with previous research, but also an unexpectedly high incidence of morphological and syntactic errors. Moreover, it has been detected that there is not a significant improvement in article acquisition between levels A1 and B1. By addressing these objectives, this study contributes to a deeper understanding of the specific difficulties faced by Estonian learners and the broader landscape of article acquisition in a third language context.
... This is usually not the case in corpus linguistics, where corpora tend to be designed as multi-purpose resources that can be exploited within the frame of various studies. However, Lozano (2021) shows how CEDEL2, a corpus of L2 Spanish, has been developed with certain SLA questions in mind, for example by including in the metadata learner variables that are relevant to SLA research. The same could be done for teaching applications, for example by deciding to include full texts in the corpus rather than text samples if the aim is to have students examine discourse features that bring cohesion to the text. ...
Chapter
Full-text available
El presente estudio contribuye a la investigación de la adquisición del marcado diferencial de objeto (ingl. differential object marking, DOM) del español como segunda lengua (L2), eva- luando cuantitativamente el uso escrito del DOM por hablantes nativos de español europeo y por parte de hablantes de alemán como lengua primera (L1). Mediante un enfoque basado en corpus, se analizan más de 100 textos escritos en términos de la alternancia a/∅ con objetos directos nominales, evaluando un conjunto de predictores semánticos y de estructura de la in- formación utilizando un análisis multifactorial basado en clasificaciones de bosques aleatorios. Con este enfoque se investiga hasta qué punto y en función de qué factores el grupo de hablantes L2 difiere del grupo nativo en su uso del DOM. Los resultados indican que los aprendices son capaces de utilizar el DOM de manera similar al hablante nativo, pero omiten una cantidad considerable de marcadores en casos de objetos directos animados en los que los hablantes nativos usarían el DOM. Differential object marking, L1 German speakers, written texts, target-likeness, random forest analysis
Article
While general first language corpora are composed of samples from various naturalistic sources (e.g., websites, books), language samples in most written learner corpora (LC) are texts produced in response to prompts. In this context, LC users need to develop a clear awareness of the affordances and limitations of specific prompts and how responses to said prompts may affect the investigation of their intended object(s) of study. Through an analysis of the presence/absence of specific Spanish verb tenses in texts written in response to two supposedly narrative prompts in a Spanish LC (COWS-L2H; Yamada et al., 2020), this article illustrates the impact of inter- and intra-prompt response variation on LC data interpretation. Based on this evidence, we caution against rapid assumptions about text content based solely on the superficial phrasing of LC writing prompts. Instead, we recommend that LC users perform in-depth quantitative and qualitative analyses of learners’ samples written in response to each prompt they aim to include in their study prior to running statistical models on those data.
Chapter
Full-text available
This chapter deals with the combined use of learner corpus data and experimental data to gain a better understanding of learner language and how it is acquired. It presents the advantages of such a combination and some of its challenges. It also describes the experimental methods that have most often been combined with learner corpus analyses. Examples of studies that have successfully combined learner corpus data and experimental data are provided. The chapter advocates the use of more – and more diversified – multimethod approaches and suggests that this could contribute to the theoretical rapprochement between learner corpus research and second language acquisition.
Article
Full-text available
This paper considers the issue of the norm in the context of learner corpus research and its implications for foreign language teaching. It seeks to answer three main questions: Does learner corpus research require a native norm? What corpus-derived norms are available and how do we choose? What do we do with these norms in the classroom? The first two questions are more research-oriented, reviewing the types of reference corpora that can be used in the analysis of learner corpora, whereas the third one looks into the pedagogical use of corpus-derived norms. It is shown that, while studies in learner corpus research can dispense with a native norm, they usually rely on one, and that a wide range of native and non-native norms are available, from which choosing the most appropriate one(s) is of crucial importance. This large repertoire of corpus-derived norms is then reconsidered in view of the reality of the foreign language classroom.
Article
Full-text available
Full text: http://rdcu.be/Gkk5 This paper shows the need to triangulate different approaches in Bilingualism and Second Language Acquisition (SLA) research to fully understand late bilinguals’ interlanguage grammars. Methodologically, we show how experimental and corpus data can be (and should be) triangulated by reporting on a corpus study (Lozano and Mendikoetxea in Biling Lang Cognit 13(4):475–497, 2010) and a new follow-up offline experiment investigating Subject–Verb inversion (Subject–Verb/Verb–Subject order) in L1 Spanish–L2 English (n = 417). Theoretically, we follow a recent line in psycholinguistic approaches to Bilingualism and SLA research (Interface Hypothesis, Sorace in Linguist Approaches Biling 1(1):1–33, 2011). It focuses on the interface between syntax and language-external modules of the mind/brain (syntax-discourse [end-focus principle] and syntax-phonology [end-weight principle]) as well as a language-internal interface (lexicon-syntax [unaccusative hypothesis]). We argue that it is precisely this multi-faceted interface approach (corpus and experimental data, core syntax and the interfaces, representational and processing models) that provides a deeper understanding of (i) the factors that favour inversion in L2 acquisition in particular and (ii) interlanguage grammars in general.
Article
Full-text available
Learners of Spanish show persistent deficits with the distribution of overt and null pronouns in subject position. The interface between syntax and discourse has been claimed to account for these deficits (Sorace & Filiaci 2006; Sorace, Serratrice, Filiaci & Baldo 2009). This study uses corpus methodology to explore the anaphoric 3rd person subject usage in the interlanguage of Greek and English learners of Spanish. Learners of two different proficiency levels (elementary and upper-advanced) for each group (English and Greek) were examined and compared to a native Spanish control group. Results indicate that although elementary Greek-speaking learners of Spanish show some tendency to overuse overt subjects, they do so in a significantly lower percentage than their English counterparts. Moreover, at the upper-advanced level, Greek-speaking learners exhibit native-like preferences, in contrast to the English-speaking learners, who show deficits even at the highest levels of proficiency.
Chapter
El auge de la investigación de la adquisición del español como segunda lengua (L2) ha hecho necesaria la creación de amplias muestras del lenguaje (corpus de aprendices). Dichos corpus permiten investigar qué tipo de conocimiento o competencia (interlengua) adquieren los aprendices de español L2. Se justificará la necesidad del buen diseño de un corpus de aprendices y se ilustra cómo la metodología de corpus permite al investigador entender de manera sistemática los fenómenos propios de la interlengua del español L2. Se presentará asimismo una panorámica de los corpus de español como L2 disponibles gratuitamente en línea, con algunos estudios representativos que investigan varios fenómenos de la interlengua en español L2. Finalmente, se mostrarán casos prácticos del uso de dos paquetes de software gratuito: AntConc, que permite hacer búsquedas de concordancias, y UAM Corpus Tool para etiquetar y analizar estadísticamente el corpus. Ambas herramientas se ilustrarán con datos procedentes del corpus CEDEL2. ----------- The increasing interest in the research on the acquisition of Spanish as a second language (L2) has led to the creation of large language databases (learner corpora). Such corpora allow researchers to investigate the type of knowledge or competence (interlanguage) that learners can acquire in their L2 Spanish. We will justify the need for good learner corpus design and will illustrate how corpus methods help researchers understand typical L2 Spanish interlanguage phenomena in a systematic way. An overview of freely available L2 Spanish corpora will be presented along with some representative L2 Spanish acquisition studies that investigate several interlanguage phenomena. Finally, we will do hands-on practice with two software tools: Antconc, which allows to do concordance searchers on the corpora, and UAM Corpus Tool, which allows to tag and statistically analyse the corpora. Both tools will be illustrated with samples from the CEDEL2 corpus.
Chapter
Variability in subject expression has been a widely studied phenomenon over the last few decades and is still the focus of a considerable body of research in both native (L1) and second language (L2) grammars. Crucially, the production of L2 Spanish learners, both written and oral, has been investigated in depth with a view to understand how they use referential expressions (REs) like null and overt pronominals (i.e. what has been traditionally called anaphora resolution) and other REs such as lexical noun phrases (NPs), as well as which factors constrain their use in real discourse (e.g. Blackwell & Quesada, 2012; Lozano, 2009b, 2016). Even though L2 learners acquire the morphosyntactic features that license null subjects in L2 Spanish from very early stages (Liceras, 1989; Phinney, 1987), results from both experimental and corpus-based developmental studies (e.g. Lozano, 2009b, 2018; Montrul & Rodríguez-Louro, 2006) have shown that certain features are particularly difficult for non-native speakers even at end-states of acquisition. L2 learners show persistent deficits in selecting felicitous null/overt pronouns when constrained at the interfaces (e.g. syntax–discourse interface), following Sorace’s Interface Hypothesis (2011, 2012), which holds that such features are more difficult to acquire than merely syntactic ones. However, Lozano (2009b, 2016) used a near-native corpus of L2 Spanish learners to show that these deficits are rather selective and do not necessarily affect the whole pronominal paradigm: most of these deficits were (i) attributed to third person human singular subject REs (whereas the rest of the pronominal paradigm was unproblematic), and (ii) were mainly observable in topic continuity scenarios (whereas topic shift and other scenarios were not problematic). These scenarios will be further explored in this chapter using a corpus approach, which will also allow for the investigation of other less-explored factors that constrain the form of subject REs in native and non-native grammars.