Conference Paper

Tagging Collocations for Learners

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... 5 As of March 2011, CEDEL2 has reached around 750,000 words in electronic format, since data are being gathered via an online application. 6 While the data collection is still work in progress, some CEDEL2 samples have been used in published research on the acquisition of pronominal subjects (Lozano 2009b) and learner collocations (Alonso et al. 2010a(Alonso et al. , 2010b. ...
... While CEDEL2 is not fully tagged yet, some samples have been preliminarily tagged (see published work in Lozano 2009b;Alonso-Ramos 2010a, 2010b. We are using the tagging and concordancing software UAM CorpusTool (O'Donnell 2009), which is freely available. ...
Chapter
Full-text available
Second language acquisition (SLA) research has traditionally relied on elicited experimental data, and it has disfavoured natural language use data. Learner corpus research has the potential to change this but, to date, the research has contributed little to the interpretation of L2 acquisition, and some of the corpora are flawed in design. We analyse the reasons why many SLA researchers are still reticent about using corpora, and how good corpus design and adequate tools to annotate and search corpora can help overcome some of the problems observed. We do so by describing how the ten standard principles used in corpus design (Sinclair 2005) were applied to the design of CEDEL2, a large learner corpus of L1 English – L2 Spanish (Lozano 2009a).
... A subcorpus of 200 texts was used to annotate collocations: 100 learner texts and 100 native texts. The annotation process was made in two stages: first, the learner corpus was annotated (Alonso Ramos et al. 2010aRamos et al. , 2010b; and later, the native corpus. We have followed the same steps in the two stages: recognition of the collocation, identification of the elements of the collocation, assignment of the semantic and syntactic pattern, and screening of correct and incorrect collocations. ...
... Interference is the main cause of error: interlingual in learners and intralingual in natives. Learners make errors such as *gastar el tiempo instead of perder el tiempo , as a literal translation of to spend time (Alonso Ramos et al. 2010aRamos et al. , 2010b. Native errors are produced by interferences with their own language: a speaker produces an incorrect collocation, but has a correct collocation in mind which has been mixed with another correct collocation («greffes collocationnelles» in Polguère 2007). ...
Article
Full-text available
This study proposes a method for evaluating the written production of Spanish collocations. We begin by asking if the native speaker model is the appropriate one for learners. In order to answer this question we undertook the annotation of collocations in two parallel corpora, one by native speakers, and another by learners. Once both corpora were annotated, the collocational richness of learners and native speakers were compared. In order to measure collocational richness, four parameters were established (density, variety, sophistication and number of errors). Our results show that learners do, in fact, use collocations, but their choices lack the variety, sophistication and correction exhibited by native speakers.
... This is because the choice of one of the two elements in a collocation is free while the choice of the second depends on the first, such that while grammatical constructions-including those that are very different from constructions in L1 1 -follow generalized patterns and can thus be applied by analogy once some representative samples have been learned, collocations are much less generalizable and must be learned nearly one by one (Hausmann, 1984;Nation, 2001;Futagi et al., 2008). Even advanced learners who master well the grammar of L2 make collocation mistakes in that they often literally translate collocation elements from L1 or another foreign language, use non-existing words as collocation elements, get the subcategorizion of one of the elements wrong, etc. (Alonso Ramos et al., 2010a). Automatic means for detection and correction of collocation mistakes in L2 writings are thus in high demand. ...
... The first question has been addressed in (Alonso Ramos et al., 2010a), where a fine-grained multi-dimensional collocation error typology has been presented. The dimensions of the typology capture: (1) the scope of the error (collocate, base or collocation as a whole); (2) the type of the error (lexical or grammatical) and the subtype of the error (choice of a wrong element, creation of a non-existing element, use of a correct collocation which has a different meaning from the the intended one, etc. in the case of a lexical error, and error in determination, number, government, etc. in the case of a grammatical error); and (3) the source (or motivation) of the error (erroneous phonetic similarity, erroneous morphological derivation, L1 calque, etc.). ...
Article
Full-text available
Collocations in the sense of idiosyncratic binary lexical co-occurrences are one of the biggest challenges for any language learner. Even advanced learners make collocation mistakes in that they literally translate collocation elements from their native tongue, create new words as collocation elements, choose a wrong subcategorization for one of the elements, etc. Therefore, automatic collocation error detection and correction is increasingly in demand. However, while state-of-the-art models predict, with a reasonable accuracy, whether a given co-occurrence is a valid collocation or not, only few of them manage to suggest appropriate corrections with an acceptable hit rate. Most often, a ranked list of correction options is offered from which the learner has then to choose. This is clearly unsatisfactory. Our proposal focuses on this critical part of the problem in the context of the acquisition of Spanish as second language. For collocation error detection, we use a frequency-based technique. To improve on collocation error correction, we discuss three different metrics with respect to their capability to select the most appropriate correction of miscollocations found in our learner corpus.
... For NLP purposes, in general, this investigation could possibly lead to the specification of suitable statistical methods for the identification of inheritance patterns in corpora (cf. Roark/Sproat 2007 and work done by Alonso Ramos et al. 2010). The development of collocation-based interlinguistic models would be particularly useful in the field of Machine Translation and in enhancing functionality of Translation Memories. ...
Conference Paper
Full-text available
The paper presents part of the results obtained in the frame of investigations conducted at Heidelberg University on corpus methods in translation practice and, in particular , on the topic of paradigmatic collo-cates variation. It concentrates on collo-cates inheritance across emotion words by focusing on different syntactic frames and a multilingual perspective in order to highlight the potential benefits of this approach for automatic analysis of word combinations and its applications, e.g. in the fields of e-lexicography and machine translation.
... Algunas muestras del CEDEL2 han sido etiquetadas de acuerdo con la estructura sintáctica y colocaciones -por medio de UAM Corpus Tool (O' Donell, 2009)-. De hecho, se ha iniciado el estudio de las colocaciones en el CEDEL2 (Prieto et al., 2009;Pérez Serrano, 2012; Orol González y Alonso Ramos, 2013) con vistas a diseñar aplicaciones en línea, asistentes informáticos que sirvan de herramienta de ayuda a la redacción en español, que detecten el error colocacional y aporten estrategias de corrección (Ferraro et al., 2011;Vincze et al., 2011;Wanner et al., 2013a;Ferraro et al., 2014), lo cual requiere la creación de una tipología del error colocacional que permita etiquetar el corpus de aprendientes (Alonso Ramos et al., 2010ay 2010bWanner et al., 2013b). En Sánchez Rufat (2015) se analizan detalladamente las colocaciones y otras combinaciones léxicas proyectadas por el verbo dar en el CEDEL2 a partir de varias técnicas y procedimientos combinados: las relaciones de frecuencias, el test de significatividad y la tipología del error. ...
... The chapter will be organized around the following two main topics: (1) an examination and comparison of the whole array of the collocations found in writings produced by learners and native speakers of Spanish and (2) a detailed analysis of collocation errors found in the learner corpus, according to the error typology proposed in Alonso Ramos et al. (2010aRamos et al. ( , 2010b. Accordingly, our study has two main aims. ...
... Some applications of the CEDEL2 corpus to study learners' collocations are described inAlonso et al. (2010).3 Pravec(2002)offers a comprehensive list of learner corpora. ...
Article
Full-text available
This paper describes the collection of a semi-spontaneous spoken corpus of learners of Spanish from more than nine different mother tongues. The corpus was tagged to conduct a computer–aided error analysis. In addition, the corpus was used to develop a computer–based tool that has practical pedagogical applications (e.g., to train teachers of Spanish). The interface is available online to allow teachers and linguists to consult the data. This paper explains the methodology I followed to gather the data. First, I consider the data collection method and the corpus design. Secondly, the transcription conventions and the XML tags used to code learners’ metadata and their errors. Thirdly, the article explains the criteria used to mark oral production errors and the error typology. I then consider the design, development and evaluation of the corpus search tool. Lastly, some pedagogical applications are put forward. The conclusions and limitations of the project are outlined in the final section.
... 6 As of March 2011, CEDEL2 has reached around 750,000 words in electronic format, since data are being gathered via an online application. 7 While the data collection is still work in progress, some CEDEL2 samples have been used in published research on the acquisition of pronominal subjects (Lozano 2009b) and learner collocations (Alonso et al., 2010a(Alonso et al., , 2010b. ...
Article
Second language acquisition (SLA) research has traditionally relied on elicited experimental data, and it has disfavoured natural language use data. Learner corpus research has the potential to change this but, to date, the research has contributed little to the interpretation of L2 acquisition, and some of the corpora are flawed in design. We analyse the reasons why many SLA researchers are still reticent about using corpora, and how good corpus design and adequate tools to annotate and search corpora can help overcome some of the problems observed. We do so by describing how the ten standard principles used in corpus design (Sinclair 2005) were applied to the design of CEDEL2, a large learner corpus of L1 English – L2 Spanish (Lozano 2009a).
Chapter
We explore the context of verb-noun collocations using a corpus of the Excelsior newspaper issues in Spanish. Our purpose is to understand to what extent the context is able to distinguish the semantics of collocations represented by lexical functions of the Meaning-Text Theory. For experiments, four lexical functions were chosen: Oper1, Real1, CausFunc0, and CausFunc1. We inspected different parts of the eight-word window context: the left context, the right context, and both the left and right context. These contexts were retrieved from the original corpus as well as from the same corpus after stopwords deletion. For the vector representation of the context, word counts and tf-idf of words were used. To estimate the ability of the context to predict lexical functions, we used various machine-learning techniques. The best F-measure of 0.65 was achieved for predicting Real1 by Gaussian Naïve Bayes using the left context without stopwords and word counts as features in vectors.
Article
Full-text available
This article presents a state-of-the-art discussion of second language (L2) Spanish corpus-based research on lexical competence. While L2 Spanish learner corpus research (LCR) is still in its infancy, we will review the major findings of relevant studies on the production of several lexical aspects: copula choice with ser/estar; overt/null pronoun distribution; collocations and lexico-syntactic verbal competence. Due to the highly contextualised nature of learner corpus data, many of these studies show that learners do not always behave differently from natives in terms of frequency of use, though they may differ in terms of discursive and pragmatic uses. The article ends with some theoretical and methodological caveats about L2 Spanish learner corpus research. An argument is made for the need to conduct L2 corpus-based research which (1) is theoretically motivated and explanatory (as opposed to descriptive and pedagogical), (2) uses fine-grained annotation (as opposed to coarse-grained, general tagsets), (3) exploits learner corpora that are properly designed and where learner variables are properly controlled for.
ResearchGate has not been able to resolve any references for this publication.