Chapter

Annotation of Collocations in a Learner Corpus for Building a Learning Environment

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Collocations in the sense of idiosyncratic lexical co-occurrences are one of the main barriers and challenges for any second language (L2) learner. In Computer Assisted Language Learning (CALL), a number of works deal with the automatic recognition of collocation errors and compilation of candidate lists for their correction. However, this is not sufficient. Firstly, to obtain a clear picture of the difficulties experienced by learners in order to be able to offer targeted aid to learners, a fine-grained linguistic analysis of collocation errors and their annotation in learner corpora is necessary. Secondly, programs must be developed that make concrete correction suggestions, besides providing correction candidate lists, and supply a learner with illustration and didactic material that is oriented towards the types of collocations with which this learner has difficulties. In our work, we attempt to push the state-of-the-art one step further in both of these strands of research, focusing on Spanish as L2. Within the first strand, we carry out a detailed collocation-oriented annotation of a fragment of the corpus of learners of Spanish CEDEL2. Within the second strand, we experiment with a number of strategies for choosing the most likely correction of a collocation error.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To assess the errors that we have found, we used the location dimension of (Wanner et al., 2011) taxonomy to evaluate students errors when producing collocations. The first two categories show errors that were found on one of the two elements of the collocation (cf. ...
... On all four MT systems the majority of the errors occur when choosing the collocate. This was also observed on foreign language learners on the already mentioned study by (Wanner et al., 2011). The source of the errors are literal translations of the collocate ( " grey " -cinzento), use of a wrong synonym ( " angle " -perspectiva) or untranslations (e.g. ...
Book
Full-text available
This volume documents the proceedings of the 2nd Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015), held on 1-2 July 2015 as part of the EUROPHRAS 2015 conference: "Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives" (Málaga, 29 June – 1 July 2015). The workshop was sponsored by European COST Action PARSing and Multi-word Expressions (PARSEME) under the auspices of the European Society of Phraseology (EUROPHRAS), the Special Interest Group on the Lexicon of the Association for Computational Linguistics (SIGLEX), and SIGLEX's Multiword Expressions Section (SIGLEX-MWE). The workshop was co-chaired by Gloria Corpas Pastor (Universidad de Málaga), Ruslan Mitkov (University of Wolverhampton), Johanna Monti (Università degli Studi di Sassari), and Violeta Seretan (Université de Genève). It received the support of the Advisory Board, composed of Dmitrij O. Dobrovol'skij (Russian Academy of Sciences, Moscow), Kathrin Steyer (Institut für Deutsche Sprache, Mannheim), Agata Savary (Université François Rabelais Tours), Michael Rosner (University of Malta), and Carlos Ramisch (Aix-Marseille Université). The topic of the workshop was the integration of multi-word units in machine translation and translation technology tools. In spite of the recent progress achieved in machine translation and translation technology, the identification, interpretation and translation of multi-word units still represent open challenges, both from a theoretical and from a practical point of view. The idiosyncratic morpho-syntactic, semantic and translational properties of multi-word units poses many obstacles even to human translators, mainly because of intrinsic ambiguities, structural and lexical asymmetries between languages, and, finally, cultural differences. After a successful first edition held in Nice on 3 September 2013 as part of the Machine Translation Summit XIV, the present edition provided a forum for researchers working in the fields of Linguistics, Computational Linguistics, Translation Studies and Computational Phraseology to discuss recent advances in the area of multi-word unit processing and to coordinate research efforts across disciplines.
... Even more sophisticated and intricate ICALL programs are engineered to provide automatic feedback on the accuracy and style of learners' texts. It is worth noting, however, that while such automatic feedback generators may be highly beneficial, achieving a satisfactory level of precision in their implementation remains challenging (Meurers, 2015;Wanner et al., 2013). That said, one ICALL program that has demonstrated considerable success is E-Tutor. ...
... Research in NLP has already addressed a number of collocation-related tasks, in particular: (1) collocation error detection, categorization, and correction in writings of second language learners (Ferraro et al., 2011;Wanner et al., 2013;Ferraro et al., 2014;Rodríguez-Fernández et al., 2015); (2) creation of collocation-enriched lexical resources Maru et al., 2019;Di Fabio et al., 2019); (3) use of knowledge on collocations in downstream NLP tasks, among them, e.g., machine translation (Seretan, 2014), word sense disambiguation (Maru et al., 2019), natural language generation (Wanner and Bateman, 1990), or semantic role labeling (Scozzafava et al., 2020); (4) probes involving collocations for understanding to which extent language models are able to identify non-compositional meanings (Shwartz and Dagan, 2019;Garcia et al., 2021); and (5) detection and categorization of collocations with respect to their semantics (Wanner et al., 2006;Espinosa Anke et al., 2019;Levine et al., 2020;Espinosa-Anke et al., 2021). It is this last task which is the focus of this paper. ...
... Research in NLP has already addressed a number of collocation-related tasks, in particular: (1) collocation error detection, categorization, and correction in writings of second language learners (Ferraro et al., 2011;Wanner et al., 2013;Ferraro et al., 2014;Rodríguez-Fernández et al., 2015); (2) creation of collocation-enriched lexical resources Maru et al., 2019;Di Fabio et al., 2019); (3) use of knowledge on collocations in downstream NLP tasks, among them, e.g., machine translation (Seretan, 2014), word sense disambiguation (Maru et al., 2019), natural language generation (Wanner and Bateman, 1990), or semantic role labeling (Scozzafava et al., 2020); (4) probes involving collocations for understanding to which extent language models are able to identify non-compositional meanings (Shwartz and Dagan, 2019;Garcia et al., 2021); and (5) detection and categorization of collocations with respect to their semantics (Wanner et al., 2006;Espinosa Anke et al., 2019;Levine et al., 2020;Espinosa-Anke et al., 2021). It is this last task which is the focus of this paper. ...
Preprint
Full-text available
Recognizing and categorizing lexical collocations in context is useful for language learning, dictionary compilation and downstream NLP. However, it is a challenging task due to the varying degrees of frozenness lexical collocations exhibit. In this paper, we put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
... For example, Nesselhauf (2005) reported a 25% error rate for verb + noun combinations produced by advanced German learners of English. A large proportion of collocation errors are due to the influence of the mother tongue: about 50% for Nesselhauf (2005) and up to 67% for Wanner et al. (2013), who investigate miscollocations of all types produced by Spanish learners. Learners also tend to underuse the most collocationally restricted combinations. ...
Chapter
Full-text available
In the current study, we investigate aspects of the learner phrasicon based on a section of the Longitudinal Database of Learner English (LONGDALE), made up of written data produced by learners of English as a Foreign Language (EFL) followed over a period of three years. Section 2 describes the phraseological measure used in the study, which we refer to as collgram. After providing information on the data and methodology used (Section 3), we present the main developmental trends displayed by the data (Section 4). In Section 5, two caveats are highlighted – one related to the density of the data collection, the other to the variability of learners’ trajectories. Section 6 presents our conclusions and outlines some potential applications of collgram-based research.
... Algunas muestras del CEDEL2 han sido etiquetadas de acuerdo con la estructura sintáctica y colocaciones -por medio de UAM Corpus Tool (O' Donell, 2009)-. De hecho, se ha iniciado el estudio de las colocaciones en el CEDEL2 (Prieto et al., 2009;Pérez Serrano, 2012; Orol González y Alonso Ramos, 2013) con vistas a diseñar aplicaciones en línea, asistentes informáticos que sirvan de herramienta de ayuda a la redacción en español, que detecten el error colocacional y aporten estrategias de corrección (Ferraro et al., 2011;Vincze et al., 2011;Wanner et al., 2013a;Ferraro et al., 2014), lo cual requiere la creación de una tipología del error colocacional que permita etiquetar el corpus de aprendientes (Alonso Ramos et al., 2010ay 2010bWanner et al., 2013b). En Sánchez Rufat (2015) se analizan detalladamente las colocaciones y otras combinaciones léxicas proyectadas por el verbo dar en el CEDEL2 a partir de varias técnicas y procedimientos combinados: las relaciones de frecuencias, el test de significatividad y la tipología del error. ...
... Overall, the manual annotation of collocations was found to be rather demanding given that both identification of collocations and deciding on error categories often constituted far from trivial tasks. For a more detailed account the main difficulties encountered see Vincze et al. (2011) and Wanner et al. (2013). ...
... The COLOCATE research group at the University of La Coruña has used the CEDEL2 corpus to explore L2 Spanish learners' use of collocations (e.g., Alonso Ramos et al. 2010;Prieto González, Mosqueira Suárez, and Vázquez Veiga 2009;Vincze et al. 2011;Wanner et al. 2013). In a key study, Orol González and Alonso Ramos (2013) explored collocational richness in the CEDEL2 corpus by comparing a learner sample (N = 1,863 collocations) vs. a native sample (N = 1,105 collocations) to (dis)confirm the long-held assumption that learners use fewer collocations than natives. ...
Article
Full-text available
This article presents a state-of-the-art discussion of second language (L2) Spanish corpus-based research on lexical competence. While L2 Spanish learner corpus research (LCR) is still in its infancy, we will review the major findings of relevant studies on the production of several lexical aspects: copula choice with ser/estar; overt/null pronoun distribution; collocations and lexico-syntactic verbal competence. Due to the highly contextualised nature of learner corpus data, many of these studies show that learners do not always behave differently from natives in terms of frequency of use, though they may differ in terms of discursive and pragmatic uses. The article ends with some theoretical and methodological caveats about L2 Spanish learner corpus research. An argument is made for the need to conduct L2 corpus-based research which (1) is theoretically motivated and explanatory (as opposed to descriptive and pedagogical), (2) uses fine-grained annotation (as opposed to coarse-grained, general tagsets), (3) exploits learner corpora that are properly designed and where learner variables are properly controlled for.
Article
Full-text available
Lexicon plays a key role in learning any natural language and, certainly, Latin is no exception in this matter. From a practical perspective, a proof of this statement is its presence in the official Spanish Bachillerato curriculum and its relevance regarding the Evaluación para el Acceso a la Universidad (University Entrance Exam, abbreviated in Spanish as EvAU). Considering this, the aim of this paper is twofold. On the one hand, it seeks to analyse, in terms of frequency, the vocabulary of the texts proposed for the Latin II EvAU exam in the Community of Madrid, between 2004 and 2020. On the other hand, once the most frequent words have been identified, it aims to offer several activities based on «lexical constellations» strategy with the purpose of acquiring and consolidating this vocabulary in Latin class
Chapter
Full-text available
This chapter reviews the contribution of learner corpus research to the study of two types of formulaic expression, namely collocations and lexical bundles. It sums up some of the key findings related to frequency of use, accuracy and appropriacy, L1 transfer and development. Special attention is paid to aspects of research design that have a strong impact on the results, in particular the quantitative measures used to identify multiword units and assess their degree of significance.
Article
Full-text available
One of the weaknesses of most current academic word lists is that they fail to do justice to the large stock of multiword units that is typical of academic language. The objective of this chapter is to raise awareness of the importance of phrasal academic vocabulary. After a brief critical survey of three recently compiled phrasal academic lists, the chapter highlights the potential contribution of learner corpus data to identifying the most useful units for teaching purposes. The approach is illustrated with a case study of phrasal metadiscourse based on corpora of novice and expert native writing, and subcorpora from the International Corpus of Learner English representing L2 writers from six different mother tongue backgrounds.
Article
The main purpose of this paper is to explain the application of usage labels in the Diccionario de colocaciones del español (DiCE), an online dictionary of Spanish collocations. The first part of the paper offers a brief description of the dictionary, before moving on to examine how the attributes of a stylistically marked base or collocate are projected onto the collocation as a whole. The architecture of DiCE allows us to provide information about collocation use in relation to both base and collocate, thereby allowing users - especially learners - to select the combination best suited to their particular purpose. The next section deals with the type of socio-pragmatic information included in the usage labels for DiCE (this list is currently under review), followed by a detailed account of diaphasic marking and diaevaluative marking, and the usage labels within those categories (formal, informal, vulgar, euphemistic and pejorative). The final section examines the possibility of including ironic as a label in DiCE.
Article
Full-text available
This article studies advanced French-speaking learners' knowledge of make-collocations. It suggests that, while an investigation of the errors found in a learner corpus may be enlightening, it should ideally be complemented by two other types of analyses, namely a comparison of the learner corpus data with native data, which highlights phenomena of overuse or underuse, and elicitation tests, which focus on competence rather than performance. Using such a threefold approach, this study shows that, while the learners under study do not make many errors, they tend to underuse make-collocations and limit themselves to those which have a direct equivalent in their mother tongues and are therefore safer. When forced to produce certain collocations or judge their acceptability, on the other hand, they reveal their collocational deficiencies and unreliable judgements.
Chapter
Full-text available
Over the last twenty years, phraseology has become a major field of pure and applied research in Western European and North American linguistics. This book is made up of authoritative contributions from leading specialists who examine the increasingly crucial role played by ready-made word-combinations in language acquisition and adult language use. After a wide-ranging introduction by the editor, the book introduces the main theoretical approaches, analyses the corpus data and phrase typology, and finally considers the application of phraseology to associated disciplines including lexicography, language learning, stylistics, and computational analysis. This book is the first comprehensive and up-to-date account of the subject to be published in English. Series Information Series ISBN: 0-19-961811-9 Series Editors: Richard W. Bailey, Noel Osselton, and Gabriele Stein; Oxford Studies in Lexicography and Lexicology provides a forum for the publication of substantial scholarly works on all issues of interest to lexicographers, lexicologists, and dictionary users. It is concerned with the theory and history of lexicography, lexicological theory, and related topics such as terminology, and computer applications in lexicography. It focuses attention too on the purposes for which dictionaries are compiled, on their uses, and on their reception and role in society today and in the past.
Article
Full-text available
Previous work in the literature reveals that EFL learners were deficient in collocations that are a hallmark of near native fluency in learner's writing. Among different types of collocations, the verb-noun (V-N) one was found to be particularly difficult to master, and learners' first language was also found to heavily influence their collocation production. In this paper, we develop an online collocation aid for EFL writers in Taiwan, aiming at detecting and correcting of learners' miscollocations attributable to L1 interference. Relevant correct collocation as feedback messages is suggested according to the translation equivalents between learner's L1 and L2. The system utilizes natural language processing (NLP) techniques to segment sentences in order to extract V-N collocations in given texts, and to derive a list of candidate English verbs that share the same Chinese translations via consulting electronic bilingual dictionaries. After combining nouns with these derived candidate verbs as V-N pairs, the system makes use of a reference corpus to exclude the inappropriate V-N pairs and single out the proper collocations. The system can effectively pinpoint the miscollocations and provide the learner with adequate collocations that the learner intends to write but misuses. It is hoped that this online assistant can facilitate EFL learner-writers' collocation use and help them transfer this essential knowledge to their future writing.
Article
Full-text available
This paper describes the first prototype of an automated tool for detecting collocation errors in texts written by non-native speakers of English. Candidate strings are extracted by pattern matching over POS-tagged text. Since learner texts often contain spelling and morphological errors, the tool attempts to automatically correct them in order to reduce noise. For a measure of collocation strength, we use the rank-ratio statistic calculated over one billion words of native-speaker texts. Two human annotators evaluated the system's performance. We report the overall results, as well as detailed error analyses, and discuss possible improvements for the future.
Article
Full-text available
Collocation is one of the most difficult aspects in second language learning, but has been largely neglected by researchers and practitioners. A questionnaire survey shows the advanced Chinese learners' collocational ability in English to be significantly inferior to that of native speakers. Our research attempts to correct this problem by developing an on-line correcting program which is able to detect some collocational errors in the learner' English writing and offer examples of standard collocations from a large corpus for reference. The system is based on two kinds of corpora: a Learner Corpus which is used for the study of known collocational errors, and a Reference Corpus which is used to extract standard English collocations. The system also makes use of a Dictionary of Synonyms derived from WordNet to discover the potential collocational errors in learner' input, as well as a Paraphrase Database gathered from the learners themselves to help diagnose un-collocational learner phrases. Altogether, it is hoped that the result of this research not only produces a usable on-line collocational aid, but also demonstrates a simple and efficient way of using learner corpora and reference corpora to support CALL software design.
Conference Paper
Full-text available
In recent years, collocation has been widely acknowledged as an essential characteristic to distinguish native speakers from non-native speakers. Research on academic writing has also shown that collocations are not only common but serve a particularly important discourse function within the academic community. In our study, we propose a machine learning approach to implementing an online collocation writing assistant. We use a data-driven classifier to provide collocation suggestions to improve word choices, based on the result of classification. The system generates and ranks suggestions to assist learners' collocation usages in their academic writing with satisfactory results.
Chapter
Over the last twenty years, phraseology has become a major field of pure and applied research in Western European and North American linguistics. This book is made up of authoritative contributions from leading specialists who examine the increasingly crucial role played by ready-made word-combinations in language acquisition and adult language use. After a wide-ranging introduction by the editor, the book introduces the main theoretical approaches, analyses the corpus data and phrase typology, and finally considers the application of phraseology to associated disciplines including lexicography, language learning, stylistics, and computational analysis. This book is the first comprehensive and up-to-date account of the subject to be published in English. Series Information Series ISBN: 0-19-961811-9 Series Editors: Richard W. Bailey, Noel Osselton, and Gabriele Stein; Oxford Studies in Lexicography and Lexicology provides a forum for the publication of substantial scholarly works on all issues of interest to lexicographers, lexicologists, and dictionary users. It is concerned with the theory and history of lexicography, lexicological theory, and related topics such as terminology, and computer applications in lexicography. It focuses attention too on the purposes for which dictionaries are compiled, on their uses, and on their reception and role in society today and in the past.
Conference Paper
This paper provides an insight into ongoing research focusing on the exploitation of data from learner corpus in order to enhance the performance of an automatic tool aimed at the correction of collocation errors of L2 Spanish speakers. The procedure adopted for collocation annotation is described together with the main difficulties involved in the annotation task, such as the problem of distinguishing collocations from other kinds of idiomatic expressions and from free combinations, the problem of correction judgment, and the problem of assigning concrete error types. It is shown that the fine-grained typology used in the course of error annotation sheds lights on certain collocation error types that are generally not taken into account by automatic error correction tools, such as errors concerning the base of the collocation, target language non-words, and grammatical collocation errors.
Article
Usage-based models claim that first language learning is based on the frequency-based analysis of memorised phrases. It is not clear though, whether adult second language learning works in the same way. It has been claimed that non-native language lacks idiomatic formulas, suggesting that learners neglect phrases, focusing instead on orthographic words. While a number of studies challenge the claim that non-native language lacks formulaicity, these studies have two important shortcomings: they fail to take account of appropriate frequency information and they pool the writing of different learners in ways that may mask individual differences. Using methodologies which avoid these problems, this study found that non-native writers rely heavily on high-frequency collocations, but that they underuse less frequent, strongly associated collocations (items which are probably highly salient for native speakers). These findings are consistent with usage-based models of acquisition while accounting for the impression that non-native writing lacks idiomatic phraseology.
Article
One of the most common and persistent error types in second language writing is collocation errors, such as learn knowledge instead of gain or acquire knowledge, or make damage rather than cause damage. In this work-in-progress report, we propose a probabilistic model for suggesting corrections to lexical collocation errors. The probabilistic model incorporates three features: word association strength (MI), semantic similarity (via Word-Net) and the notion of shared collocations (or intercollocability). The results suggest that the combination of all three features outperforms any single feature or any combination of two features.
Article
This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora and NLP technologies to construct an online English learning environment for learners in Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an English-Chinese parallel corpus, Sinorama, was used as the main course material for reading, writing, and culture-based learning courses. Second, an online bilingual concordancer, TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and other corpora. Third, many online lessons, including extensive reading, verb-noun collocations, and vocabulary, were designed to be used alone or together with TotalRecall and TANGO. Fourth, an online collocation check program, MUST, was developed for detecting V-N miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other computational scaffoldings are under development. It is hoped that this project will help intermediate learners in Taiwan enhance their English proficiency with effective pedagogical approaches and versatile language reference tools.
Supporting collocations learning
  • S Wu
Wu, S. (2010). Supporting collocations learning. PhD Thesis, University of Waikato, Hamilton, NZ.
Teaching collocation. Further developments in the lexical approach
  • M Lewis
Lewis, M. (2000). Teaching collocation. Further developments in the lexical approach. London: Language Teaching Publications.
An assistive tool for detecting and correcting errors
  • Awkchecker
AwkChecker: An assistive tool for detecting and correcting errors. In UIST '08: Proceedings of the 21st ACM symposium on User interface software and technology.
CEDEL2: Corpus Escrito del Español L2 Applied Linguistics Now: Understanding Language and Mind
  • C Lozano
Lozano, C. (2009). CEDEL2: Corpus Escrito del Español L2. In C. M. Bretones Callejas et al. (eds) Applied Linguistics Now: Understanding Language and Mind / La Lingüística Aplicada Hoy: Comprendiendo el Lenguaje y la Mente (pp.80-93).
Wortschatzlernen ist Kollokationslernen
  • F J Hausmann
Hausmann, F.J. (1984). Wortschatzlernen ist Kollokationslernen. Zum Lehren u.
Lexical Collocations in Learner English: a corpus-based approach
  • A Martelli
Martelli, A. (2007). Lexical Collocations in Learner English: a corpus-based approach. Alessandria: Edizioni dell'Orso.
Is the sky pure today
  • T Park
  • E Lank
  • P Poupart
  • M Terry
Park, T., Lank, E., Poupart, P., M. Terry (2008). "Is the sky pure today?"
Le dictionnaire de collocations
  • F J Hausmann
Hausmann, F. J. (1989). Le dictionnaire de collocations. In F.J Hausmann et al. (eds.)
Developing an English Collocation Retrieval Web Site for ESL Learners
  • H. H-J Chen
Chen, H. H-J. (2010). Developing an English Collocation Retrieval Web Site for ESL Learners, pp. 25–34
Collocations: A challenge in computer assisted language learning
  • G Ferraro
  • R Nazar
  • L Wanner
Ferraro, G., Nazar, R. & Wanner, L. (2011). Collocations: A challenge in computer assisted language learning. In Proceedings of the 5 th International Conference on Meaning-Text Linguistics.