ArticlePDF Available

Abstract

Learner corpora are used to investigate computerised learner language so as to gain insights into foreign language learning. One of the methodologies that can be applied to this type of research is computer-aided error analysis (CEA), which, in general terms, consists in the study of learner errors as contained in a learner corpus. Surveys of current learner corpora and of issues of learner corpus research have information on CEA research can be found, although usually limited. This article is centred on CEA research and is intended as a review of error tagging systems, including error categorizations, dimensions and levels of description. KEYWORDS. Second language acquisition, learner corpus research, computer-aided error analysis. RESUMEN. Los corpus de estudiantes se utilizan para la investigación de la lengua de estudiantes en formato electrónico con el fin de arrojar luz al proceso de adquisición de lenguas extranjeras. Una de las metodologías que se utilizan en este campo es el análisis informatizado de errores que, en términos generales, consiste en estudiar los errores recogidos en un corpus de estudiantes. Revisiones de los corpus de estudiantes existentes y de cuestiones relacionadas con el campo de la investigación en corpus de estudiantes han sido publicadas en los últimos años se proporciona información sobre la investigación en análisis informatizado de errores, aunque ésta es normalmente limitada. Este artículo se centra en el campo de análisis informatizado de errores y trata de proporcionar una revisión de los sistemas existentes de etiquetado de errores, sus categorizaciones, dimensiones y niveles de descripción. PALABRAS CLAVE. Adquisición de segundas lenguas, investigación en corpus de estudiantes, análisis informatizado de errores.
... Since language errors are different from one language to another, it is not possible to create a single error annotation system for all the existing languages. Thus, each researcher has to develop their own research error annotation system (Safari, 2017;Díaz-Negrillo & Fernández-Domínguez, 2006). Granger (2003) considers it necessary to regard certain points when developing an error annotation system. ...
... Error annotation has undergone remarkable development since its rather modest introduction in first-generation learner corpus research (for surveys, see Díaz-Negrillo & Fernández-Domínguez, 2006;Lüdeling & Hirschmann, 2015). A wide range of error annotation systems has been designed for English (e.g. ...
Article
Full-text available
The aim of this article is to survey the field of learner corpus research from its origins to the present day and to provide some future perspectives. Key aspects of the field — learner corpus design and collection, learner corpus methodology, statistical analysis, research focus and links with related fields, in particular SLA, FLT and NLP — are compared in first-generation LCR, which extends from the late 1980s to 2000, and second-generation LCR, which covers the period from the early 2000s until today. The survey shows that the field has undergone major theoretical and methodological changes and considerably extended its range of applications. Future developments that are likely to gain ground are grouped into three categories: increased diversity, increased interdisciplinarity and increased automation.
... In the 1980s, a review of the literature by Dulay et al. (1982) identified four bases commonly used in error classifications: (a) the linguistic unit involved; (b) the surface strategy, which describes how the error differs from the target form; (c) a comparative analysis with the L1 or with interlanguage development; and (d) the communicative effect of the error. The first two have been employed more frequently, as can be seen in James (1998) and in Díaz-Negrillo and Fernández-Domínguez (2006), who reviewed several error taxonomies designed for corpus annotation. Taxonomies can vary in the tags they contain and how they are organized. ...
Article
The potential of the technique known as round-trip translation to detect errors in language use has been exploited in the design of programs for automatic error detection (Hermet & Désilets, 2009; Madnani et al., 2012), but to my knowledge, no study has explored the potential of translators as a tool that learners themselves can use to correct their writing in a second language. Consequently, there is no information as to how many of the transformations introduced by round-trip translation are useful for learners, how many simply rephrase the original text, or how many actually make it worse. Hermet and Désilets (2009) report a “repair rate” of 66.4% working with prepositions in French, while Madnani et al. (2012) report 36% successful changes, 33% paraphrasing and 31% changes for the worse in 200 sentences in English. The present study found a significant improvement in the number of corrections in texts written in English by Spanish students (97%) at the cost of generating an excessive number of false positives (34%). The most reliable transformations are those affecting spelling or word morphology, which correct errors in 88.33% and 78.57% of cases, respectively. These results show the progress made in machine translation and the reliability of the round-trip translation technique for correcting errors and inform which transformations are most useful.
Article
In this paper, we describe the development of the tagging editor for learner corpora annotation and computer-aided error analysis. We collect essays written by learners of Chinese as a foreign language for grammatical error annotation and correction. Our tagging editor is effective and enables the annotated corpus to be used in a shared task in ICCE 2014.
Article
Full-text available
Classification of errors in language use plays a crucial role in language learning & teaching, error analysis studies, and language technology development. However, there is no standard and inclusive error classification method agreed upon among different disciplines, which causes repetition of similar efforts and a barrier in front of a common understanding in the field. This article brings a new and holistic perspective to error classifications and annotation schemes across different fields (i.e., learner corpora research, error analysis, grammar error correction, and machine translation), all serving the same purpose but employing different methods and approaches. The article first reviews previous error annotation efforts from different fields for nineteen languages with different characteristics, including the morphologically rich ones that pose diverse challenges for language technologies. It then introduces a faceted taxonomy for errors in language use, comprising multidimensional and hierarchical facets that can be utilized to create both fine- and coarse-grained error annotation schemes depending on specific requirements. We believe that the proposed taxonomy based on the principles of universality and diversity will address the emerging need for a common framework in error annotation.
Chapter
Full-text available
In recent years, the compilation and exploitation of linguistic corpora have been extended and applied to different domains, including language didactics. Particularly, learner corpora have emerged as electronic collections of language data produced by second (L2) or foreign-language (FL) students. In the literature, many researchers have focused on the benefits and pitfalls of compiling and exploiting a learner corpus, however, frequently there is no mention of essential aspects related to participants or the nature of the samples. In our study, we aim to investigate these key aspects that should be taken into consideration when working with learner corpora. Firstly, it is important to record certain data, such as the learners’ native language and the target language of the study. Secondly, it should be considered whether the samples are ‘authentic’, meaning whether they reflect a degree of naturalness produced by students under possible experimental conditions (or not). Thirdly, it should be taken into account the potential formats of the texts (oral and written) and the required format for exploitation programs, along with a codification to preserve the learners’ identities. Fourthly, we explore various types of annotation that can be applied to texts. Lastly, a brief description of relevant tools to manage, code, tag and annotate is provided. We aim to offer some valuable recommendations for researchers and teachers interested in languages, the process of language acquisition (L2) or language learning (FL), to enlighten the compiling process of learner corpora.
Chapter
With the development of technology, the need for compiling computer-based learner corpora has gradually gained more attention from language teachers and researchers. A learner corpus can reflect learners’ authentic use of a target language, which provides useful information for language teachers, researchers, and textbook editors. Limitations of retrieving errors in learner corpora, however, still exist. For example, it is difficult to retrieve omission errors if a corpus is not error-tagged beforehand. To offer researchers an error-tagged learner corpus of Chinese, this study manually error-tagged the two-million-word Chinese Learner Written Corpus of National Taiwan Normal University. A preliminary analysis of errors tagged in the learner corpus shows a total of 48,266 errors distributed to 119 tags. These 48,266 errors are mostly distributed to the incorrect selection of words or the missing of necessary word-level components, and the misuse of nouns, action verbs, adverbs, and structural particles is especially common. Among the 119 tags, the top 12 common error tags (i.e., occurring more than 1,000 times) accounted for more than 50% of the total errors, and incorrect selections of nouns and action verbs together constituted more than 27% of the total errors. These 12 common error types, especially the wrong choice of nouns and action verbs, should thus be regarded to be particularly difficult for second language (L2) learners of Chinese to acquire. Analysis of the top 12 common errors also reveals that learners’ misuse of verbs, adverbs, and structural particles were somewhat varied (i.e., involving different types of target modification, such as missing, redundant, and incorrect selection), whereas their misuse of nouns mostly resulted from an incorrect selection. A comparison between the top 10 common error types in this study with those in Lee et al. (2016) reveals that, regardless of some discrepancies in ranking, 90% of the top 10 error tags overlapped in the two studies, suggesting that these error types are indeed difficult for L2 Chinese learners to acquire and should be investigated further. Based on the findings yielded in this study, suggestions for further research on L2 Chinese learners’ errors are provided.KeywordsChinese teachingLearner corpusError-taggingError analysis
Chapter
Natural language processing (NLP) is concerned with the automated processing of human language. It addresses the analysis and generation of written and spoken language, though speech processing is often regarded as a separate subfield. NLP can be seen as the applied side of computational linguistics, the interdisciplinary field concerned with formal analysis and computational modeling of language at the intersection of linguistics, computer science, and psychology. In terms of the aspects analyzed by NLP, traditionally lexical, morphological, and syntactic aspects of language were the center of attention, but aspects of meaning, discourse, and the relation to the extralinguistic context have become increasingly prominent in the last decade. This entry focuses on showing the relevance, characterizing the techniques, and delineating the uses of NLP for second language learning. It distinguishes two broad uses of NLP related to language learning. On the one hand, NLP can be used to analyze learner language, that is, words, sentences, or texts produced by language learners. This includes the development of NLP techniques for the analysis of learner language by tutoring systems in intelligent computer-assisted language learning (ICALL), automated scoring in language testing, as well as the analysis and annotation of learner corpora. On the other hand, NLP for the analysis of native language can also play an important role in the language learning context. Applications in this second domain support the search for and the enhanced presentation of reading material for language learners as well as the generation of exercises and tests based on authentic materials.
Chapter
This volume illustrates the high potential of learner corpus investigations for research into the CAF triad by presenting eleven original learner corpus-based studies which are set within solid theoretical frameworks, examine learner corpora with state-of-the-art analytical techniques and yield highly interesting findings. The volume’s major strength lies in the range of issues it undertakes and in its interdisciplinary thematic novelty. The chapters collectively address all three dimensions of L2 performance related to different linguistic subsystems (i.e. lexical, phraseological and grammatical complexity and accuracy, along with fluency) as well as the interactions among these constructs. The studies are based on data drawn from carefully compiled learner corpora which are analysed with the help of diverse corpus-based methods. The theoretical discussions and the empirical results shall contribute to the advancement of the fields of SLA and writing and speech research and shall inspire further investigations in the area of the CAF triad.
Chapter
This volume illustrates the high potential of learner corpus investigations for research into the CAF triad by presenting eleven original learner corpus-based studies which are set within solid theoretical frameworks, examine learner corpora with state-of-the-art analytical techniques and yield highly interesting findings. The volume’s major strength lies in the range of issues it undertakes and in its interdisciplinary thematic novelty. The chapters collectively address all three dimensions of L2 performance related to different linguistic subsystems (i.e. lexical, phraseological and grammatical complexity and accuracy, along with fluency) as well as the interactions among these constructs. The studies are based on data drawn from carefully compiled learner corpora which are analysed with the help of diverse corpus-based methods. The theoretical discussions and the empirical results shall contribute to the advancement of the fields of SLA and writing and speech research and shall inspire further investigations in the area of the CAF triad.
Article
This article reflects on the status of corpus‐linguistic methodologies in English linguistics, and on the role of English linguistics in the development of corpus linguistics: what does English‐language corpus linguistics look like from the outside? What is the extent to which English‐language corpus linguistics is comparatively well‐endowed with resources, in a way that other languages are not? And finally, what are key corpus‐linguistic approaches and methodologies that were mainly or entirely developed in the context of English linguistics? In connection with that last question, we then go on to sketch seven corpus‐linguistic approaches and methodologies that have (or had initially) a strong English‐linguistics bent: the British tradition in corpus linguistics, critical discourse analysis, corpus‐based approaches to dialectology and regional varieties, multidimensional analysis, corpus‐based psycholinguistics, variationist linguistics, and learner corpus research.
Book
Collocations are both pervasive in language and difficult for language learners, even at an advanced level. In this book, these difficulties are for the first time comprehensively investigated. On the basis of a learner corpus, idiosyncratic collocation use by learners is uncovered, the building material of learner collocations examined, and the factors that contribute to the difficulty of certain groups of collocations identified. An extensive discussion of the implications of the results for the foreign language classroom is also presented, and the contentious issue of the relation of corpus linguistic research and language teaching is thus extended to learner corpus analysis.
Book
Corpus Annotation gives an up-to-date picture of this fascinating new area of research, and will provide essential reading for newcomers to the field as well as those already involved in corpus annotation. Early chapters introduce the different levels and techniques of corpus annotation. Later chapters deal with software developments, applications, and the development of standards for the evaluation of corpus annotation. While the book takes detailed account of research world-wide, its focus is particularly on the work of the UCREL (University Centre for Computer Corpus Research on Language) team at Lancaster University, which has been at the forefront of developments in the field of corpus annotation since its beginnings in the 1970s.