Conference PaperPDF Available

A Tagging Editor for Learner Corpus Annotation and Error Analysis


Abstract and Figures

In this paper, we describe the development of the tagging editor for learner corpora annotation and computer-aided error analysis. We collect essays written by learners of Chinese as a foreign language for grammatical error annotation and correction. Our tagging editor is effective and enables the annotated corpus to be used in a shared task in ICCE 2014.
Content may be subject to copyright.
Liu, C.-C. et al. (Eds.) (2014). Proceedings of the 22nd International Conference on Computers in
Education. Japan: Asia-Pacific Society for Computers in Education
A Tagging Editor for Learner Corpora
Annotation and Error Analysis
Lung-Hao LEE a,d, Kuei-Ching LEE a,d, Li-Ping CHANG b,
Yuen-Hsien TSENGa*, Liang-Chih YUc,d & Hsin-Hsi CHENe
aInformation Technology Center, National Taiwan Normal University, Taiwan
bMandarin Training Center, National Taiwan Normal University, Taiwan
cDepartment of Information Management, Yuen-Ze University, Taiwan
dInnovation Center for Big Data and Digital Convergence, Yuen-Ze University, Taiwan
eDepartment of Computer Science and Information Engineering, National Taiwan University, Taiwan
Abstract: In this paper, we describe the development of the tagging editor for learner corpora
annotation and computer-aided error analysis. We collect essays written by learners of
Chinese as a foreign language for grammatical error annotation and correction. Our tagging
editor is effective and enables the annotated corpus to be used in a shared task in ICCE 2014.
Keywords: Computer-aided error analysis, learner corpora, interlanguage, Mandarin Chinese
1. Introduction
Learner corpora are the collection of foreign language learners’ produced responses, which are
valuable resources for research of second language learning and teaching. For example, the
International Corpus of Learner English (ICLE) is considered as one of the most important learner
corpora. ICLE consists of argumentative essays written by advanced learners of English as a Foreign
Language from different native language backgrounds (Granger, 2003). The first version was
published in 2002. They are currently working towards the third version of this corpus. In addition,
the Cambridge Learner Corpus (CLC) is made up of more than 200 thousands of examination scripts
written by English learners speaking 148 different mother languages (Nicholls, 2003). CLC was
established to assist English Language Teaching/Training (ELT) publishers to create aided materials,
e.g., Cambridge dictionaries and ELT course books, addressed to their target users.
Annotated learner corpora play an important role to develop natural language processing
techniques for educational applications. As an example, Assessing LExical Knowledge (ALEK)
system (Chodorow and Leacock, 2000) adopted statistical analysis from learner corpora to detect the
errors of an English sentence. Izumi et al. (2003) detected English grammatical and lexical errors
made by Japanese learners. Lee et al. (2013) proposed a linguistic rule based approach to detect
grammatical errors written by learners of Chinese as a foreign language. Recently, the CoNLL
2013/2014 shared tasks focus on grammatical error correction for learners’ English as a foreign
language (Ng. et al. 2013). SIGHAN 2013/2014 bakeoffs on Chinese spelling check evaluation focus
on developing automatic checker to detect and correct spelling errors (Wu et al. 2013).
However, to make learner corpora useful for these tasks, they must be annotated correctly
before automated analysis can be applied. In this work, we design a tagging editor to help annotators
to insert error tags and rewrite correct usages for the sentences in the learner corpus. In addition, our
editor provides the function for error analysis, which further assists annotators to instantly discover
incorrect or inconsistent tagging instances during the annotation process.
2. The Error Tagging Editor
The construction of a tagging editor includes designing tag-associated error categories arranged on a
menu interface, which can help annotators to select and insert error tags alongside the wrong part of
the learners’ written texts. In addition to error tagging, reconstruction of correct usages is usually
needed in the annotation process. After the learner corpus is tagged and corrected, error analysis can
be done quantitatively according to various kinds of users’ interests.
Figure 1 shows a screenshot of the tagging editor. The functions of our tagging editor can be
divided into three main parts: (1) Searching zone (the left panel): learners’ written texts are stored in
individual files accompanying with their metadata, such as the level describing the learner’s language
proficiency, the score of the learner’s written texts, and the learner’s mother-tongue language (ML).
Our tagging editor can search learners’ texts using these fields of metadata. The searching results can
be listed in order by the unique ID with the (tagging | correction) status. The symbol means a
finished situation. In contrast, the symbol X” represents that the texts need to be annotated. (2)
Tagging zone (the middle panel): when the texts are loaded in this zone, annotators can select and
insert error tags from the menu bar into some position of learners’ texts. Inserted tags are shown in
terms of square brackets in red color. (3) Correction zone (the right panel): annotators usually need to
correct the error parts for correct usages. Correction zone is aligned paragraph-wisely with tagging
zone to facilitate annotatorscorrections. We highlight the changed texts in blue color. Besides, our
tagging editor also reports error analysis, which benefits annotators to find incorrect/inconsistent
tagging instances to be fixed in the verification procedure.
Our tagging editor is flexible enough to meet various annotation tasks for learner corpora in
different language. The character encoding is in Unicode; the editor is developed in Java; both of
which are cross-platform. Besides, annotators can add, delete or fix their self-defined error tags for
their annotation tasks. The metadata is also optional. The tagging editor could load the learners’
written texts even without the accompanying metadata.
Figure 1. Screenshot of our tagging editor
3. The Annotation of TOCFL Learner Corpus
The annotated corpus using this tagger editor is mainly from the computer-based writing Test of
Chinese as a Foreign Language (TOCFL). The writing test is designed according to the six
proficiency levels of the Common European Framework of Reference (CEFR). Test takers have to
complete two different tasks for each level. For example, for the A2 (Waystage level) candidates, they
will be asked to write a note and describe a story after looking at four pictures. All candidates are
asked to complete the writings on line. Each text is then scored on a 0-5 point scale. Score 5 means
high-quality writings, score 3 is the threshold for passing the test, and so forth. There are 4,567 essays
collected in the corpus so far.
For the purpose of studies in Chinese learners’ interlanguage, hierarchical error tags are
designed to annotate grammatical errors. There are two types of error tagging, one is target
modification taxonomy, the other is linguistic category classification. The former includes four PADS
error types: mis-ordering (Permutation), redundancy (Addition), omission (Deletion), and
mis-selection (Substitution). The latter includes 36 linguistic types, e.g., noun, verb, preposition,
specific construction, and so on. Using our tagging editor, 51 essays belonging to CEFR B1 level with
score 5 had been annotated by two linguists at the same time. Figure 2 shows the distribution of error
tags sorted by their occurrence. In total, there are 678 errors in 51 essays. The top 3 error tags are: Sv
(mis-Selection of verbs), Madv (Missing of adverbs), and Sn (mis-Selection of nouns). Their
frequency is 53, 40, and 39, respectively. In the above error tag abbreviations, the first capital letter of
the tag represents the higher level of error tagging.
Figure 2. Distribution of error tags in the annotated corpus
4. Conclusions and Future Work
This article describes our tagging editor that can be employed to meet annotation tasks in learner
corpora. This tagging editor is effective to annotate grammatical errors in CFL learners’ essays. We
will further collect annotators’ feedbacks and discuss with them to enhance its functions. We shall
release this editor for public use in the future. The corpus annotated by this editor described above has
currently been used for a shared task in the ICCE workshop.
This research was supported by the Ministry of Science and Technology, under the grant MOST 102-
2221-E-002-103-MY3, 102-2221-E-155-029-MY3, 103-2221-E-003-013-MY3, 103-2911-I-003-301
and the “Aim for the Top University Project” and “Center of Learning Technology for Chinese” of
National Taiwan Normal University, sponsored by the Ministry of Education, Taiwan, R.O.C..
Chodorow, M., & Leacock, C. (2000). An unsupervised method for detecting grammatical errors. Proceedings
of NAACL’00 (pp. 140-147). Seattle, Washington: ACL Anthology.
Díaz-Negrillo, A., & Fernández-Domínguez, J. (2006). Error tagging systems for learner corpora. RESLA, 19,
Granger, S. (2003). The International Corpus of Learner English: a new resource for foreign language learning
and teaching and second language acquisition research. TESOL Quarterly, 37(3), 538-546.
Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., & Isahara, H. (2003). Automatic error detection in the Japanese
learner’s English spoken data. Proceedings of ACL’03 (pp. 145-148). Sapporo, Japan: ACL Anthology.
Lee, L.-H., Chang, L.-P., Lee, K.-C., Tseng, Y.-H., & Chen, H.-H. (2013). Linguistic rules based Chinese error
detection for second language learning. Proceedings of ICCE’13 (pp. 27-29). Bali, Indonesia: Asia-Pacific
Society for Computers in Education.
Ng, H. T., Wu, S. M., Wu, Y., Hadiwinoto, C., & Tetreault, J. (2013). The CoNLL-2013 shared task on
grammatical error correction. Proceedings of CoNLL’13 (pp. 1-12), Sofia, Bulgaria: ACL Anthology.
Nicholls, D. (2003). The Cambridge Learner Corpus error coding and analysis for lexicography and ELT.
Proceedings of CL’03 (pp. 572-581), Lancaster, UK.
Wu, S.-H., Liu, C.-L., & Lee, L.-H. (2013). Chinese spelling check evaluation at SIGHAN bake-off 2013.
Proceedings of SIGHAN’13 (pp. 35-42), Nagoya, Japan: ACL Anthology.
... Word-level: action verb (v), auxiliary (aux), stative verb (vs), noun (n), pronoun (pron), conjunction (conj), preposition (p), numeral (num), demonstrative (det), measure word (cl), sentential particle (sp), aspectual particle (asp), adverb (adv), structural particle (de), question word (que), plural suffix (plural) Native Chinese-speaking annotators were trained to follow our annotation guidelines for the error-tagging task. We also asked the annotators to provide one of corresponding corrections using a tagging editor [29]. Some example sentences are given in Table 2. ...
Conference Paper
This study describes the construction of the TOCFL (Test Of Chinese as a Foreign Language) learner corpus, including the collection and grammatical error annotation of 2,837 essays written by Chinese language learners originating from a total of 46 different mother-tongue languages. We propose hierarchical tagging sets to manually annotate grammatical errors, resulting in 33,835 inappropriate usages. Our built corpus has been provided for the shared tasks on Chinese grammatical error diagnosis. These demonstrate the usability of our learner corpus annotation.
... There are 4,567 essays have been collected in the TOCFL learner corpus. T h e n a t i v e C h i n e s e s p e a k er s a r e t h e n t r ai n e d a n d asked to label the grammatical error types of learners' writings using the tagging editor (Lee et al., 2014). For the purpose of studies in Chinese learners' interlanguage, hierarchical error tags are designed. ...
Conference Paper
Full-text available
In this paper, we describe the development of a retrieval system that is designed for analyzing the interlanguage. We adopt the annotated TOCFL learner corpus as the target to explore the language acquisition for leaners of learning Chinese as a foreign language. An illustrative scenario is presented to demonstrate the functionalities of implemented prototype system. This system can be deemed as a computer-assisted tool for contrastive interlanguage analysis research.
Conference Paper
Full-text available
This study describes the construction of a TOCFL learner corpus and its usage for Chinese grammatical error diagnosis. We collected essays from the Test Of Chinese as a Foreign Language (TOCFL) and annotated grammatical errors using hierarchical tagging sets. Two kinds of error classifications were used simultaneously to tag grammatical errors. The first capital letter of each error tags denotes the coarse-grained surface differences, while the subsequent lowercase letters denote the fine-grained linguistic categories. A total of 33,835 grammatical errors in 2,837 essays and their corresponding corrections were manually annotated. We then used the Standard Generalized Markup Language to format learner texts and annotations along with learners’ accompanying metadata. Parts of the TOCFL learner corpus have been provided for shared tasks on Chinese grammatical error diagnosis. We also investigated systems participating in the shared tasks to better understand current achievements and challenges. The datasets are publicly available to facilitate further research. To our best knowledge, this is the first annotated learner corpus of traditional Chinese, and the entire learner corpus will be publicly released.
Chinese as a foreign language (CFL) learners may, in their language production, generate inappropriate linguistic usages, including character-level confusions (or commonly known as spelling errors) and word-/sentence-/discourse-level grammatical errors. Chinese spelling errors frequently arise from confusions among multiple-character words that are phonologically and visually similar but semantically distinct. Chinese grammatical errors contain coarse-grained surface differences in terms of missing, redundant, incorrect selection, and word ordering error of linguistic components. Simultaneously, fine-grained error types further focus on representing linguistic morphology and syntax such as verb, noun, preposition, conjunction, adverb, and so on. Annotated learner corpora are important language resources to understand these error patterns and to help the development of error diagnosis systems. In this chapter, we describe two representative Chinese learner corpora: the HSK Dynamic Composition Corpus constructed by Beijing Language and Culture University and the TOCFL Learner Corpus built by National Taiwan Normal University. In addition, we introduce several evaluations based on both learner corpora designed for computer-assisted Chinese learning. One is a series of SIGHAN bakeoffs for Chinese spelling checkers. The other series are the NLPTEA workshop shared tasks for Chinese grammatical error identification. The purpose of this chapter is to summarize the resources and evaluations for better understanding the current research developments and challenges of automated Chinese error diagnosis for CFL learners.
Conference Paper
Full-text available
In this paper, we handcraft a set of linguistic rules with syntactic information to detect errors occurred in Chinese sentences written by SLL. Experimental results come the similar conclusions with well-known ALEK system used by ETS for English Learning. Our developed Chinese sentence error detection system will be helpful for Chinese self-learners.
Conference Paper
Full-text available
This paper introduces an overview of Chinese Spelling Check task at SIGHAN Bake-off 2013. We describe all aspects of the task for Chinese spelling check, consisting of task description, data preparation, performance metrics, and evaluation results. This bake-off contains two subtasks, i.e., error detection and error correction. We evaluate the systems that can automatically point out the spelling errors and provide the corresponding corrections in students’ essays, summarize the performance of all participants’ submitted results, and discuss some advanced issues. The hope is that through such evaluation campaigns, more advanced Chinese spelling check techniques will be emerged.
Conference Paper
Full-text available
This paper describes a method of detecting grammatical and lexical errors made by Japanese learners of English and other techniques that improve the accuracy of error detection with a limited amount of training data. In this paper, we demonstrate to what extent the proposed methods hold promise by conducting experiments using our learner corpus, which contains information on learners' errors.
Full-text available
Learner corpora are used to investigate computerised learner language so as to gain insights into foreign language learning. One of the methodologies that can be applied to this type of research is computer-aided error analysis (CEA), which, in general terms, consists in the study of learner errors as contained in a learner corpus. Surveys of current learner corpora and of issues of learner corpus research have information on CEA research can be found, although usually limited. This article is centred on CEA research and is intended as a review of error tagging systems, including error categorizations, dimensions and levels of description. KEYWORDS. Second language acquisition, learner corpus research, computer-aided error analysis. RESUMEN. Los corpus de estudiantes se utilizan para la investigación de la lengua de estudiantes en formato electrónico con el fin de arrojar luz al proceso de adquisición de lenguas extranjeras. Una de las metodologías que se utilizan en este campo es el análisis informatizado de errores que, en términos generales, consiste en estudiar los errores recogidos en un corpus de estudiantes. Revisiones de los corpus de estudiantes existentes y de cuestiones relacionadas con el campo de la investigación en corpus de estudiantes han sido publicadas en los últimos años se proporciona información sobre la investigación en análisis informatizado de errores, aunque ésta es normalmente limitada. Este artículo se centra en el campo de análisis informatizado de errores y trata de proporcionar una revisión de los sistemas existentes de etiquetado de errores, sus categorizaciones, dimensiones y niveles de descripción. PALABRAS CLAVE. Adquisición de segundas lenguas, investigación en corpus de estudiantes, análisis informatizado de errores.
Full-text available
We present an unsupervised method for detecting grammatical errors by inferring negative evidence from edited textual corpora. The system was developed and tested using essay-length responses to prompts on the Test of English as a Foreign Language (TOEFL). The errorrecognition system, ALEK, performs with about 80% precision and 20% recall.
The Cambridge Learner Corpus -error coding and analysis for lexicography and ELT
  • D Nicholls
Nicholls, D. (2003). The Cambridge Learner Corpus -error coding and analysis for lexicography and ELT. Proceedings of CL'03 (pp. 572-581), Lancaster, UK.
An unsupervised method for detecting grammatical errors
  • M Chodorow
  • C Leacock
Chodorow, M., & Leacock, C. (2000). An unsupervised method for detecting grammatical errors. Proceedings of NAACL'00 (pp. 140-147). Seattle, Washington: ACL Anthology.