Conference PaperPDF Available

A Retrieval System for Interlanguage Analysis

Authors:

Abstract and Figures

In this paper, we describe the development of a retrieval system that is designed for analyzing the interlanguage. We adopt the annotated TOCFL learner corpus as the target to explore the language acquisition for leaners of learning Chinese as a foreign language. An illustrative scenario is presented to demonstrate the functionalities of implemented prototype system. This system can be deemed as a computer-assisted tool for contrastive interlanguage analysis research.
Content may be subject to copyright.
Ogata, H. et al. (Eds.) (2015). Proceedings of the 23
rd
International Conference on Computers in Education.
China: Asia-Pacific Society for Computers in Education
A Retrieval System for Interlanguage Analysis
Lung-Hao LEE
a
, Li-Ping CHANG
b
,
Bo-Shun LIAO
a
, Wan-Ling CHENG
b
& Yuen-Hsien TSENG
a*
a
Information Technology Center, National Taiwan Normal University, Taiwan
b
Mandarin Training Center, National Taiwan Normal University, Taiwan
*samtseng@ntnu.edu.tw
Abstract: In this paper, we describe the development of a retrieval system that is designed for
analyzing the interlanguage. We adopt the annotated TOCFL learner corpus as the target to
explore the language acquisition for leaners of learning Chinese as a foreign language. An
illustrative scenario is presented to demonstrate the functionalities of implemented prototype
system. This system can be deemed as a computer-assisted tool for contrastive interlanguage
analysis research.
Keywords: second language acquisition, learner corpora, Mandarin Chinese
1. Introduction
Learner corpora: the Longman Learners’ Corpus, the International Corpus of Learner English (ICLE)
(Granger, 2003), and the Cambridge Learner Corpus (CLC) (Nicholls, 2003), to name but a few, are
important collection of foreign language learnerslinguistic production for research of second language
acquisition and foreign language teaching (Granger, 2002). To make learner corpora to be more useful,
they must be annotated using defined error types for automatic or manual analysis (Díaz-Negrillo &
Fernández-Domínguez, 2006).
From the viewpoint of engineering, annotated learner corpora can be employed to develop
specific Natural Language Processing (NLP) systems for educational applications. For instances,
computer-assisted essay writing (Milton, 1998), spell error checking (Yu et al., 2014; Tseng et al.,
2015), and grammatical error detection/correction (Chodorow and Leacock, 2000; Izumi et al., 2003;
Lee et al., 2013, Ng et al., 2014; Yu et al., 2014; Lee et al., 2015). From linguistic perspectives,
interlanguage is the type of linguistic system used by the second-/foreign- language learners who are in
the process of learning a target language. Contrastive Interlanguage Analysis (CIA) is the main
methodology that combines research areas of corpus linguistics and second language acquisition
(Granger, 2015). Comparing learner corpus with native speaker’s usages, researchers can identify
learners’ incorrectly linguistic usages or overgeneralized situations (Ishikawa, 2009). In addition,
linguistic features of different L1s learners can also be obtained from CIA researches (Chang, 2014).
In this work, we develop and implement a retrieval system to help researchers to analyze the
interlanguage. Our system is flexible to meet information needs in terms of various searching
conditions. Besides, the search results can be downloaded easily if needed.
2. The Retrieval System of Annotated Leaner Corpus
The learner corpus is mainly originated from the computer-based writing Test of Chinese as a Foreign
Language (TOCFL). The writing test is designed according to the six proficiency levels of the Common
European Framework of Reference (CEFR). Test takers have to complete two different tasks for each
level. For example, for the A2 (Waystage level) candidates, they will be asked to write a note and
describe a story after looking at four pictures. All candidates are asked to complete the writings on line.
Each text is then scored on a 0-5 point scale. Score 5 means high-quality writings, score 3 is the
threshold for passing the test, and so forth. There are 4,567 essays have been collected in the TOCFL
learner corpus.
599
The native Chinese speakers are then trained and asked to label the grammatical error types of
learners’ writings using the tagging editor (Lee et al., 2014). For the purpose of studies in Chinese
learners’ interlanguage, hierarchical error tags are designed. One is target modification taxonomy,
which includes mis-ordering (permutation), redundancy (addition), omission (deletion), and
mis-selection (substitution). The other is linguistic category classification that consists of linguistic
types, for example, noun, verb, preposition, specific construction, and so on. So far, 2837 essays with
the score above 3 have been annotated. In total, there are 33, 497 error instances. The top 3 error tags are
Sv (mis-Selection of verbs), Sn (mis-Selection of nouns), and Madv (Missing of adverbs). Their
frequencies are 3838, 2252, 1714, respectively.
The searching functions of our retrieval system can be divided into two main parts: (1) Basic
search: users can select the main types of error tags, i.e., modification types, and the linguistic
categories. The levels of learner’s language proficiency in CEFR and the scores of the learners’ written
essays can be chosen by ticking all that apply using checkbox. Searchers can also choose learner’s
mother-tongue language and types of writing styles. Besides, we also provide the concordance function
to show the character contexts surrounding the search target in the search results. (2) Advanced search:
when the search targets are determined, searchers can further filter the search results by
including/excluding the characters occurring in the left-hand/right-hand sides. Moreover, the search
results can be downloaded easily in plain text format for further research.
3. An Illustrative Scenario for Interlanguage Analysis
We present a scenario to illustrate the effectiveness of our developed retrieval system for interlanguage
analysis. Take the (rang4to make) sentence for example, we can choose the main error type S
and the sub-type rang. Figure 1 shows the searching results. We found that learners usually confuse
(rang4to make) with(ba3 ‘disposal marker’), (dui4to someone’), and ’ (gei3 ‘to give).
If there is no error tag annotated in the corpus, even we search the keyword ‘’ (rang4 ‘to make’) and
investigate one by one sentence to find the erroneous usages, but we cannot find the misused sentences
with ba3 or dui4. With the help of this retrieval system, we can shorten the time efficiently. Moreover,
we can limit the searching results into the specific word only, such as(ba3 ‘disposal marker’),
which will benefit to do deep observation and analysis. In addition to filtering function, we can also
select the specific learners’ attributes such as the learners’ mother tongue or their proficiency. Take
advantage of these functions, the analysis of interlanguages could be more easily and quickly done.
Figure 1. Searching results of rang4 sentence in the annotated TOCFL corpus
600
4. Conclusions and Future Work
This article describes our retrieval system that can be applied to analyze error types in annotated learner
corpora. An illustrative scenario of this prototype system is presented for lnterlanguage analysis. We
will further collect researchers’ feedbacks and discuss with them to enhance its functions.
Acknowledgements
This research was partially supported by the Ministry of Science and Technology, under the grant
MOST 103-2221-E-003-013-MY3, MOST 104-2911-I-003-301 and the “Aim for the Top University
Project” and “Center of Learning Technology for Chinese” of National Taiwan Normal University,
sponsored by the Ministry of Education, Taiwan.
References
Chang, L.-P. (2014). Salient linguistic features of Chinese learners with different L1s: a corpus-based study.
International Journal of Computational Linguistics and Chinese Language Processing, 19(2), 53-72.
Chodorow, M., & Leacock, C. (2000). An unsupervised method for detecting grammatical errors. Proceedings of
NAACL’00 (pp. 140-147). Seattle, Washington: ACL Anthology.
Díaz-Negrillo, A., & Fernández-Domínguez, J. (2006). Error tagging systems for learner corpora. RESLA, 19,
83-102.
Granger, S. (2002). A bird’s eye view of learner corpus research. In Granger, S., Huang, J. & Petch-Tyson, S.
(eds.) Computer Learner Corpora, Second Language Acquisition, and Foreign Language Teaching.
Amsterdam & Philadelphia: Benjamins, 3-33.
Granger, S. (2003). The International Corpus of Learner English: a new resource for foreign language learning
and teaching and second language acquisition research. TESOL Quarterly, 37(3), 538-546.
Granger, S. (2015). Contrastive interlanguage analysis: a reappraisal. International Journal of Learner Corpus
Research, 1(1), 7-24.
Ishikawa, S. (2009). Phraseology overused underused by Japanese learners of English: a contrastive interlanguage
analysis. Phraseology, Corpus Linguistics and Lexicography, 87-100
Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., & Isahara, H. (2003). Automatic error detection in the Japanese
learner’s English spoken data. Proceedings of ACL’03 (pp. 145-148), Sapporo, Japan: ACL Anthology.
Lee, L.-H., Chang, L.-P., Lee, K.-C., Tseng, Y.-H., & Chen, H.-H. (2013). Linguistic rules based Chinese error
detection for second language learning. Proceedings of ICCE’13 (pp. 27-29), Bali, Indonesia: Asia-Pacific
Society for Computers in Education.
Lee, L.-H., Lee, K.-C., Chang, L.-P., Yu, L.-C., Tseng, Y.-H., & Chen, H.-H. (2014). A tagging editor for learner
corpus annotation and error analysis. Proceedings of ICCE’14 (pp. 806-808), Nara, Japan: Asia-Pacific
Society for Computers in Education.
Lee, L.-H., Yu, L.-C., & Chang, L.-P. (2015). Overview of the NLP-TEA 2015 shared task for Chinese
grammatical error diagnosis. Proceedings of the 2
nd
Workshop on Natural Language Processing Techniques
for Educational Applications (pp. 1-6), Beijing, China: ACL Anthology.
Milton, J. (1998). WORDPILOT: enabling learners to navigate lexical universes. Proceedings of the International
Symposium on Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching,
Hong Kong, China.
Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-2014 shared
task on grammatical error correction. Proceedings of CoNLL’14 (pp. 1-14), Biltmore, Maryland: ACL
Anthology.
Nicholls, D. (2003). The Cambridge Learner Corpus error coding and analysis for lexicography and ELT.
Proceedings of CL’03 (pp. 572-581), Lancaster, UK.
Tseng, Y.-H., Lee, L.-H., Chang, L.-P., & Chen, H.-H. (2015). Introduction to SIGHAN 205 bake-off for Chinese
spelling check. Proceedings of SIGHAN’15 (pp. 32-37), Beijing, China: ACL Anthology.
Yu, L.-C., Lee, L.-H., Tseng, Y.-H., & Chen, H.-H. (2014). Overview of SIGHAN 2014 bake-off for Chinese
spelling check. Proceedings of CLP’14 (pp. 126-132), Wuhan, China: ACL Anthology.
Yu, L.-C., Lee, L.-H., & Chang, L.-P. (2014). Overview of grammatical error diagnosis for learning Chinese as a
foreign language. Proceedings of the 1
st
Workshop on Natural Language Processing Techniques for
Educational Applications (pp. 42-47), Nara, Japan: Asia-Pacific Society for Computers in Education.
601
... Half of these errors are categorized as missing wordlevel linguistic components. We also developed and implemented a special-purpose retrieval system for the TOCFL learner corpus, which is available online at http://tocfl.itc.ntnu.edu.tw, to facilitate interlanguage analysis for second language acquisition (Lee et al., 2015a). Take the ' 讓 ' (rang4 'to make') sentence for example, we can choose the main error type S and the sub-type rang. ...
Conference Paper
Full-text available
This study describes the construction of a TOCFL learner corpus and its usage for Chinese grammatical error diagnosis. We collected essays from the Test Of Chinese as a Foreign Language (TOCFL) and annotated grammatical errors using hierarchical tagging sets. Two kinds of error classifications were used simultaneously to tag grammatical errors. The first capital letter of each error tags denotes the coarse-grained surface differences, while the subsequent lowercase letters denote the fine-grained linguistic categories. A total of 33,835 grammatical errors in 2,837 essays and their corresponding corrections were manually annotated. We then used the Standard Generalized Markup Language to format learner texts and annotations along with learners’ accompanying metadata. Parts of the TOCFL learner corpus have been provided for shared tasks on Chinese grammatical error diagnosis. We also investigated systems participating in the shared tasks to better understand current achievements and challenges. The datasets are publicly available to facilitate further research. To our best knowledge, this is the first annotated learner corpus of traditional Chinese, and the entire learner corpus will be publicly released.
Chapter
Chinese as a foreign language (CFL) learners may, in their language production, generate inappropriate linguistic usages, including character-level confusions (or commonly known as spelling errors) and word-/sentence-/discourse-level grammatical errors. Chinese spelling errors frequently arise from confusions among multiple-character words that are phonologically and visually similar but semantically distinct. Chinese grammatical errors contain coarse-grained surface differences in terms of missing, redundant, incorrect selection, and word ordering error of linguistic components. Simultaneously, fine-grained error types further focus on representing linguistic morphology and syntax such as verb, noun, preposition, conjunction, adverb, and so on. Annotated learner corpora are important language resources to understand these error patterns and to help the development of error diagnosis systems. In this chapter, we describe two representative Chinese learner corpora: the HSK Dynamic Composition Corpus constructed by Beijing Language and Culture University and the TOCFL Learner Corpus built by National Taiwan Normal University. In addition, we introduce several evaluations based on both learner corpora designed for computer-assisted Chinese learning. One is a series of SIGHAN bakeoffs for Chinese spelling checkers. The other series are the NLPTEA workshop shared tasks for Chinese grammatical error identification. The purpose of this chapter is to summarize the resources and evaluations for better understanding the current research developments and challenges of automated Chinese error diagnosis for CFL learners.
Article
Full-text available
The study aims to explore the salient linguistic features of Chinese lexical items from different L1s learners. The research method is corpus-based, including comparing the learner corpus and the native-speaker corpus, as well as sub-corpora for different L1s. The learner corpus which consists of more than 1.14 million Chinese words from novice proficiency to advanced learners' texts is mainly from the computer-based writing Test of Chinese as a Foreign Language (TOCFL). The sub-corpora of Japanese, English, Korean, Vietnamese, Indonesia and Thai are observed. Japanese corpus is top 1, which occupies twenty four percent of the total data, followed by English, Korean, and etc. And the native corpus is from the Academia Sinica balanced corpus. Through the overuse or underuse linguistic forms and keyword-keyness analysis, some salient features are discovered. For examples, comparative to Chinese learners with other L1s, English language background learners show the unusual high frequency on pronouns and unusual low frequency on sentential final particles in Chinese writing. And Japanese as well as Korean background learners tend to overuse the post form ‘de hua’ instead of ‘ruguo’ when expressing the ‘if’ sentence, and overuse ‘suoyi’ instead of ‘yinwei’ when expressing the cause-result relation. The article also provides possible explanations for these results from the aspects of learners’ native language typology, linguistic structure, syntactic category and culture.
Chapter
Full-text available
In the current study, adopting the research methodology called contrastive interlanguage analysis (CIA), we compared the usage and frequency of phraseology, which is defined here as a whole set of high-frequency bigrams, trigrams, and four grams, seen in English essays written by Japanese learners of English (JLE) and English native speakers (NS). By statistical comparison, we concluded that JLE tend to overuse the phraseologies including the first person pronouns, contractions, and frequent idiomatic expressions, while often underusing possible lexical variants and prepositional phrases. These findings obtained from a quantitative data analysis are expected to help improve TEFL or its related materials in Japan.
Conference Paper
Full-text available
This paper introduces the NLP-TEA 2015 shared task for Chinese grammatical error diagnosis. We describe the task, data preparation, performance metrics, and evaluation results. The hope is that such an evaluation campaign may produce more advanced Chinese grammatical error diagnosis techniques. All data sets with gold standards and evaluation tools are publicly available for research purposes.
Conference Paper
Full-text available
This paper introduces the SIGHAN 2015 Bake-off for Chinese Spelling Check, including task description, data preparation, performance metrics, and evaluation results. The competition reveals current state-of-the-art NLP techniques in dealing with Chinese spelling checking. All data sets with gold standards and evaluation tool used in this bake-off are publicly available for future research.
Article
Full-text available
Since its introduction in 1996, Contrastive Interlanguage Analysis (CIA) has become a highly popular method in Learner Corpus Research. Its comparative design has made it possible to uncover a wide range of features distinctive of learner language and assess their degree of generalizability across learner populations. At the same time, however, the method has drawn criticism on several fronts. The purpose of this article is threefold: to provide a brief overview of CIA research, to discuss the main criticisms the method has faced in recent years and to present a revised model, CIA², which makes the central role played by variation in interlanguage studies more explicit and is generally more in line with the current state of foreign language theory and practice.
Conference Paper
Full-text available
We organize a shared task on grammatical error diagnosis for learning Chinese as a Foreign Language (CFL) in the ICCE-2014 workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA). In this paper, we describe all aspects of this shared task, including task description, data preparation, evaluation metrics, and testing results. The aim is, through such evaluation campaigns, more advanced computer-assisted Chinese learning techniques will be emerged.
Conference Paper
Full-text available
This paper introduces a Chinese Spelling Check campaign organized for the SIGHAN 2014 bake-off, including task description, data preparation, performance metrics, and evaluation results based on essays written by Chinese as a foreign language learners. The hope is that such evaluations can produce more advanced Chinese spelling check techniques.
Conference Paper
Full-text available
In this paper, we describe the development of the tagging editor for learner corpora annotation and computer-aided error analysis. We collect essays written by learners of Chinese as a foreign language for grammatical error annotation and correction. Our tagging editor is effective and enables the annotated corpus to be used in a shared task in ICCE 2014.
Conference Paper
The CoNLL-2013 shared task was devoted to grammatical error correction. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results.