Annotating an Arabic Learner Corpus for Error.
ABSTRACT This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, develop ing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to foll ow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the dis tance between the two languages with respect to lea rner difficulty. The current collection of texts, which is constantly growing, c ontains intermediate and advanced-level student wri tings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the err or frequency distribution of both proficiency levels and the ongoing work.
- SourceAvailable from: Sylviane Granger[show abstract] [hide abstract]
ABSTRACT: Learner corpora—electronic collections of foreign or second language learner data—con-stitute a new resource for second language acquisition (SLA) and foreign language teaching (FLT) specialists. They are especially useful when they are error-tagged, that is, when all errors in the corpus have been annotated with the help of a standardized system of error tags. This article describes the three-tiered error annotation system designed to annotate the French Interlanguage Database (FRIDA) corpus. The research took place within the framework of the FreeText project which aims to produce a learner corpus-informed CALL program for French as a Foreign Language. Once annotated, the FRIDA corpus was put through standard text retrieval software to extract detailed error statistics and to carry out concordance-based analyses of specific error types. The results were used to focus the CALL exercises on learners' attested difficulties and to improve the error diagnosis system integrated in the CALL program.01/2003; 20:465-480.
- [show abstract] [hide abstract]
ABSTRACT: The Montclair Electronic Language Database (MELD) is an expanding collection of essays written by students of English as a second language. This paper describes the content and structure of the database and gives examples of database applications. The essays in MELD consist of the timed and untimed writing of undergraduate ESL students, dated so that progress can be tracked over time. Demographic data is also collected for each student, including age, sex, L1 background, and prior experience with English. The essays are continuously being tagged for errors in grammar and academic writing as determined by a group of annotators. The database currently consists of 44,477 words of tagged text and another 53,826 words of text ready to be tagged. The database allows various analyses of student writing, from assessment of progress over time to relation of error type and L1 background.Language and Computers. 10/2004;