Annotating an Arabic Learner Corpus for Error.

Conference: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco
ABSTRACT This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, develop ing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to foll ow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the dis tance between the two languages with respect to lea rner difficulty. The current collection of texts, which is constantly growing, c ontains intermediate and advanced-level student wri tings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the err or frequency distribution of both proficiency levels and the ongoing work.


