Conference Paper

Annotating an Arabic Learner Corpus for Error.

Conference: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco
Source: DBLP


This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, develop ing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to foll ow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the dis tance between the two languages with respect to lea rner difficulty. The current collection of texts, which is constantly growing, c ontains intermediate and advanced-level student wri tings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the err or frequency distribution of both proficiency levels and the ongoing work.

Download full-text


Available from: Anna Feldman, May 10, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a new Arabic spelling correction system which is intended for use with electronic dictionary search by learners of Arabic. Unlike other spelling correction systems, this system does not depend on a corpus of attested student errors but on student- and teacher-generated ratings of confusable pairs of phonemes or letters. Separate error modules for keyboard mistypings, phonetic confusions, and dialectal confusions are combined to create a weighted finite-state transducer that calculates the likelihood that an input string could correspond to each citation form in a dictionary of Iraqi Arabic. Results are ranked by the estimated likelihood that a citation form could be misheard, mistyped, or mistranscribed for the input given by the user. To evaluate the system, we developed a noisy-channel model trained on students ’ speech errors and use it to perturb citation forms from a dictionary. We compare our system to a baseline based on Levenshtein distance and find that, when evaluated on single-error queries, our system performs 28 % better than the baseline (overall MRR) and is twice as good at returning the correct dictionary form as the top-ranked result. We believe this to be the first spelling correction system designed for a spoken, colloquial dialect of Arabic. 1.
    Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta; 08/2010
  • Source

    Leeds Language, Linguistics and Translation PGR Conference 2013, University of Leeds, UK; 01/2013