Supervised collaboration for syntactic annotation of Quranic Arabic

Language Resources and Evaluation (Impact Factor: 0.62). 03/2013; 47(1):1-30. DOI: 10.1007/s10579-011-9167-7


The Quranic Arabic Corpus ( is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation
including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400year-old
central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based
on the historical traditional grammar known as i′rāb (إعراب). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online
annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic
corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging,
initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors
per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing
linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role,
allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing
historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness
of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe
the custom linguistic software used to aid collaborative annotation.

KeywordsCollaborative annotation–Arabic–Treebank–Quran–Corpus

Download full-text


Available from: Eric Atwell, Oct 13, 2015
196 Reads
  • Source
    • "These make dealing with Arabic language a challenging task when applying machine learning and artificial intelligence techniques. Few research studies have considered the Arabic text of Quran [5], [6], [7], [8], instead many studies deal with the translations of the meaning of the words of the holy Quran [9], [10], [11], [12], [13], [14]. Kais and his colleagues have created an open source Quranic corpus [15] using both arabic words as well as translations of these words. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Holy Quran is the reference book for more than 1.6 billion of Muslims all around the world Extracting information and knowledge from the Holy Quran is of high benefit for both specialized people in Islamic studies as well as non-specialized people. This paper initiates a series of research studies that aim to serve the Holy Quran and provide helpful and accurate information and knowledge to the all human beings. Also, the planned research studies aim to lay out a framework that will be used by researchers in the field of Arabic natural language processing by providing a ”Golden Dataset” along with useful techniques and information that will advance this field further. The aim of this paper is to find an approach for analyzing Arabic text and then providing statistical information which might be helpful for the people in this research area. In this paper the holly Quran text is preprocessed and then different text mining operations are applied to it to reveal simple facts about the terms of the holy Quran. The results show a variety of characteristics of the Holy Quran such as its most important words, its wordcloud and chapters with high term frequencies. All these results are based on term frequencies that are calculated using both Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) methods.
    International Journal of Advanced Computer Science and Applications 03/2015; 6(2):262-267. DOI:10.14569/IJACSA.2015.060237 · 1.32 Impact Factor
  • Source
    • "There are online (Dukes, Atwell & Habash, 2012; "Project Root List", 2012) and offline (Dror, Shaharabani, Talmon & Wintner, 2004 ; Talmon & Wintner, 2003 ) applications that serve the need for these lexical lookups. Foremost among these is the Qur'an Corpus (Dukes et al., 2012 "
    [Show abstract] [Hide abstract]
    ABSTRACT: About 80 percent of the world's Muslim populations are non-native speakers of Arabic language. Since it is obligatory for all Muslims to recite the Qur'an in Arabic during regular prayers, an extraordinary social phenomenon has taken place in some parts of the Muslim world: in schools, children are only taught the complex phonetic rules of the Arabic language in the context of the Qur'an. This has given rise to a demographic segment of adult learners who are interested in a Language for Specific Purposes (LSP) curriculum that would help them learn a closed set of syntactic rules and vocabularies in the context of the Qur'an, so that they can recall an idiomatic translation in their native language while they recite or listen to the Qur'an. Little research work exists on the task modeling and user modeling for this purpose. This research work explores the possibilities of using user stereotypes in the creation of task models to be used in development of a comprehensive Computer Assisted Language Learning (CALL) module. In this paper, firstly , the design and initial prototype of a ubiquitous web based language learning software is presented and some results of the user modeling are shared . Secondly , changes made to the initial implementation based on the user modeling are presented. And finally, merits and drawbacks of the new implementation are shared.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe and compare two statistical parsing approaches for the hybrid dependency-constituency syntactic representation used in the Quranic Arabic Treebank (Dukes and Buckwalter, 2010). In our first approach, we apply a multi-step process in which we use a shift-reduce algorithm trained on a pure dependency preprocessed version of the treebank. After parsing, the dependency output is converted into the hybrid representation. This is compared to a novel one-step parser that is able to learn the hybrid representation without preprocessing. We define an extended labelled attachment score (ELAS) as our performance metric for hybrid parsing, and report 87.47% (F1 score) for the multi-step approach, and 89.03% (F1 score) for the one-step integrated algorithm. We also consider the effect of using different sets of morphological features for parsing the Quran, comparing our results to recent work on Modern Standard Arabic.
    Proceedings of the 12th International Conference on Parsing Technologies, IWPT 2011, October 5-7, 2011, Dublin City University, Dubin, Ireland; 01/2011
Show more