Conference PaperPDF Available

Arabic Rule-based Named Entity Recognition System Using GATE

Abstract and Figures

As Arab users presence and participation continue to grow in the digital world, the electronic content comprising Arabic textual information significantly also increases due to the transition of traditional media channels to the online Arabic news and social networking. Such a rich source of information attracted many Arabic natural language processing researchers to tackle the named entity recognition problem, as it has a very influential role in the implementation of information processing systems. The proposed Arabic named entity recognition system adopts the rule-based approach using the General Architecture for Text Engineering (GATE) as a development environment. The system is applied on the ANERcorp and the results in terms of F-measure achieved 83% for Person NE, 89% for Organisation NE, and 92% for Location NE.
Content may be subject to copyright.
Arabic Rule-based Named Entity Recognition System
Using GATE
Hatem M. ElSherif 1[0000-0002-4402-9425], Khaled Mohammad Alomari2[0000-0001-6677-6301],
Ahmad Qasim AlHamad3 [0000-0002-7083-5375] and Khaled Shaalan4 [0000-0003-0823-8390]
1Faculty of Engineering & IT, The British University in Dubai, Dubai, UAE
2Faculty of IT, Abu Dhabi University, Abu Dhabi, UAE
3Faculty of IT, Abu Dhabi University, Abu Dhabi, UAE
4Faculty of Engineering & IT, The British University in Dubai, Dubai, UAE
Abstract. As Arab users presence and participation continue to grow in the dig-
ital world, the electronic content comprising Arabic textual information signifi-
cantly also increases due to the transition of traditional media channels to the
online Arabic news and social networking. Such a rich source of information at-
tracted many Arabic natural language processing researchers to tackle the named
entity recognition problem, as it has a very influential role in the implementation
of information processing systems. The proposed Arabic named entity recogni-
tion system adopts the rule-based approach using the General Architecture for
Text Engineering (GATE) as a development environment. The system is applied
on the ANERcorp and the results in terms of F-measure achieved 83% for Person
NE, 89% for Organisation NE, and 92% for Location NE.
Keywords: Arabic Named Entity Recognition, Gate, Natural Language Pro-
1 Introduction
Named Entity Recognition (NER) is a subtask of Information Extraction that was
first introduced by the Sixth Message Understanding Conference (MUC-6) [1]. It is the
task of both recognition and classification of definite named entities such as locations,
organizations and persons names, numeric expressions including monetary amounts
and dates and time expressions [2], [3].
It is widely used in the Natural Language Processing (NLP) applications to identify
names in the context and classify them according to a specific predefined set of cate-
gories. Due to the continuous changes in the requirements, the Arabic Named Entity
Recognition (ANER) task is considered in its early stages with all the respect to the
many successful ANER research and systems. This study presents the challenges of
implementing ANER system adopting the rule-based approach using General Architec-
ture for Text Engineering (GATE) 1. This study used the ANERcorp2 which formatted
according to the Conference on Computational Natural Language Learning (CoNLL)
shared task framework CoNLL-20023, i.e. following “Begin, Inside, Outside (BIO)
chunking representation format [4]. Illustrative examples extracted from the ANERcorp
are shown in (Table 1); providing the Arabic word, English translation and the word
tag using the CoNLL-2002 tagging format. The type classifies the words according to
the CoNLL-2002 task into four types (persons/PERS, locations/LOC, organiza-
tions/ORG and miscellaneous/MISC) followed by the named entity chunking.
Table 1. Example of ANERcorp CoNLL tagging
He favored
Person Name
1.1 Approaches for Named Entity Recognition
The various approaches proposed for Named Entity Recognition tasks can be
grouped under three main categories, as follows: The Rule-Based (Handcrafted) Ap-
proach, Machine Learning Approached and Hybrid Approach [5][8]. Where several
researchers conducted a comprehensive survey for NER approaches [2], [9][12].
1.2 Arabic Named Entity Recognition Issues and Challenges
A comprehensive review by Farghaly and Shaalan [13] provides insight on ِANLP
challenges facing researchers. On the other hand, the Arabic NER has more challenging
issues like using the Latin characters as replacement to the Arabic script [14],
Capitalization orthographic feature used in Roman script languages to recognize named
entities such as proper names, acronyms, and abbreviations [15], the analysed Arabic
word could have more than one interpretation that consists of root and prefixes, suffixes
and clitics characters [16], Arabic language has a great level of transcriptional
ambiguity leading to various ways of writing and translating foreign NE [17],
Ambiguity due to different diacritic signs for the same word can represent different
meanings [9], and the availability of free resource are limited while licensed resources
are very costly [18].
2 ANER Systems and Related work
Many ANER systems were developed using various NER approaches and tech-
niques. In this section, we provide an example for ANER systems.
Maloney and Niv [19], presented the TAGARAB system as one of the earliest efforts
to implement an Arabic NER system. The system relay on rule-based approach using
pattern matching engine consisting of two main modules, the morphological tokenizer
and Name finder, to recognize Person, Organization, Location, Number and Time. The
system used fourteen random texts from the AI-Hayat to evaluate the system perfor-
mance. The results show better accuracy performance combining these two modules
(NE finder and morphological tokenizer) than only using them individually.
ANERsys system used the ANERcorp. 125,000 tokens were used for training and
25,000 tokens for testing, the system used the Maximum Entropy approach is to identify
a named entity using the two words before and after a token; excluding the stop words.
The training set was weighted using YASMET. CONLL framework script is used for
evaluation with and without the use of gazetteers [4]. The second version ANERsys2
adopted a different architecture using Part-Of-Speech tags and Base Phrase Chunks
improving the performance by 10 points [16]. Another performance improvement
achieved by replacing the Maximum Entropy probabilistic model to Conditional Ran-
dom Fields [20].
Shaalan and Raza [17] introduced Arabic NER called Named Entity Recognition for
Arabic (NERA) extracting 10 types of Arabic named entities adopting the rule-based
approach using gazetteers, grammar rules, and a filtering mechanism to extract the NE
whitelists first and using the grammar rules with the support of gazetteers to extract the
NE finally using Blacklist to filter invalid entities. The system results achieved F-meas-
ure of 87.7% for person, 85.9% for locations, and 83.15% for organizations [21], [22].
Oudah and Shaalan [6] developed a new hybrid system from NERA that can recog-
nize 11 Arabic named entity types. The machine learning (ML) approach is used to
enhance the accuracy of recognising Named Entity. The rule-base module developed
using General Architecture for Text Engineering (GATE) and MADA for morpholog-
ical analysis and POS tagging. The system results achieved (F-measures 94.4% for Per-
son, 90.1% for Location, and 88.2% for Organisation) tested on the ANERcorp corpus.
Several researchers used the GATE environment for developing ANER systems
such as [5], [6], [23], [24]. The rule-based system still achieves significant results and
it plays important roles when integrated with machine learning approach to for a unified
hybrid system [6].
3 Methodology
Building a rule-based NER system from scratch requires five main stages. The first
stage is the data collection for providing the needed language resources. The second is
to identify and select the development environment and processing resources. The third
stage is concerned with processing the linguistic resources. The fourth stage is to de-
velop the recognition rules, and customize the system components. The last stage is to
test and evaluate the system’s results.
3.1 Linguistic Resources
The linguistic resources are a critical component for implementing the NER system.
The collection process of such resources requires reviewing literatures and other related
Arabic NER systems, such as [18], [21], [22], [25], in addition to wide internet search
for free Arabic language resources. Different methods and techniques ware used to ex-
tract and refine the data sources using automated and manual approaches. The linguistic
resources consist of two main resources the corpus and gazetteers. This study used
freely available corpora due to the lack of access to commercial corpora. Zaghouani
[18] conducted an intensive survey for freely available Arabic corpora including six
Named Entities Corpora freely available.
A Corpus is a very large set of text which more considered as a methodology used to
teach and study aspects of language [26]. Corpus often used in Computational Linguis-
tics research and it is essential for NE researchers to use corpus in developing, training
and evaluating NER systems. A monolingual corpus contains single language texts,
while parallel or bilingual corpus consists of two natural languages the multilingual
corpus may include more languages[9]. Corpus used in the training and evaluation pro-
cess should be annotated. Corpora are manually annotated by individuals or team of
researchers [27]. Manually annotated corpora commonly named as gold standard cor-
pora. Another approach is the use of rich structured electronic resources (i.e. Wikipedia,
Arabic WordNet4, YAGO5) to automatically generate annotated corpora commonly
named as silver-standard corpora [28]. Shaalan [9], pointed to the importance of the
balance of corpora particularly in named entities distribution when used for NER in
order to be reliable. Meanwhile Al-Thubaity et al. [29], proposed a framework for eval-
uating standalone Arabic Corpora to assist researcher in testing their corpora (usability,
functionality, and performance). Corpus usually collected from Arabic newspaper ar-
chives [30], or other digital electronic resources such as RSS Feed [31] which focused
on Modern Standard Arabic. With the growing use of social media networks and mobile
devices by the Arabic user's researchers start to build corpus from SMS\Chat by trans-
lating Arabizi to Arabic [32], and from different dialects to other languages such as
from Algerian Arabic to French [33].
ANERcorp. The main corpora used in training and evaluating this system is the
ANERcorp and additional corpus used for development and testing such as Wik-
iFANEGold6, NewsFANEGold7 which developed by Alotaibi & Lee [34]. Benajiba et
al. [4], developed Arabic manually annotated corpus (ANERcorp) for the NER task
according to the CONLL 2002 conference framework in their work to implement the
ANERsys (Arabic Named Entity Recognition System). The corpus consists of 316
news articles from different topics and newspapers, which includes 150,286 tokens
were 11% of tokens are Named Entities distributed as (39% Persons, 30.4% location,
20.6% organization and 10% miscellaneous). The ANERcorp has some spelling and
annotation errors as notes by researchers [35].
Example for annotation error is  B-PERSinstead of B-LOC
3.2 Gazetteers
The study refined and used the ANERgazet8 which is a manually built set of gazet-
teers collected from several internet resources. It is consisted of three gazetteers (Loca-
tion Gazetteer:1,950 , Person Gazetteer: 1,920, Organizations Gazetteer:262) [4].
In addition, location gazetteers are collected from serval resources, such as the Wik-
iLocation 9 a Multilanguage structured data set collected from Wikipedia database
dump, as well as the GeoNames Gazetteer 10.
Example for the GeoNames (alternateNames) file format:
4471603 292231 ar 
1603445 347497 ar 
WikiFANEGazet11 is a gazetteer that includes 683555 named entities collected from
the Wikipedia which consists of 8 main classes (Person, Organization, Location, Geo-
Political, Facility, Vehicle, Weapon, and Product) and 55 subclasses [36].
Example of the WikiFANEGazet gazetteer's entities:
<ORG_Educational> </ORG_Educational>
JRC-Names12 is a multilingual named entity gazetteer that consists of 205,000
named entities for Person and Organisation name entities collected over seven years
from multilingual news and Wikipedia mining including a number of spelling variants
[37]. By extracting, refining and solving some encoding problems we end up with
19464 Arabic Persons and 773 Organization named entities.
Example of the JRC-Names gazetteer's entities:
185274 P ar 
185311 P u 
100044 O u 
3.3 Arabic Tokenizer
This system uses GATE Arabic language plugin tokenizer 13, which is customized
from the English tokenizer, it is used to split text to the simplest tokens such as punc-
tuation, numbers, and words aiming to increase eciency and support other process
such as grammar rules. The default settings don’t recognize the Arabic diacritic and
require modification to recognize it as NON_SPACING_MARK [24].
3.4 Part-Of-Speech Tagging
The Part-Of-Speech Tagger reads text and assigns parts of speech for each token.
For example, noun -> NN , verb -> VB , adjective-> JJ and noun plural -> NNS). As
shown in (
) the word Lebanonin Arabic will be tagged as (NNP: noun proper singu-
lar). The proper POS tagging requiers clitics segmentation. The lack of this process
causes the tool to tag words including clitics as noun. For example in English
and described tagged as (NNP), while it should be separated into the proclitic
(And) and tagged as (CC), and  (described) and tagged as (VBD), see Fig.
1. Monroe et al., argue that although segmentation of clitics improves accuracy on Ar-
abic NLP tasks, it is limited to formal Modern Standard Arabic and has poor perfor-
mance in dialect Arabic text proposing a single accurate clitic segmentation model for
both MSA and informal Arabic [38]. On the other hand, Mohamed & Kübler claims
that segmentation is not required for POS tagging and providing higher accuracy pro-
posing a POS tagging approach without any word segmentation described in [39].
Stanford Part-Of-Speech (POS) Tagger. developed originally for English at Stanford
University based on the maximum-entropy model [40]. The enhanced version improved
the performance and supports other languages (Arabic, Chinese, French, Spanish, and
German) [41]. The latest version (v3.5.214) includes Arabic model trained on parts 1-3
of the Penn Arabic Treebank (ATB) using the pre-processing described in [42]. Stan-
ford POS tagger requires tokenization and segmentation; although Stanford Word Seg-
menter15 supports the Arabic language, unfortunately GATE currently does not support
this tool as part of Stanford CoreNLP plugin [43].
Table 2. Stanford POS Tagger example
Fig. 1. Stanford POS Tagger\ Parser example
3.5 System development environment
This study used GATE development environment which is a free open-source NLP
toolkit with multilingual support including Arabic that provides an infrastructure for
developing and deploying software components that process human language [44]. It
is widely used by NLP researchers in many communities and used in the development
of several Arabic NER systems [9], [45], [46]. A Nearly-New Information Extraction
System (ANNIE included within GATE as a package of reusable processing resources
for common natural language processing tasks(tokenizer, sentence splitter, POS tagger,
gazetteer, Name matcher), ANNIE depend on finite state algorithms and the JAPE lan-
guage [47].
3.6 Processing linguistic resources and System Development
GATE used to build the system using the Lang_Arabic plugin components (Gazet-
teer Collector, Gazetteer, Tokenizer, Sentence splitter and JAPE Transducer) and the
Stanford POS Tagger plugin. Corpora divided equally into three parts. The first part is
used for the initial development process to implement the grammar rules using JAPE
and to test the quality of the gazetteers. The second is the corpora used to test, refine,
and implement additional rules and gazetteers to cover other domains such as sports, in
addition to tune the rules for optimizing priorities and effectiveness. The last subset
used as Evaluation corpora (held-out data) used to evaluate the system.
GATE currently supporting the CoNLL-2002 format which keeps the original marks
for named entitles words as chunk as example originally separated
and defined as shown in (Table 3) and will be one chunk annotated as type (ORG).
Table 3. Example for ANERcorp annotation by GATE
Original CoNLL-2002 format
GATE Annotation
 B-ORG
 I-ORG
 I-ORG
Gazetteers building, cleansing, and refining require a massive work using several
approaches. GATE default Arabic Gazetteer Collector system is customized to auto-
mate the extraction of 50 subset named entity Gazetteers from the WikiFANEGold
Corpora such as (Sports, Politician, Artist, Educational) incorporated with the Wik-
iFANEGazet16 [36], and other collected gazetteers. The system used also to extract the
(LOC, ORG, PERS, and MISC) from the two ANERcorp corpora used in the develop-
ment included only in the last stage of testing. In addition to several automated pro-
cesses for cleansing and refining the gazetteers to remove invalid characters such as
“:”, remove extra spaces, delete duplication and trimming leading and trailing
whitespaces, the manual process was important to review and group gazetteers such as
JRC-Names. The location gazetteers collected from GeoNames17 Gazetteer (alternate-
Names) extracting over 100,000 Arabic location name, which was refined and incorpo-
rated with the WikiFANEGazet locations gazetteer to include about 125,000 location,
in addition to counters short and long names, other types were collected (landmarks,
mountains, water bodies, rivers, continents, and others).
The persons gazetteers set includes different types (Artists, Business, Engineer, Po-
lice, Politician, Religious, Scientist, Lawyer and other) collected and refined including
the male, female and surname gazetteers from the GATE Arabic gazetteers.
Supporting set of gazetteers created to provide the common words comes before and
after the named entitles such as titles, nationalities, organizations types and directions.
Other gazetteers such as date, time, currency, percent and numbers to support the
monetary and numbers expressions rules, in addition to conjunction and stopwords,
which collected from several resources.
4 Rule-based Named Entity Extractors
The extraction of named entities approach has two main stages. In the first stage, we
extract the identified named entities from gazetteers (whitelist). The second indirectly
extracts name entities, through POS features, grammar rules, and implemented gram-
matical rules using the JAPE Transducer processing resource.
4.1 Locations Named Entities Recognition rules
The extraction of Locations Named Entities implemented in JAPE by identifying the
matched patterns in the Lift Hand Side (LHS) part of the rule and defining the annota-
tion in the Right Hand Side (RHS) of the rule. The implemented rules can be classified
into three types:
Direct rules. using gazetteers whitelist and JAPE macros to annotate tokens and add
features for (Country, City, Continent, Waterbody, River), Additional feature (Land-
marks and Religious Locations) were annotated (LOCF) as it has not been represented
in the ANERcorp.
Indirect rules. which matching Location NE by using keywords gazetteers in a
simple form as pre-location keywords (“ City of”) or (-Country of”) and
directions, in order to annotate the recognised Locations Named Entities from the
whitelist gazetteers and adding attaching features. Example of the location NE anno-
tated by this rule is  Java Islandrecognised using the - Island
keyword, this named entity cannot be identified by the gazetteers lookup process as the
right Arabic spelling is ”.
Context Relational rules. The contextual relations are a very powerful technique to
recognise the NE, the same keyword approach can be integrated with this method as
example   , the relations can be resolved using specific keywords
such as  - the two citesindicating that the next two following tokens most
probable will be the names of two cites.
This rule LocRule_DirRelation1 will be triggered by this text 
using keyword  and direction part to recognize as affix +
as a location Named Entity.
4.2 Organisations Extractor
Using direct rule to annotate the organisation NE whitelist according to the corpora
(ORG), and adding features by researcher for types (Airport, Educational, Commercial,
Entertainment, Medical-Science, Media, Non-Government, Religious and Sports). Di-
rect rule is used relaying on the quality of the gazetteers. In particular, the recognising
sports organisation NE demonstrate its very difficult characteristic to recognise by the
direct rule as majority of the sports clubs has cites names and nationality as part of the
NE. Example of the results for rule OrgSports4is " for Man-
chester Portuguese player” which uses pre-organization words, Locations gazetteers,
Nationality gazetteers to recognise the candidate sports team and utilize the POS tag-
ging to reduce the matching scope for (NNP, “proper noun” ) and (NN, “noun”).
4.3 Persons Extractor
This component annotates the Persons NE according to the corpora (PERS) and in-
cludes additional features by researcher for persons named entity of the types (Artist,
Business, Engineer, Police, Politician, Religious, Scientist, Lawyers, and others). Ad-
ditional rules implemented to chunk and collect person name sets using gazetteers (e.g.,
Titles, pre-persons keywords) and POS tags varies techniques. Following some of the
implemented rules for recognising the Persons NE using job title and Lexical infor-
mation using POS tagging. As example, the will trigger
rule (PERSaffix3) as the first token , The artiestis found in the person pre-words,
followed by token tagged as DTJJ”. Stanford POS tagger follows the Pann Treebank
tagset meaning DT, Determinerand JJ, Adjectivematching the second part ,
The Genius” and looking for a minimum one to the maximum four following tokens
with the “noun” types.
Another rule uses both person title and organisation pre-keywords to recognize the
person NE the text triggered by this rule  this text formatted as
5 Evaluation
The Confusion matrix shown in (Table 5) mathematically defines the way of meas-
ure the system’s performance “goodnessagainst a human-annotated gold standard”.
It is used to extract the evaluation metrics (Precision, Recall, and F-Measure) which is
widely used in Information Retrieval systems performance Evaluation and became a
standard evaluation method [48].
Table 4. Confusion matrix
Original Marks
true positive
false negative
false positive
true negative
The following formulas define the calculation of evaluation metrics:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = true positive
true positive+false positive (1)
𝑅𝑒𝑐𝑎𝑙𝑙 = true Positive
true positive+false negative (2)
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 × true positive
(2×true positive)+false positive+false negative (3)
The system performance results evaluated using GATE AnnotationDiff tool as
shown in (Fig. 2). This tool provides comparison of original marks and the generated
annotations, the evaluation results provide statistics of the system performance and the
standard measures Recall, Precision and F-Measure.
Fig. 2. GATE AnnotationDiff tool snapshot for the Arabic Person NE.
The system performance results generated by the AnnotationDiff for the Persons NE
is shown in (Table 5). It calculates the measures according to the Confusion matrix
approach with the partially correct (lenient) NE system achieved (0.86 F-Measure) and
without the partially correct (strict) NE system achieved (0.80 F-Measure) where the
average result between strict result and lenient achieved (0.83 F-Measure).
Table 5. Persons NE Results
partially correct
false positives
The system performance results generated by the AnnotationDiff for the Location
NE is shown in (Table 6). It calculates the measures according to the Confusion matrix
approach with the partially correct (lenient) NE system achieved (0.93 F-Measure) and
without the partially correct (strict) NE system achieved (0.90 F-Measure) where the
average result between strict result and lenient achieved (0.92 F-Measure).
Table 6. Location NE Results
partially correct
false positives
The system performance results generated by the AnnotationDiff for the organisa-
tion NE is shown in (Table 7). It calculates the measures according to the Confusion
matrix approach with the partially correct (lenient) NE system achieved (0.93 F-Meas-
ure) and without the partially correct (strict) NE system achieved (0.86 F-Measure)
where the average result between strict result and lenient achieved (0.89 F-Measure).
Table 7. Organization NE Results
partially correct
false positives
Table 8 provides a summery NE system result in three situations (Strict, Lenient,
and Average) for persons NE, Location NE, and Organistion NE.
Table 8. NE System results Summery
Through comparing the Average result with related work findings, this study
achieved a better result in Location NE (92% F-measure) and Organisation NE (89% F-
measure) where the other work [21], [22] achieved result in Location NE (85.9% F-
measure) and Organisation NE (83.1% F-measure) and [6] paper achieved result in Lo-
cation NE (90.1% F-measure) and Organisation NE (88.2% F-measure) but both studies
achieved better result in the Persons NE (87.7% F-measure) and (94.4% F-measure)
where this study achieved (83% F-measure).
6 Conclusion
The Named Entity Recognition system using the rule-based approach requires inten-
sive efforts to provide high performance and considered more successful when focused
on a specific domain. This system biggest limitation is the lack of segmentation and the
need for filter process to resolve and remove the incorrect annotation caused by the
different NE recognition process specially the conflict between the location and
organisation NE due to embeddings. One of the important facts learned from imple-
menting this system that although most of the resent researches tend to use the online
large knowledge systems available on the internet to increase the gazetteers size and
scope the price of such approach is very expensive as it reduces the system precision
and requires intensive efforts to clean and manage. Building ANER rule-based systems
are more of art than science; it requires experiences and skills to develop and judge the
right techniques which needs deep knowledge of the Arabic language.
7 Bibliography
1. R. Grishman and B. Sundheim, “Design of the MUC-6 evaluation,” in Proceedings of a workshop
on held at Vienna, Virginia: May 6-8, 1996, 1996, pp. 413422.
2. D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae
Investig., vol. 30, no. 1, pp. 326, 2007.
3. R. Gaizauskas and Y. Wilks, Information extraction: beyond document retrieval, vol. 54, no. 1.
4. Y. Benajiba, P. Rosso, J. BenedíRuiz, and M. Bened, “ANERsys: an Arabic named entity
recognition system based on maximum entropy,” Gelbukh, A. CICLing 2007. LNCS, pp. 143153,
5. S. Abdallah, K. Shaalan, and M. Shoaib, “Integrating Rule-Based System with Classification for
Arabic Named Entity Recognition,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif.
Intell. Lect. Notes Bioinformatics), vol. 7182, no. PART 2, pp. 311322, 2012.
6. M. Oudah and K. Shaalan, “A Pipeline Arabic Named Entity Recognition using a Hybrid
Approach.,” Coling, vol. 2, no. December 2012, pp. 21592176, 2012.
7. O. Zayed, S. El-Beltagy, and O. Haggag, “A Novel Approach for Detecting Arabic Persons ’
Names using Limited Resources,” Complement. Proc. 14th Int. Conf. Intell. Text Process. Comput.
Linguist. CICLing 2013 (Accepted Present., vol. 70, pp. 8193, 2013.
8. B. Bengfort, “A Survey of Stochastic and Gazetteer Based Approaches for Named Entity
Recognition,” no. 7, pp. 1–9, 2014.
9. K. Shaalan, “A Survey of Arabic Named Entity Recognition and Classification,” Comput. Linguist.,
vol. 40, no. 2, pp. 469510, Jun. 2014.
10. A. Satpreet, “Named Entity Recognition Literature Survey,” pp. 1–47, 2012.
11. V. Gupta, “A survey of Named Entity Recognition in English and other Indian Languages,” IJCSI
Int. J. Comput. Sci. Issues, vol. 7, no. 6, pp. 239245, 2010.
RECOGNITION TOOL,” IJRET Int. J. Res. Eng. Technol., pp. 23192322, 2013.
13. A. Farghaly and K. Shaalan, “Arabic Natural Language Processing : Challenges and Solutions,”
ACM Trans. Asian Lang. Inf. Process., vol. 8, no. 4, pp. 122, 2009.
14. M. Maamouri, A. Bies, S. Kulick, and M. Ciul, “Developing an Egyptian Arabic Treebank : Impact
of Dialectal Morphology on Annotation and Tool Development,” in Proc. of LREC, 2010, pp.
15. B. Farber, D. Freitag, N. Habash, and O. Rambow, “Improving NER in Arabic Using a
Morphological Tagger,” in Proceedings of the Sixth International Conference on Language
Resources and Evaluation (LREC’08), 2008, pp. 25092514.
16. Y. Benajiba and P. Rosso, “ANERsys 2.0: Conquering the NER Task for the Arabic Language by
Combining the Maximum Entropy with POS-tag Information.,” in 3rd Indian International
Conference on Artificial Intelligence (IICAI-07), 2007, pp. 18141823.
17. K. Shaalan and H. Raza, “Person Name Entity Recognition for Arabic,” Comput. Linguist., no.
June, pp. 1724, 2007.
18. W. Zaghouani, “Critical Survey of the Freely Available Arabic Corpora Current situation of the
freely available,” in Proceedings of the Workshop on Free/Open-Source Arabic Corpora and
Corpora Processing Tools Workshop Programme, 2014, p. 1.
19. J. Maloney and M. Niv, “Tagarab: a fast, accurate arabic name recognizer using high-precision
morphological analysis,” Proc. Work. Comput. Approaches to Semit. Lang., pp. 815, 1998.
20. Y. Benajiba and P. Rosso, “Arabic Named Entity Recognition using Conditional Random Fields,”
in Proc. of Workshop on HLT & NLP within the Arabic World, 2008, pp. 143153.
21. K. Shaalan and H. Raza, “Arabic named entity recognition from diverse text types,” Lect. Notes
Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5221
LNAI, pp. 440451, 2008.
22. K. Shaalan and H. Raza, “NERA: Named entity recognition for Arabic,” J. Am. Soc. Inf. Sci.
Technol., vol. 60, no. 8, pp. 16521663, 2009.
23. a Elsebai, F. Meziane, and F. Belkredim, “A Rule Based Persons Names Arabic Extraction
System,” Commun. IBIMA, vol. 11, no. August, pp. 5359, 2009.
24. A. Alfaries, M. Albahlal, M. Almazrua, and A. Almazrua, “A Rule Based Annotation system to
extract Tajweed Rules from Quran,” in Taibah University International Conference on Advances
in Information Technology for the Holy Quran and Its Sciences, 2013.
25. K. Shaalan and M. Oudah, “A hybrid approach to Arabic named entity recognition,” J. Inf. Sci.,
vol. 40, no. 1, pp. 6787, 2013.
26. L. Al-Sulaiti and E. S. Atwell, “The design of a corpus of Contemporary Arabic,” Int. J. Corpus
Linguist., vol. 11, no. 2, pp. 135171, 2006.
27. D. Farwell et al., “Interlingual annotation of multilingual text corpora,” in Proceedings of The
North American Chapter of the Association for Computational Linguistics Workshop on Frontiers
in Corpus Annotation, 2004, pp. 5562.
28. J. Nothman, N. Ringland, W. Radford, T. Murphy, and J. R. Curran, “Learning multilingual named
entity recognition from Wikipedia,” Artif. Intell., vol. 194, pp. 151175, 2013.
29. A. Al-Thubaity, H. Al-Khalifa, R. Alqifari, and M. Almazrua, “Proposed Framework for the
Evaluation of Standalone Corpora Processing Systems: An Application to Arabic Corpora,” Sci.
World J., vol. 2014, no. August 2015, pp. 110, 2014.
30. A. M. Saif and M. J. A. Aziz, “An Automatic Collocation Extraction from Arabic Corpus,” J.
Comput. Sci., vol. 7, no. 1, pp. 611, 2011.
31. S. Khoja, “An RSS Feed Analysis Application and Corpus Builder,” Interface J. Educ. Community
Values, vol. 9, no. 3, pp. 115118, 2009.
32. A. Bies et al., “Transliteration of Arabizi into Arabic Orthography : Developing a Parallel
Annotated Arabizi-Arabic Script SMS / Chat Corpus,” ANLP 2014, vol. 93, 2014.
33. R. Cotterell, A. Renduchintala, N. Saphra, and C. Callison-burch, “An Algerian Arabic-French
Code-Switched Corpus,” in LREC-2014 Workshop on Free/Open-Source Arabic Corpora and
Corpora Processing Tools, 2014, p. 34.
34. F. Alotaibi and M. Lee, “A Hybrid Approach to Features Representation for Fine-grained Arabic
Named Entity Recognition,” in Proceedings of COLING 2014, the 25th International Conference
on Computational Linguistics: Technical Papers, 2014, pp. 2329.
35. M. M. Oudah, “Integrating Rule-based Approach and Machine learning Approach for Arabic
Named Entity Recognition,” The British University in Dubai, 2012.
36. F. Alotaibi and M. Lee, “Automatically Developing a Fine-grained Arabic Named Entity Corpus
and Gazetteer by utilizing Wikipedia,” in Proceedings of the Sixth International Joint Conference
on Natural Language Processing, 2013, no. October, pp. 392400.
37. R. Steinberger, B. Pouliquen, and M. Kabadjov, “JRC-NAMES: A Freely Available, Highly
Multilingual Named Entity Resource.,arXiv Prepr. arXiv, no. September, pp. 104110, 2013.
38. W. Monroe, S. Green, and C. D. Manning, “Word Segmentation of Informal Arabic with Domain
Adaptation,” Proc. 52nd Annu. Meet. Assoc. Comput. Linguist. (Volume 2 Short Pap., pp. 206
211, 2014.
39. E. Mohamed and S. Kübler, “Is Arabic Part of Speech Tagging Feasible Without Word
Segmentation?,” in Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, 2010, no. 1, pp. 705708.
40. K. Toutanova and C. D. Manning, “Enriching the Knowledge Sources Used in a Maximum Entropy
Part-of-Speech Tagger,” in Proceedings of the Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large Corpora, 2000, pp. 6370.
41. K. Toutanova, D. Klein, and C. D. Manning, “Feature-rich part-of-speech tagging with a cyclic
dependency network,” Proc. 2003 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang.
Technol. - Vol. 1 (NAACL ’03), pp. 252259, 2003.
42. S. Green, M. Galley, and C. D. Manning, “Improved Models of Distortion Cost for Statistical
Machine Translation,” Hum. Lang. Technol. 2010 Annu. Conf. North Am. Chapter Assoc. Comput.
Linguist., vol. 5, pp. 867875, 2010.
43., “ - Stanford CoreNLP,” 2015. .
44. H. Cunningham et al., “Developing Language Processing Components with GATE Version 8,”
45. D. Maynard et al., “A Survey of Uses of GATE,” 2000.
46. S. Zaidi, M. T. Laskri, and A. Abdelali, “Arabic collocations extraction using gate,” in 2010
International Conference on Machine and Web Intelligence, ICMWI 2010 - Proceedings, 2010, no.
August 2015, pp. 473475.
47. K. Bontcheva, D. Maynard, V. Tablan, and H. Cunningham, “GATE: A Unicode-based
infrastructure supporting multilingual information extraction,” Proc. Work. Inf. Extr. Slavon. other
Cent. East. Eur. Lang. (IESL03), Borovets, 2003.
48. A. De Sitter, T. Calders, and W. Daelemans, “A Formal Framework for Evaluation of Information
Extraction,” 2004.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework.
Conference Paper
Full-text available
This paper presents a methodology to ex-ploit the potential of Arabic Wikipedia to as-sist in the automatic development of a large Fine-grained Named Entity (NE) corpus and gazetteer. The corner stone of this approach is efficient classification of Wikipedia articles to target NE classes. The resources developed were thoroughly evaluated to ensure reliability and a high quality. Results show the developed gazetteer boosts the performance of the NE classifier on a news-wire domain by at least 2 points F-measure. Moreover, by combining a learning NE classifier with the developed cor-pus the score achieved is a high F-measure of 85.18%. The developed resources overcome the limitations of traditional Arabic NE tasks by more fine-grained analysis and providing a beneficial route for further studies.
Full-text available
Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products revolving around natural language processing tasks. Many researchers have attacked the name identification problem in a variety of languages, but only a few limited research efforts have focused on named entity recognition for Arabic script. This is due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this article, we present the results of our attempt at the recognition and extraction of the 10 most important categories of named entities in Arabic script: the person name, location, company, date, time, price, measurement, phone number, ISBN, and file name. We developed the system Named Entity Recognition for Arabic (NERA) using a rule-based approach. The resources created are: a Whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. A filtration mechanism is used that serves two different purposes: (a) revision of the results from a named entity extractor by using metadata, in terms of a Blacklist or rejecter, about ill-formed named entities and (b) disambiguation of identical or overlapping textual matches returned by different name entity extractors to get the correct choice. In NERA, we addressed major challenges posed by NER in the Arabic language arising due to the complexity of the language, peculiarities in the Arabic orthographic system, nonstandardization of the written text, ambiguity, and lack of resources. NERA has been effectively evaluated using our own tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.
Full-text available
As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.
Full-text available
Problem statement: The identification of collocations is very important part in natural language processing applications that require some degree of semantic interpretation such as, machine translation, information retrieval and text summarization. Because of the complexities of Arabic, the collocations undergo some variations such as, morphological, graphical, syntactic variation that constitutes the difficulties of identifying the collocation. Approach: We used the hybrid method for extracting the collocations from Arabic corpus that is based on linguistic information and association measures. Results: This method extracted the bi-gram candidates of Arabic collocation from corpus and evaluated the association measures by using the n-best evaluation method. We reported the precision values for each association measure in each n-best list. Conclusion: The experimental results showed that the log-likelihood ratio is the best association measure that achieved highest precision.
Full-text available
In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.
Full-text available
This paper describes a new, freely available, highly multilingual named entity resource for person and organisation names that has been compiled over seven years of large-scale multilingual news analysis combined with Wikipedia mining, resulting in 205,000 per-son and organisation names plus about the same number of spelling variants written in over 20 different scripts and in many more languages. This resource, produced as part of the Europe Media Monitor activity (EMM,, can be used for a number of purposes. These include improving name search in databases or on the internet, seeding machine learning systems to learn named entity recognition rules, improve machine translation results, and more. We describe here how this resource was created; we give statistics on its current size; we address the issue of morphological inflection; and we give details regarding its functionality. Updates to this resource will be made available daily.
Conference Paper
Despite considerable research on the topic of Arabic Named Entity Recognition (NER), almost all efforts focus on a traditional set of semantic classes, features and token representations. In this work, we advance previous research in a systematic manner and we devised a novel method to represent these features, relying on a dependency-based structure to capture further evidence within the sentence. Moreover, the work also describes an evaluation of the method involving the capture of global features and employing the clustering of unannotated textual data. To meet this set of goals, we conducted a series of evaluations to evaluate different aspects that demonstrate great improvement when compared with the baseline model.