Conference PaperPDF Available

A Single-Model Approach for Arabic Segmentation, POS-Tagging and Named Entity Recognition

Authors:

Abstract and Figures

This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrarily to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifes downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is publicly free for research purposes.
Content may be subject to copyright.
A Single-Model Approach
for Arabic Segmentation, POS Tagging,
and Named Entity Recognition
1st Abed Alhakim Freihat
DISI
University Of Trento
Trento, Italy
abed.freihat@unitn.it
2nd Gabor Bella
DISI
University Of Trento
Trento, Italy
gabor.bella@unitn.it
3rd Hamdy Mubarak
Qatar Computer Research Institute
Hamad Bin Khalifa University
Doha, Qatar
hmubarak@hbku.edu.qa
4th Fausto Giunchiglia
DISI
University Of Trento
Trento, Italy
fausto.giunchiglia@unitn.it
Abstract—This paper presents an entirely new, two-million-
word annotated corpus for a comprehensive, machine-learning-
based preprocessing of text in Modern Standard Arabic. Con-
trary to the conventional pipeline architecture, we solve the NLP
tasks of word segmentation, POS tagging and named entity recog-
nition as a single sequence labeling task. This single-component
configuration results in a faster operation and is able to provide
state-of-the-art precision and recall according to our evaluations.
The fine-grained output tag set output by our annotator greatly
simplifies downstream tasks such as lemmatization. Provided as a
trained OpenNLP component, the annotator is free for research
purposes.
Index Terms—NLP, segmentation, POS tagging, lemmatization,
named entity recognition, machine learning
I. INTRODUCTION
Common natural language understanding tasks, such as
information retrieval, word sense disambiguation, or query
answering, are usually built on top of a set of basic NLP
preprocessing operations. These operations are supposed to
bring text to a more canonical form with dictionary words
(lemmas) and named entities clearly identified. The precise
solutions applied depend greatly on the language; however,
state-of-the-art approaches typically involve a pipeline of
components, such as a part-of-speech tagger, a morphological
analyzer, a lemmatizer and a named entity recognizer (NER).
Compared to English, both lemmatization and NER are harder
for Arabic text: for the former because of the inflectional
complexity and ambiguity inherent to written language, and
for the latter because Arabic does not mark named entities by
capitalization among other reasons.
There has been extensive research on solving each of the
tasks mentioned above. In the case of Arabic POS tagging,
the approaches are typically based on statistical classifiers
such as SVM [1], [2], sometimes combined with rule-based
methods [3] or with a morphological analyzer [4]–[6]. The
idea of POS tagging applied to unsegmented words has been
investigated in [5] and in [7].
Likewise, for NER several solutions and tools have been
reported. They can be classified as rule-based systems such as
the approach presented in [8], machine-learning-based ones
such as [9], [10], and hybrid systems such as [11]. The cor-
relation between NER and POS tagging is illustrated in [12].
Good-quality Arabic annotated corpora for machine learn-
ing are few and far between. The Penn Arabic Treebank [13] is
a non-free, half-a-million-word annotated corpus destined for
POS tagging and syntactic parsing, upon which a large number
of research results are based. The KALIMAT corpus1, while
freely available, is a silver standard corpus on which a POS
tagging accuracy of 96% was reported [14].
In terms of NLP pipeline architecture, most existing solu-
tions perform the aforementioned tasks as a cascade of several
processing steps. For example, POS tagging in FARASA [15],
[16] and in MADAMIRA [17] supposes that word segmenta-
tion has been done as a previous step. Segmentation, in turn,
relies on further preprocessing tasks such as morphological
analysis in MADAMIRA.
Likewise, NER and lemmatization are often implemented
as separate downstream tasks that rely on the results of POS
tagging, base phrase chunking, and morphological analysis.
In several Arabic pipelines in the literature [18], however,
upstream tasks such as POS tagging are implemented in
a coarse-grained manner, which amounts to delegating the
resolution of certain cases of ambiguity to downstream com-
ponents. For example, by using single VERB and PART
tags, the POS tagger in [1] avoids challenging ambiguities
in Arabic verbs and particles, respectively. Consequently, an
additional downstream component is needed for morphological
disambiguation, e.g., to find out whether
¬Qª
Kis an imperative
(
¬
Q
ª
K/recognize), past (
¬
Q
ª
K/recognized), or present tense
verb (
¬Q
ª
K/you know or she nows); whether the noun I
.m
is singular (in which case it means withdrawal) or plural
(meaning clouds); or whether
à@is an accusative (
à@) or a
subordinate particle (
à@).
In this paper, we present a corpus-based approach that
performs word segmentation, POS tagging and named entity
1https://sourceforge.net/projects/kalimat/
978-1-5386-4543-7/18/$31.00 © 2018 IEEE
recognition as a single processing component and without
any other auxiliary or preprocessing tool. Such a single-step
solution has several advantages:
it is faster to execute than a running several machine
learning models in series;
it is easier to reuse as part of a natural language under-
standing application;
it does not suffer from the problem of cumulative errors
that are inherent to solutions that solve the same tasks in
series.
Another design goal of our corpus was to provide a rich output
by using fine-grained POS tags in the training corpus. This
way, a great deal of ambiguity in Arabic text is resolved by
our tool and subsequent operations such as lemmatization are
largely simplified. The tool can be tested online2and is freely
available for research upon request.
The rest of the paper is organized as follows. Section 2
highlights challenging cases of ambiguity in Arabic that are
solved by our tool. Section 3 provides a high-level overview
and principles of our solution, followed by a detailed pre-
sentation of the tag set used. Section 4 presents the corpus
building process. In section 5 we demonstrate how the output
of the tool serves downstream tasks such as lemmatization and
named entity extraction. Finally, sections 6 and 7 present the
evaluation and the conclusions.
II. AMBIGUITY IN ARABIC
To illustrate the challenging cases that low-level NLP tasks
such as word segmentation or lemmatization typically need
to solve, in the following we list some common examples of
ambiguity in Arabic.
A. Ambiguity in Word Segmentation
Certain words can be segmented into morphemes in more
than one valid way. In such cases the correct segmentation can
only be determined in context. In table I we list some common
examples of ambiguity that occur at the segmentation level.
B. Ambiguity in POS Tagging
While correct segmentation decreases the ambiguity in
Arabic text, polysemy and the lack of short vowels result in
morphemes having multiple meanings with distinct parts of
speech. In table II we show some examples of this kind.
Even with correct segmentation and POS tagging, chal-
lenging cases of ambiguity still remain on the level of fine-
grained POS tags, mostly due to MSA words overwhelmingly
being written without diacritics. In the following, we list some
examples of ambiguity with which we deal on the fine-grained
level.
1) Verb ambiguities, passive vs active voice: Many verbs
in Arabic have the same form in the active or passive voice
cases. Verbs like É
m./reported, or has been reported can be
only through the context disambiguated.
2http://www.arabicnlp.pro/alp/
2) Verb ambiguities, past vs present tense:
The same verb word form that denotes a verb in first
person singular present denotes (another) verb in third
person singular masculine past. Consider for example the
verb ÉÔg
.@which can be
ÉÔ
g
.
@/(I) illustrate can also be
É
Ô
g
.
@/(he) illustrated.
Third person singular feminine present verb form de-
notes (another) verb in third person singular masculine
past. Consider for example the verb ÉÒm
'which can be
ÉÒ
m
'/(she) carries, can be also
É
Ò
m
'/(he) sustained.
3) Verb ambiguities, imperative:
Imperative verb (second person singular masculine) form
can be read as third person singular masculine past tense
verb. For example the verb
¬Qª
Kwhich may be an
imperative verb (
¬
Q
ª
K/recognize) or a past tense verb
(
¬
Q
ª
K/(he) recognized).
Imperative verb (second person plural masculine) form
can be read as third person plural masculine past tense
verb. It can be also third person plural masculine present
tense. For example the verb
¯Qª
Kwhich may be an
imperative verb (
¯
Q
ª
K/recognize), a past tense verb
(
¯
Q
ª
K/(they) recognized), or a present tense verb in cases
like (
¯Q
ª
K
ú
»/so that (you) know).
Imperative verb (second person singular feminine) form
can be read as second person singular feminine present
tense verb (after some particles). For example the same
form ú
¯Qª
Kcan be second person singular feminine
imperative (ú
¯
Q
ª
K/recognize) or second person singular
feminine present tense after subordination particles such
as in the case (ú
¯
Q
ª
K
ú
»/so that you know).
4) Noun ambiguities, singular vs plural: In Arabic, there
are several word forms that denote (different) singular and
plural nouns. For example the word I
.mdenotes the singular
noun I
.
m/dragging and the plural noun I
.
m/clouds.
5) Noun ambiguities, dual vs singular: The @accusative
case ending in Arabic leads to dual singular ambiguity. For
example, the word form AK.A
may be read as singular noun
A
K.A
/one book or dual A
K.A
/two books (in genitive dual cases
such as Ðñ
Ê
ªË@ A
K.A
).
6) Noun ambiguities, dual vs plural: Dual form nouns
and masculine plural noun in general are ambiguous. For
example the word
á
ñÓ can be read as
á
ñÓ/dual form or
as
á
J
Ó
ñÓ/masculine plural form.
7) Noun ambiguities, feminine vs masculine singular:
There are cases in which the same word form denotes singular
but with different gender For example the word ÐY
¯can be
feminine Ð
Y
¯/foot or masculine Ð
Y
¯
/antiquity.
TABLE I
AMBIGUITY EXAMPLES AT THE SEGMENTATION LEVEL
Ambiguity Example
Nouns vs conjunction+pronoun
áëð/weakness vs
áëð/and they (feminine)
Noun vs conjunction+verb Égð/mud vs Égð/and (he) solved
Noun vs conjunction+noun
é
®ð/receipt vs
é
®ð/and (a) character
Noun vs singular noun+pronoun ú
G.A
/two books (in genitive) vs ú
G.A
/my book
Noun vs preposition+noun
éªË/sting vs
éªË/for capacity
Proper noun vs preposition+noun
éK.ñ
®ªK./a city in Iraq vs
éK.ñ
®ªK./with punishment
Proper noun vs conjunction+noun
à@Qëð/a city in Algeria vs
à@Qëð/and two cats
Proper noun vs definite article+ noun H
.AJ.Ë@/a city in Syria vs H
.AJ.Ë@/the door
Noun vs interrogative particle+ negation particle ÕË@/pain vs ÕË@/did I not
Adjective vs noun+pronoun ú
æ
.KAg
./lateral vs ú
æ
.KAg
./my side
Adjective vs preposition+noun
éK
Qm'./nautical vs
éK
Qm'./to freedom
Verb vs conjunction+pronoun Ñê
¯/(he) understood vs Ñê
¯/and they (masculine)
Verb vs conjunction+verb Q
¯ð/saved vs Q
¯ð/and (he) escaped
Verb vs verb+pronoun AJÒÊ«/we knew vs AJÒÊ«/(he) taught us
Verb vs interrogative particle+verb
Y
K@/(I) remember vs
Y
K@/do (you) remember
TABLE II
AMBIGUITY EXAMPLES AT THE POS TAGG IN G LEV EL
Ambiguity Example
Verb vs noun ÉÔg/carried vs ÉÔg/carrying
Verb vs comparative É
®
K@/overburdened vs É
®
K@/heavier
Verb vs adjective ÉîD/facilitate vs ÉîD/easy
Verb/noun vs particle ÕË/gathered vs ÕË/not
Verb vs number ©
/expanded vs ©
/nine
Verb vs proper noun
IªÊ£/rised vs
IªÊ£/Talat
Noun vs number ©J./lion vs ©J./seven
Noun vs proper noun
àAk@/philanthropy vs
àAk@/Ehsan
Adjective vs noun
éJ
«ñK/qualitative vs
éJ
«ñK/quality
Adjective vs proper
noun ÉJ
Ôg
./nice vs ÉJ
Ôg
./Jamil
Interrogative particle vs
relative pronoun
According to their position in the
sentence
Particle ambiguity in
à@
à@/subordinating vs
à@/accusative
Particle ambiguity in
à@
à@/conditional vs
à@/accusative
Particle ambiguity in ÕË ÕË/negation vs ÕË
/interrogative
Particle ambiguity in AÓ AÓ/negation vs /interrogative
C. Ambiguity in Named Entity Recognition
Below we present two examples of ambiguity related to
NER, referring to reader to [19] for a more detailed treatise
on the matter.
1) Inherent ambiguity in named entities: It is possible for
a word or a sequence of words to denote named entities that
belong to different classes. For example
á¢J
@ð denotes both
a person and location. Another problem is that it is frequent
to name organizations and establishments after the name of
persons. For example,
éJ
J
®
JË@ ÐñʪÊË é<Ë@ YJ.« ½ÊÖÏ@
éªÓAg
./ King
Abdullah University of Science and Technology.
2) Ellipses: Ellipses (omitting parts of nominal phrases
and entity names) contribute to the high ambiguity of natural
languages. Considering the lack of orthographic features in
Arabic, ellipses increase the ambiguity. For example, a text
about ¡ñ
Ï@
J
K.
B@ QjJ.Ë@@/The Mediterranean Sea mentions
it explicitly at the beginning of the text. After that, it may
omit
J
K.
B@ QjJ.Ë@/the White Sea and refers to it by ¡ñ
Ï@/the
Mediterranean. This word is used mostly as an adjective
(which means the average) and there is no orthographic
triggers that may disambiguate the entity from the adjective
token.
III. ANN OTATIO N DESIGN
The core idea of our approach is to combine three linguis-
tically related tasks, namely part-of-speech tagging, named
entity recognition, and word segmentation, into a single task
that we solve through supervised sequence labeling. A further
design goal is to provide as much information as possible to
the downstream NLP tasks such as lemmatization.
We achieve these goals by annotating a single large training
corpus with complex and fine-grained tags that encode infor-
mation with respect to part of speech, word segments, and
named entities. These three main kinds of tags are composed
as follows:
<TAG> ::= <PREFIX> <BASETAG> <POSTFIX>
<BASETAG> ::= <POSTAG> | <NERTAG>
<PREFIX> ::= <PREFIX> | <PROCLITIC> "+" | ""
<POSTFIX> ::= <POSTFIX> | "+" <ENCLITIC> | ""
A tag is thus composed of a mandatory base tag and of zero
or more (i.e., optional) proclitics and enclitics concatenated
with the “+” sign indicating word segments. A base tag,
in turn, is either a POS tag or a NER tag, but not both
(in other words, we do not annotate named entities by part
of speech). For example, the full tag of the word éK.A
JºK.ð
is a noun tag preceded by two proclitic tags (conjunction
and preposition) and followed by a pronoun enclitic tag. The
choice of our base and clitic tags was inspired by the coarse-
grained tags used in MADA 2.32 [2] and 3.0 [17], as well as
by the more fine-grained tags used in the Quran Corpus [20].
For compatibility with other NLP tools, mapping our tags to
MADA 2.32, MADA 3.0 and Stanford [21] or any other tag
sets is straightforward.
A. Base POS Annotation
The POS tag set consists of 58 tags classified into five main
categories:
<POSTAG> ::= <NOUN> | <ADJECTIVE> | <VERB> | <ADVERB>
| <PREPOSITION> | <PARTICLE>
Nouns. The noun class has 13 tags as shown in table III.
The first 9 tags are fine-grained annotations of common nouns
that we classify according to their number (singular, dual or
plural) and gender (masculine, feminine or irregular). We use
the feature irregular to annotate the irregular plural nouns. As
it is the case in MADA, we consider quantifiers, numbers and
foreign words as special noun classes. Following the Quran
corpus, we consider pronouns, demonstrative pronouns and
relative pronouns as special noun classes.
Adjectives. The adjective class has 9 tags as described in
table III. Similar to nouns, the first 7 tags are fine-grained
annotations of adjectives that we classify according to their
number and gender. As it is the case in MADA, we consider
comparative adjectives and numerical adjectives as special
adjective classes.
Verbs. The verb class contains 5 tags as described in table III.
The first four tags are fine-grained annotations of verbs that we
classify according to their passive marking (active or passive)
and tense (past or present). Annotating future tense in Arabic
is explained in the particle class. For imperative verbs, we use
the tag IMPV.
Adverbs. It is not clear in the modern Arabic linguistics
community, whether adverb belongs to the Arabic part of
speech system or not. In this study, we follow FARASA and
MADA in considering adverbs as a category of the Arabic
part of speech system, where we consider adverbs as predicate
modifiers that we classify in three classes as shown in table III.
Prepositions and particles. This class contains 28 preposition
and particle tags from the Quran Corpus tagset, that we list in
table IV.
Let us finally note that, while our POS annotation does
reduce the problem of ambiguity due to missing vocalization,
it is not a complete word sense disambiguation method.
Thus, we do not consider cases such as transitive/ditransitive
verb ambiguity: for example, verbs such as ÕΫ(ÕÎ
«/knew
or Õ
Î
«/taught) remain ambiguous according to our current
annotation tag set.
TABLE III
NOU N,AD JE CTI VE,V ER B,AN D ADV ER B TAGS
Tag Explanation Example
SMN Singular masculine noun H
.A
,ÉJ.k
.,Ég
.P
SFN Singular feminine noun
èYK
Qk
.,
éK
Q
¯,
I
K.
DMN Dual masculine noun
àAK.A
,
á
K.A
,AK.A
,ú
G.A
DFN Dual feminine noun
àA
JK
Q
¯,
á
JK
Q
¯,A
JK
Q
¯,ú
æK
Q
¯
PMN Plural masculine noun
àñ
®
£ñÓ ,
á
®
£ñÓ ,ñ
®
£ñÓ,
ù
®
£ñÓ
PFN Plural feminine noun
HA
®
£ñÓ ,
HAÓAîD@,
HBAg,
HAK.A
PIN Plural irregular noun I
.
,øQ
¯,
HAJK.,ÈAg
.P
FWN Foreign noun
á
Ê
K.,ÉK
AK.ñÓ ,H
.ñ
JK.B,
KAJ
NQ Quantifiers ©J
Ôg
.,É¿ ,
ªK.,ø
@
NM Numbers Yg@ð,1,
àAJ
K@ ,
á
J
K@
PRO Pronouns ñë,é ,ù
ë,Aê
DM Demonstrative pronouns @
,
à@
,ZB
ñë ,½Ê
K
REL Relative pronouns ø
YË@,ú
æË@ ,
à@
YÊË@ ,ú
G@ñÊË@
SMAJ Singu-
lar masculine adjective ÉJ
Ôg
.,ø
ñ
¯,Õæ
Ê
SFAJ Singular feminine
adjective
éÊJ
Ôg
.,
éK
ñ
¯,
éÒJ
Ê
DMAJ Dual masculine adjective
àCJ
Ôg
.,
á
ÊJ
Ôg
.,CJ
Ôg
.,ú
ÎJ
Ôg
.
DFAJ Dual feminine adjective
á
JÊJ
Ôg
.,
àA
JK
ñ
¯,
àA
JÒJ
Ê
PMAJ Plural masculine adjective
á
ÊJ
Ôg
.,
àñJ
ÊJ
Ôg
.,ñÊJ
Ôg
.,ú
ÎJ
Ôg
.
PFAJ Plural feminine adjective
HCJ
Ôg
.,
HAK
ñ
¯,
HAÒJ
Ê
PIAJ Plural irregular adjective ZAm@,ZAK
ñ
¯@,g
AJCMP Comparative adjectives ÉÔg
.@,øñ
¯@,ÕÎ@
AJNM Ordinal adjectives Èð@,ú
GA
K,
IËA
K
PRSV Present verb (active) Èñ
®K
,È
A
,ÉÒm'
PSTV Past verb (active) ÈA
¯,È
A ,ÉÔg
PPRSV Present verb (passive) ÈA
®K
,È
A
,ÉÒm'
PPSTV Past verb (passive) ÉJ
¯@,É
J ,ÉÔg
IMPV Imperative verb É
¯,È
A@ ,ÉÔg@
TTemporal adverb AgAJ.,AKAJ
k@,YªK.
LC Location adverb
ñ
¯,
Im
',YªK.
AV Adverb AJ
ÓñK
,
éA
g,AÓAÖ
ß
B. Word Segment Annotation
We represent the morphology of words through complex
tags that correspond to their internal structure. As shown
above, the structure of a complex tag is
[PROCLITIC+]BASETAG [+ENCLITIC]
where BASETAG is one of the base POS tags, ENCLITIC,
when presents, stands for one or two clitic tags at the end of
the word, and PROCLITIC, when presents, is the combination
of one to three tags out of a set of the proclitic tags at the
beginning of the word.
TABLE IV
PREPOSITION AND PARTICLE TAGS
Tag Explanation Example
DDefinite article È@
CConjunctions ð,ð@,
¬
PPrepositions
áÓ,úÍ@ ,È
QInterrogative particles @
XAÓ,Éë,
J
»
COND Conditional particles ñË,@
X@,
à@
NEG Negation particles ÕË,B,
áË
ACC accusative particles
à@,
áºË ,ɪË
SUB subordinate particles
à@,ú
»
FUT Future particles ,
¬ñ
VOC Vocative particles AK
,Ñ
ANS Answer particles ѪK,BC¿
EXL Explanation particles ø
@,@
EXP Exceptive particles øñ,@Y«
EXC Exclamation particles ,AK
RES Restriction particle B@
CERT Certainty particle Y
¯
SUR Surprise particle @
X@,
à
X@
EMPH Emphatic particle Ë
PRP Purpose particle Ë
RET Retraction particle ÉK.
REM Resumption particles
¬,ð
INTG Interogative particle @
PREV Preventive particle
INC Inceptive particle B@
IMPP Imperative particle È
PR Prohibition particle B
ABB Abbreviation (Pñ
J»X) X
PX Punctuation .,:
In our corpus, the number of distinct individual tags (in-
cluding both simple and complex tags) is 358, as shown in
table V.
C. Named Entity Annotation
Our NER tags use the following syntax:
<NERTAG> ::= <POSITION> "-" <CLASS>
<POSITION> ::= "B" | "I"
<CLASS> ::= "PER" | "LOC" | "ORG" | "EVENT"
| "MISC"
Our approach does not mark named entities with POS tags;
rather, we annotate them directly with named entity tags.
Following the conventions of CONLL-20033, the NER tags
provide both the class of the entity and its boundaries through
indicating the positions of the tokens composing it. B- stands
for beginning, i.e., the first token of the entity, while I- stands
for internal, marking subsequent tokens of the same entity.
3http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt
TABLE V
CLI TIC TAG S
# clititcs # tags Examples
0 78 SMN H
.A
1 163 C+PRSV ɪ
®K
ð
2 105 C+PIN+PRO è
ðA
¯Y
3 12 C+FUT+PRSV+PRO éJ.
JºJ
ð
TABLE VI
NAM ED EN TI TY TAGS
Tag Explanation Example
B-PER,
I-PER
Person I
.J
m.'B-PER
ñ
®m×I-PER
B-LOC,
I-LOC
Location QjJ.Ë@ B-LOC
J
K.
B@ I-LOC
¡ñ
Ï@I-LOC
B-ORG,
I-ORG
Organization H
.
Qk B-ORG
éK
QmÌ'@ I-ORG
éË@YªË@ð I-ORG
B-EVENT,
I-EVENT
Event H
.QmÌ'@ B-EVENT
éJ
ÖÏAªË@ I-EVENT
éJ
KA
JË@ I-EVENT
B-MISC,
I-MISC
Misc H
.PX B-MISC
éKAJ.
JË@ I-MISC
Our corpus currently distinguishes between the most com-
mon types of named entities: Persons, Locations, Organiza-
tions, Events and Others. We did not yet classify entity classes
such as date, time, currency or measurement, nor subclasses of
organizations (e.g., we do not differentiate between a football
team and a university).
Thus the total number of NER tags is eight as shown in
table VI; however, as shown earlier, NER tags can be further
combined with clitic tags.
IV. ANN OTATIO N MET HO D
In order to be free from licensing restrictions and modeling
choices of existing resources such as the Penn Arabic Tree-
bank [13], we assembled and hand-annotated an entirely new
corpus. In the following, we present our corpus annotation
method that takes into consideration the challenging ambigu-
ities discussed in section II.
A. Sources
The corpus was assembled from documents and text seg-
ments from a varied set of online resources in Modern
Standard Arabic, such as medical consultancy web pages,
Wikipedia, news portals, online novels and social media, as
shown in table VII. The diversity of sources serves the purpose
of increasing the robustness of the model with respect to
changes in domain, style, form and orthography. For consis-
tency within the corpus and with the type of texts targeted
by our annotator, we removed all short vowels from the input
corpora.
TABLE VII
RESOURCES USED IN THE TRAINING CORPUS
Resource Proportion
Aljazeera online 30%
Arabic Wikipedia 20%
Novels 15%
Alquds newspaper 10%
Altibbi medical corpus 10%
IslamWeb 5%
Social networks (Facebook, Twitter) 5%
Other resources 5%
The current corpus consists of more than 130k annotated
sentences with more than two millions Arabic words and
200k punctuation marks.
B. The Annotation Process
The annotation was performed semi-automatically in two
major steps:
1) annotation of a corpus large enough to train an initial
machine learning model;
2) iterative extension of the corpus, where new sets of
sentences were annotated by the initial trained model,
added to the corpus after hand-correction, and the model
retrained after each iteration.
Step 1 was an iterative process. It was bootstrapped using a
200-sentence gold-standard seed corpus that was fully hand-
annotated. The goal of each iteration was to add a new set of
100 new sentences to the seed corpus, until about 15k sen-
tences were tagged. Each iteration consisted of the following
steps:
1.a For each word in the untagged corpus that was already
tagged in the seed corpus, simply copy the tag from the
tagged word (this of course can lead to wrong tagging
as the process does not take the context into account;
we fix such mistakes later).
1.b Find the 100 sentences with the highest number of tags
obtained through replacement in the previous step.
1.c Manually verify, correct and complete the annotation of
the sentences extracted in step 1.b.
1.d Add the annotated and verified sentences to the seed
corpus and repeat.
At the end of step 1 many rounds of manual verification were
performed on the annotated corpus.
In step 2, we extended the corpus in an iterative manner:
2.a Train an initial machine learning model from the anno-
tated corpus.
2.b Use this model to label a new set of 100 sentences.
2.c Verify and correct the new annotations obtained.
2.d Add the newly annotated sentences to the annotated
corpus and repeat.
For training the machine learning model we used the POS
tagger component of the widely-known OpenNLP tool with
default features and parameters. The annotation work was
accomplished by two linguists, the annotator and a consultant
who was beside the design of the tag set active in corrections
and consultations especially in the first phase. In figure 1 we
provide an example of a complete annotated sentence.
V. PIPELINE IN TE GR ATION
In this section, we show through a few typical examples how
our annotator can be integrated into an Arabic NLP pipeline
and its output effectively used for downstream language pro-
cessing tasks.
The input of the annotator is expected to be UTF-8-encoded,
whitespace-tokenized, but otherwise unannotated text in Mod-
ern Standard Arabic. We are also supposing that sentences
have been previously split by the usual sentence-end markers
(“.”, “!”, “?”, “.. . ”) and newlines.
A. Word Segmentation
Word segmentation is executed based on the clitic tags
provided by the annotator. The input of the method is a word
and its corresponding tag. The output is a list of tokens that
correspond to the PROCLITIC, BASETAG and the ENCLITIC
tags. Given that clitics are linguistically determined, segmen-
tation becomes a simple string splitting task. An example of
the output of a segmentation tool we implemented is shown
in figure 2.
B. Lemmatization
For lemmatization, we distinguish regular and irregular
cases. Regular cases include singular nouns and adjectives,
regular plural nouns and regular verbs. The fine-grained tags
output by our annotator encode the morphological features of
number and gender for nouns and adjectives, and the tense and
voice (active or passive) for verbs. This makes lemmatization a
straightforward task of normalization that removes inflectional
prefixes and suffixes. For example, the verb form @ñÊÒm'
is
normalized into the lemma ÉÔg. In the case of singular and
regular plural nouns, it is sufficient to remove the plural
suffixes and case endings. For example, the lemma of the dual
noun form
áK
YËð is ð.
Irregular cases including broken plurals and irregular verbs,
are more complex and are typically processed using finite-
state transducers and/or lemmatization dictionaries, such as
AraComLex [22] or the OpenNLP lemmatizer4.
C. Named Entity Extraction
By named entity extraction, we mean the identification of
the all entity occurrences present in the text and the extraction
of their canonical names. Based on the NER tags output by
our annotator, identifying the start and end of a named entity
is a trivial task. Then through subsequent word segmentation,
the clitics can be removed from the entity and the canonical
name obtained. An example result of this process can be seen
in figure 3.
4https://opennlp.apache.org/docs/1.8.4/manual/opennlp.html
Fig. 1. An example annotated sentence
Fig. 2. Segmentation output example
Fig. 3. Example text with named entities segmented and their boundaries within the text indicated.
VI. EVALUATIO N
To evaluate the performance of the proposed solution, we
trained a machine learning model on the annotated corpus
using the OpenNLP Maximum Entropy POS tagger with de-
fault features and cutoff = 3. We did not apply any preliminary
normalization to the evaluation corpus. The evaluation corpus
was taken from two sources: the Aljazeera news portal and the
Altibbi medical consultancy web portal. The text contained
9990 tokens (9075 words and 915 punctuations). Manual
validation of the evaluation results was performed. The per-
task accuracy figures are shown in table VIII.
The Segmentation error type refers to words that were
not segmented correctly. The Coarse-grained POS error type
refers to words that were correctly segmented but the base
POS tag was wrong. This also includes incorrect named entity
POS tags. Finally, the Fine-grained POS error type means that
the word segmentation and the coarse-grained POS tag were
correct but the fine-grained information within the tag was
incorrect in one of the following ways:
for nouns and adjectives: number/gender error;
for verbs: tense error or passive/active voice error.
In some cases, the tag included more than one type of error.
For example, the
èQåÓð SFN tag (instead of C+SFAJ) includes
segmentation and POS-tagging errors and was counted twice.
We also evaluated named entity recognition separately. Our
evaluation corpus contains 674 named entity tags that denote
297 named entities (For example
HQK.ðP B-PER
àñ
@ð I-
PER is one named entity that contains two named entity
tags). The total number of true positives (correctly detected
TABLE VIII
EVALUATI ON RE SU LTS
Error type Number of errors Accuracy
Segmentation 25 99.7%
Coarse-grained POS 131 98.7%
Fine-grained POS 206 97.9%
and classified named entities) was 265 (89.2% precision).
The number of false negatives (assigning a non-named-entity
tag, partial tagging, named entity boundary error, or a wrong
named entity class applied) was 32 and the number of the false
positives 15 (94.6% recall). F1-Measure = 91.8%. In table IX
we provide some examples of these errors. The evaluation
data, and how to replicate the evaluation tests are available
online5.
VII. CONCLUSION AND FUTURE WORK
In this study, we have demonstrated a single-corpus and
single-model approach to solving low-level Arabic NLP tasks
in Arabic. We showed how our tool manages to resolve a
large number of cases of ambiguity in Arabic, facilitating
subsequent operations such as lemmatization. A trained model
and corresponding tools are available online and are free
for research purposes upon request. We are also planning to
release the annotated corpus itself in the near future.
Nevertheless, our work is still in progress and can be
improved in several manners.
5http://www.arabicnlp.pro/alp/eval.zip
TABLE IX
EXA MPE S OF NER MISTAKES
Error type Example
Non-NER tag @YK
PñÊ
¯ð C+SMN instead of C+B-LOC
Partially tagged ½J
¯ð SMN AÒ
J.j
.
I-PER instead of
½J
¯ð C+B-PER AÒ
J.j
.
I-PER
Boundary error
é»SFN
àAK
Q.J
B-ORG
éJ
ðQË@ I-ORG instead of
é»SFN
àAK
Q.J
B-ORG
éJ
ðQË@ D+SFAJ
Wrong
classification
áK
Q
«B-LOC (in
áK
Q
« A«Xð) instead of
áK
Q
«B-PER
False positive Ñ
¢JÊË P+B-ORG
éJ
k
.ñËñºK
B@ I-ORG instead of
Ñ
¢JÊË P+D+PIN
éJ
k
.ñËñºK
B@ D+PIAJ
Finetuning. While the tool reached very good results with
default OpenNLP features, we believe that they can still be
improved by customizing the classifier and the features, or
using another machine or deep learning algorithms such as
CRF and biLSTM;
Noun classification. In the current tag set, we do not differ-
entiate between gerunds (PYÖÏ@) and other noun classes. For
example, the noun I
.Ê
¯/heart is tagged the same as the gerund
I
.Ê
¯/overthrow.
Named entity classification. The classification of named
entities in our corpus is still incomplete and coarse-grained.
For example,
á
J
AJ.ªË@,ú
GÏ
B@ I
.ª
Ë@, or
éJ
K.QªË@
é
ªÊË@ are
not classified as named entities. We plan to introduce new
classes such as dates and currencies, as well as a finer-grained
classification of organizations.
Other tools and corpora. We plan to use the same corpus
and tag set to produce annotations for other NLP tasks such
as chunking and parsing.
REFERENCES
[1] K. Darwish, H. Mubarak, A. Abdelali, and M. Eldesouki, “Arabic pos
tagging: Don’t abandon feature engineering just yet,” in Proceedings of
the Third Arabic Natural Language Processing Workshop, pp. 130–137,
2017.
[2] M. Diab, “Second generation amira tools for arabic processing: Fast
and robust tokenization, pos tagging, and base phrase chunking,” in
2nd International Conference on Arabic Language Resources and Tools,
vol. 110, 2009.
[3] S. Khoja, “Apt: Arabic part-of-speech tagger,” in Proceedings of the
Student Workshop at NAACL, pp. 20–25, 2001.
[4] H. Aldarmaki and M. Diab, “Robust part-of-speech tagging of arabic
text,” in ANLP Workshop 2015, p. 173, 2015.
[5] N. Habash and O. Rambow, “Arabic tokenization, part-of-speech tagging
and morphological disambiguation in one fell swoop,” in Proceedings of
the 43rd Annual Meeting on Association for Computational Linguistics,
pp. 573–580, Association for Computational Linguistics, 2005.
[7] E. Mohamed and S. K¨
ubler, “Is arabic part of speech tagging feasible
without word segmentation?,” in Human Language Technologies: The
2010 Annual Conference of the North American Chapter of the As-
sociation for Computational Linguistics, pp. 705–708, Association for
Computational Linguistics, 2010.
[6] M. Sawalha and E. Atwell, “Fine-grain morphological analyzer and part-
of-speech tagger for arabic text,” in Proceedings of the Seventh confer-
ence on International Language Resources and Evaluation (LREC’10),
pp. 1258–1265, European Language Resources Association (ELRA),
2010.
[8] K. Shaalan and H. Raza, “Arabic named entity recognition from di-
verse text types,” in Proceedings of the 6th International Conference
on Advances in Natural Language Processing, GoTAL ’08, (Berlin,
Heidelberg), pp. 440–451, Springer-Verlag, 2008.
[9] M. Althobaiti, U. Kruschwitz, and M. Poesio, “A semi-supervised learn-
ing approach to arabic named entity recognition,” in Recent Advances
in Natural Language Processing, RANLP 2013, 9-11 September, 2013,
Hissar, Bulgaria, pp. 32–40, 2013.
[10] K. Darwish, “Named entity recognition using cross-lingual resources:
Arabic as an example,” in Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers),
vol. 1, pp. 1558–1567, 2013.
[11] S. Abdallah, K. F. Shaalan, and M. Shoaib, “Integrating rule-based
system with classification for arabic named entity recognition,” in
Computational Linguistics and Intelligent Text Processing - 13th In-
ternational Conference, CICLing 2012, New Delhi, India, March 11-17,
2012, Proceedings, Part I, pp. 311–322, 2012.
[12] S. AlGahtani, “Arabic named entity recognition: A corpus-based study,
ph.d. thesis.” 2011.
[13] M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki, “The penn arabic
treebank: Building a large-scale annotated arabic corpus,” in NEMLAR
conference on Arabic language resources and tools, vol. 27, pp. 466–
467, 2004.
[14] M. El-Haj and R. Koulali, “Kalimat a multipurpose arabic corpus,” in
Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25,
2013.
[15] K. Darwish and H. Mubarak, “Farasa: A new fast and accurate arabic
word segmenter.,” in LREC, 2016.
[16] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa: A
fast and furious segmenter for arabic,” in Proceedings of the 2016
Conference of the North American Chapter of the Association for
Computational Linguistics: Demonstrations, pp. 11–16, 2016.
[17] A. Pasha, M. Al-Badrashiny, M. T. Diab, A. El Kholy, R. Eskander,
N. Habash, M. Pooleery, O. Rambow, and R. Roth, “Madamira: A fast,
comprehensive tool for morphological analysis and disambiguation of
arabic.,” in LREC, vol. 14, pp. 1094–1101, 2014.
[18] M. Diab, “Second generation tools (amira 2.0): Fast and robust tok-
enization, pos tagging, and base phrase chunking,” in Proceedings of
the Second International Conference on Arabic Language Resources and
Tools (K. Choukri and B. Maegaard, eds.), (Cairo, Egypt), The MEDAR
Consortium, April 2009.
[19] K. Shaalan, “A survey of arabic named entity recognition and classifi-
cation,” Comput. Linguist., vol. 40, pp. 469–510, June 2014.
[20] K. Dukes and N. Habash, “Morphological annotation of quranic arabic,”
in Proceedings of the Seventh International Conference on Language
Resources and Evaluation (LREC’10) (N. C. C. Chair), K. Choukri,
B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias,
eds.), (Valletta, Malta), European Language Resources Association
(ELRA), may 2010.
[21] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-
rich part-of-speech tagging with a cyclic dependency network,” in
Proceedings of the 2003 Conference of the North American Chapter
of the Association for Computational Linguistics on Human Language
Technology - Volume 1, NAACL ’03, (Stroudsburg, PA, USA), pp. 173–
180, Association for Computational Linguistics, 2003.
[22] M. Attia, P. Pecina, A. Toral, L. Tounsi, and J. van Genabith, “An
open-source finite state morphological transducer for modern standard
arabic,” in Proceedings of the 9th International Workshop on Finite State
Methods and Natural Language Processing, FSMNLP ’11, (Stroudsburg,
PA, USA), pp. 125–133, Association for Computational Linguistics,
2011.

Supplementary resource (1)

... Thus, the tagger that performs better than others will be chosen to tag our corpus. Second, it is claimed that most of the Arabic POS taggers (including the evaluated five taggers in this study) work more appropriately with Modern Standard Arabic (MSA) because they were trained mainly with data gathered from MSA sources [15,16]. The problem, however, is that MSA are used in different genres which differ lexically and stylistically. ...
... The problem, however, is that MSA are used in different genres which differ lexically and stylistically. Our goal, therefore, is to address this generalized approach to MSA by observing how these taggers perform when the target texts are limited to a particular genre-in our case, the Saudi novel-especially as the vast majority of the data training in these taggers comes from the news genre [9,16]. ...
... Freihat [16] presented the Arabic Language Pipeline tool (ALP), that performs segmentation, POS tagging and Named Entity Recognition (NER) tasks in one single process (i.e., without implementing any preprocessing tasks). This, according to the developers, solves a major problematic case of ambiguity in Arabic. ...
Article
Full-text available
Part of Speech (POS) tagging is one of the most common techniques used in natural language processing (NLP) applications and corpus linguistics. Various POS tagging tools have been developed for Arabic. These taggers differ in several aspects, such as in their modeling techniques, tag sets and training and testing data. In this paper we conduct a comparative study of five Arabic POS taggers, namely: Stanford Arabic, CAMeL Tools, Farasa, MADAMIRA and Arabic Linguistic Pipeline (ALP) which examine their performance using text samples from Saudi novels. The testing data has been extracted from different novels that represent different types of narrations. The main result we have obtained indicates that the ALP tagger performs better than others in this particular case, and that Adjective is the most frequent mistagged POS type as compared to Noun and Verb.
... In this step, the words of the corpus were annotated with POS tags using the free Arabic Linguistics Pipeline (ALP) tool [59] (The tool can be accessed via the following link: http://arabicnlp.pro/alp/index-ar.php, accessed on 12 April 2022). ...
... Moreover, one of the main advantages of ALP is that it performs three computational tasks in one single process: word segmentation, POS tagging, and NER. Thus, words do not need to be segmented first then tagged second [59]. Another distinctive feature of this tool, unlike other Arabic taggers, is that it contains 58 tags that label the words of the texts with more detailed grammatical features. ...
Article
Full-text available
Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital datasets that would enhance corpus linguistic and stylistic studies of these works. Thus, Arabic lags behind English and other European languages in this context. In this paper, we present the Saudi Novels Corpus, built to be a valuable resource for linguistic and stylistic research communities. We specifically present the procedures we followed and the decisions we made in creating the corpus. We describe and clarify the design criteria, data collection methods, process of annotation, and encoding. In addition, we present preliminary results that emerged from the analysis of the corpus content. We consider the work described in this paper as initial steps to bridge the existing gap between corpus linguistics and Arabic literary texts. Further work is planned to improve the quality of the corpus by adding advanced features.
... Using this setup, they report better results and are able to efficiently use smaller datasets. Freihat et al. (2018) manually curated an Arabic corpus of 2 million words, that was simultaneously annotated for POS, NER, and segmentation. Using an MTL model trained to perform the three aforementioned tasks, the authors were able to achieve state of the art results on said tasks, and show that such a model can greatly simplify and enhance downstream tasks, such as lemmatization. ...
... From these sentences, we construct sentence pairs, mostly by modifying the sub-phrases of a premise to produce a hypothesis reflecting the same linguistic phenomena in GLUE. For this purpose, we use three main sources, namely: Arabic Wikipedia, UN Multilingual Corpus V1.0 (Ziemski et al., 2016) and the corpus from the Arabic Linguistic Tool (ALP) (Freihat et al., 2018). This latter source is composed of syntactically annotated texts of various genres (e.g. ...
Conference Paper
The emergence of Multi-task learning (MTL) models in recent years has helped push the state of the art in Natural Language Understanding (NLU). We strongly believe that many NLU problems in Arabic are especially poised to reap the benefits of such models. To this end, we propose the Arabic Language Understanding Evaluation Benchmark (ALUE), based on 8 carefully selected and previously published tasks. For five of these, we provide new privately held evaluation datasets to ensure the fairness and validity of our benchmark. We also provide a diagnostic dataset to help researchers probe the inner workings of their models. Our initial experiments show that MTL models outperform their singly trained counterparts on most tasks. But in order to entice participation from the wider community, we stick to publishing singly trained baselines only. Nonetheless, our analysis reveals that there is plenty of room for improvement in Arabic NLU. We hope that ALUE will play a part in helping our community realize some of these improvements. Interested researchers are invited to submit their results to our online, and publicly accessible leaderboard.
... NER is considered as one of the substantial Natural Language Processing (NLP) tools. It provides the Information Retrieval (IR) domain with recognized Named Entities (NEs) within the query and searched documents Benajiba et al. (2009), Freihat et al. (2018. Geographical Information Retrieval (GIR) Larson and Frontiera (1996) handles indexing, searching, retrieving and browsing georeferenced information sources. ...
Article
Full-text available
The Web is a source of information for Location-Based Service (LBS) applications. These applications lack postal addresses for the user’s Point of Interests (POIs) such as schools, hospitals, restaurants, etc., as these locations are annotated manually by using the yellow pages or by the location owners (users/companies). Our study in this paper confirms that Google Maps, a common LBS application, only contains about \(32.5\%\) of the public schools that are registered officially in the documents provided by the Directorate of Education in Egypt. However, the remaining missed school addresses could be fished from the Web (e.g., social media). To the best of our knowledge, no prior survey has been published to compare the previous Web postal address extraction approaches. Additionally, all proposed approaches for address extraction are local (could be working in specific countries/locations with particular languages) and could not be used or even adapted to work in other countries/locations with other languages. Furthermore, the problem of Web postal address extraction is not addressed in many countries such as Arab countries (e.g. Egypt). This paper discusses the issue of address extraction, highlights and compares the recently used techniques in extracting addresses from Web pages. In addition, it investigates the discrepancy of knowledge among existing systems. Moreover, it provides a comprehensive review of the geographical Gazetteers used in the Web postal address approaches and compares their data quality dimensions.
... The PATB has been the most used resource for supervised morphological disambiguation (Diab et al., 2004;Habash and Rambow, 2005;Pasha et al., 2014;AlGahtani and McNaught, 2015;Zalmout and Habash, 2017). Some efforts have used other annotated resources and/or large unannotated data sets (Lee et al., 2003;Abdelali et al., 2016;Freihat et al., 2018). More closely related to this paper, demonstrated that delexicalized information provides a cheap means of inducing morphological knowledge and thereby predicting lexical information in MSA. ...
Conference Paper
Full-text available
We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring language specific knowledge, but no direct supervision. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morphosyntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. We demonstrate the utility of de-lexical segmentation on several dialects of Arabic. We consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.
... One of the general design principles we used to achieve these goals was to reduce the number of individual NLP components within the pipeline. Thus, ALP consists of just two components: the first one is a preprocessor that perform word segmentation, POS tagging, and named entity recognition as a single processing task, without any other auxiliary processing tool [29]. The second component uses these results to perform lemmatization [30]. ...
Preprint
This paper presents ALP, an entirely new linguistic pipeline for natural language processing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture , we solve common NLP operations of word segmentation, POS tagging, and named entity recognition as a single sequence labeling task. Based on this single component , we also introduce a new lemmatizer tool that combines a machine-learning-based and dictionary-based approaches , the latter providing increased accuracy, robustness, and flexibility to the former. The presented pipeline configuration results in a faster operation and is able to provide a solution to the challenges of processing Modern Standard Arabic, such as the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels. https://doi.org/DOI Received ..; revised ..; accepted .. Abstract:
... We applied a light preprocessing step where a simple blank tokenization and punctuation filtering have been achieved. It is worthy to say, that we deployed in our preliminary experiments Low level NLP processing such as POStagging (Freihat et al., 2018b) features and lemmatization (Freihat et al., 2018a) but without a significant enhancement of the achieved results. Besides the word and character n-grams features used in previous work such as (sal;Lichouri et al., 2018), we added the character-word boundary (char wb). ...
Conference Paper
Full-text available
This paper describes the solution that we propose on MADAR 2019 Arabic Fine-Grained Dialect Identification task. The proposed solution utilized a set of classifiers that we trained on character and word features. These clas-sifiers are: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Passive Aggressive(PA) and Perceptron (PC). The system achieved competitive results, with a performance of 62.87% and 62.12% for both development and test sets.
... For the second method, the process of choosing the correct segmentation is conducted within the process of choosing the correct morphology analysis, which contains the segmentation information and POS tagging decision. For instance, Habash and Rambow [23] and Roth et al. [24] used the SVM algorithm to choose the best solution of BAMA, Zalmout and Habash [6] used Bi-LSTM, and Freihat et al. [25] used a maximum entropy POS tagger. All the aforementioned studies have either relied on the Buckwalter morphological analyzer or dictionaries with rules to solve the issue of segmentation with rewriting. ...
Article
Full-text available
In this paper, we propose an Arabic word segmentation technique based on a bi-directional long short-term memory deep neural network. Our work addresses the two tasks of word segmentation only and word segmentation for nine cases of rewrite. Word segmentation with rewrite concerns inferring letters that are dropped or changed when the main word unit is attached to another unit, and it writes these letters back when the two units are separated as a result of segmentation. We only use binary labels as indicators of segmentation positions. Therefore, label 1 is an indicator of the start of a new word (split) in a sequence of symbols not including whitespace, and label 0 is an indicator for any other case (no-split). This is different from the mainstream feature representation for word segmentation in which multi-valued labeling is used to mark the sequence symbols: beginning, inside and outside. We used Arabic Treebank data and its clitics segmentation scheme in our experiments. The trained model without the help of any additional language resources, such as dictionaries, morphological analyzers, or rules, achieved a high F1 value for the Arabic word segmentation only (98.03%) and Arabic word segmentation with rewrite (more than 99% for frequent rewrite cases). We also compared our model with four state-of-the-art Arabic word segmenters. It performed better than the other segmenters on a modern standard Arabic text, and it was the best among the segmenters that do not use any additional language resources in another test using classical Arabic text.
Chapter
Setswana is an under-resourced Bantu African language that is morphologically rich with the disjunctive writing system. Developing NLP pipeline tools for such a language could be challenging, due to the need to balance the linguistics semantics robustness of the tool with computational parsimony. A Part-of-Speech (POS) tagger is one such NLP tool for assigning lexical categories like noun, verb, pronoun, and so on, to each word in a text corpus. POS tagging is an important task in Natural Language Processing (NLP) applications such as information extraction, Machine Translation, Word prediction, etc. Developing a POS tagger for a morphologically rich language such as Setswana has computational linguistics challenges that could affect the effectiveness of the entire NLP system. This is due to some contextual semantics features of the language, that demand a fine-grained granularity level for the required POS tagset, with the need to balance tool semantic robustness with computational parsimony. In this paper, a context-driven corpus-based model for text segmentation and POS tagging for the language is presented. The tagger is developed using the Apache OpenNLP tool and returns the accuracy of 96.73%.
Conference Paper
Full-text available
In this paper, we present Farasa, a fast and accurate Arabic segmenter. Our approach is based on SVM-rank using linear kernels. We measure the performance of the seg-menter in terms of accuracy and efficiency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or is at par with the state-of-the-art Arabic segmenters (Stanford and MADAMIRA), while being more than one order of magnitude faster.
Article
Full-text available
In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.
Conference Paper
Full-text available
We present a new and improved part of speech tagger for Arabic text that incorporates a set of novel features and constraints. This framework is presented within the MADAMIRA software suite, a state-of-the-art toolkit for Arabic language processing. Starting from a linear SVM model with basic lexical features, we add a range of features derived from morphological analysis and clustering methods. We show that using these features significantly improves part-of-speech tagging accuracy, especially for unseen words, which results in better generalization across genres. The final model, embedded in a sequential tagging framework, achieved 97.15% accuracy on the main test set of newswire data, which is higher than the current MADAMIRA accuracy of 96.91% while being 30% faster.
Article
Full-text available
As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.
Conference Paper
Full-text available
Named Entity Recognition (NER) is a subtask of information extrac-tion that seeks to recognize and classify named entities in unstructured text into predefined categories such as the names of persons, organizations, locations, etc. The majority of researchers used machine learning, while few researchers used handcrafted rules to solve the NER problem. We focus here on NER for the Ara-bic language (NERA), an important language with its own distinct challenges. This paper proposes a simple method for integrating machine learning with rule-based systems and implement this proposal using the state-of-the-art rule-based system for NERA. Experimental evaluation shows that our integrated approach increases the F-measure by 8 to 14% when compared to the original (pure) rule based system and the (pure) machine learning approach, and the improvement is statistically significant for different datasets. More importantly, our system out-performs the state-of-the-art machine-learning system in NERA over a bench-mark dataset.
Conference Paper
Full-text available
We develop an open-source large-scale finitestate morphological processing toolkit (AraComLex) for Modern Standard Arabic (MSA) distributed under the GPLv3 license. The morphological transducer is based on a lexical database specifically constructed for this purpose. In contrast to previous resources, the database is tuned to MSA, eliminating lexical entries no longer attested in contemporary use. The database is built using a corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based pattern matching to automatically acquire lexical knowledge. Our morphological transducer is evaluated and compared to LDC's SAMA (Standard Arabic Morphological Analyser).
Conference Paper
Full-text available
We present an approach to using a mor- phological analyzer for tokenizing and morphologically tagging (including part- of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.
Conference Paper
Full-text available
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any word segmentation, the second approach is segmention-based, using a machine learning segmenter. Surprisingly, word-based POS tagging yields the best results, with a word accuracy of 94.74%.
Conference Paper
Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in F-measure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and microblogs test sets respectively.