ArticlePDF Available

Abstract and Figures

Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 142 (2018) 132–140
1877-0509 © 2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
10.1016/j.procs.2018.10.468
10.1016/j.procs.2018.10.468
© 2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientic committee of the 4th International Conference on Arabic Computational Linguistics.
1877-0509
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
2Freihat et al. /Procedia Computer Science 00 (2017) 000–000
POS & Name
Tagging
Word
Segmentation
Lemmatization
Learning-based
Dictionary-based
Fusion
Preprocessing
tokenized
text
word
segments,
POS tags
lemmatized
text
Fig. 1. The high-level NLP pipeline architecture for lemmatization
the corresponding lemma

/grit. As deciding for the correct lemma is ultimately a word sense disambiguation prob-
lem, such cases put considerable stress on the quality of lemmatization. Tools that are capable of outputting multiple
solutions in an order of preference are in this sense more robust as they potentially allow the disambiguation problem
to be delayed to later, syntactic or semantic processing steps.
There has been extensive research so far on solving the lemmatization problem for Arabic. While several ap-
proaches were proposed, there are no more than a handful of actual tools available. Existing tools typically combine
multiple techniques to achieve ecient lemmatization. The Alkhalil lemmatizer [9] first applies morpho-syntactic
analysis to the input sentence to generate all potential word surface forms. Then, among these only one form is
selected per word using a technique based on hidden Markov models. The accuracy of the tool is reported to be
about 94%. Another lemmatizer is Madamira [10] which relies on preliminary morphological analysis on the input
word that outputs a list of possible analyses. As a second step, it predicts the correct lemma using language models.
The accuracy of the tool is 96.6%. The Farasa lemmatizer [11] uses a dictionary of words and their diacritizations
ordered according to their number of occurrences. The accuracy reported for Farasa is 97.32%.
Beside these tools, there are other proposed approaches: for example, [12] propose a pattern-based approach while
[13] and [14] present rule-based solutions.
In this paper, we present a new, freely available lemmatization tool that is composed of the fusion of a machine-
learning-based classifier as a main lemmatizer and of an auxiliary dictionary-based lemmatizer. The underlying idea
is that the classifier, that relies on context, is well-suited to solving cases of lexical ambiguity while the dictionary-
based extension provides an extra performance boost, easy extensibility by new lemmas (e.g., neologisms), as well as
the possibility to retrieve multiple possible lemmas per word form for subsequent analysis. The two lemmatizers are,
however, implemented as separate tools and can also be used independently, each one yielding results beyond 95% of
accuracy. Their high performance is partly explained by the preceding, fast and lightweight steps of POS tagging, mor-
phological analysis, and word segmentation—earlier contributions of the authors reused in this work—that provide
rich morphological information to the lemmatizers.
The lemmatizer was implemented as a component in a new, free, comprehensive pipeline for Arabic NLP [15]
and is freely available for private or research purposes.1Both the 3-million-entry dictionary and the 2-million-token
annotated corpus used to train the classifier were entirely generated by the authors and are contributions of this paper.
2. The Lemmatization Pipeline
As is the case of most approaches, our lemmatizer operates over an input pre-annotated by previous preprocessing
steps. The pipeline specific to our method is shown in figure 1 and is composed of the following main steps:
1. Preprocessing: taking whitespace-tokenized Arabic text in input, we pre-annotate the text through the following
operations:
(a) POS and name tagging: tokens are annotated by a machine-learning-based sequence labeler that outputs
both POS and named entity tags, later used by the lemmatizer;
(b) word segmentation: using the POS output, cliticized words are segmented into a proclitic, a base word, and
an enclitic, making the subsequent lemmatization step simpler.
1http://www.arabicnlp.pro/alp/
Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140 133
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
Towards an Optimal Solution to Lemmatization in Arabic
Abed Alhakim Freihata,, Mourad Abbasb,G
´
abor Bellaa, Fausto Giunchigliaa
aDepartment of Information Engineering and Computer Science, University of Trento, via Sommarive, 5, 38123 Trento, Italy
bComputational Linguistics Department, CRSTDLA, Algeria
Abstract
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a
key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is
a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in
writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization
dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of
over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Keywords: lemmatization; Arabic; natural language processing; machine-learning-based lemmatization; dictionary-based lemmatization
1. Introduction
Lemmatization consists of assigning to the surface form of each word in a text its corresponding lemma, that is,
its canonical form as the word is commonly found in a dictionary. As such, lemmatization decreases morphological
variations in text, in turn facilitating operations such as semantic analysis [1], information retrieval [2], question
answering [3], or search [4]. For this reason, lemmatization is a crucial preprocessing operation in a wide range of
applications that involve dealing with natural language.
The diculty of the lemmatization task greatly depends on the nature of the language. In morphologically poor
languages such as English, lemmatization can be considered an easy task already solved by simple normalization
rules and a list of exceptional cases. For morphologically rich—highly inflecting or agglutinative—languages such as
Arabic, on the other hand, it remains dicult and requires diverse, more complex approaches that are often specific
to the language [5].
The major challenges specific to Arabic lemmatization, and NLP in general, are the rich morphology [6], which
includes agglutinative properties [7], and the optional—and mostly omitted—marking of short vowels in writing [8].
This last property results in pervasive lexical ambiguity even considering the corresponding part of speech: for exam-
ple, the past tense verb
 could be vocalized as

with the corresponding lemma

/become or as

with
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
E-mail address: abdel.fraihat@gmail.com
1877-0509 c
2018 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
2Freihat et al. /Procedia Computer Science 00 (2017) 000–000
POS & Name
Tagging
Word
Segmentation
Lemmatization
Learning-based
Dictionary-based
Fusion
Preprocessing
tokenized
text
word
segments,
POS tags
lemmatized
text
Fig. 1. The high-level NLP pipeline architecture for lemmatization
the corresponding lemma

/grit. As deciding for the correct lemma is ultimately a word sense disambiguation prob-
lem, such cases put considerable stress on the quality of lemmatization. Tools that are capable of outputting multiple
solutions in an order of preference are in this sense more robust as they potentially allow the disambiguation problem
to be delayed to later, syntactic or semantic processing steps.
There has been extensive research so far on solving the lemmatization problem for Arabic. While several ap-
proaches were proposed, there are no more than a handful of actual tools available. Existing tools typically combine
multiple techniques to achieve ecient lemmatization. The Alkhalil lemmatizer [9] first applies morpho-syntactic
analysis to the input sentence to generate all potential word surface forms. Then, among these only one form is
selected per word using a technique based on hidden Markov models. The accuracy of the tool is reported to be
about 94%. Another lemmatizer is Madamira [10] which relies on preliminary morphological analysis on the input
word that outputs a list of possible analyses. As a second step, it predicts the correct lemma using language models.
The accuracy of the tool is 96.6%. The Farasa lemmatizer [11] uses a dictionary of words and their diacritizations
ordered according to their number of occurrences. The accuracy reported for Farasa is 97.32%.
Beside these tools, there are other proposed approaches: for example, [12] propose a pattern-based approach while
[13] and [14] present rule-based solutions.
In this paper, we present a new, freely available lemmatization tool that is composed of the fusion of a machine-
learning-based classifier as a main lemmatizer and of an auxiliary dictionary-based lemmatizer. The underlying idea
is that the classifier, that relies on context, is well-suited to solving cases of lexical ambiguity while the dictionary-
based extension provides an extra performance boost, easy extensibility by new lemmas (e.g., neologisms), as well as
the possibility to retrieve multiple possible lemmas per word form for subsequent analysis. The two lemmatizers are,
however, implemented as separate tools and can also be used independently, each one yielding results beyond 95% of
accuracy. Their high performance is partly explained by the preceding, fast and lightweight steps of POS tagging, mor-
phological analysis, and word segmentation—earlier contributions of the authors reused in this work—that provide
rich morphological information to the lemmatizers.
The lemmatizer was implemented as a component in a new, free, comprehensive pipeline for Arabic NLP [15]
and is freely available for private or research purposes.1Both the 3-million-entry dictionary and the 2-million-token
annotated corpus used to train the classifier were entirely generated by the authors and are contributions of this paper.
2. The Lemmatization Pipeline
As is the case of most approaches, our lemmatizer operates over an input pre-annotated by previous preprocessing
steps. The pipeline specific to our method is shown in figure 1 and is composed of the following main steps:
1. Preprocessing: taking whitespace-tokenized Arabic text in input, we pre-annotate the text through the following
operations:
(a) POS and name tagging: tokens are annotated by a machine-learning-based sequence labeler that outputs
both POS and named entity tags, later used by the lemmatizer;
(b) word segmentation: using the POS output, cliticized words are segmented into a proclitic, a base word, and
an enclitic, making the subsequent lemmatization step simpler.
1http://www.arabicnlp.pro/alp/
134 Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 3
2. Lemmatization: the segmented and pre-annotated text is fed into the following lemmatizer components:
(a) dictionary-based lemmatizer: words are lemmatized through dictionary lookup;
(b) machine-learning-based lemmatizer: words are lemmatized by a trained machine learning lemmatizer;
(c) fusion: the outputs of the two lemmatizers are combined into a single output.
In the following sections we present each component in detail, with a focus on lemmatization components as the
pre-processors have already been discussed in our earlier work [15].
3. Preprocessing
The role of preprocessing is to enrich the input of the lemmatizer (and other subsequent components) by mor-
phological and other contextual information, based on which the lemmatization task is simplified. With respect to
state-of-the-art tools [9, 10], the preprocessing required by our lemmatization approach is lightweight and fast. The
tokenized input text is first enriched by part-of-speech tagging, implemented as a fast machine-learning-based se-
quence labeler. Then it is further segmented by a very simple word segmenter component that further reduces the
complexity and ambiguity of words. The simplicity of the process, presented in detail in [15], is explained by the rich
morphological information output by the POS tagger, based on which word segmentation becomes a nearly trivial
task. For example, the word 
(and by using) is first identified by the POS tagger to contain two proclitics
and an inflected base word. On the basis of this result the segmenter outputs the word segment sequence <,
,

>(<and,by,using>) and the corresponding POS tags <C, P, SMN>(<conjunction, preposition, singular
masculine noun>).
In this section we provide a brief overview of the preprocessing from the point of view of the subsequent lemmati-
zation task.
3.1. POS and Name Tagging
In the lemmatization pipeline, the main goal of the POS and name tagger is to reduce the ambiguity of words by
extracting information from their morphology and context:
whether the word is part of a name or not;
the corresponding part of speech;
whether the word contains proclitics or enclitics (prefixes or suxes).
The tagger is implemented as a single machine learning component, described in our earlier work [15] and freely
available online.2In the following we provide only a brief presentation of it, in order to demonstrate the level of detail
it provides to the subsequent lemmatizer.
<TAG> ::= <PREFIX> <BASETAG> <POSTFIX>
<BASETAG> ::= <POSTAG> | <NERTAG>
<PREFIX> ::= <PREFIX> | <PROCLITIC> "+" | ""
<POSTFIX> ::= <POSTFIX> | "+" <ENCLITIC> | ""
A tag is thus composed of a mandatory base tag and of zero or more proclitics and enclitics concatenated with the
+” sign indicating word segments. A base tag, in turn, is either a POS tag or a named entity (NER) tag. We do not
consider name tags in the rest of the paper as they are irrelevant for lemmatization beyond the fact that names are
skipped by the lemmatizer.
On a coarse-grained level, POS tags are divided into the following categories:
2http://www.arabicnlp.pro/alp/
4Freihat et al. /Procedia Computer Science 00 (2017) 000–000
Fig. 2. An example piece of text annotated with parts of speech and morphology
Fig. 3. An example output of the word segmenter
<POSTAG> ::= <NOUN> | <ADJECTIVE> | <VERB> | <ADVERB> | <PREPOSITION> | <PARTICLE>
Coarse-grained parts of speech [16] are enriched with verb tenses and morphological features, the goal of which is to
solve a large part of lexical ambiguity problems already on the level of POS tagging:3
<NOUN> ::= ( <NUMBER> <GENDER> "N" ) | "PIN" /* PIN: broken plural noun */
<ADJECTIVE> ::= ( <NUMBER> <GENDER> "N" ) | "PIAJ" /* PIAJ: broken plural adjective */
<NUMBER> ::= "S" | "D" | "P" /* Singular, Dual, or Plural */
<GENDER> ::= "M" | "F" /* Masculine or Feminine */
<VERB> ::= ( <PASSIVE> <TENSE> "V" ) | "IMPV" /* IMPV: imperative verbs */
<PASSIVE> ::= "P" | "" /* empty for active verbs */
<TENSE> ::= "PST" | "PRS" /* Past, Present */
Examples of noun tags are SMN and SFN meaning singular masculine and singular feminine, respectively. They enable
us to dierentiate between, for example, the words 
SMN/man and 
SFN/leg.
Examples of verb tags are PSTV and PRSV (past and present tense) that enable us, for example, to dierentiate be-
tween 
PSTV/(he) sustained and 
PRSV/(she) carries. In figure 2, we provide an example of an annotated
text.
3.2. Word Segmentation
Word segmentation is executed based on the segmentation information embedded within POS tags. Its serves a
double goal: to reduce the amount of distinct word forms, resulting in smaller and more robust lemmatizers, as well as
to reduce lexical ambiguity due to multiple possible interpretations. For example, word segmentation of reduces the
number of possible word forms of the lemma 
from several hundreds of clitisized nouns {
,

,

,

,

,...}to six forms {
,

,
,

,

,
}only. On the other hand, word segmentation reduces the lexical
ambiguity in cases such as
 which may be single word (sting) or a clitisized word (for capacity).
The input of the segmentation component is a word and its corresponding POS tag. The output is a list of tokens
that correspond to the proclitic, base, and enclitic components of the POS tag. Given that the presence of clitics is
identified upstream, segmentation becomes a simple rule-based string splitting task, as described in detail in [15]. An
example output of the segmentation tool is shown in figure 3.
4. Lemmatization
The principal component of our lemmatization approach is a machine-learning-based classifier. It takes as input
word segments and their corresponding POS tag, also taking context (words and tags) into account. The learning-based
3We omitted some tags, such as the named entity tags, which are irrelevant for lemmatization.
Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140 135
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 3
2. Lemmatization: the segmented and pre-annotated text is fed into the following lemmatizer components:
(a) dictionary-based lemmatizer: words are lemmatized through dictionary lookup;
(b) machine-learning-based lemmatizer: words are lemmatized by a trained machine learning lemmatizer;
(c) fusion: the outputs of the two lemmatizers are combined into a single output.
In the following sections we present each component in detail, with a focus on lemmatization components as the
pre-processors have already been discussed in our earlier work [15].
3. Preprocessing
The role of preprocessing is to enrich the input of the lemmatizer (and other subsequent components) by mor-
phological and other contextual information, based on which the lemmatization task is simplified. With respect to
state-of-the-art tools [9, 10], the preprocessing required by our lemmatization approach is lightweight and fast. The
tokenized input text is first enriched by part-of-speech tagging, implemented as a fast machine-learning-based se-
quence labeler. Then it is further segmented by a very simple word segmenter component that further reduces the
complexity and ambiguity of words. The simplicity of the process, presented in detail in [15], is explained by the rich
morphological information output by the POS tagger, based on which word segmentation becomes a nearly trivial
task. For example, the word 
(and by using) is first identified by the POS tagger to contain two proclitics
and an inflected base word. On the basis of this result the segmenter outputs the word segment sequence <,
,

>(<and,by,using>) and the corresponding POS tags <C, P, SMN>(<conjunction, preposition, singular
masculine noun>).
In this section we provide a brief overview of the preprocessing from the point of view of the subsequent lemmati-
zation task.
3.1. POS and Name Tagging
In the lemmatization pipeline, the main goal of the POS and name tagger is to reduce the ambiguity of words by
extracting information from their morphology and context:
whether the word is part of a name or not;
the corresponding part of speech;
whether the word contains proclitics or enclitics (prefixes or suxes).
The tagger is implemented as a single machine learning component, described in our earlier work [15] and freely
available online.2In the following we provide only a brief presentation of it, in order to demonstrate the level of detail
it provides to the subsequent lemmatizer.
<TAG> ::= <PREFIX> <BASETAG> <POSTFIX>
<BASETAG> ::= <POSTAG> | <NERTAG>
<PREFIX> ::= <PREFIX> | <PROCLITIC> "+" | ""
<POSTFIX> ::= <POSTFIX> | "+" <ENCLITIC> | ""
A tag is thus composed of a mandatory base tag and of zero or more proclitics and enclitics concatenated with the
+” sign indicating word segments. A base tag, in turn, is either a POS tag or a named entity (NER) tag. We do not
consider name tags in the rest of the paper as they are irrelevant for lemmatization beyond the fact that names are
skipped by the lemmatizer.
On a coarse-grained level, POS tags are divided into the following categories:
2http://www.arabicnlp.pro/alp/
4Freihat et al. /Procedia Computer Science 00 (2017) 000–000
Fig. 2. An example piece of text annotated with parts of speech and morphology
Fig. 3. An example output of the word segmenter
<POSTAG> ::= <NOUN> | <ADJECTIVE> | <VERB> | <ADVERB> | <PREPOSITION> | <PARTICLE>
Coarse-grained parts of speech [16] are enriched with verb tenses and morphological features, the goal of which is to
solve a large part of lexical ambiguity problems already on the level of POS tagging:3
<NOUN> ::= ( <NUMBER> <GENDER> "N" ) | "PIN" /* PIN: broken plural noun */
<ADJECTIVE> ::= ( <NUMBER> <GENDER> "N" ) | "PIAJ" /* PIAJ: broken plural adjective */
<NUMBER> ::= "S" | "D" | "P" /* Singular, Dual, or Plural */
<GENDER> ::= "M" | "F" /* Masculine or Feminine */
<VERB> ::= ( <PASSIVE> <TENSE> "V" ) | "IMPV" /* IMPV: imperative verbs */
<PASSIVE> ::= "P" | "" /* empty for active verbs */
<TENSE> ::= "PST" | "PRS" /* Past, Present */
Examples of noun tags are SMN and SFN meaning singular masculine and singular feminine, respectively. They enable
us to dierentiate between, for example, the words 
SMN/man and 
SFN/leg.
Examples of verb tags are PSTV and PRSV (past and present tense) that enable us, for example, to dierentiate be-
tween 
PSTV/(he) sustained and 
PRSV/(she) carries. In figure 2, we provide an example of an annotated
text.
3.2. Word Segmentation
Word segmentation is executed based on the segmentation information embedded within POS tags. Its serves a
double goal: to reduce the amount of distinct word forms, resulting in smaller and more robust lemmatizers, as well as
to reduce lexical ambiguity due to multiple possible interpretations. For example, word segmentation of reduces the
number of possible word forms of the lemma 
from several hundreds of clitisized nouns {
,

,

,

,

,...}to six forms {
,

,
,

,

,
}only. On the other hand, word segmentation reduces the lexical
ambiguity in cases such as
 which may be single word (sting) or a clitisized word (for capacity).
The input of the segmentation component is a word and its corresponding POS tag. The output is a list of tokens
that correspond to the proclitic, base, and enclitic components of the POS tag. Given that the presence of clitics is
identified upstream, segmentation becomes a simple rule-based string splitting task, as described in detail in [15]. An
example output of the segmentation tool is shown in figure 3.
4. Lemmatization
The principal component of our lemmatization approach is a machine-learning-based classifier. It takes as input
word segments and their corresponding POS tag, also taking context (words and tags) into account. The learning-based
3We omitted some tags, such as the named entity tags, which are irrelevant for lemmatization.
136 Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 5
(a) (b)
Fig. 4. Examples of the contents of (a) the dictionary and (b) of the corpus used to train the classifier.
approach is justified by the inherent ambiguity of diacritic-free Arabic words whose meanings are typically deduced,
by humans and machines alike, from context. While the preliminary POS tagging resolves a great deal of ambiguity,
some cases still remain such as the verb form

which may be the verb form of the verb
, or the verb
.
A downside of learning-based lemmatization is that more rare and exceptional cases, such as
(spears), may not
be covered by its training corpus, which leads to lemmatization mistakes. The addition of new cases requires the re-
training of the classifier. Another inconvenience is that classifiers—such as OpenNLP that we used—typically commit
on a single output result, which may or may not be correct. In case of such ambiguity, from the full set of possible
lemmas further NLP processing steps may be able to provide a correct results based on, e.g., syntactic or semantic
analysis. In order to support these cases, we complement the learning-based lemmatizer by a dictionary-based one.
The dictionary lemmatizer can be run independently, but we also provide a simple fusion method that combines the
results of the two lemmatizers as described below.
Both lemmatizer components were implemented using the Apache machine learning based toolkit OpenNLP,4
using the Maximum Entropy classifier.
4.1. Dictionary Lemmatization
The dictionary of a db lemmatizer consists of a text file containing, for each row, a word, its POS tag and the
corresponding lemma, each column separated by a tab character. An example of the of the dictionary is shown in
column (A) of figure 4.
In case of ambiguous word forms (i.e., a word form POS-tag pair that has several lemmas), the corresponding
lemmas are separated by ”#” character. For example lemmas of the word form 
are 
#
In the following we describe the method we used to build the dictionary. The used corpus is the same corpus we
used for segmenting, POS-tagging, and named entity recognition as described in our previous work [15].
1. Segmentation: The corpus was segmented as explained in the previous section. The result of this step was
generating a segmented corpus that contains more than 3.1 million segmented tokens.
2. POS-tag based classification: In this step, we classified the word forms according to their POS-tag.
3. Inherent feminine and adjectival feminine classification: In this step, we classified the feminine nouns into
inherent feminine and adjectival feminine nouns. For example, the noun
SFN/family is inherent feminine
while the noun
SFN/prisoner is adjectival. This dierentiation is important because the lemma of adjectival
nouns is the masculine singular form of the noun
while there is no masculine singular lemma for
.
4. Plural type classification: In this step, we classified the singular, and dual nouns (after extracting their singular
forms) according to their plural type into six classes as shown in Table 1. This classification enables us to build
the possible word number gender forms of a given lemma automatically. For example, the class SMN PMN has
six dierent possible number gender forms. On the other hand, using the feminine classification lists in previous
step, enabled us to dierentiate between the SMN PFN and SFN PFN. Tn the class SMN PFN, the lemma of a
singular feminine noun (SFN) is the singular masculine noun (SMN). In the class SFN PFN on the other hand,
4http://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.cli.lemmatizer
6Freihat et al. /Procedia Computer Science 00 (2017) 000–000
the lemma is the singular feminine noun itself. The adjectives were classified into three classes. The first class
is similar to the class SMN PMN which allows six dierent word forms. The second class contains a seventh
possible form which is the broken plural adjective form. The third class contains PIAJ as a single possible plural
form. For example,  belongs to this class since it has two possible plural forms
and
.
5. Lemmas extraction: This step is semi-automatic as follows:
Manual: Assigning the lemmas to broken plural nouns and adjectives was performed manually.
Automatic: Based on the morphological features in the tags, it was possible to extract lemmas for singular,
dual, masculine plurals, feminine plurals adjectives and nouns. We also used rules to extract the verb
lemmas such as removing the axes .
6. lemmas Enrichment: Using the lemmas from the previous step, we have enriched the corpus with new verbs,
adjectives, and nouns. For example, if the lemma of a plural noun or adjective was missing, we added it to the
noun and adjective lemmas lists.
7. Dictionary generation: The files produced so far are as follows.
(a) Noun files: Three files for masculine, feminine, and foreign nouns. The lemmas in these files were classified
according to Table 1. There is a fourth file that contains quantifiers, pronouns, adverbs, ...
(b) Adjectives: Three files for adjectives, comparatives, and ordinal adjectives. The lemmas in the adjective file
are classified according to Table 1.
(c) Verbs: One file that contains all extracted verb lemmas.
Using these files, the dictionary was generated as follows:
Nouns and adjectives generation: According to the plural class, the noun and adjective forms were gen-
erated. The case ending, or changing
to
were also considered in this step.
Verbs generation: For each verb in the verb lemmas list, we automatically generated the verb conjuga-
tions in present, past, imperative cases. We considered also accusative (
 
) and asserted verbs
(

).
Dictionary building: Using the results from previous step, we built the dictionary as described in Figure
4 (B), where the lemmas of ambiguous surface forms were joined into a single string using the # operator.
Table 1. Plural classes
Class Possible Word Forms Example
SMN PMN SMN,SFN,DMN, DFN, PFN, PMN

:

,

,

,


,

,


SMN PFN SMN,DMN,PFN 
:,
,

,

SFN PFN SFN,DFN,PFN

:,

,

,


SMN PIN SMN,DMN,PIN :,,
,
SFN PIN SFN,DFN,PIN
:,
,

,
FWN PFN FWN,DMN,PFN


:,


,


,


4.2. Machine Learning Lemmatization
The format of the corpus here is similar to the dictionary described in previous section. The only dierence is
that we order the entries according to their original position in the sentence in the segmented corpus. An empty line
Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140 137
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 5
(a) (b)
Fig. 4. Examples of the contents of (a) the dictionary and (b) of the corpus used to train the classifier.
approach is justified by the inherent ambiguity of diacritic-free Arabic words whose meanings are typically deduced,
by humans and machines alike, from context. While the preliminary POS tagging resolves a great deal of ambiguity,
some cases still remain such as the verb form

which may be the verb form of the verb
, or the verb
.
A downside of learning-based lemmatization is that more rare and exceptional cases, such as
(spears), may not
be covered by its training corpus, which leads to lemmatization mistakes. The addition of new cases requires the re-
training of the classifier. Another inconvenience is that classifiers—such as OpenNLP that we used—typically commit
on a single output result, which may or may not be correct. In case of such ambiguity, from the full set of possible
lemmas further NLP processing steps may be able to provide a correct results based on, e.g., syntactic or semantic
analysis. In order to support these cases, we complement the learning-based lemmatizer by a dictionary-based one.
The dictionary lemmatizer can be run independently, but we also provide a simple fusion method that combines the
results of the two lemmatizers as described below.
Both lemmatizer components were implemented using the Apache machine learning based toolkit OpenNLP,4
using the Maximum Entropy classifier.
4.1. Dictionary Lemmatization
The dictionary of a db lemmatizer consists of a text file containing, for each row, a word, its POS tag and the
corresponding lemma, each column separated by a tab character. An example of the of the dictionary is shown in
column (A) of figure 4.
In case of ambiguous word forms (i.e., a word form POS-tag pair that has several lemmas), the corresponding
lemmas are separated by ”#” character. For example lemmas of the word form 
are 
#
In the following we describe the method we used to build the dictionary. The used corpus is the same corpus we
used for segmenting, POS-tagging, and named entity recognition as described in our previous work [15].
1. Segmentation: The corpus was segmented as explained in the previous section. The result of this step was
generating a segmented corpus that contains more than 3.1 million segmented tokens.
2. POS-tag based classification: In this step, we classified the word forms according to their POS-tag.
3. Inherent feminine and adjectival feminine classification: In this step, we classified the feminine nouns into
inherent feminine and adjectival feminine nouns. For example, the noun
SFN/family is inherent feminine
while the noun
SFN/prisoner is adjectival. This dierentiation is important because the lemma of adjectival
nouns is the masculine singular form of the noun
while there is no masculine singular lemma for
.
4. Plural type classification: In this step, we classified the singular, and dual nouns (after extracting their singular
forms) according to their plural type into six classes as shown in Table 1. This classification enables us to build
the possible word number gender forms of a given lemma automatically. For example, the class SMN PMN has
six dierent possible number gender forms. On the other hand, using the feminine classification lists in previous
step, enabled us to dierentiate between the SMN PFN and SFN PFN. Tn the class SMN PFN, the lemma of a
singular feminine noun (SFN) is the singular masculine noun (SMN). In the class SFN PFN on the other hand,
4http://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.cli.lemmatizer
6Freihat et al. /Procedia Computer Science 00 (2017) 000–000
the lemma is the singular feminine noun itself. The adjectives were classified into three classes. The first class
is similar to the class SMN PMN which allows six dierent word forms. The second class contains a seventh
possible form which is the broken plural adjective form. The third class contains PIAJ as a single possible plural
form. For example,  belongs to this class since it has two possible plural forms
and
.
5. Lemmas extraction: This step is semi-automatic as follows:
Manual: Assigning the lemmas to broken plural nouns and adjectives was performed manually.
Automatic: Based on the morphological features in the tags, it was possible to extract lemmas for singular,
dual, masculine plurals, feminine plurals adjectives and nouns. We also used rules to extract the verb
lemmas such as removing the axes .
6. lemmas Enrichment: Using the lemmas from the previous step, we have enriched the corpus with new verbs,
adjectives, and nouns. For example, if the lemma of a plural noun or adjective was missing, we added it to the
noun and adjective lemmas lists.
7. Dictionary generation: The files produced so far are as follows.
(a) Noun files: Three files for masculine, feminine, and foreign nouns. The lemmas in these files were classified
according to Table 1. There is a fourth file that contains quantifiers, pronouns, adverbs, ...
(b) Adjectives: Three files for adjectives, comparatives, and ordinal adjectives. The lemmas in the adjective file
are classified according to Table 1.
(c) Verbs: One file that contains all extracted verb lemmas.
Using these files, the dictionary was generated as follows:
Nouns and adjectives generation: According to the plural class, the noun and adjective forms were gen-
erated. The case ending, or changing
to
were also considered in this step.
Verbs generation: For each verb in the verb lemmas list, we automatically generated the verb conjuga-
tions in present, past, imperative cases. We considered also accusative (
 
) and asserted verbs
(

).
Dictionary building: Using the results from previous step, we built the dictionary as described in Figure
4 (B), where the lemmas of ambiguous surface forms were joined into a single string using the # operator.
Table 1. Plural classes
Class Possible Word Forms Example
SMN PMN SMN,SFN,DMN, DFN, PFN, PMN

:

,

,

,


,

,


SMN PFN SMN,DMN,PFN 
:,
,

,

SFN PFN SFN,DFN,PFN

:,

,

,


SMN PIN SMN,DMN,PIN :,,
,
SFN PIN SFN,DFN,PIN
:,
,

,
FWN PFN FWN,DMN,PFN


:,


,


,


4.2. Machine Learning Lemmatization
The format of the corpus here is similar to the dictionary described in previous section. The only dierence is
that we order the entries according to their original position in the sentence in the segmented corpus. An empty line
138 Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 7
indicates the end of a sentence. An example of the of the training corpus is shown in column (B) of figure 4. We used
the segmented corpus from previous section to build the lemmatization corpus was performed in two steps:
1. Lemmas assignation: In this step, we used a dictionary lemmatizer to assign the word forms to their correspond-
ing lemmas. In case of preposition, particles, and numbers, the lemma of the word form was a normalized form
of the word form itself.The lemmas of named entities were also the named entities them selves. In this step, if a
word form was ambiguous, all its possible lemmas were assigned.
2. Validation: In this step, we disambiguated the lemmas of the ambiguous word forms Manually.
The size of the generated corpus is 3,229,403 lines. The unique word forms after discarding the digits is 59,049 as
specified in Table 2.
Table 2. Lemmas and unique word forms distribution in the corpus of the mlb lemmatizer
POS Number of lemmas Number of word forms
Noun 18,165 26,337
Adjective 6,369 13,703
Verb 4,258 19,009
Named entity 20,407 20,407
Particle 605 649
In a final step, we added all generated word forms and their corresponding lemmas from the dictionary described
in previous section to the corpus. This increased the size of the corpus to 3,890,737 lines.
4.3. Fusion
While the learning-based lemmatizer outputs for each word a single candidate lemma, from the dictionary multiple
solutions could be retrieved even for a single part of speech (for example, the verb form 
can be a verb form of
the verb gave up, or converted to Islam). The goal of the simple fusion component is to produce a final result
from these solutions. The final output is a list of one or more lemmas in a decreasing order of confidence.
The idea underlying the fusion method is that we usually trust the dictionary to be capable of providing a correct
solution space (a small set of possible lemmas), while we usually trust the classifier to return the most likely lemma
from the previous set. However, in the case of out-of-corpus words the classifier may return incorrect results extrapo-
lated from similar examples, such as returning the lemma

for the word form


. Thus, whenever a lemma is
returned by the classifier that is not included in the dictionary, it will still be included as a solution but with a lower
confidence.
Accordingly, our simple fusion method is as follows. We take as input the results output by the two lemmatizers,
namely LDIC ={l1, ..., ln}for the dictionary-based one and LCL ={l}for the classifier-based one, and output LF, the
fusion result. We start by comparing the results of the two lemmatizers:
if |LDIC|=1 and l1=l, i.e., the outputs are identical, then the solution is trivial, we return either output and we
are done: LF={l};
otherwise, two further cases are distinguished:
if lLDIC, that is, the dictionary contains the classification output, then we prioritize the result of the
classifier by making it first (i.e., the preferred lemma): LF={l,l1, ..., ln};
otherwise, we add the classifier result as the last element: LF={l1, ..., ln,l}.
8Freihat et al. /Procedia Computer Science 00 (2017) 000–000
5. Evaluations
For evaluation we used a corpus of a 46,018-token text, retrieved and assembled from several news portals (such as
Aljazeera news portal5and Al-quds Al-Arabi news paper6). We excluded from the evaluation the categories of tokens
that cannot be lemmatized: 5,853 punctuation tokens, 3,829 tokens tagged as named entities, 482 digit tokens, and
10 malformed tokens (i.e., containing typos, such as


instead of


). Thus the number of
tokens considered was 35,844.
In order to have a clear idea of the eciency of the lemmatization pipeline, we evaluated it in a fine-grained
manner, manually classifying the mistakes according to the component involved. This allows us to compute a com-
prehensive accuracy for the entire pipeline as well as evaluate individual components: the POS tagger, the segmenter,
each lemmatizer, as well as the fusion lemmatization. The evaluation data files are available online.7
Table 3. Types of mistakes committed by the learning-based lemmatizer, and their proportions
Type of mistake Occurrences Example
POS tag (coarse-grained) mistakes 199

SMN instead of

PRSV
Morphological tag (fine-grained) mistakes 201  SMN instead of  PIN
Segmentation tag mistakes 103
PSTV instead of
PSTV and P RO
Classifier mistakes: nonexistent lemma 158 
instead of 
for 
PRSV
Classifier mistakes: wrong disambiguation 12
 instead of
 for  
PRSV
Dictionary mistakes: missing word form 1,207 

,,


Fusion mistakes 50 ,,

The fine-grained evaluation is summed up in table 3.8Nonexistent lemma stands for cases where the POS tag and
the segmentation were correct, yet the classifier gave a wrong, non-linguistic result. Wrong disambiguation means
that the lemmatizer chose an existing but incorrect lemma for an ambiguous word form.
Table 4. Accuracy values computed for various components of the lemmatization pipeline
Component Evaluation method Accuracy
preprocessing all mistakes (POS, morphological, segmentation) 98.6%
classifier-based lemmatizer in isolation 99.5%
classifier-based lemmatizer in isolation, built-in OpenNLP cross-validation 99.7%
classifier-based lemmatizer entire pipeline 98.1%
dictionary-based lemmatizer in isolation 96.6%
dictionary-based lemmatizer entire pipeline 95.2%
fusion lemmatizer entire pipeline 98.4%
The accuracy measures reported in table 4 were computed based on the results in table 3. On these we make the
following observations. The performance of preprocessing (98.6%) represents an upper bound for the entire lemmati-
zation pipeline. In this perspective, the near-perfect results of the classifier (99.5% when evaluated in isolation, 98.1%
on the entire pipeline) are remarkable. We cross-checked these results using the built-in cross-validation feature of
5http://www.aljazeera.net/
6http://www.alquds.co.uk/
7http://www.arabicnlp.pro/alp/lemmatizationEval.zip
8While after tagging and segmentation the number of (segmented) tokens rose to 62,694, we computed our evaluation results based on the
number of unsegmented tokens.
Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140 139
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 7
indicates the end of a sentence. An example of the of the training corpus is shown in column (B) of figure 4. We used
the segmented corpus from previous section to build the lemmatization corpus was performed in two steps:
1. Lemmas assignation: In this step, we used a dictionary lemmatizer to assign the word forms to their correspond-
ing lemmas. In case of preposition, particles, and numbers, the lemma of the word form was a normalized form
of the word form itself.The lemmas of named entities were also the named entities them selves. In this step, if a
word form was ambiguous, all its possible lemmas were assigned.
2. Validation: In this step, we disambiguated the lemmas of the ambiguous word forms Manually.
The size of the generated corpus is 3,229,403 lines. The unique word forms after discarding the digits is 59,049 as
specified in Table 2.
Table 2. Lemmas and unique word forms distribution in the corpus of the mlb lemmatizer
POS Number of lemmas Number of word forms
Noun 18,165 26,337
Adjective 6,369 13,703
Verb 4,258 19,009
Named entity 20,407 20,407
Particle 605 649
In a final step, we added all generated word forms and their corresponding lemmas from the dictionary described
in previous section to the corpus. This increased the size of the corpus to 3,890,737 lines.
4.3. Fusion
While the learning-based lemmatizer outputs for each word a single candidate lemma, from the dictionary multiple
solutions could be retrieved even for a single part of speech (for example, the verb form 
can be a verb form of
the verb gave up, or converted to Islam). The goal of the simple fusion component is to produce a final result
from these solutions. The final output is a list of one or more lemmas in a decreasing order of confidence.
The idea underlying the fusion method is that we usually trust the dictionary to be capable of providing a correct
solution space (a small set of possible lemmas), while we usually trust the classifier to return the most likely lemma
from the previous set. However, in the case of out-of-corpus words the classifier may return incorrect results extrapo-
lated from similar examples, such as returning the lemma

for the word form


. Thus, whenever a lemma is
returned by the classifier that is not included in the dictionary, it will still be included as a solution but with a lower
confidence.
Accordingly, our simple fusion method is as follows. We take as input the results output by the two lemmatizers,
namely LDIC ={l1, ..., ln}for the dictionary-based one and LCL ={l}for the classifier-based one, and output LF, the
fusion result. We start by comparing the results of the two lemmatizers:
if |LDIC|=1 and l1=l, i.e., the outputs are identical, then the solution is trivial, we return either output and we
are done: LF={l};
otherwise, two further cases are distinguished:
if lLDIC, that is, the dictionary contains the classification output, then we prioritize the result of the
classifier by making it first (i.e., the preferred lemma): LF={l,l1, ..., ln};
otherwise, we add the classifier result as the last element: LF={l1, ..., ln,l}.
8Freihat et al. /Procedia Computer Science 00 (2017) 000–000
5. Evaluations
For evaluation we used a corpus of a 46,018-token text, retrieved and assembled from several news portals (such as
Aljazeera news portal5and Al-quds Al-Arabi news paper6). We excluded from the evaluation the categories of tokens
that cannot be lemmatized: 5,853 punctuation tokens, 3,829 tokens tagged as named entities, 482 digit tokens, and
10 malformed tokens (i.e., containing typos, such as


instead of


). Thus the number of
tokens considered was 35,844.
In order to have a clear idea of the eciency of the lemmatization pipeline, we evaluated it in a fine-grained
manner, manually classifying the mistakes according to the component involved. This allows us to compute a com-
prehensive accuracy for the entire pipeline as well as evaluate individual components: the POS tagger, the segmenter,
each lemmatizer, as well as the fusion lemmatization. The evaluation data files are available online.7
Table 3. Types of mistakes committed by the learning-based lemmatizer, and their proportions
Type of mistake Occurrences Example
POS tag (coarse-grained) mistakes 199

SMN instead of

PRSV
Morphological tag (fine-grained) mistakes 201  SMN instead of  PIN
Segmentation tag mistakes 103
PSTV instead of
PSTV and P RO
Classifier mistakes: nonexistent lemma 158 
instead of 
for 
PRSV
Classifier mistakes: wrong disambiguation 12
 instead of
 for  
PRSV
Dictionary mistakes: missing word form 1,207 

,,


Fusion mistakes 50 ,,

The fine-grained evaluation is summed up in table 3.8Nonexistent lemma stands for cases where the POS tag and
the segmentation were correct, yet the classifier gave a wrong, non-linguistic result. Wrong disambiguation means
that the lemmatizer chose an existing but incorrect lemma for an ambiguous word form.
Table 4. Accuracy values computed for various components of the lemmatization pipeline
Component Evaluation method Accuracy
preprocessing all mistakes (POS, morphological, segmentation) 98.6%
classifier-based lemmatizer in isolation 99.5%
classifier-based lemmatizer in isolation, built-in OpenNLP cross-validation 99.7%
classifier-based lemmatizer entire pipeline 98.1%
dictionary-based lemmatizer in isolation 96.6%
dictionary-based lemmatizer entire pipeline 95.2%
fusion lemmatizer entire pipeline 98.4%
The accuracy measures reported in table 4 were computed based on the results in table 3. On these we make the
following observations. The performance of preprocessing (98.6%) represents an upper bound for the entire lemmati-
zation pipeline. In this perspective, the near-perfect results of the classifier (99.5% when evaluated in isolation, 98.1%
on the entire pipeline) are remarkable. We cross-checked these results using the built-in cross-validation feature of
5http://www.aljazeera.net/
6http://www.alquds.co.uk/
7http://www.arabicnlp.pro/alp/lemmatizationEval.zip
8While after tagging and segmentation the number of (segmented) tokens rose to 62,694, we computed our evaluation results based on the
number of unsegmented tokens.
140 Abed Alhakim Freihat et al. / Procedia Computer Science 142 (2018) 132–140
Freihat et al. /Procedia Computer Science 00 (2017) 000–000 9
OpenNLP and obtained similar results (99.7%). The dictionary-based lemmatizer reached a somewhat lower yet still
very decent result (96.6% in isolation, 95.2% on the entire pipeline), due to the 1207 OOV word forms. The fusion of
the two lemmatizers, finally, improved slightly on the classifier: of the 170 mistakes made by the classifier, 120 could
be correctly lemmatized using the dictionary. Thus the fusion method reached a full-pipeline result of 98.4%, only a
tiny bit worse than the performance of preprocessing itself.
6. Conclusion and Future Work
We presented an optimization approach for Arabic lemmatization, based on the combination of machine learning
and a lemmatization dictionary, that provides excellent accuracy. Beside the result itself, the addition of a lemmatiza-
tion dictionary provides additional robustness to the underlying NLP pipeline. Firstly, it makes the lemmatizer easy
to extend by new lemmas that could potentially be mislabeled by the classifier. Secondly, it allows the lemmatizer to
return not only one result but an order list of candidate lemmas, allowing the decision to be delayed to subsequent
NLP components.
Both the machine learning model and the dictionary were built using a corpus of 2.2 million tokens annotated and
manually validated by the authors. The dictionary, the trained model, and corresponding tools are all free for research
purposes upon request.
The presented tool was implemented as a component of the ALP comprehensive NLP pipeline. We plan to extend
the current pipeline with new components such as a vocalizer, a phrase chunker, a dependency parser, or a multiword
expression detector.
References
[1] R. Navigli, “Word sense disambiguation: A survey,ACM Comput. Surv., vol. 41, pp. 10:1–10:69, Feb. 2009.
[2] V. Balakrishnan and L.-Y. Ethel, “Stemming and lemmatization: A comparison of retrieval performances,” vol. 2, pp. 262–267, 01 2014.
[3] A. A. Freihat, M. R. H. Qwaider, and F. Giunchiglia, “Using grice maxims in ranking community question answers,” in Proceedings of the Tenth
International Conference on Information, Process, and Knowledge Management, eKNOW 2018, Rome, Italy, March 25-29, 2018, pp. 38–43,
2018.
[4] F. Giunchiglia, U. Kharkevich, and I. Zaihrayeu, “Concept search,” in The Semantic Web: Research and Applications (L. Aroyo, P. Traverso,
F. Ciravegna, P. Cimiano, T. Heath, E. Hyv¨
onen, R. Mizoguchi, E. Oren, M. Sabou, and E. Simperl, eds.), (Berlin, Heidelberg), pp. 429–444,
Springer Berlin Heidelberg, 2009.
[5] A. Farghaly and K. Shaalan, “Arabic natural language processing: Challenges and solutions,” vol. 8, pp. 14:1–14:22, Dec. 2009.
[6] M. Gridach and N. Chenfour, “Developing a new approach for arabic morphological analysis and generation,” CoRR, vol. abs/1101.5494,
2011.
[7] K. Shaalan, “A survey of arabic named entity recognition and classification,” Comput. Linguist., vol. 40, pp. 469–510, June 2014.
[8] O. Hamed and T. Zesch, “A Survey and Comparative Study of Arabic Diacritization Tools,” JLCL: Special Issue - NLP for Perso-Arabic
Alphabets., vol. 32, no. 1, pp. 27–47, 2017.
[9] M. Boudchiche, A. Mazroui, M. Ould Abdallahi Ould Bebah, A. Lakhouaja, and A. Boudlal, “Alkhalil morpho sys 2: A robust arabic morpho-
syntactic analyzer,J. King Saud Univ. Comput. Inf. Sci., vol. 29, pp. 141–146, Apr. 2017.
[10] A. Pasha, M. Al-Badrashiny, M. T. Diab, A. El Kholy, R. Eskander, N. Habash, M. Pooleery, O. Rambow, and R. Roth, “Madamira: A fast,
comprehensive tool for morphological analysis and disambiguation of arabic.,” in LREC, vol. 14, pp. 1094–1101, 2014.
[11] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa: A fast and furious segmenter for arabic,” in Proceedings of the 2016 Conference
of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11–16, Association for Computational
Linguistics, San Diego, California, 2016.
[12] M. Attia, A. Zirikly, and M. T. Diab, “The power of language music: Arabic lemmatization through patterns,” in Proceedings of the 5th
Workshop on Cognitive Aspects of the Lexicon, CogALex@COLING 2016, Osaka, Japan, December 12, 2016, pp. 40–50, 2016.
[13] E. Al-Shammari and J. Lin, “A novel arabic lemmatization algorithm,” in Proceedings of the Second Workshop on Analytics for Noisy Unstruc-
tured Text Data, AND ’08, (New York, NY, USA), pp. 113–118, ACM, 2008.
[14] T. El-Shishtawy and F. El-Ghannam, “An accurate arabic root-based lemmatizer for information retrieval purposes,” CoRR, vol. abs/1203.3584,
2012.
[15] A. A. Freihat, G. Bella, H. Mubarak, and F. Giunchiglia, “A single-model approach for arabic segmentation, pos tagging, and named entity
recognition,” in 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), pp. 1–8, April 2018.
[16] I. Zeroual, A. Lakhouaja, and R. Belahbib, “Towards a standard part of speech tagset for the arabic language,” Journal of King Saud University
- Computer and Information Sciences, vol. 29, no. 2, pp. 171 – 178, 2017. Arabic Natural Language Processing: Models, Systems and
Applications.
... The automation of diacritizing Arabic text is involved with finding an efficient method to automatically diacritize Arabic text for different applications such as speech processing applications. In order to remove the many levels of word ambiguity arisen of incorrect diacritization or absence of diacritic marks in writing, lemmatizer tools have been used to to alleviate this phenomenon, however, it is considered a stressful task to obtain a high quality lemmatizer [3]. Automatic diacritization enhances the performance of many applications in accordance with their accuracy and speed of processing and can be beneficial in other Arabic processing steps such as Part-of-Speech (POS) tagging. ...
... Over the last decade, Arabic have begun to gain ground in the area of research within NLP. Much work targeted different aspects related to how this language is processed, such as: morphological analysis [7], lemmatization [3], text categorization [8], documents summarizing [9], word segmentation [10], sentiment analysis [11] and automatic diacritization [12]. ...
Full-text available
Article
Deep learning has emerged as a new area of machine learning research. It is an approach that can learn features and hierarchical representation purely from data and has been successfully applied to several fields such as images, sounds, text and motion. The techniques developed from deep learning research have already been impacting the research on Natural Language Processing (NLP). Arabic diacritics are vital components of Arabic text that remove ambiguity from words and reinforce the meaning of the text. In this paper, a Deep Belief Network (DBN) is used as a diacritizer for Arabic text. DBN is an algorithm among deep learning that has recently proved to be very effective for a variety of machine learning problems. We evaluate the use of DBNs as classifiers in automatic Arabic text diacritization. The DBN was trained to individually classify each input letter with the corresponding diacritized version. Experiments were conducted using two benchmark datasets, the LDC ATB3 and Tashkeela. Our best settings achieve a DER and WER of 2.21% and 6.73%, receptively, on the ATB3 benchmark with an improvement of 26% over the best published results. On the Tashkeela benchmark, our system continues to achieve high accuracy with a DER of 1.79% and 14% improvement.
... The automation of diacritizing Arabic text is involved with finding an efficient method to automatically diacritize Arabic text for different applications such as speech processing applications. In order to remove the many levels of word ambiguity arisen of incorrect diacritization or absence of diacritic marks in writing, lemmatizer tools have been used to to alleviate this phenomenon, however, it is considered a stressful task to obtain a high quality lemmatizer [3]. Automatic diacritization enhances the performance of many applications in accordance with their accuracy and speed of processing and can be beneficial in other Arabic processing steps such as Part-of-Speech (POS) tagging. ...
... Over the last decade, Arabic have begun to gain ground in the area of research within NLP. Much work targeted different aspects related to how this language is processed, such as: morphological analysis [7], lemmatization [3], text categorization [8], documents summarizing [9], word segmentation [10], sentiment analysis [11] and automatic diacritization [12]. ...
Full-text available
Article
Deep learning has emerged as a new area of machine learning research. It is an approach that can learn features and hierarchical representation purely from data and has been successfully applied to several fields such as images, sounds, text and motion. The techniques developed from deep learning research have already been impacting the research on Natural Language Processing (NLP). Arabic diacritics are vital components of Arabic text that remove ambiguity from words and reinforce the meaning of the text. In this paper, a Deep Belief Network (DBN) is used as a diacritizer for Arabic text. DBN is an algorithm among deep learning that has recently proved to be very effective for a variety of machine learning problems. We evaluate the use of DBNs as classifiers in automatic Arabic text diacritization. The DBN was trained to individually classify each input letter with the corresponding diacritized version. Experiments were conducted using two benchmark datasets, the LDC ATB3 and Tashkeela. Our best settings achieve a DER and WER of 2.21% and 6.73%, receptively, on the ATB3 benchmark with an improvement of 26% over the best published results. On the Tashkeela benchmark, our system continues to achieve high accuracy with a DER of 1.79% and 14% improvement.
... The automation of diacritizing Arabic text is involved with finding an efficient method to automatically diacritize Arabic text for different applications such as speech processing applications. In order to remove the many levels of word ambiguity arisen of incorrect diacritization or absence of diacritic marks in writing, lemmatizer tools have been used to to alleviate this phenomenon, however, it is considered a stressful task to obtain a high quality lemmatizer [3]. Automatic diacritization enhances the performance of many applications in accordance with their accuracy and speed of processing and can be beneficial in other Arabic processing steps such as Part-of-Speech (POS) tagging. ...
... Over the last decade, Arabic have begun to gain ground in the area of research within NLP. Much work targeted different aspects related to how this language is processed, such as: morphological analysis [7], lemmatization [3], text categorization [8], documents summarizing [9], word segmentation [10], sentiment analysis [11] and automatic diacritization [12]. ...
Full-text available
Article
Deep learning has emerged as a new area of machine learning research. It is an approach that can learn features and hierarchical representation purely from data and has been successfully applied to several fields such as images, sounds, text and motion. The techniques developed from deep learning research have already been impacting the research on Natural Language Processing (NLP). Arabic diacritics are vital components of Arabic text that remove ambiguity from words and reinforce the meaning of the text. In this paper, a Deep Belief Network (DBN) is used as a diacritizer for Arabic text. DBN is an algorithm among deep learning that has recently proved to be very effective for a variety of machine learning problems. We evaluate the use of DBNs as classifiers in automatic Arabic text diacritization. The DBN was trained to individually classify each input letter with the corresponding diacritized version. Experiments were conducted using two benchmark datasets, the LDC ATB3 and Tashkeela. Our best settings achieve a DER and WER of 2.21% and 6.73%, receptively, on the ATB3 benchmark with an improvement of 26% over the best published results. On the Tashkeela benchmark, our system continues to achieve high accuracy with a DER of 1.79% and 14% improvement.
... Lemmatization takes a more complex approach in text processing; It aims to regroup semantically related words, and it is proved to be beneficial in the areas of Arabic information retrieval [11], [14]. However, in Arabic, the use of lemmatization is more difficult task due to the morphological complexity of the language itself, and the absence of short vowels in most existing Arabic documents [15]. ...
Full-text available
Article
Information retrieval is an important field that aims to provide a relevant document to a user information need, expressed through a query. Arabic is a challenging language that gained much attention recently in the information retrieval domain. To overcome the problems related to its complexity, many studies and techniques have been presented, most of them were conducted to solve the stemming problem. This paper presents an overview of the Arabic information retrieval process, including various text processing techniques, ranking approaches, evaluation measures, and some important information retrieval models. The paper finally presents some recent related studies and approaches in different Arabic information retrieval fields.
... Words that are included in the stopword are also drawn to reduce computational load. Finally, lemmatization is carried out to convert each token into its common root word [30]. ...
Article
Fake news is inaccurate information that is intentionally disseminated for a specific purpose. If allowed to spread, fake news can harm the political and social spheres, so several studies are conducted to detect fake news. This study uses a deep learning method with several architectures such as CNN, Bidirectional LSTM, and ResNet, combined with pre-trained word embedding, trained using four different datasets. Each data goes through a data augmentation process using the back-translation method to reduce data imbalances between classes. The results showed that the Bidirectional LSTM architecture outperformed CNN and ResNet on all tested datasets.
... Freihat et al., 2018b) features and lemmatiza- tion (Freihat et al., 2018a) but without a signif- icant enhancement of the achieved results. Be- sides the word and character n-grams features used in previous work such asLichouri et al., 2018 ...
Full-text available
Conference Paper
This paper describes the solution that we propose on MADAR 2019 Arabic Fine-Grained Dialect Identification task. The proposed solution utilized a set of classifiers that we trained on character and word features. These clas-sifiers are: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Passive Aggressive(PA) and Perceptron (PC). The system achieved competitive results, with a performance of 62.87% and 62.12% for both development and test sets.
Article
Lemmatization is a process which can be used to derive the headword or root word from its inflectional forms. So natural language processing applications can use lemmatization as a pre-processing step. A lemma is simply the ‘Dictionary form of a word’. This paper aims to develop a tool for the purpose of lemmatization of words belonging to the Assamese language, using a Trie based approach. In this paper we have explored certain challenges related to lemmatization of Assamese language and also we have implemented a hybrid system containing multiple sections dedicated to elevate these challenges. Our implemented hybrid system operates on the basis of longest prefix match, character skipping algorithm and morphological analysers conforming to different rules in order to deal with the irregularities present in the Assamese language. Our proposed hybrid lemmatizer tool correctly lemmatizes 21,655 words out of 26,521 words, thus achieving an accuracy of 81.65%.
Full-text available
Conference Paper
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key pre processing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Full-text available
Conference Paper
This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrarily to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifes downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is publicly free for research purposes.
Full-text available
Conference Paper
Community question answering portals and forum websites are becoming prominent resources of knowledge and experience exchange and such platforms are becoming invaluable information mines. Getting to these information in such knowledge mines is not trivial and fraught with difficulties and challenges. One of these difficulties is to discover the relevant answers and/or to predict the best answer(s) among these. In this paper, we present a Grice cooperative maxims based approach for ranking community question answers.
Full-text available
Article
Part of Speech (PoS) tagging is still not very well investigated with respect to the Arabic language. Determining the PoS tags of a word in a particular context is difficult, primarily because there is no use of diacritics in most of contemporary texts. Consequently, the same word may be spelled in different ways. Further, detecting the difference between Arabic derivatives represents a very challenging issue for the majority of PoS taggers. Hence, the task of tagging the correct PoS tags requires advanced processing and the use of considerable resources. This study aims to design detailed hierarchical levels of the Arabic tagset categories and their relationships. These hierarchical levels allow easier expansion when required and produce more accurate and precise results. They are based on a comparative study and important references in Arabic grammar; they are also validated by experts in this field. In addition, the proposed tagset is implemented in a PoS tagger and tested via various experiments. We believe that our study makes a significant contribution to the literature because this work is an advancement in the direction of achieving a standard, rich, and comprehensive tagset for Arabic.
Full-text available
Conference Paper
In this paper, we present Farasa, a fast and accurate Arabic segmenter. Our approach is based on SVM-rank using linear kernels. We measure the performance of the seg-menter in terms of accuracy and efficiency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or is at par with the state-of-the-art Arabic segmenters (Stanford and MADAMIRA), while being more than one order of magnitude faster.
Full-text available
Article
In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.
Full-text available
Article
AlKhalil Morpho Sys is a morphosyntactic analyzer of standard Arabic words taken out of context. The system analyzes either partially vowelized words or totally vowelized ones. In this paper, we present the second version of this analyzer. The correction of errors in the database of the first version, and enrichment of this database by missing data allowed us to develop a more accurate version with very high coverage since the percentage of analyzed words exceeds 99%. In addition, we have enriched the morphological features provided by this new version with the lemma tag of the word and its pattern, which are very useful in many applications of Arabic language processing. Furthermore, with the new organization of this database and the improvements brought to its source code, this new version produces very fast analysis.
Full-text available
Article
As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.
Full-text available
Article
In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma level analysis and generation does not yet focused in Arabic NLP literatures. In the current research, we propose the first non-statistical accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems. The proposed lemmatizer makes use of different Arabic language knowledge resources to generate accurate lemma form and its relevant features that support IR purposes. As a POS tagger, the experimental results show that, the proposed algorithm achieves a maximum accuracy of 94.8%. For first seen documents, an accuracy of 89.15% is achieved, compared to 76.7% of up to date Stanford accurate Arabic model, for the same, dataset.
Article
The current study proposes to compare document retrieval precision performances based on language modeling techniques, particularly stemming and lemmatization. Stemming is a procedure to reduce all words with the same stem to a common form whereas lemmatization removes inflectional endings and returns the base or dictionary form of a word. Comparisons were also made between these two techniques with a baseline ranking algorithm (i.e. with no language processing). A search engine was developed and the algorithms were tested based on a test collection. Both mean average precisions and histograms indicate stemming and lemmatization to outperform the baseline algorithm. As for the language modeling techniques, lemmatization produced better precision compared to stemming, however the differences are insignificant. Overall the findings suggest that language modeling techniques improves document retrieval, with lemmatization technique producing the best result.
Conference Paper
Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming. Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language. The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.