Conference PaperPDF Available

Arabic Morphological Analysis: a New Approach

Authors:

Abstract and Figures

Morphological analysis is an important step in the process of Arabic language processing which became a must in today's IT development. We present here a new approach for Arabic root extraction. Our algorithm outperforms most other morphological analysis algorithms; we achieved more reliable results with better performance and minimal storage. Our Morphological analyzer provides different levels of reliability and performance to support the needs of different applications. It also provides a separate stage which can be plugged easily in any other Morphological analyzer to improve its performance, and with no negative effect on its reliability.
Content may be subject to copyright.
Arabic Morphological Analysis: a New Approach
Riad Sonbol
Informatics Department, HIAST
Damascus, SYRIA
rsonbol@gmail.co m
Nada Ghneim
Informatics Department, HIAST
Damascus, SYRIA
nghnem@scs-net.org
Mohammed Said Desouki
Informatics Department, HIAST
Damascus, SYRIA
msdos@scs-net.org
Abstract—Morphological analysis is an important step in the
process of Arabic language processing which became a must in
today's IT development. We present here a new approach for
Arabic root extraction. Our algorithm outperforms most other
morphological analysis algorithms; we achieved more reliable
results with better performance and minimal storage. O
ur
Morphological analyzer provides different levels of reliability
and performance to support the needs of different applications. It
also provides a separate stage which can be plugged easily in any
other Morphological analyzer to improve its performance, and
with no negative effect on its reliability.
Keywords- Morphological Analyzer; O-
Letters Detection
Algorithm; Arabic Language Processing; Root Extraction.
I. I
NTRODUCTION
Morphological analysis is an important step in Arabic
language processing because of the complex morphological
structure of Arabic. Moreover, morphological analysis is a
basic step in almost all Natural Language Processing (NLP)
applications
including text mining, information retrieval (IR),
machine translation, and automatic summarizatio n. Therefore,
the time spent on creating an "intelligent" reliable and efficient
Arabic morphological analyzer is justified by its reuse
in many
of these applications.
Our aim in this paper is to present a new root-extraction
algorithm which outperforms other approaches in both
reliability and performance. We try to provide an approach
that
can be used in both IR and NLP applications in an
efficient way.
In section 2 we will present an overview on the previous
work in Arabic Morphology. Section 3 will provide a
description of our approach for morphological analysis
. In
section 4, we provide some experiments to evaluate the
approach. We conclude our study in section 5.
II. P
REVIOUS
W
ORK
Several approaches have been proposed for Arabic
stemming, the most common approaches are: light stemmers
and Root-based stemmers.
Most light stemmers [1] [4] [6] are based on the same idea:
delete the most frequent pre
fixes and suffixes to achieve the
stem. These stemmers does not interest in producing the
Arabic root, because they are directed to serve IR applications.
Root-based stemmers try to find the correct morphological
analysis. Many morphological analyzers
have been developed
like Khoja stemmer [5], ISRI stemmer [7], and Buckwalter
Morphological analyzer [3].
Khoja and Garside produce an effective stemmer, this
stemmer removes prefixes and suffixes, then m
atches the
remaining word against the patterns to extract the root. Finally
it checks whether the extracted root is a valid root using a list
of Arabic roots. Khoja stemmer is considered as a high
perfor
mance stemmer. However, this stemmer makes some
problems especially when removing the prefixes and suffixes
which lead to wrong solutions or a failure in stemming
operation. Also it generates wrong roots f
or words which
contain Ebdal case like ' ا '. This stemmer gives one
solution for each word, so it ignores other possible solutions.
This fact makes the use of this stemmer in NLP
applications
less effective, because in such applications we need to provide
all possible solutions. For example, the Arabic word ' ل' has
two possible roots: 'ل' (say) a
nd '' (transfer). But according
to Khoja stemmer the root is always '' (transfer).
Taghva, Elkhoury, and Coombs propose the ISRI
stemmer similar to Khoja one, but
without a root dictionary.
This stemmer performs equivalently to the Khoja one as well
as light stemmers in IR task.
Buckwalter's morphological analyzer f
ollows another
approach, the stemmer produces all possible morphological
analyses by using three Arabic lexicon files: prefixes
dictionary (299 entries), s
uffixes dictionary (618 entries), and
stems dictionary (82158 entries), and three compatibility
tables representing pairs of co mpatible morphological
categories (
prefixes-suffixes, prefixes-stem, and stem-
suffixes). It provides high reliable results which leads it to be
one of the most useful analyzer in
NLP tasks. But it is not a
suitable choice for IR tasks, where the performance is an
important metric.
III. O
UR
A
PPROACH
A. Main Idea:
The principal idea in our algorithm is based on the
encoding of Arabic letters by a new code that: (1) preserves
morphologically
useful information, (2) and simplifies its
capturing toward retrieving the root.
We can divide our algorithm into two main stages:
In the
first stage, we will try to extract all possible roots
without using any dictionary.
The seco nd stage is designed to improve the reliability, and t
o
solve some special problems such as Ebdal, and Ealal.
B. Algorithm Detail:
Stage1. Extracting Possible Roots:
1) Normalization:
Before stemming
, we normalize the Arabic word by
applying the following steps:
- Remove diacritics, and the Shadda.
- Replace all distinct forms of Hamza with (أ).
-
Replace Madda (  ) with Ha m za and Alef ( اأ ).
- Replace Alef Maksura ( ى ) with Alef ( ا ).
2) Encoding:
In this step, we will encode the Arabic letters in our
ne
w coding. This coding is based on six symbols {O, P,
S, PS, U, A} repre senting six groups of letters each of
which shares cer tain character
istics:
O:
Original letters. These letters are surely part of
the root. They are:
} ،غ ،ع ،ظ ،ط ،ض ،ص ،ش ،ز ،ر ،ذ ،د ،خ ،ح ،ج ،ت
ق{.
P:
Prefix letters. These letters can be added only
in the prefix part. They are:
}ل ،س ،ف ،ب{
S:
Suffix letters. These letters can be added only
in the suffix part. They are:
}{ only Haa
PS:
Pref
ix-Suffix letters. These letters can be only
added in both sides of the word i.e. in the suffix
part or in the prefix part. They are:
}ن ،م ،ك{
U:
Uncertain letters. These letters can be added
anywhere in the word. They are:
}أ ،ا ،ي ،و ،ت{
A:
Added letters. These letters are always
considered
additional letters. They are:
}ة{ only Taa Marbuta.
We add some improvements to the last coding by adding
the position conditio
ns when processing some letters.
There is a maximum index for each lette r when it is found in a
prefix. For example, the letter Baa
'ب' can be part of a prefix
only if it is found in the first three letters like 'ف 
و'. And it
can not be part of the prefix if it is not found in the first three
letters like '
او'.
We can apply the same idea for suffixes by using the
minimum index in suffixes for each letter, b ut we found that
this idea is not
efficient for S letters (except Haa '').
Table1 presents the maximum index in prefix and the
minimum index in suffix for certain letters (P, S, a
nd PS
letters). The symbol '*' indicate that it is not effective to put
such condition for this letter.
Table 1- statistics about the position of some letters in the word
The
letter
The maximum index
in prefixes
The minimum index
in suffixes
Baa 'ب' 3 _
Lam 'ل' 5 _
Seen 'س' 4 _
Faa 'ف' 2 _
Haa '' _ 3
Kaf 'ك' 3 *
Noon 'ن' * *
Meem 'م' * *
In addition, the maximum length for any prefix or suffix is
(len-2) where len is the length of the word. For example, this
maximum is satisf
ied in ' ّ' where the prefix is ' ـ' and in
'ّ where the suffix is ''.
So, the position rules are:
- We consider the letter Baa
a Prefixes letter if it is
found in the first three letters, else it is an Original
letters. For example, Baa is an original letter in the
A
rabic word 'تاو'.
- We consider the letter Faa 'ف' a Prefixes letter if it is
found in the first two letters, else it is an Original
letters. For example
, Faa is an original letter in the
Arabic word '
ا'.
- We consider the letter Seen 'س' a Prefixes letter if the
next letter is one of " أ" letters (Hamza, Noon, Yaa,
and Taa), else it is an
Original letters. For example,
Seen is an original letter in the Arabic word 'ب
'.
- We consider the letter Lam 'ل' a Prefixes letter if it is
found in the first five letters, else it is an Original
letters. For e xample
, Lam is an original letter in the
Arabic word 'ن
ا'.
- We consider the letter Haa '' a Suffix letter if it is
found in the last three letters, else it is an Original
letters. For example
, Haa is an original letter in the
Arabic word '
'.
- We consider the letter Kaf 'ك' a Prefix letter if it is
found in the first three letters, else it is a Suffix
letters. For example, Kaf
is not a pre fix letter in the
Arabic word '
هأ', we consider it as suffix letter.
- We consider any Prefixes letter when it is not
founded in the first (len-2) letters, an Original letters.
For example, Baa is an o riginal letter in t
he Arabic
word '
' (note that the first position rule can not
discover such case).
- We consider any Suffix letter when it is not founded
in the last (len-2) l
etters, an Original letters. For
example, Haa is an original letter in the Arabic word
'ب
' (note that the fifth position rule can not
discover such case).
So, after applying the last coding, we have an
encoded word which is more us
eful for morphological
analysis. It can find the root directly in some cases,
because if we have 3 O-Letters (or more) in the encoded
word, then these
letters are considered root letters, and we
can terminate the process here. Else, we have to apply the
next step.
For example; we
can extract the root of the Arabic word
' ّا' only by using this step (Figure-1), but we can not
do that in the case of 'ا' (Figure-2):
Figure 1- applying Encoding step on 'ّا'
Figure 2- applying Encoding step on 'ا'
3) Apply Transformation Rule:
In this step, we will appl y transformation rules
between groups to obtain maximum number of original
letters we can discove
r.
Note that when using the words "before" and "after" in
the transformation rules, we consider the way of Arabic
readin
g (right to left).
Transformation Rules:
R1) Change each 'P' after 'O' to 'O'.
R2) Change each 'S' before 'O' to 'O'.
R3) Change each 'P' after 'S' to 'O', a
nd each 'S'
before 'P' to 'O'.
R4) Change each 'PS' before 'P' to 'P'.
R5) Change each 'PS' before 'O' to 'P'.
R6) Change each 'PS' a
fter 'S' to 'S'.
R7) Change each 'PS' after 'O' to 'S'.
These rules are concluded from the properties of our
encoding. For
example, we can not have P-Letter after O-
Letter, because if so, all letters before this P should be
part of the prefix
. So it is a contradiction, because we
have O-Letter before it.
As the previous step, if we have 3 O-Letters in the
en
coded word, then these letters are considered root
letters, and the process will terminate here, else we app l y
the next ste p.
We
will call the last two steps (Encoding and Apply
transformatio n Rules) O-Letters Detection Algorithm.
For exa mple, we can extract the root of 'ا'
immediately after this step (Figure-3).
Figure 3- Applying O-Letters Detection Algorithm on 'ا'
4) Extract Possible Roots:
We have to app ly this step only if we have less than 3
O-Letters. In this step, we use the idea of traditio
nal
algorithms, but with the aid of o ur encoded word.
Traditional algorithms store lists of Arabic prefixes,
suffixes, and pa
tterns. These algorithms de lete prefixes
and suffixes then use the pattern to extract the root from
the reminder.
By aid of the
encoded word, we achieve more reliable
results without storing all these lists as:
اد ل
ع
م
UO S PO PS
د
ع
E
ncodin
g
Apply
Trans.Rules
UO O PO P
=> The Root is ''
R4 R2
اد ل
ع
م
UO S PO PS
د
ع
?
ا
ي
ب
ل
ط
ت
ي
ق
UU O PO UA U O
ب
ط
ق
=> The Root is ''
o We r eplace the list of classical prefixes by a new
adequate list (78 entries). We consider a prefix
each stri ng of letters t
hat can be found before the
first or iginal letter in the word (which called Faa
AL-Feal). For example, in the Arabic word 'ا'
we c
onsider 'ـا' as a p refix, while the classical
prefix is only 'ـا';
o We replace the list of suffixes by 4 equivalent
conditions
:
The encoded suffix contains only symbols: S,
PS, or U.
I.e., there is no O or P in it.
If we have the letter Meem 'م' in th
e suffix, it
should be one of the following suffixes:
} ،ه ،آ ،{
If we have the letter Taa Marbuta 'ة' in the
suffix, it should be one of t
he following
suffixes:
}ا ، ا ، ، ة{
If we have the letter Hamza in the suffix, the
previous letter o f Hamza sho uld be Alef'ا'.
For example, i
n the Arabic word 'ّا' (Figure-1) the
available suffixes are: 'ة' and ''.
o We replace the list of patterns for triple roots by 6
equivalent forms (Table-2) using our coding.
Table 2- Encoded Patterns
Equivalent
Arabic Pattern
The New Form

OOO
،ل ، ، ل...
OUOO
، ، ،
)ا -( ،..
OOUO
،،ل..
OUOUO
ا. OUUOO
(uncommon)
. OOUUO
(uncommon)
Stage2. Improving The Reliability:
The aim of this stage is to enhance the reliability by
deleting the wrong solutions, and adding missed ones
(which repre
sent special cases). In all steps, we consider the
prefix, suffix, and encoded pattern conditions which are
described in the last stage.
1) Root
Existence Test:
In this test, we check the existence of the concluded
root in the list of Arabic roots. If it does not exist, we
should apply the next s
tep (step 2) on it, and then delete
this wrong root. We can implement this step in O(1) test
using a suitable data structure (two-dimension arr
ay) to
store the roots.
2) Adding Special Solutions (Ebdal & Ealal):
We apply this step when we have a wrong root to
check if it is wrong becau
se of a special case. We process
two special cases: Ebdal and Ealal.
We indicate that we have Eb dal problem according to
a list of Arabic rules which
is known in Arabic language
references [10][11]. But we can not do that to solve Ealal
problem because most of Ealal rules depend on the
diacritics. So, we s
olve this pro blem as following:
When we have a weak letter (Alef, Yaa, and Waw) we
replace this letter with the two other letters and check
if
the result is a valid root. If so, we add this root to the
possible roots.
3) Shadda Test:
We try to duplicate letters when we have two lette
rs
after deleting a possible prefix and a possible suffix. We
add the result to available roots if it is a valid root.
4) Non-Tri Roots Test:
We add t
hese roots by applying the procedure which
was shown in the previous stage (stage1-step4) but by
using suitable Non-tri roots patterns. In this
step, there is
no difference whether we use the traditional patterns, or
encoded ones. We will achieve the same results in the
sa
me efficiency, because of the reduced number of these
patterns.
IV. E
XPERIMENTS
A
ND
D
ISCUSSION
A. Corpus:
We conducted our experiments using two different
corpuses.
The first corpus consists of lists of word-root pairs
(167162 pairs
) extracted from HIAST Arabic lexical database
[2] which covers the morphological categories in Arabic
(verbs, nouns, infinitives, plural of nouns, analogou
s
adjectives, exaggeration forms of active participle, non-
standard pluraletc). Because of this variety, this corpus has
an important role
to determine if a morphological analyzer
acts with all morphological category in the same effectiveness.
The second corpus is a collec
tion of 585 Arabic articles
from different categories (policy, economy, culture, science
and technology, and sport). This corpus consists of 377793
words.
B. Experiments:
a) Evaluating O-Letters Detection algorithm:
We found that by applying O-Letters Detection
algorithm (which is: Encoding, and then applying
Transform Rules) we achieve the root directly if we
retrieve three O-Letters (or more). Here, we ar e going
to find the probability of
such case, and measure the
accuracy of these solutions and the performance of this
algorithm.
We apply the algorithm on the last two corpus and
we
find that we detect a satisfied number of O-Letters
to retrieve the root in more than 30% of the Arabic
word. (Table-3):
Table 3- The probability for each possible output for O-Letters Detection
Algorithm
Corpus
No
O-Letters
One
O-Letter
Two
O-Letters
Three
O-Letters
(
or more)
Dictionary
3.2% 16.6% 41.3% 38.9%
Articles
8.5% 22.1% 36.9% 32.6%
To evaluate the accuracy of O-Letters Dete ction
algorithm, we do two evaluations (Table-4):
In the first o ne, we use the word-root pairs
to check the
behavior of this algorithm with each morphological
category. In general, the accuracy of these results is
more than 95%.
In the sec
ond evaluation, we use the second corpus
which represents Arabic te xts. We find that the
accuracy is about 99%.
We studied the errors in each case
, and we conclude
that:
- About 75% of errors are because of words from
non tri-root. This result is normal, because when
we have three O
-Letters we conclude the root
even if it is not tri-root.
- Some errors are a result of Ebdal problem.
- The rest erro rs are
because of words from rarely
pattern like ''.
Table 4- O-Letters Detection Algorithm's Accuracy
Corpus
No. of
words
Accuracy
of
Tri-root
words
Accuracy
of
NonTri-
root words
Total
Dictionary
167162 98.8% 59.8% 95.7%
Articles
(manual
evaluation)
3217 99.7% 68.7% 98.9%
The Performance:
This algorithm is very effective. Its performance is
more than 55000 words/sec.
b) Evaluating The Reliability OF The Complete
App
roach:
We evaluate our approach using the last two corpus,
we infer that for about 85% of words, all concluded
morphological analyses are right. And for le
ss than 1% of
words all concluded analyses are wrong. In most of other
cases, we have right analyses more then wrong analyses.
On general, the accuracy is ab
out 96%-98% (Table-5), and
the average number of concluded analyses for a word is1.4
(Figure-4).
Table 5- Our Approach's reliability
Corpus
No. of
words
Failure
(No
Solution)
Accuracy
Dictionary
167162 0.4% 98.1%
Articles
(manual
evaluation)
1408 1.3% 95.5%
Figure 4- Number of analyses for each word
According to these results, this stemmer is considered a
height reliable stemmer. We mention that khoja stemmer's
accuracy is about 96% [5].
c) Evalu
ating The Performance OF The Complete
Approach:
We compare the approach with two famous
approaches: Khoja Stemmer, and Buckwalter
Morphological analyz
er. We apply the three algorithms on
five graduated length files. We choose these files as length
as we can ignore the effect o f wasted tim
e (initialization,
reading stored data, opening filesetc) to emulate the
situation of IR.
We can see that our approach's performance is significantly
outpe
rforming other's performance.
Table 6- Comparing the performance for Buckwalter, Khoja, and our
approach
No. Of Words
Buckwalter
(word/sec)
Khoja
(word/sec)
Our
Approach
(word/sec)
19000 1496 3125 15289
78000 2956 8765 20781
186000 4473 9672 22111
262000 4969 9459 22649
350000 5469 9872 22803
Figure 5- Comparing the performance for Buckwalter, Khoja, and our
Approach
V. T
HE BENEFIT OF
O-L
ETTERS
D
ETECTION
A
LGORITHM
Our tests showed that O-Letters Detection algorithm
produces high reliable results (about 99%) in very effe ctive
performance (55000 words/s
ec). It is faster than any other
stemmer by 3 to 10 times.
So, we can improve the performance and the reliability of
other stemmers by plugging this c
omponent in them as
following:
For each word (w):
Apply O-Letters Detection on this word.
If we have 3 O-Letters or more so we e
xtract the root
directly which is the O-Letters.
Else: Apply a Morphological analyzer algorithm.
We can apply the same procedure for light stemmer
s.
VI. C
ONCLUSION
A
ND
F
UTURE
W
ORK
In this paper, we presented a new approach in
morphological analyses. The experiments clearly showed the
effectiveness of the approach. The ke
y ide a in this efficiency
is O-Letters Detection algorithm which reduces the number o f
false analyses by putting more constraints in the possible
analyses.
The first stage in the approach can be considered a light
stemmer. We expect that this stage outperforms other light
stemmers in bo t
h performance and reliability. We did not
compare it directly to light stemmer, but we compare it with
Khoja stemmer.
The seco nd stage
is divided into several unrelated steps to
provide extra levels of r eliability. This stage raises the
reliability to an important level which o
utperforms most others
Morphological analyzers.
Our future work will be focused on four points:
Evaluating our approach in IR task.
Trying to f
ind better solution for Ealal problem Which
is considered the main reason for wrong solutions in the
approach.
Start building IR tools depending on
this stemmer.
Complete the work in the process of Arabic language
processing. We plan to build an Arabic POS tagger as a
next step.
RERER
ANCES
[1] M. Aljlayl and O. Frieder. 2002. On Arabic Search: Improving the
retrieval effectiveness via a light stemming approach. In Proceedings of
CIKM’02, VA, USA.
[2] S. Attar,
M. Bawab, and O. Dakkak. Arabic Lexical database. HIAST,
Damascus. 2007.
[3] T. Buckwalter. Buckwalter Arabic Morphological Analyzer Version 1.0.
http://www.ldc.upenn.edu/Catalog/Cat
ologEntry.jsp?catologId=LDC200
2L49
[4] A. Chen, and F. Gey. Building an Arabic Stemmer for Information
Retrieval, In TREC 2002.
[5] S. Khoja. APT: An Automatic Ar
abic Part-of-Speech Tagger. PhD Thesis.
Lancaster University, Computing Department. 2003.
[6] L. S. Larkey, L. Ballesteros and M.E.Connell. 2002. Improving stemming
for Arabic inform
ation retrieval: Light Stemming and co-occurrence
analysis. In SIGIR 2002, Tampere, Finland: ACM, 2002.
[7] K. Taghva, R. Elkhoury, J. Coombs. Arabic Stemming Without A Root
Dictionary. 2005.
[8] N. T
habet, Stemming the Quran. 2004.
[10] A. Hassan, AL-Nahw al-wafi.
[11] S. Al-afagani, Al-wajiz fi al-sarf al-arabi.
http://www.islamguiden.com/arabi/
... Weaknesses in this algorithm are that the rules may apply to one set of words but not to another. Sonbol et al. [26] developed a new algorithm for extracting Arabic roots that is rule-based and categorizes Arabic letters into five groups: constant letters, prefix letters, suffix letters, prefix-suffix letters, and extra letters. The general description of the algorithm is presented in Figure 3. ...
... There are many rule-based algorithms available in the literature for Arabic root extraction. For instance, Thalji et al. [32] enhanced Khoja and Garside algorithm [29], while Thalji et al. [33] improved Sonbol et al. algorithm [26]. Kanaan et al. [34] presented an improved algorithm for extracting triliteral Arabic roots, while Ababneh et al. [35] built an effective rule-based light stemmer to enhance search effectiveness. ...
Article
Full-text available
The importance of a survey of Arabic root extraction algorithms is that it can provide a comprehensive overview of the current state of the field and the various approaches used for this task. This can be helpful for academics and professionals working on Arabic natural language processing projects, since it enables them to comprehend the advantages and disadvantages of several approaches and select the one that is best suited to the task at hand. A survey can also help to pinpoint areas that require more research and can serve as a springboard for fresh studies in this area. In general, a survey can enhance the study of Arabic root extraction and improve the quality of Arabic natural language processing tools. Researchers can use this survey to compare the performance of several algorithms and select the best one for their particular application. In this survey, the majority of the root extraction algorithms for the Arabic language were studied with a focus on identifying their advantages and disadvantages, as well as their accuracy. The algorithms were classified into various approaches. The importance of root extraction in Arabic is explored, along with its uses and applications. Also, the issues that still require research and improvement in order to improve the efficiency of algorithms were highlighted. For instance, one issue that the algorithms encounter is the absence of a unified list that encompasses all roots, patterns, and affixes. Additionally, more rules are needed to regulate the process of root extraction.
... The research focuses on providing robust methods for accurately identifying and extracting the roots of Arabic words, contributing to Arabic language analysis and processing tasks. Sonbol, Ghneim, and Desouki introduced a new approach for Arabic morphological analysis [22]. The study presents innovative methods and techniques for analyzing the morphological structure of Arabic language, contributing to the field of Arabic language processing and related applications. ...
... There are various Arabic root-finding algorithms, e.g. Al-Kabi and Al-Mustafa (2006) Ghawanmeh et al. (2005) and Sonbol et al. (2008). According to the results of research (Al-Kabi et al., 2011;Hadni et al., 2012), among all the various aforementioned methods, the stemmer of Ghwanmeh et al. (2009) achieved the best results. ...
Article
Question answering system aims at retrieving precise information from a large collection of documents. This work presents a question answering method to apply on Hadith in order to provide an informative answer corresponding to the user's query. Hadith englobes stories and qualification of the prophet Muhammad (PBSL). It also includes the sayings of his companions and their disciples. The problem with current methods is that they fail to capture the meaning when comparing a sentence and a user's query; hence there is often a conflict between the extracted sentences and user's requirements. However, our proposed method has successfully tackled this problem through: (1) avoiding extract a passage whose similarity with the query is high but whose meaning is different. (2) Computing the semantic and syntactic similarity of the sentence-to-sentence and sentence-to-query. (3) Expanding the words in both the query and sentences to tackle the fundamental problem of term mismatch between sentences and the user's query. Furthermore, in order to reduce redundant Hadith texts, the proposed method uses the greedy algorithm to impose diversity penalty on the sentences. The experimental results display that the proposed method is able to improve performance compared with the existing methods on Hadith datasets.
... For that, they built a corpus that contains 27.6 million words using 3823 triliteral roots. The results of comparison showed that the best algorithm was that proposed by [21], that demonstrated an accuracy rate of 39%, followed by Khoja (34%), [22] (32%), [23] (24%), [24] (20%), and finally [25] (14%). Their main observations on the algorithms on hand were as follows, having in mind that all accuracy results are based on their own generated corpus: ...
Article
Full-text available
Sentiment analysis (SA), also known as opinion mining, is a growing important research area. Generally, it helps to automatically determine if a text expresses a positive, negative or neutral sentiment. It enables to mine the huge increasing resources of shared opinions such as social networks, review sites and blogs. In fact, SA is used by many fields and for various languages such as English and Arabic. However, since Arabic is a highly inflectional and derivational language, it raises many challenges. In fact, SA of Arabic text should handle such complex morphology. To better handle these challenges, we decided to provide the research community and Arabic users with a new efficient framework for Arabic Sentiment Analysis (ASA). Our primary goal is to improve the performance of ASA by exploiting deep learning while varying the preprocessing techniques. For that, we implement and evaluate two deep learning models namely convolutional neural network (CNN) and long short-term memory (LSTM) models. The framework offers various preprocessing techniques for ASA (including stemming, normalisation, tokenization and stop words). As a result of this work, we first provide a new rich and publicly available Arabic corpus called Moroccan Sentiment Analysis Corpus (MSAC). Second, the proposed framework demonstrates improvement in ASA. In fact, the experimental results prove that deep learning models have a better performance for ASA than classical approaches (support vector machines, naive Bayes classifiers and maximum entropy). They also show the key role of morphological features in Arabic Natural Language Processing (NLP).
... Taghva et al. [27] is similar to Khoja et al. with no use of dictionaries; rather a rule-based technique is used. Sonbol et al. [28] uses a rule-based technique to extract roots by dividing letters to a part of root, and others are further divided into sub-groups which are examined with well-defined rules to extract the final root, with no use of a dictionary. Moreover, Spline function technique is utilized for extracting root. ...
Article
Full-text available
Numerous Arabic morphology systems have been devoted toward morphed requirements of words that are required by other text analyzers. Term rooting is an essential requirement in those systems, yet rooting module in the state-of-the-art morphology systems insufficiently meet that requirement, especially verb term. Consequently, due to termination in stemming term rather than a rooting term. Since the stem of the verb is not the root of the verb, it is not feasible to generate or inference verb’s derivations and whole it’s surface forms (patterns) such tense, number, mood, person, aspect, and others of verb irregular patterns. Therefore, we propose a new model for identifying the verb’s root produced in a tool (RootIT) in order to overcome verb root extraction without disambiguation out of traditional methods, applied in current morphology systems. A major design goal of this system is that it can be used as a standalone tool and can be integrated, in a good manner, with other linguistic analyzers. The adopted approach is a mapping surface verb with full-scale derivative verbs discharged previously in the relational database. Moreover, the proposed system is tested on the adopted dataset from PATB verbs extracted from CoreNLP system. The extracted dataset, containing more than (7950) distinguishes verbs belonging to (1938) different roots. The results obtained outstrip the best-compared system by (2.74%) of high accuracy.
... Sonbol, Ghneim and Desouki algorithm [14] is another rule-based root-extraction algorithm where the principal idea is based on the encoding of Arabic letters with a new code that preserves morphologically useful information and simplifies it's capturing toward retrieving the root. They conducted their experiments using two different corpuses. ...
Article
Full-text available
Non-vocalized Arabic words are ambiguous words, because non-vocalized words may have different meanings. Therefore, these words may have more than one root. Many Arabic root extraction algorithms have been conducted to extract the roots of non-vocalized Arabic words. However, most of them return only one root and produce lower accuracy than reported when they are tested on different datasets. Arabic root extraction algorithm is an urgent need for applications like information retrieval systems, indexing, text mining, text classification, data compression, spell checking, text summarization, question answering systems and machine translation. In this work, a new rule-based Arabic root extraction algorithm is developed and focuses to overcome the limitation of previous works. The proposed algorithm is compared to the algorithm of Khoja, which is a well-known Arabic root extraction algorithm that produces high accuracy. The testing process was conducted on the corpus of Thalji, which is mainly built to test and compare Arabic roots extraction algorithms. It contains 720,000 word-root pairs from 12000 roots, 430 prefixes, 320 suffixes, and 4320 patterns. The experimental result shows that the algorithm of Khoja achieved 63%, meanwhile the proposed algorithm achieved 94% of accuracy.
... Another comparative work [10] studied the accuracy of six recent and/or popular Arabic root finding algorithms that have success rates greater than 90%. The algorithms were Khoja and Garside [12], Al-Shalabi et al. [22], Taghva et al. [19], Al-Shalabi [23], Ghwanmeh et al. [21] and Sonbol et al. [24]. ...
Chapter
Using either stems or roots as index terms offered considerable performance to Arabic Information Retrieval (IR) systems compared to the use of surface words for indexing. Many comparative works tried to find out the best from these two indexing approaches but until then, no of the two methods widely overtook the other. Each of the two index types performed better under different test circumstances in terms of recall and precision. In this paper, the authors propose a hybrid approach combining the two indexing units in a way they take the advantages from both of them and try to overcome their shortcomings. Then, based on some combining techniques, the authors assign a weight for each indexing unit and try to find out the best weighting values.
... When the root is known, finding the derivation words is much easier than finding the root. We have used the rules that were discovered by root extraction algorithm in [25]. These rules are mention below: ...
... Another comparative work [10] studied the accuracy of six recent and/or popular Arabic root finding algorithms that have success rates greater than 90%. The algorithms were Khoja and Garside [12], Al-Shalabi et al. [22], Taghva et al. [19], Al-Shalabi [23], Ghwanmeh et al. [21] and Sonbol et al. [24]. Given that none of the indexing units out-performs the others, we propose to combine them. ...
Article
Using either stems or roots as index terms offered considerable performance to Arabic Information Retrieval IR systems compared to the use of surface words for indexing. Many comparative works tried to find out the best from these two indexing approaches but until then, no of the two methods widely overtook the other. Each of the two index types performed better under different test circumstances in terms of recall and precision. In this paper, the authors propose a hybrid approach combining the two indexing units in a way they take the advantages from both of them and try to overcome their shortcomings. Then, based on some combining techniques, the authors assign a weight for each indexing unit and try to find out the best weighting values.
Article
Full-text available
trans-kom ISSN 1867-4844 http://www.trans-kom.eu trans-kom ist eine wissenschaftliche Zeitschrift für Translation und Fachkommunikation. trans-kom 8 [2] (2015): 382-414 Seite 382 Abstract There has been an increasing interest recently in machine translation technology in the Arab world, especially with the growing demand for translation. The aim of this paper is to investigate how widely machine translation systems are used and researched in the Arab world and what can be done to achieve real progress in this field by taking one country as a case study. A survey was carried out in Saudi Arabia to collect the necessary data to assess how many Arabic machine translation systems are used by Saudi organizations, how much interest exists in machine translation in Saudi universities and research centers, and how much research is being conducted on issues of machine translation. The findings of this study indicate that most Saudi organizations and translation agencies do not trust or are not interested in machine translation. Only a few universities have conducted research in this field. It is recommended that more attention be paid and more research be conducted to get the most use out of this technology and that more efficient Arabic machine translation systems be developed.
Article
Full-text available
This paper reports on the construction of an Arabic stoplist and two Arabic stemmers, and the experiments on Arabic monolingual retrieval, English-to-Arabic cross-language retrieval
Article
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-language retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analysis produced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or morphological analysis.
Conference Paper
In natural language, a stem is the morphological base of a word to which affixes can be attached to form derivatives. Stemming is a process of assigning morphological variants of words to equivalence classes such that each class corresponds to a single stem. Different stemmers have been developed for a wide range of languages and for a variety of purposes. Arabic, a highly inflected language with complex orthography, requires good stemming for effective text analysis. Preliminary investigation indicates that existing approaches to Arabic stemming fail to provide effective and accurate equivalence classes when applied to a text like the Qur'an written in Classical Arabic. Therefore, I propose a new stemming approach based on a light stemming technique that uses a transliterated version of the Qur'an in western script.
Article
Arabic is the language of millions of people all over the world, yet a publicly available grammatically tagged corpus of Arabic still does not exist. In this paper, I describe some of the initial findings in the development of an Arabic part-of-speech tagger. For this tagger I have compiled a tagset containing 131 tags that is derived from traditional Arabic grammatical theory. I have used this tagger to manually tag a corpus and I have extracted a lexicon from this corpus. Because Arabic is a morphologically complex language, some pre-processing is necessary before automatic tagging can take place. The results I have obtained so far highlight some of the characteristics of the Arabic language that need to be addressed to improve tagging.
Conference Paper
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stem�ming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-lan�guage retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analy�sis pro�duced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or mor�phological analysis.
Conference Paper
We have implemented a root-extraction stemmer for Arabic which is similar to the Khoja stemmer but without a root dictionary. Our stemmer was found to perform equivalently to the Khoja stemmer as well as so-called "light" stemmers in monolingual document retrieval tasks performed on the Arabic Trec-2001 collection. A root dictionary, therefore, does not improve Arabic monolingual document retrieval.
Article
The inflectional structure of a word impacts the retrieval accuracy of information retrieval systems of Latin-based languages. We present two stemming algorithms for Arabic information retrieval systems. We empirically investigate the effectiveness of surfacebased retrieval. This approach degrades retrieval precision since Arabic is a highly inflected language. Accordingly, we propose root-based retrieval. We notice a statistically significant improvement over the surface-based approach. Many variant word senses are based on an identical root; thus, the root-based algorithm creates invalid conration classes that result in an ambiguous query which degrades the performance by adding extraneous terms. To resolve ambiguity, we propose a novel lightstemming algorithm for Arabic texts. This automatic rule-based stemming algorithm is not as aggressive as the root extraction algorithm. We show that the light stemming algorithm significantly outperforms the root-based algorithrr We also show that a significant improvement in retrieval precision can be achieved with light inflectional analysis of Arabic words.
Arabic Lexical database. HIAST, Damascus
  • S Attar
  • M Bawab
  • O Dakkak
S. Attar, M. Bawab, and O. Dakkak. Arabic Lexical database. HIAST, Damascus. 2007.