Conference PaperPDF Available

Arabic Root Based Stemmer

Authors:

Abstract and Figures

This paper presents a new (root-based) stemming algorithm for Arabic language. As other natural languages not all the words used in Arabic language, has roots, some of these are borrowed from other languages, e.g. as the word "تلفزيون" television, so in this case the stemmer will fail to get the right root because these foreign words have no root. This algorithm is based on affix removal beside knowledge from structural linguistics. The implementation and evaluation of this algorithm shows a noticeable improvement in the accuracy relative to previous algorithms.
Content may be subject to copyright.
The 2006 International Arab Conference on Information Technology (ACIT'2006)
1/7
Arabic Root Based Stemmer
Mohammed Naji AL-Kabi* and Ronza S. Al- Mustafa**
*Computer Information Systems Dept, Yarmouk University, Irbid, Jordan, mohammednaji@yahoo.com
**Computer Information Systems Dept, Yarmouk University, Irbid, Jordan, ronza_malkawi@yahoo.com
ABSTRACT
This paper presents a new (root-based) stemming
algorithm for Arabic language. As other natural
languages not all the words used in Arabic language, has
roots, some of these are borrowed from other languages,
e.g. as the word " " television, so in this case the
stemmer will fail to get the right root because these
foreign words have no root. This algorithm is based on
affix removal beside knowledge from structural
linguistics. The implementation and evaluation of this
algorithm shows a noticeable improvement in the
accuracy relative to previous algorithms.
Keywords:Arabic, Stemming, Root, Negative Suffix,
Negative Prefix, Light Stemming, NLP.
1. INTRODUCTION
The Arabic language is the fifth most widely spoken
language in the world. It belongs to the Semitic family; so
it differs from the Indo-European languages
morphologically, semantically, and syntactically. The
Arabic alphabet contains twenty-eight letters, always
written from right to left in cursive form. Diacritical
marks (harakat) (tashkiil ) appear either above or
below the letters, and play an essential role in many cases
in distinguishing semantically and phonetically between
two identical words with the same characters, but with
different diacritics. Diacritical marks are used in holy
books, poems, and children’s literature; newspapers,
journals and other books for adults are usually printed
without diacritics, which means that many strings are
ambiguous. Most native Arabic words are derived from
verbal roots. Arabized words, on the other hand, mainly
nouns borrowed from other languages with a slight
phonetic adjustment to suit the Arabic pronunciation, have
no roots [8].
All Arabic words belong to three main categories:
noun, verb or particle. Around 64% of Arabic words are
derived from triliteral verbs (three consonants), but there
are also biliteral verbs (two consonants), quadriliteral
verbs (four consonants), and pentaliteral verbs (five
consonants). Naturally these verbs represent the roots for
which stemming algorithms typically search. This
stemming process excludes words derived from nouns and
particles[9].
A morpheme is the smallest meaningful lingual unit
which has a semantic interpretation in the grammar of a
language. There is a difference between stem and a root, a
stem is a morpheme or a set of concatenated
morphemes that can accept an affix, where a root is a
single morpheme that provides the basic meaning of a
word.
Stemming might be useful to Information
retrieval systems, text classification systems, text
clustering systems, dictionary automation, text
compression, etc.
Stemming is considered by a number of authors
as word Standardization [12]. A number of writers
thought that stemming is useful for improving
retrieval performance because it reduces variants of
the same root word to common concept, besides
reducing the size of the indexing structure because the
number of distinct index terms is reduced [3]. Other
writers are not satisfied with the concept of using
stemming in IR and Text mining [3]. Accordingly
many search engines do not adopt stemming [3].
Several common types of stemming strategies are
discussed by Frakes: affix removal, table lookup,
successor variety, and n-grams [7]. Affix removal
strategy tries to eliminate the prefixes and suffixes.
The most important part in this strategy is suffix
removal, since most variants of terms are generated by
suffixes.
In Arabic language as with other natural
languages the stemmer may face the problem of a
negative prefix, where the prefix which eliminated is
part of the word and not really a prefix. If a stemmer
tries to strip the " ا" which is a well known prefix
from the following examples, the output will be
definitely wrong, e.g. "ﷲا" Allah, " " Germany,
" " Brigades, " " Albania, etc. It also includes
other prefix such as "و" And which represents a
frequently used conjunction, e.g., stripping off "و"
And from "ءﺎﻓو" honesty leads to a wrong stem.
The negative prefix problem in Arabic language
stemmer is not restricted to the " ا" and "و" prefixes,
but it also includes other prefixes such as " "," ",
" ﻟ ـ ", " ـﻟﺎﻓ ", etc. The Arabic light stemming in this case
for the term " ﻲـﻟاو " Governor will be wrong, if the
prefix "ـﻟاو " strip off from the term. Similarly the
stems of the words "ﺢﻟﺎﻛ " glum, "ﷲا" Allah, " "
successful, if we strip from them the prefixes " ـﻟﺎﻛ ",
" ـﻠ ", "ـﻟ " respectively. Similarly Arabic stemmers
face another problem of a negative suffix, where the
suffix which has been eliminated is part of the word
and not really a suffix. If a stemmer tries to strip off
The 2006 International Arab Conference on Information Technology (ACIT'2006)
2/7
the "نا" which is a well known suffix from the following
examples, the output will be definitely wrong, e.g. " نﺎﻤﻌﻟ "
To Amman, " " Japan, etc. Table 5 in the Appendix
illustrates a number of examples.
Table lookup is the simplest strategy among the four;
it simply looks for the root of the term in the lookup table.
The performance of this strategy is highly affected by the
number of words (terms) and their root in the table, as the
lookup tables gets larger the performance get higher too.
Large lookup tables might need a considerable storage
space. Successor variety is not straightforward as the
others, and depends on algorithms which is based on
structural linguistics and attempts to determine morpheme
boundaries. N-grams stemming searches for digrams,
trigrams or more term successive letters. This strategy is a
term of clustering procedure not a stemming procedure.
The above two problems (negative prefix & negative
suffix) of Arabic stemmers leads to a wrong grammatical
root, so the accuracy of IR & Text mining systems which
rely on these stemmers will be deteriorated.
The two main problems of stemming have been
described by Chris D. Paice [12]. In the first place, pairs
of etymologically related words sometimes differ sharply
in meaning [12] for example; consider " ﻞﺳ" ask, " ﺐﻠﺳ "
stole, and "مﻼﺳ " Peace. In the second place, the
transformations involved in adding and removing suffixes
involve numerous irregularities and special cases [12].
Stemming errors are of two kinds: understemming errors,
in which words which refer to the same concept are not
reduced to the same stem, and overstemming errors, in
which words are converted to the same stem even though
they refer to distinct concepts. In designing a stemming
algorithm there is a trade-off between these two kinds of
error.
Alight stemmer plays safe in order to avoid
overstemming errors, but consequently leaves many
understemming errors. A heavy stemmer boldly removes
all sorts of endings, some of which are decidedly unsafe,
and therefore commits many overstemming errors [12].
Shereen Khoja addressed the problems that might
face the Arabic stemmer [9]:
"If the root contains a weak letter (e.g. "أ" alif, "و"
waw or "ي" yaa), the form of this letter may change
during derivation. To deal with this, the stemmer must
check to see if the weak letter is in the correct form. " If
not, the stemmer produces the correct form of this weak
letter, which then gives the correct form of the root. If any
triliteral rooted verb's one of the three root letters contains
either "أ" alif hamza (a), "و" waw (w) or "ي" yaa (y) then
that is defined as a weak verb, e.g. " " gave, " َﺪَﺟَ و"
found, " َﻊَﺿَ و" put, " َﻒَﻗَو" stood, " َﺪَﻋَ و" promised, "َعﺎَﺑ "
bought, "ءﺎﺟ" came, "أَﺮَﻗ" read. Also weak verbs includes a
triliteral rooted verb's where the second letter is doubled
with a ّshadda, e.g. " َﺮّﻤَﺷ" prepared. Shadda (Germination
mark (tashdeed)) is written above the consonant that is
doubled, and it look like the w shape. Strong verb is a
triliteral rooted verb's which does not have any of the
above three weak letters.
"Some words do not have roots. For example
the Arabic equivalents of "ﻦﺤ " we, "ﺪﻌ "
after, " ﺖﺤ " under and so on. If the stemmer
comes across any of these words, it does
nothing. "
"Sometimes a root letter is deleted during
derivation. This is especially true of roots
that have duplicate letters (e.g. the last two
letters are the same), e.g., " َﺞٍﺟُ د" get dressed,
" َﻞﱠﻟَد" dandle, " َﻞﱠَﺧ" souse, " َﻞﱠَﻋ" explained,
" َﻞﱠﻠَﻗ " reduced, " ﻞَﻠَﺑ" wet, etc. The stemmer
can detect this, and return the letter that was
removed. - If a root contains a hamza, this
hamza could change form during derivation,
e.g., " " talk, " " stand up, etc. The
stemmer detects this, and returns the original
form of this hamza."
L. S. Larkey and M. E. Connell [11] conducted a
good study based on a modified version of Shereen
Khoja stemmer. The modified version includes a few
changes to enhance the accuracy of the stemmer.
These changes are summarized as follows:
If a root were not found, the normalized form
would be returned, rather than returning the
original unmodified word.
List of place names are considered
"unbreakable" words exempt from stemming.
In addition to the Arabic stop word list
included in the Khoja stemmer, a script was
to remove stop phrases.
A light stemmer used to strip off definite
articles (ـﻟﺎﻓ, ـﻟﺎﻛ, ـﻟﺎﺑ, ـﻟاو, ـﻟا, and و ) from
the beginnings of normalized words and
strips 10 suffixes from the ends of words (تا,
نا, ﺎھ, ي, ة, ه, ﺔﯾ,ﮫﯾ,ﻦﯾ, and نو).
Table 5 in the appendix shows that light
stemming leads to wrong results if it carried out
unconditionally, so we record our reservation on the
last step. Larkey, and Connell's stemmer seem to be
better than its parent (Khoja stemmer).
Morphology is a branch of linguistics that is
concerned with studying of the internal structure of
word forms. Semitic languages have a complex
morphology and so the Arabic language is a complex
language for stemming. Arabic stemmers have to deal
with affixes (prefixes, infixes, and suffixes), in
addition to diacritic marks (harakat), in order to get
the right root with its appropriate diacritic marks on it.
Furthermore Arabic stemmer has to deal with
The 2006 International Arab Conference on Information Technology (ACIT'2006)
3/7
Arabized words (foreign words) which have no root, and
in this case have to be excluded from stemming.
This study uses morphological patterns to obtain the
trilateral and quadriliteral roots. The algorithm used
simply tries to extract the root, in case there is a match
between pattern infix and word's infix.
Shereen Khoja is a pioneer in this field, but
unfortunately we failed to get her original work entitled
"Stemming Arabic Text" with her colleague Roger
Garside. Leah S. Larkey and Margaret E. Connell and
others headed a team at University of Massachusetts,
Amherst to conduct a number of studies which depends
on Khoja work. Their work [10] [11] represent an
improvement to Khoja work. Although their work include
improvements to Khoja but it does not solve the problems
of negative prefix and negative suffix which discussed
before. Al-Kharashi, I.A. et. Al. [2] presents pattern based
stemming for Arabic language, also Taghva K. et. Al. [13]
used the same approach which is different from Khoja,
with an equivalent performance. Pattern based stemming
does not use root dictionary. This approach based on
matching the word with a number of Arabic patterns to
extract the root. Chen A. et. Al. [4] conducted a study to
find Arabic roots using Machine Translation (MT) based
stemmer. Although this study depends on Ajeeb machine
translation system, stopword removing, clustering, light
stemming, and morphological analysis, but it does not
presents a solution to the problems of negative prefix and
negative suffix. Kareem Darwish [5] shows how to extract
a root from the word, by first removing the prefix and
suffix of the word to get a stem, then match a stem to a
number of templates to get the root. In this study the
researcher did not mention how many templates used in
comparisons, beside the absence of an algorithm.
Darwish, K. et. Al. [6] used an approach which is similar
to his previous one[5], but with more details about the
prefixes, and suffixes being removed. Table 6 shows the
patterns used within our algorithm.
2. THE ALGORITHM
The first step of the Arabic Rooter under study is to
normalize the text. Afterward a matching is performed
between the stem and the verbal and noun patterns, in
order to obtain the root. To conduct this study, a system
(stemmer) is built to find the Arabic roots using Visual
Basic 6.0. This stemmer kept the words unchanged if it
failed to find a root, and this is a normal case when the
stem is an Arabized word or when it represent the names
of places, such as continents, regions, countries, states,
districts, cities, villages, rivers, mountains, deserts, etc.
Germination mark (tashdeed) ( ّ )
"shaddah" is placed above a consonant
letter as a sign for the duplication of the
consonant
T(i) be any term
Let LenT(i)be the length of each
term
Let nbe a number of terms within a
document
Let chr(i)be the character position
within a term
Let LenP(j)be the length of the
pattern
Let Infixes_String be a string
generated manually, consisting of
the pattern, and the affix of that
pattern, e.g., the stem "ﺢﺑﺎﺴ "
swimming pools, match with the
pattern of "ﻞ ", so the
Infixes_String in this case is the
string " ﺎﻣ", where "م" lie in the first
position, and "ا" lie in the third
position.
Let T_String be the corresponding
string of the word which
corresponds the string of the pattern
Infixes_String,i. e., to clarify the
idea suppose we want to find the
root of the stem " ﺢﺑﺎﺴ " swimming
pools, the system has to check this
word with all 5 characters patterns,
one of these patterns is "ﻞﯿﻌﻔَ", so the
Infixes_String in this case is " ﻲﺗ"
and the T_String is "ﺐﻣ ", the
mismatch is obvious in this case,
when matching the stem with the
pattern " ﻋﺎﻔﻣ " the Infixes_String &
T_String will be "ﺎﻣ".
Table 1 shows how to get Infixes_String for each
of the patterns used.
Table 1: An example of patterns and their infixes,
and the position of each infix
1. Stop word removal depending on a list of
(1281) stop words consists of prepositions,
pronouns, article and conjunctions.
2. Normalization
2.1 Remove tatweel (kasheeda) symbol ("_")
2.2 Remove punctuations using a list of
punctuation characters
Pattern
Infixes_String
Infix : Infix position
لﺎﻌﻓ
ا
: ا3
لﻮﻌﻔﻣ
ﻮﻣ
: م1
ﺴﯾﻦﻠﻌﻔﺘ
ﻦﺘﺴﯾ
: ي1
The 2006 International Arab Conference on Information Technology (ACIT'2006)
4/7
IncorrectCorrect
Correct
Precision
UnAnalyzedCorrectCorrect
Recall
RecallPrecision RecallPrecision
F
2
2.3 Remove diacritics depending on a list of
diacritics characters
3. If LenT(i) ≥ 5 then
Remove initial definite article (ﻞﻟ ،لا)
Else if LenT(i) ≥ 6 then
Remove initial definite article (لﺎﺑ ،لﺎﻓ ،لﺎﻛ)
End if
4. If LenT(i) > 4 and the final character of the T(i)
like "ءا" then
Replace final "ءا" with "ي"
End if
5. Replace initial ( إ, ), ( أ ) with bare alif ( ا)
6. Replace initial ( آ ) with bare alif ( ا)
7. Replace final ( ة ) with ( ه )
8. Replace final ( ى ) with ( ي)
9. For i1 to ndo
9.1 If LenT(i) = 3 then
9.1.1 If T(i) ends with germination mark (tashdeed)
( ّ) then Root(T(i)) = chr(1)& chr(2)& chr(2)
Else Root(T(i)) = T(i)
End if
End if
9.2 If LenT(i) ≥ 4 then
9.2.1 For j1 to number of patterns of length =
LenT(i)do
9.2.1.1 If T_String match Infixes_String
then
9.2.1.1.1 Remove the infix characters
from T(i)
9.2.1.1.2 Replace "ئ" or "ؤ" with "أ"
9.2.1.1.3 Replace "ء" or "ى" with "ي"
9.2.1.1.4 Return Root (T(i))
ElseReturn the normalized term
End if
Next j
End if
Next i
3. EVALUATION
In order to test the accuracy of our algorithm, we selected
a number of words randomly. Table 2 shows the manual
trace of the execution of the above algorithm to extract the
root of the selected terms.
Table 3 shows the strength and weakness of the
above algorithm, using a small data set containing 1,827
words. The system failed to analyze 55 words, since their
patterns are unknown. This failure mostly due to foreign
(Arabized) words. The system accepts to analyze the rest
of the (1,772 words), but we found that accuracy of
extracting the right roots is 91%.
Table 2. Trace of the manual extraction of the
correct root.
Table 3 Accuracy of root extraction for three Arabic
text files
Figure 1
Statistics for root extraction
Table 4 shows the precision, recall and the
harmonic mean (F-measure). Here we used the
precision, recall and F-measure as shown in the
following formulas:
…………. (1)
…………. (2)
…………. (3)
Original
Word T(i)
Normalized
T(i) (Stem)
T_String
Root
(T(i))
Status
تﺎﻤﯿﻠﻌﺘﻟا
تﺎﻤﯿﻠﻌﺗ
تﺎﯿﺗ
ِْﻢﻠ
Right
ناﺰﯿﻤﻟا
ناﺰﯿﻣ
نا
ﺰﯿﻣ
Wrong
ﺔﯾرﺎﻤﺜﺘﺳﻹا
ﮫﯾرﺎﻤﺜﺘﺳا
ﮫﯾﺎﺘﺳا
ََْ
Right
ﻦﯿﻤﻠﻌﻤﻠﻟ
ﻦﯿﻤﻠﻌﻣ
ﻦﯿﻣ
ََِ
Right
مﺎﺣﺮﺘﺳﻹا
مﺎﺣﺮﺘﺳا
ﺎﺘﺳا
َرِ
Right
ﺎﮭﻧﺎﻛﺮﺘﯿﺳ
ﺎﮭﻧﺎﻛﺮﺘﯿﺳ
ﺎھﺎﺘﯿﺳ
َََك
Right
ﻦﯾﺪﺷﺮﻤﻠﻟ
ﻦﯾﺪﺷﺮﻣ
ﻦﯿﻣ
َرََ
Right
ّﺪﻣ
ّﺪﻣ
-
َََد
Right
اﺰﯿﻣن
ناﺰﯿﻣ
نا
ََﯿَ
Wrong
اﻮﻠﺋﺎﺴﺗ
اﻮﻠﺋﺎﺴﺗ
اوﺎﺗ
َََل
Right
سراﺪﻤﻟا
سراﺪﻣ
ﺎﻣ
َدَرَس
Right
ﻢﯾﺮﻛ
ﻢﯾﺮﻛ
ي
ََُم
Right
ﺔﺒﺘﻜﻤﻟﺎﺑ
ﮫﺒﺘﻜﻣ
ﮫﻣ
َََ
Right
ﺮﺋﺎﻄﻟا
ﺎﻃأر
ا
َْرﺄ
Wrong
نﻮﺒﯿﺠﺘﺴﯾ
نﻮﺒﯿﺠﺘﺴﯾ
نﻮﺘﺴﯾ
ِْﺐﯿ
Wrong
ﺎﮭﻄﯿﺤﻣ
ﺎﮭﻄﯿﺤﻣ
ﺎھ
ُِْﻂﯿ
Wrong
Number of Roots
extracted
correctly
Number of
incorrect
Roots
Words not
Analyzed
Number of
words
130 (87.2%)
16 (10.8%)
3 (2%)
147
215 (87.4%)
24 (9.8%)
7 (2.8%)
244
527 (91%)
33 (5.7%)
19 (3.3% )
579
791 (92.4)
39 (4.6%)
26 (3%)
857
1663 (91%)
112 (6.1%)
55 (3%)
1827
The 2006 International Arab Conference on Information Technology (ACIT'2006)
5/7
Table 4 shows that the system obtains about 92%
overall precision for the analyzed words, note that words
that doe not match any of the verbal and noun patterns
have been ignored as illustrated in table 6 from the
computations of the accuracy measures, because these
words are foreign words.
Table 4. Accuracy of root extraction for three Arabic
text files
4. CONCLUSIONS
In order to increase the accuracy of the system, and to
reduce the probability of facing the problems of negative
suffix and negative prefix, the system shall not remove the
prefixes ("ـﻓ" ،"ب" ،"ـﻟ" ،"و" ،"ـﻓ") and suffix ( "ﮫـ" ).
Furthermore the system uses a conditional removing,
e.g., in case the term length is six or more the system will
remove the following prefixes ("لاو"، " لﺎﺑ " ، " لﺎﻛ " لﺎﻓ")
otherwise when the term length is less than six the term
will be unchanged.
As mentioned in Thabet [14] root-based algorithm
increases word ambiguity, where many word variants
have different meaning, and this will affect the accuracy
of IR, Text mining, etc systems which rely on root based
stemmers. Table 5 presents a number of ambiguous cases,
one of these is the term " ", this can be interpreted by
the reader as parents, religion, and debt, since this word is
bare of diacritics, and it is in its own, not within a
statement. As we said the diacritics used to distinguish the
words semantically and phonetically.
Arabic stemmers can be used to enhance the
efficiency of a number of systems such as, Spell checkers,
Information retrieval systems, Text mining systems, Text
Analysis systems, Compression systems , etc.
This algorithm is incapable of extracting Arabic roots
of some imperative verbs (" ") that is made up of
one Arabic letter with the fact that its root being of three
letters (trilateral verbs), e.g., " "ِـﻋ" , with the root of
" ﻲِﻋو ". In addition, the problem of defective roots (weak
roots) is still not solved by this algorithm. Defective roots
are roots that contain vowels ("ي"،"و"،"أ") which are
classified as irregular roots, since some vowels in these
roots are altered to other vowels or removed in the
derivational process [1], e.g., " ﺎﻣر " and " ﻲﻣر " these two
words have the same meaning throw, and both of them
represent the same root. As a future research, we hope to
solve these problems within our next enhancement to
this work.
REFERENCES
[1] Aljlayl. M, Frieder. O. "On Arabic Search:
Improving the Retrieval Effectiveness via a
Light Stemming Approach", CIKM 02,
November 4-9, 2002, McLean, Virginia,
USA. Pages 340 -- 347. ACM 1-58113-492-
4/02/0011.
[2] Al-Kharashi, I.A., & Al-Sughaiyer, I.A.
(2002e). "Pattern-based Arabic stemmer". In
Proceedings of the 2nd Saudi Technical
Conference and Exhibition (STCEX2002),
Volume II (pp. 238-244), Riyadh, Saudi
Arabia.
[3] Baeza-Yates, R., & Ribeiro-Neto, Modern
Information Retrieval. Addison Wesley,
1999.
[4] Chen A. and Gey Fredic. 2002. "Building an
arabic stemmer for information retrieval". In
Proceedings of the Eleventh Text REtrieval
Conference (TREC 2002), National Institute
of Standards and Technology, November.
[5] Darwish K. 2002. "Building a shallow Arabic
Morphological Analyzer in one day", In
proceedings of the ACL-02 workshop on
Computational approaches to semitic
languages, Association for Computational
Linguistics , July.
[6] Darwish, K. and D. Oard. "CLIR Experiments
at Maryland for TREC 2002: Evidence
Combination for Arabic-English Retrieval".
In TREC. 2002. Gaithersburg, MD.
[7] Frakes W. B., Introduction to Information
Storage and Retrieval Systems, chapter 1,
pages 1--12. Prentice-Hall, 1992.
[8] Kanaan, G.; Al-Shalabi, R.; AL-Kabi, M.N.;
Jaam, J.M.; Hasnah, A.; . 2004. "New
Approach for Extracting
Quadriliteral/Quadrilateral Arabic Roots ”, In
proceedings of 1st International Conference
on Information & Communication
Technologies: from Theory to Applications,
ICTTA'04, (Damascus, Syria, April 2004).
IEEE-France.
[9] Khoja S., Research Interests, Pacific
University, 2043 College Way, Forest Grove,
Oregon 97116,
http://zeus.cs.pacificu.edu/shereen/research.h
tm, July 8, 2006.
F-
measure
Precision (Accuracy
of Analyzed word)
Recall
Number of
words
0.9309
0.8889
0.9771
147
0.9322
0.8987
0.9682
244
0.9530
0.9411
0.9652
579
0.9606
0.9531
0.9682
857
0.9442
0.9204
0.9697
1827
The 2006 International Arab Conference on Information Technology (ACIT'2006)
6/7
[10] Larkey L., Ballesteros L., and Connell M.,
"Improving Stemming for Arabic Information
Retrieval: Light Stemming and Co-occurrence
Analysis," SIGIR 2002: 275-282, 2002.
[11] Larkey L. S., and Connell M. E., "Arabic
information retrieval at UMass in TREC-10". In
TREC 2001.
[12] Paice C.D., "An evaluation method for stemming
algorithms". In W.B. Croft and C.J. van
Rijsbergen, editors, Proceedings of the 17th
Annual International ACM SIGIR Conference on
Research and Development in Information
Retrieval, pages 69-90. Springer-Verlag, July
1994.
[13] Taghva, K., Elkoury, R., and Coombs, J.
"Arabic Stemming without a root
dictionary".2005.
www.isri.unlv.edu/publications/isripub/Tagh
va2005b.pdf
[14] Thabet, N. (2004). “Stemming the Qur’an”.
In Proceedings of Arabic Script-Based
Languages Workshop, COLING-04,
Switzerland, August 2004.
Appendix A:
Table 5: The problem of negative prefixes and
negative suffixes
Full
word
Removing
the suffix تا
Full
word
Removing
the suffix نا
Full word
Removing
the suffix نو
Full word
Removing
the suffix ﻦﯾ
تﺎﻛﺮﺒﻟا
كﺮﺒﻟا
نﺎﻣﻷا
مﻷا
نﻮﻌﻟﺎﺑ
ﻊﻟﺎﺑ
ﻦﯿﻣﻷا
مﻷا
تﺎﻤﯿﻠﻌﺘﻟا
ﻢﯿﻠﻌﺘﻟا
نﺎﺴﻧﻹا
ﺲﻧﻹا
ﺎﺒﻟانﻮﻟ
لﺎﺒﻟا
ﻦﯿﻣﺎﺘﻟا
مﺎﺘﻟا
تارﻮﺜﻟا
رﻮﺜﻟا
ناوﻷا
وﻷا
نﻮﻄﺑ
ﻂﺑ
ﻦﯿﺴﺤﺗ
ﺲﺤﺗ
تﺎﻋﺎﻤﺠﻟا
عﺎﻤﺠﻟا
نﺎﻃوﻷا
طوﻷا
نﻮﻠﺑ
ﻞﺑ
ﻦﯿﻨﺣ
ﻦﺣ
تﻼﻤﺤﻟا
ﻞﻤﺤﻟا
نﺎﻛﺮﺑ
كﺮﺑ
نوﺎﻌﺘﻟا
ﺎﻌﺘﻟا
ﻦﯾﺪﻟا
ﺪﻟا
تاروﺪﻟا
روﺪﻟا
نﺎﻨِﺠﻟا
ﻦِﺠﻟا
نﻮﺴﺤﻟا
ﺲﺤﻟا
ﻦﯾﺬﻟا
ﺬﻟا
تﺎﯾرود
يرود
نﺎﻨﺤﻟا
ﻦﺤﻟا
نﻮﻨﺣ
ﻦﺣ
ﻦﯿﺠﺳ
ﺞﺳ
تاﺬﻟا
ﺬﻟا
نﺎﺠﻠﺧ
ﺞﻠﺧ
نﻮﺘﺴﻟا
ﺖﺴﻟا
ﻦﯿﻜﺳ
ﻚﺳ
تﺎﻄﻠﺴﻟا
ﻂﻠﺴﻟا
نﺎﯾﺮﻟا
يﺮﻟا
نﻮﻜﺳ
ﻚﺳ
ﻦﯿﺘﻨﺳ
ﺖﻨﺳ
تاﻮﻨﺴﻟا
ﻮﻨﺴﻟا
نﺎﺤﯾﺮﻟا
ﺢﯾﺮﻟا
نﻮﺑﺎﺻ
بﺎﺻ
ﻦﯿﻨﺳ
ﻦﺳ
تﺎﺳﺎﯿﺴﻟا
سﺎﯿﺴﻟا
نﺎﻤﻀﻟا
ﻢﻀﻟا
نﻮﯿﻌﻟا
ﻲﻌﻟا
ﻦﯿﻋ
ع
تﺎﻛﺮﺸﻟا
كﺮﺸﻟا
نﺎﻤﺠﻋ
ﻢﺠﻋ
نوﺮﻗ
ﺮﻗ
ﻦﯿﻧاﻮﻗ
ناﻮﻗ
تﺎﻘﺒﻃ
ﻖﺒﻃ
ناﻮﻨﻋ
ﻮﻨﻋ
نﻮﻧﺎﻛ
نﺎﻛ
ﻦﯾﺪﻛ
ﺪﻛ
تاﻮﻘﻟا
ﻮﻘﻟا
نﺎﻨﺒﻟ
ﻦﺒﻟ
نﻮھﺮﻣ
هﺮﻣ
ﻦﯿﻟ
ل
تﺄﺠﻟ
ﺞﻟ
نﺎﻤﻌﻟ
ﻢﻌﻟ
نﻮﯿﻠﻤﻟا
ﻲﻠﻤﻟا
ﻦﯿﺘﻣ
ﺖﻣ
تاوﺬﻟ
وﺬﻟ
نﺎﻨﺒﻠﻟ
ﻦﺒﻠﻟ
نﻮﻣﺮﮭﻟا
مﺮﮭﻟا
ﻦﯿﻠﻟﺪﻣ
ﻞﻟﺪﻣ
ةاﻮﮭﻠﻟ
ﻮﮭﻠﻟ
نﺎﺟﺮﻣ
جﺮﻣ
نورﺪﯾ
رﺪﯾ
ﻦﯿﻜﺴﻣ
ﻚﺴﻣ
تﻻﺰﻨﻟ
لﺰﻨﻟ
ناﺰﯿﻤﻟا
ﺰﯿﻤﻟا
نﻮﻠِﺼَﯾ ،نﻮﻠﺼٌ
ﻞﺼﯾ
ﻦﯿﻘﻠﻌﻤﻟا
ﻖﻠﻌﻤﻟا
تﻼﺧاﺪﻣ
اﺪﻣﻞﺧ
نﺎﺴﯿﻧ
ﺲﯿﻧ
نﻮﻤﻀﻣ
ﻢﻀﻣ
ﻦﯿﻌﻣ
ﻊﻣ
تﺎﺷﺎﻘﻨﻟا
شﺎﻘﻨﻟا
ناﻮﮭﻟا
ﻮﮭﻟا
نﻮﻜﺴﻣ
ﻚﺴﻣ
ﻦﯿﻤﯾ
ﻢﯾ
تارذو
رذو
نﺎﺑﺎﯿﻟا
بﺎﯿﻟا
نﻮﺘﻔﻣ
ﺖﻔﻣ
The 2006 International Arab Conference on Information Technology (ACIT'2006)
7/7
Table 6: Verbal and noun patterns used within the algorithm
Full word
Pattern's used
Length 3 patterns
Length 4 patterns
لﻌﺘﻓ ل
Length 5 patterns
ﻔﻟ لﺎﻌﻔﺒ لﻴﻌﻔﻤ لﻌﺘﻔﻤ
Length 6 patterns
ﻌﻓ لﻴﻌﻔﺘﻟ لﻴﻌﻔﺘﺒ
Length 7 patterns
ﺘﺴﻤ
ﻌﻔﻟ
Length 8 patterns
ﺘﻠﻌﻔﺒ ﺎﻤﻜﺘﻠﻌﻔﻟ ﺎﻤﻜﺘﻠﻌﻔﺒ ﺎﻤﻬﻠﻋﺎﻔﻟ
ﺎﻤﻜﺘﻠﻋﺎﻓ ﺎﻤﻬﺘﻠﻋﺎﻓ
Length 9 patterns
ﺎﻬﻨﻼﻌﻔﺘﺴ ﺎﻬﻨﻼﻌﻔﻴﺴ
... The root of word is unanalyzable morphology, but stem is consisting of the 'root' with a slight modification [11]. Generally, stemming techniques can be classified into root-based, light, and lookup table stemming [12]. The root-based stemming technique [13], also called aggressive stemming, constructs a predefined list of all known roots, it removes the prefixes and suffixes from words and compares the result with the list of roots to find the exact root. ...
... ‫,اخ‬ ‫,وا‬ ‫,ذا‬ ‫,ون‬ ‫,وه‬ ‫,ان‬ ‫.)ھ)‬ The look-up table stemming technique [12] builds a table of words attached to roots. Then it searches by the word to find the exact root. ...
Article
Full-text available
ABSTRACT In recent years, there are massive numbers of users who share their contents over wide range of social networks. Thus, a huge volume of electronic data is available on the Internet containing the users’ thoughts, attitudes, views and opinions towards certain products, events, news or any interesting topics. Therefore, sentiment analysis becomes a desirable topic in order to automate the process of extracting the user’s opinions. One of the widely content sharing languages over the social network is Arabic Language. However Arabic language has several obstacles that make the sentiment analysis a challenging problem. Most users share their contents in informal Arabic. Additionally, there are lots of different Arabic dialects. Hence, Arabic sentiment analysis researches is developed slowly compared to other languages such as English. This paper proposes a new hybrid lexicon approach for Arabic sentiment analysis that combines in the same framework both unsupervised and supervised technique. In the unsupervised phase, the polarity of data is extracted by means of Look-up table stemming technique. In the supervised phase, we use the data of the true classified polarity from the unsupervised phase to generate and train a classifier for the further classification of the unclassified data. We test and evaluate the proposed approach using MIKA corpus [1]. The results show that the proposed approach gives better results
... For example, the words playing and played will be reduced to play. However, there is a danger of over-stemming and under-stemming [108]. Over-stemming occurs when two different words are converted to the same stem (e.g., "universal" and "university" are converted to "universe"), whereas under-stemming errors occur when words of the same concept are stemmed to different roots (e.g., the words "data" and "datum" to "dat" and "datu," respectively). ...
Article
Full-text available
The ever-increasing number of Internet users and online services, such as Amazon, Twitter and Facebook has rapidly motivated people to not just transact using the Internet but to also voice their opinions about products, services, policies, etc. Sentiment analysis is a field of study to extract and analyze public views and opinions. However, current research within this field mainly focuses on building systems and resources using the English language. The primary objective of this study is to examine existing research in building sentiment lexicon systems and to classify the methods with respect to non-English datasets. Additionally, the study also reviewed the tools used to build sentiment lexicons for non-English languages, ranging from those using machine translation to graph-based methods. Shortcomings are highlighted with the approaches along with recommendations to improve the performance of each approach and areas for further study and research.
... For this purpose, input is given, and its corresponding output is checked manually. Many statistical and rule-based stemmers are evaluated manually such as Ali et al. (2019) Urdu stemmer, Al-Kabi et al. (2015) Arabic stemmer and Persian stemmer (Taghi-Zadeh et al. 2015). The accuracy of the stemmer by gold standard assessment can be calculated as given in Eq . ...
Article
Full-text available
Text stemming is one of the basic preprocessing step for Natural Language Processing applications which is used to transform different word forms into a standard root form. For Arabic script based languages, adequate analysis of text by stemmers is a challenging task due to large number of ambigious structures of the language. In literature, multiple performance evaluation metrics exist for stemmers, each describing the performance from particular aspect. In this work, we review and analyze the text stemming evaluation methods in order to devise criteria for better measurement of stemmer performance. Role of different aspects of stemmer performance measurement like main features, merits and shortcomings are discussed using a resource scarce language i.e. Urdu. Through our experiments we conclude that the current evaluation metrics can only measure an average conflation of words regardless of the correctness of the stem. Moreover, some evaluation metrics favor some type of languages only. None of the existing evaluation metrics can perfectly measure the stemmer performance for all kind of languages. This study will help researchers to evaluate their stemmer using right methods.
... Although recent studies [3,4,7,9] show that lemmatization is the suitable way to enhance the performances and the efficiency of many ANLP applications, very often NLP systems make use of root-based or stem-based stemming [10,11,12] to cluster words derived from the same stem or root. From an efficiency point of view, relying on a root-based stemming, a NLP algorithm may yield both relevant and not relevant information in their process. ...
Chapter
Lemmatization is a key preprocessing step and an important component for many natural language applications. For Arabic language, lemmatization is a complex task due to Arabic morphology richness. In this paper, we present a new lemmatizer that combines a lexicon-based approach with a machine-learning-based approach to get the lemma solution. The lexicon-based step provides a context-free lemmatization and the most appropriate lemma according to the sentence context is detected using the Hidden Markov Model. The developed lemmatizer evaluations yield to over than 91% of accuracy. This achievement outperforms the state of the art Arabic lemmatizers.
... However, there are several algorithms can simplify extracting roots. These algorithms follow some rules for removing prefixes and suffixes to produce proper stemming, such as the AlKabi [39], Ghawanmeh [40], Hmeidi [41] , Khoja [42] and WSS-Based algorithms [37]. ...
Article
Full-text available
Harvesting Twitter for insight and meaning in what is called sentiment analysis (SA) is a major trend stemming from computational linguistics and AI. Industry and academia are interested in maximizing efficiency while mining text to attain the most currently available data and crowdsourcing opinions. In this study, we present the ATAM model for traffic analysis using the data available on Twitter. The model comprises five components that start with data streaming and collection and ends with the road incident prediction through classification. The classification of data is done using a lexicon-based method. The predicted classes are as follows: safe, needs attention, dangerous, and neutral. The data were collected for three months in the city of Riyadh, Saudi Arabia. The model was applied on 10k tweets with an overall accuracy of the model classifying all four classes of 82%.
... Another work, Al-Kabi and AL-Mustafa algorithm [13] is based on affix removal. They tested their algorithm on small data sets containing 1,827 words. ...
Article
Full-text available
Non-vocalized Arabic words are ambiguous words, because non-vocalized words may have different meanings. Therefore, these words may have more than one root. Many Arabic root extraction algorithms have been conducted to extract the roots of non-vocalized Arabic words. However, most of them return only one root and produce lower accuracy than reported when they are tested on different datasets. Arabic root extraction algorithm is an urgent need for applications like information retrieval systems, indexing, text mining, text classification, data compression, spell checking, text summarization, question answering systems and machine translation. In this work, a new rule-based Arabic root extraction algorithm is developed and focuses to overcome the limitation of previous works. The proposed algorithm is compared to the algorithm of Khoja, which is a well-known Arabic root extraction algorithm that produces high accuracy. The testing process was conducted on the corpus of Thalji, which is mainly built to test and compare Arabic roots extraction algorithms. It contains 720,000 word-root pairs from 12000 roots, 430 prefixes, 320 suffixes, and 4320 patterns. The experimental result shows that the algorithm of Khoja achieved 63%, meanwhile the proposed algorithm achieved 94% of accuracy.
... In [9], the authors compared four root extraction stemmers, e.g. Al-Mustafa [18], Taghva [19], Al-Sarhan [20] and Rabab'ah [21]. Another comparative work [10] studied the accuracy of six recent and/or popular Arabic root finding algorithms that have success rates greater than 90%. ...
Chapter
Using either stems or roots as index terms offered considerable performance to Arabic Information Retrieval (IR) systems compared to the use of surface words for indexing. Many comparative works tried to find out the best from these two indexing approaches but until then, no of the two methods widely overtook the other. Each of the two index types performed better under different test circumstances in terms of recall and precision. In this paper, the authors propose a hybrid approach combining the two indexing units in a way they take the advantages from both of them and try to overcome their shortcomings. Then, based on some combining techniques, the authors assign a weight for each indexing unit and try to find out the best weighting values.
... The standard measures such as precision, and recall are used in order to evaluate the effectiveness of the proposed hybrid technique that used to extract Arabic word roots. Therefore, the following formulas are used to compute each of the above three measures [22]: ...
Article
Full-text available
Root extraction is one of the main text operations conducted by converting the conflation into its root. This process aims to overcome the morphological richness problem of the Arabic language. Root extraction gives a valuable support to many natural language processing applications such as information retrieval, machine translation, and text-summarizing applications. In this research, a hybrid technique to extract Arabic word roots has been developed. The proposed technique depends on optimization function, which is the enhancing process performed by playing a set of non-morphological rules to enhance the n-gram technique. The proposed technique is tested using a dataset containing more than 6000 distinguished words belonging to 141 different roots. The results show a marked improvement after using the hybrid method, the proposed technique extracts correctly about 99% of tripartite strong roots and about 86% of tripartite vowels roots.
... Al-Kabi and Al-Mustafa in [2], Ghwanmeh et el in [3], Al-Kabi et al in [4], Taghva et al in [5], Alshalabi in [6], Al-Shalabi and Evens in [7], Yaseen and Hmeidi in [8], Hmeidi et al in [9] and most new Arabic root extraction algorithms in the literature have tested their proposed root extraction algorithm on a different data set and compared their finding with other existing work. However, the data set that they used did not cover all types of words. ...
Article
Question answering system aims at retrieving precise information from a large collection of documents. This work presents a question answering method to apply on Hadith in order to provide an informative answer corresponding to the user's query. Hadith englobes stories and qualification of the prophet Muhammad (PBSL). It also includes the sayings of his companions and their disciples. The problem with current methods is that they fail to capture the meaning when comparing a sentence and a user's query; hence there is often a conflict between the extracted sentences and user's requirements. However, our proposed method has successfully tackled this problem through: (1) avoiding extract a passage whose similarity with the query is high but whose meaning is different. (2) Computing the semantic and syntactic similarity of the sentence-to-sentence and sentence-to-query. (3) Expanding the words in both the query and sentences to tackle the fundamental problem of term mismatch between sentences and the user's query. Furthermore, in order to reduce redundant Hadith texts, the proposed method uses the greedy algorithm to impose diversity penalty on the sentences. The experimental results display that the proposed method is able to improve performance compared with the existing methods on Hadith datasets.
Article
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-language retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analysis produced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or morphological analysis.
Conference Paper
In natural language, a stem is the morphological base of a word to which affixes can be attached to form derivatives. Stemming is a process of assigning morphological variants of words to equivalence classes such that each class corresponds to a single stem. Different stemmers have been developed for a wide range of languages and for a variety of purposes. Arabic, a highly inflected language with complex orthography, requires good stemming for effective text analysis. Preliminary investigation indicates that existing approaches to Arabic stemming fail to provide effective and accurate equivalence classes when applied to a text like the Qur'an written in Classical Arabic. Therefore, I propose a new stemming approach based on a light stemming technique that uses a transliterated version of the Qur'an in western script.
Article
The paper presents a rapid method of developing a shallow Arabic morphological analyzer. The analyzer will only be concerned with generating the possible roots of any given Arabic word. The analyzer is based on automatically derived rules and statistics. For evaluation, the analyzer is compared to a commercially available Arabic Morphological Analyzer.
Conference Paper
The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections. This however does not provide any insights which might help in stemmer optimisation. This paper describes a method in which stemming performance is assessed against predefine concept groups in samples of words. This enables various indices of stemming performance and weight to be computed. Results are reported for three stemming algorithms. The validity and usefulness of the approach, and the problems of conceptual grouping, are discussed, and directions for further research are identified.
Conference Paper
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stem�ming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-lan�guage retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analy�sis pro�duced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or mor�phological analysis.
Conference Paper
We have implemented a root-extraction stemmer for Arabic which is similar to the Khoja stemmer but without a root dictionary. Our stemmer was found to perform equivalently to the Khoja stemmer as well as so-called "light" stemmers in monolingual document retrieval tasks performed on the Arabic Trec-2001 collection. A root dictionary, therefore, does not improve Arabic monolingual document retrieval.
Article
The focus of the experiments reported in this paper was techniques for combining evidence for crosslanguage retrieval, searching Arabic documents using English queries. Evidence from multiple sources of translation knowledge was combined to estimate translation probabilities, and four techniques for estimating query-language term weights from document-language evidence were tried. A new technique that exploits translation probability information was found to outperform a comparable technique in which that information was not used. Comparative results for three variants of Arabic "light" stemming are also presented. A simple variant of an existing stemming algorithm was found to result in significantly better retrieval effectiveness.
Article
The inflectional structure of a word impacts the retrieval accuracy of information retrieval systems of Latin-based languages. We present two stemming algorithms for Arabic information retrieval systems. We empirically investigate the effectiveness of surfacebased retrieval. This approach degrades retrieval precision since Arabic is a highly inflected language. Accordingly, we propose root-based retrieval. We notice a statistically significant improvement over the surface-based approach. Many variant word senses are based on an identical root; thus, the root-based algorithm creates invalid conration classes that result in an ambiguous query which degrades the performance by adding extraneous terms. To resolve ambiguity, we propose a novel lightstemming algorithm for Arabic texts. This automatic rule-based stemming algorithm is not as aggressive as the root extraction algorithm. We show that the light stemming algorithm significantly outperforms the root-based algorithrr We also show that a significant improvement in retrieval precision can be achieved with light inflectional analysis of Arabic words.
Article
Introduction The University of Massachusetts took on the TREC10 cross-language track with no prior experience with Arabic, and no Arabic speakers among any of our researchers or students. We intended to implement some standard approaches, and to extend a language modeling approach to handle co-occurrences. Given the lack of resources -- training data, electronic bilingual dictionaries, and stemmers, and our unfamiliarity with Arabic, we had our hands full carrying out some standard approaches to monolingual and cross-language Arabic retrieval, and did not submit any runs based on novel approaches. We submitted three monolingual runs and one cross-language run. We first describe the models, techniques, and resources we used, then we describe each run in detail. Our official runs performed moderately well, in the second tier (3 rd or 4 th place). Since submitting these results, we have improved normalization and stemming, improved dictionary construction, expanded Arabic queries, i