Content uploaded by Mohammed N. Al-Kabi
Author content
All content in this area was uploaded by Mohammed N. Al-Kabi on Jun 13, 2014
Content may be subject to copyright.
The 2006 International Arab Conference on Information Technology (ACIT'2006)
1/7
Arabic Root Based Stemmer
Mohammed Naji AL-Kabi* and Ronza S. Al- Mustafa**
*Computer Information Systems Dept, Yarmouk University, Irbid, Jordan, mohammednaji@yahoo.com
**Computer Information Systems Dept, Yarmouk University, Irbid, Jordan, ronza_malkawi@yahoo.com
ABSTRACT
This paper presents a new (root-based) stemming
algorithm for Arabic language. As other natural
languages not all the words used in Arabic language, has
roots, some of these are borrowed from other languages,
e.g. as the word " " television, so in this case the
stemmer will fail to get the right root because these
foreign words have no root. This algorithm is based on
affix removal beside knowledge from structural
linguistics. The implementation and evaluation of this
algorithm shows a noticeable improvement in the
accuracy relative to previous algorithms.
Keywords:Arabic, Stemming, Root, Negative Suffix,
Negative Prefix, Light Stemming, NLP.
1. INTRODUCTION
The Arabic language is the fifth most widely spoken
language in the world. It belongs to the Semitic family; so
it differs from the Indo-European languages
morphologically, semantically, and syntactically. The
Arabic alphabet contains twenty-eight letters, always
written from right to left in cursive form. Diacritical
marks (harakat) (tashkiil ) appear either above or
below the letters, and play an essential role in many cases
in distinguishing semantically and phonetically between
two identical words with the same characters, but with
different diacritics. Diacritical marks are used in holy
books, poems, and children’s literature; newspapers,
journals and other books for adults are usually printed
without diacritics, which means that many strings are
ambiguous. Most native Arabic words are derived from
verbal roots. Arabized words, on the other hand, mainly
nouns borrowed from other languages with a slight
phonetic adjustment to suit the Arabic pronunciation, have
no roots [8].
All Arabic words belong to three main categories:
noun, verb or particle. Around 64% of Arabic words are
derived from triliteral verbs (three consonants), but there
are also biliteral verbs (two consonants), quadriliteral
verbs (four consonants), and pentaliteral verbs (five
consonants). Naturally these verbs represent the roots for
which stemming algorithms typically search. This
stemming process excludes words derived from nouns and
particles[9].
A morpheme is the smallest meaningful lingual unit
which has a semantic interpretation in the grammar of a
language. There is a difference between stem and a root, a
stem is a morpheme or a set of concatenated
morphemes that can accept an affix, where a root is a
single morpheme that provides the basic meaning of a
word.
Stemming might be useful to Information
retrieval systems, text classification systems, text
clustering systems, dictionary automation, text
compression, etc.
Stemming is considered by a number of authors
as word Standardization [12]. A number of writers
thought that stemming is useful for improving
retrieval performance because it reduces variants of
the same root word to common concept, besides
reducing the size of the indexing structure because the
number of distinct index terms is reduced [3]. Other
writers are not satisfied with the concept of using
stemming in IR and Text mining [3]. Accordingly
many search engines do not adopt stemming [3].
Several common types of stemming strategies are
discussed by Frakes: affix removal, table lookup,
successor variety, and n-grams [7]. Affix removal
strategy tries to eliminate the prefixes and suffixes.
The most important part in this strategy is suffix
removal, since most variants of terms are generated by
suffixes.
In Arabic language as with other natural
languages the stemmer may face the problem of a
negative prefix, where the prefix which eliminated is
part of the word and not really a prefix. If a stemmer
tries to strip the " ا" which is a well known prefix
from the following examples, the output will be
definitely wrong, e.g. "ﷲا" Allah, " " Germany,
" " Brigades, " " Albania, etc. It also includes
other prefix such as "و" And which represents a
frequently used conjunction, e.g., stripping off "و"
And from "ءﺎﻓو" honesty leads to a wrong stem.
The negative prefix problem in Arabic language
stemmer is not restricted to the " ا" and "و" prefixes,
but it also includes other prefixes such as " "," ",
" ﻟ ﻠـ ", " ـﻟﺎﻓ ", etc. The Arabic light stemming in this case
for the term " ﻲـﻟاو " Governor will be wrong, if the
prefix "ـﻟاو " strip off from the term. Similarly the
stems of the words "ﺢﻟﺎﻛ " glum, "ﷲا" Allah, " "
successful, if we strip from them the prefixes " ـﻟﺎﻛ ",
" ﻟـﻠ ", "ـﻟﺎﻓ " respectively. Similarly Arabic stemmers
face another problem of a negative suffix, where the
suffix which has been eliminated is part of the word
and not really a suffix. If a stemmer tries to strip off
The 2006 International Arab Conference on Information Technology (ACIT'2006)
2/7
the "نا" which is a well known suffix from the following
examples, the output will be definitely wrong, e.g. " نﺎﻤﻌﻟ "
To Amman, " " Japan, etc. Table 5 in the Appendix
illustrates a number of examples.
Table lookup is the simplest strategy among the four;
it simply looks for the root of the term in the lookup table.
The performance of this strategy is highly affected by the
number of words (terms) and their root in the table, as the
lookup tables gets larger the performance get higher too.
Large lookup tables might need a considerable storage
space. Successor variety is not straightforward as the
others, and depends on algorithms which is based on
structural linguistics and attempts to determine morpheme
boundaries. N-grams stemming searches for digrams,
trigrams or more term successive letters. This strategy is a
term of clustering procedure not a stemming procedure.
The above two problems (negative prefix & negative
suffix) of Arabic stemmers leads to a wrong grammatical
root, so the accuracy of IR & Text mining systems which
rely on these stemmers will be deteriorated.
The two main problems of stemming have been
described by Chris D. Paice [12]. In the first place, pairs
of etymologically related words sometimes differ sharply
in meaning [12] for example; consider " ﻞﺳ" ask, " ﺐﻠﺳ "
stole, and "مﻼﺳ " Peace. In the second place, the
transformations involved in adding and removing suffixes
involve numerous irregularities and special cases [12].
Stemming errors are of two kinds: understemming errors,
in which words which refer to the same concept are not
reduced to the same stem, and overstemming errors, in
which words are converted to the same stem even though
they refer to distinct concepts. In designing a stemming
algorithm there is a trade-off between these two kinds of
error.
Alight stemmer plays safe in order to avoid
overstemming errors, but consequently leaves many
understemming errors. A heavy stemmer boldly removes
all sorts of endings, some of which are decidedly unsafe,
and therefore commits many overstemming errors [12].
Shereen Khoja addressed the problems that might
face the Arabic stemmer [9]:
"If the root contains a weak letter (e.g. "أ" alif, "و"
waw or "ي" yaa), the form of this letter may change
during derivation. To deal with this, the stemmer must
check to see if the weak letter is in the correct form. " If
not, the stemmer produces the correct form of this weak
letter, which then gives the correct form of the root. If any
triliteral rooted verb's one of the three root letters contains
either "أ" alif hamza (a), "و" waw (w) or "ي" yaa (y) then
that is defined as a weak verb, e.g. " " gave, " َﺪَﺟَ و"
found, " َﻊَﺿَ و" put, " َﻒَﻗَو" stood, " َﺪَﻋَ و" promised, "َعﺎَﺑ "
bought, "ءﺎﺟ" came, "أَﺮَﻗ" read. Also weak verbs includes a
triliteral rooted verb's where the second letter is doubled
with a ّshadda, e.g. " َﺮّﻤَﺷ" prepared. Shadda (Germination
mark (tashdeed)) is written above the consonant that is
doubled, and it look like the w shape. Strong verb is a
triliteral rooted verb's which does not have any of the
above three weak letters.
"Some words do not have roots. For example
the Arabic equivalents of "ﻦﺤﻧ " we, "ﺪﻌﺑ "
after, " ﺖﺤﺗ " under and so on. If the stemmer
comes across any of these words, it does
nothing. "
"Sometimes a root letter is deleted during
derivation. This is especially true of roots
that have duplicate letters (e.g. the last two
letters are the same), e.g., " َﺞٍﺟُ د" get dressed,
" َﻞﱠﻟَد" dandle, " َﻞﱠﻠَﺧ" souse, " َﻞﱠﻠَﻋ" explained,
" َﻞﱠﻠَﻗ " reduced, " ﻞَﻠَﺑ" wet, etc. The stemmer
can detect this, and return the letter that was
removed. - If a root contains a hamza, this
hamza could change form during derivation,
e.g., " " talk, " " stand up, etc. The
stemmer detects this, and returns the original
form of this hamza."
L. S. Larkey and M. E. Connell [11] conducted a
good study based on a modified version of Shereen
Khoja stemmer. The modified version includes a few
changes to enhance the accuracy of the stemmer.
These changes are summarized as follows:
If a root were not found, the normalized form
would be returned, rather than returning the
original unmodified word.
List of place names are considered
"unbreakable" words exempt from stemming.
In addition to the Arabic stop word list
included in the Khoja stemmer, a script was
to remove stop phrases.
A light stemmer used to strip off definite
articles (ـﻟﺎﻓ, ـﻟﺎﻛ, ـﻟﺎﺑ, ـﻟاو, ـﻟا, and و ) from
the beginnings of normalized words and
strips 10 suffixes from the ends of words (تا,
نا, ﺎھ, ي, ة, ه, ﺔﯾ,ﮫﯾ,ﻦﯾ, and نو).
Table 5 in the appendix shows that light
stemming leads to wrong results if it carried out
unconditionally, so we record our reservation on the
last step. Larkey, and Connell's stemmer seem to be
better than its parent (Khoja stemmer).
Morphology is a branch of linguistics that is
concerned with studying of the internal structure of
word forms. Semitic languages have a complex
morphology and so the Arabic language is a complex
language for stemming. Arabic stemmers have to deal
with affixes (prefixes, infixes, and suffixes), in
addition to diacritic marks (harakat), in order to get
the right root with its appropriate diacritic marks on it.
Furthermore Arabic stemmer has to deal with
The 2006 International Arab Conference on Information Technology (ACIT'2006)
3/7
Arabized words (foreign words) which have no root, and
in this case have to be excluded from stemming.
This study uses morphological patterns to obtain the
trilateral and quadriliteral roots. The algorithm used
simply tries to extract the root, in case there is a match
between pattern infix and word's infix.
Shereen Khoja is a pioneer in this field, but
unfortunately we failed to get her original work entitled
"Stemming Arabic Text" with her colleague Roger
Garside. Leah S. Larkey and Margaret E. Connell and
others headed a team at University of Massachusetts,
Amherst to conduct a number of studies which depends
on Khoja work. Their work [10] [11] represent an
improvement to Khoja work. Although their work include
improvements to Khoja but it does not solve the problems
of negative prefix and negative suffix which discussed
before. Al-Kharashi, I.A. et. Al. [2] presents pattern based
stemming for Arabic language, also Taghva K. et. Al. [13]
used the same approach which is different from Khoja,
with an equivalent performance. Pattern based stemming
does not use root dictionary. This approach based on
matching the word with a number of Arabic patterns to
extract the root. Chen A. et. Al. [4] conducted a study to
find Arabic roots using Machine Translation (MT) based
stemmer. Although this study depends on Ajeeb machine
translation system, stopword removing, clustering, light
stemming, and morphological analysis, but it does not
presents a solution to the problems of negative prefix and
negative suffix. Kareem Darwish [5] shows how to extract
a root from the word, by first removing the prefix and
suffix of the word to get a stem, then match a stem to a
number of templates to get the root. In this study the
researcher did not mention how many templates used in
comparisons, beside the absence of an algorithm.
Darwish, K. et. Al. [6] used an approach which is similar
to his previous one[5], but with more details about the
prefixes, and suffixes being removed. Table 6 shows the
patterns used within our algorithm.
2. THE ALGORITHM
The first step of the Arabic Rooter under study is to
normalize the text. Afterward a matching is performed
between the stem and the verbal and noun patterns, in
order to obtain the root. To conduct this study, a system
(stemmer) is built to find the Arabic roots using Visual
Basic 6.0. This stemmer kept the words unchanged if it
failed to find a root, and this is a normal case when the
stem is an Arabized word or when it represent the names
of places, such as continents, regions, countries, states,
districts, cities, villages, rivers, mountains, deserts, etc.
Germination mark (tashdeed) ( ّ )
"shaddah" is placed above a consonant
letter as a sign for the duplication of the
consonant
T(i) be any term
Let LenT(i)be the length of each
term
Let nbe a number of terms within a
document
Let chr(i)be the character position
within a term
Let LenP(j)be the length of the
pattern
Let Infixes_String be a string
generated manually, consisting of
the pattern, and the affix of that
pattern, e.g., the stem "ﺢﺑﺎﺴﻣ "
swimming pools, match with the
pattern of "ﻞ ", so the
Infixes_String in this case is the
string " ﺎﻣ", where "م" lie in the first
position, and "ا" lie in the third
position.
Let T_String be the corresponding
string of the word which
corresponds the string of the pattern
Infixes_String,i. e., to clarify the
idea suppose we want to find the
root of the stem " ﺢﺑﺎﺴﻣ " swimming
pools, the system has to check this
word with all 5 characters patterns,
one of these patterns is "ﻞﯿﻌﻔَﺗ", so the
Infixes_String in this case is " ﻲﺗ"
and the T_String is "ﺐﻣ ", the
mismatch is obvious in this case,
when matching the stem with the
pattern " ﻋﺎﻔﻣﻞ " the Infixes_String &
T_String will be "ﺎﻣ".
Table 1 shows how to get Infixes_String for each
of the patterns used.
Table 1: An example of patterns and their infixes,
and the position of each infix
1. Stop word removal depending on a list of
(1281) stop words consists of prepositions,
pronouns, article and conjunctions.
2. Normalization
2.1 Remove tatweel (kasheeda) symbol ("_")
2.2 Remove punctuations using a list of
punctuation characters
Pattern
Infixes_String
Infix : Infix position
لﺎﻌﻓ
ا
: ا3
لﻮﻌﻔﻣ
ﻮﻣ
: م1
: و4
ﺴﯾﻦﻠﻌﻔﺘ
ﻦﺘﺴﯾ
: ي1
: س2
: ت3
: ن7
The 2006 International Arab Conference on Information Technology (ACIT'2006)
4/7
IncorrectCorrect
Correct
Precision
UnAnalyzedCorrectCorrect
Recall
RecallPrecision RecallPrecision
F
2
2.3 Remove diacritics depending on a list of
diacritics characters
3. If LenT(i) ≥ 5 then
Remove initial definite article (ﻞﻟ ،لا)
Else if LenT(i) ≥ 6 then
Remove initial definite article (لﺎﺑ ،لﺎﻓ ،لﺎﻛ)
End if
4. If LenT(i) > 4 and the final character of the T(i)
like "ءا" then
Replace final "ءا" with "ي"
End if
5. Replace initial ( إ, ), ( أ ) with bare alif ( ا)
6. Replace initial ( آ ) with bare alif ( ا)
7. Replace final ( ة ) with ( ه )
8. Replace final ( ى ) with ( ي)
9. For i1 to ndo
9.1 If LenT(i) = 3 then
9.1.1 If T(i) ends with germination mark (tashdeed)
( ّ) then Root(T(i)) = chr(1)& chr(2)& chr(2)
Else Root(T(i)) = T(i)
End if
End if
9.2 If LenT(i) ≥ 4 then
9.2.1 For j1 to number of patterns of length =
LenT(i)do
9.2.1.1 If T_String match Infixes_String
then
9.2.1.1.1 Remove the infix characters
from T(i)
9.2.1.1.2 Replace "ئ" or "ؤ" with "أ"
9.2.1.1.3 Replace "ء" or "ى" with "ي"
9.2.1.1.4 Return Root (T(i))
ElseReturn the normalized term
End if
Next j
End if
Next i
3. EVALUATION
In order to test the accuracy of our algorithm, we selected
a number of words randomly. Table 2 shows the manual
trace of the execution of the above algorithm to extract the
root of the selected terms.
Table 3 shows the strength and weakness of the
above algorithm, using a small data set containing 1,827
words. The system failed to analyze 55 words, since their
patterns are unknown. This failure mostly due to foreign
(Arabized) words. The system accepts to analyze the rest
of the (1,772 words), but we found that accuracy of
extracting the right roots is 91%.
Table 2. Trace of the manual extraction of the
correct root.
Table 3 Accuracy of root extraction for three Arabic
text files
Figure 1
Statistics for root extraction
Table 4 shows the precision, recall and the
harmonic mean (F-measure). Here we used the
precision, recall and F-measure as shown in the
following formulas:
…………. (1)
…………. (2)
…………. (3)
Original
Word T(i)
Normalized
T(i) (Stem)
T_String
Root
(T(i))
Status
تﺎﻤﯿﻠﻌﺘﻟا
تﺎﻤﯿﻠﻌﺗ
تﺎﯿﺗ
ِﻋْﻢﻠ
Right
ناﺰﯿﻤﻟا
ناﺰﯿﻣ
نا
ﺰﯿﻣ
Wrong
ﺔﯾرﺎﻤﺜﺘﺳﻹا
ﮫﯾرﺎﻤﺜﺘﺳا
ﮫﯾﺎﺘﺳا
َﺛَﻤْﺮ
Right
ﻦﯿﻤﻠﻌﻤﻠﻟ
ﻦﯿﻤﻠﻌﻣ
ﻦﯿﻣ
َﻋِﻠَﻢ
Right
مﺎﺣﺮﺘﺳﻹا
مﺎﺣﺮﺘﺳا
ﺎﺘﺳا
َرِﺣﻢ
Right
ﺎﮭﻧﺎﻛﺮﺘﯿﺳ
ﺎﮭﻧﺎﻛﺮﺘﯿﺳ
ﺎھﺎﺘﯿﺳ
َﺗَﺮَك
Right
ﻦﯾﺪﺷﺮﻤﻠﻟ
ﻦﯾﺪﺷﺮﻣ
ﻦﯿﻣ
َرَﺷَﺪ
Right
ّﺪﻣ
ّﺪﻣ
-
َﻣَﺪَد
Right
اﺰﯿﻣن
ناﺰﯿﻣ
نا
َﻣَﯿَﺰ
Wrong
اﻮﻠﺋﺎﺴﺗ
اﻮﻠﺋﺎﺴﺗ
اوﺎﺗ
َﺳَﺄَل
Right
سراﺪﻤﻟا
سراﺪﻣ
ﺎﻣ
َدَرَس
Right
ﻢﯾﺮﻛ
ﻢﯾﺮﻛ
ي
َﻛُﺮَم
Right
ﺔﺒﺘﻜﻤﻟﺎﺑ
ﮫﺒﺘﻜﻣ
ﮫﻣ
َﻛَﺘَﺐ
Right
ﺮﺋﺎﻄﻟا
ﺎﻃأر
ا
َﻃْرﺄ
Wrong
نﻮﺒﯿﺠﺘﺴﯾ
نﻮﺒﯿﺠﺘﺴﯾ
نﻮﺘﺴﯾ
ِﺟْﺐﯿ
Wrong
ﺎﮭﻄﯿﺤﻣ
ﺎﮭﻄﯿﺤﻣ
ﺎھ
ُﻣِﺤْﻂﯿ
Wrong
Number of Roots
extracted
correctly
Number of
incorrect
Roots
Words not
Analyzed
Number of
words
130 (87.2%)
16 (10.8%)
3 (2%)
147
215 (87.4%)
24 (9.8%)
7 (2.8%)
244
527 (91%)
33 (5.7%)
19 (3.3% )
579
791 (92.4)
39 (4.6%)
26 (3%)
857
1663 (91%)
112 (6.1%)
55 (3%)
1827
The 2006 International Arab Conference on Information Technology (ACIT'2006)
5/7
Table 4 shows that the system obtains about 92%
overall precision for the analyzed words, note that words
that doe not match any of the verbal and noun patterns
have been ignored as illustrated in table 6 from the
computations of the accuracy measures, because these
words are foreign words.
Table 4. Accuracy of root extraction for three Arabic
text files
4. CONCLUSIONS
In order to increase the accuracy of the system, and to
reduce the probability of facing the problems of negative
suffix and negative prefix, the system shall not remove the
prefixes ("ـﻓ" ،"ب" ،"ـﻟ" ،"و" ،"ـﻓ") and suffix ( "ﮫـ" ).
Furthermore the system uses a conditional removing,
e.g., in case the term length is six or more the system will
remove the following prefixes ("لاو"، " لﺎﺑ " ، " لﺎﻛ"، " لﺎﻓ")
otherwise when the term length is less than six the term
will be unchanged.
As mentioned in Thabet [14] root-based algorithm
increases word ambiguity, where many word variants
have different meaning, and this will affect the accuracy
of IR, Text mining, etc systems which rely on root based
stemmers. Table 5 presents a number of ambiguous cases,
one of these is the term " ", this can be interpreted by
the reader as parents, religion, and debt, since this word is
bare of diacritics, and it is in its own, not within a
statement. As we said the diacritics used to distinguish the
words semantically and phonetically.
Arabic stemmers can be used to enhance the
efficiency of a number of systems such as, Spell checkers,
Information retrieval systems, Text mining systems, Text
Analysis systems, Compression systems , etc.
This algorithm is incapable of extracting Arabic roots
of some imperative verbs (" ") that is made up of
one Arabic letter with the fact that its root being of three
letters (trilateral verbs), e.g., " "ِـﻋ" , with the root of
" ﻲِﻋو ". In addition, the problem of defective roots (weak
roots) is still not solved by this algorithm. Defective roots
are roots that contain vowels ("ي"،"و"،"أ") which are
classified as irregular roots, since some vowels in these
roots are altered to other vowels or removed in the
derivational process [1], e.g., " ﺎﻣر " and " ﻲﻣر " these two
words have the same meaning throw, and both of them
represent the same root. As a future research, we hope to
solve these problems within our next enhancement to
this work.
REFERENCES
[1] Aljlayl. M, Frieder. O. "On Arabic Search:
Improving the Retrieval Effectiveness via a
Light Stemming Approach", CIKM 02,
November 4-9, 2002, McLean, Virginia,
USA. Pages 340 -- 347. ACM 1-58113-492-
4/02/0011.
[2] Al-Kharashi, I.A., & Al-Sughaiyer, I.A.
(2002e). "Pattern-based Arabic stemmer". In
Proceedings of the 2nd Saudi Technical
Conference and Exhibition (STCEX2002),
Volume II (pp. 238-244), Riyadh, Saudi
Arabia.
[3] Baeza-Yates, R., & Ribeiro-Neto, Modern
Information Retrieval. Addison Wesley,
1999.
[4] Chen A. and Gey Fredic. 2002. "Building an
arabic stemmer for information retrieval". In
Proceedings of the Eleventh Text REtrieval
Conference (TREC 2002), National Institute
of Standards and Technology, November.
[5] Darwish K. 2002. "Building a shallow Arabic
Morphological Analyzer in one day", In
proceedings of the ACL-02 workshop on
Computational approaches to semitic
languages, Association for Computational
Linguistics , July.
[6] Darwish, K. and D. Oard. "CLIR Experiments
at Maryland for TREC 2002: Evidence
Combination for Arabic-English Retrieval".
In TREC. 2002. Gaithersburg, MD.
[7] Frakes W. B., Introduction to Information
Storage and Retrieval Systems, chapter 1,
pages 1--12. Prentice-Hall, 1992.
[8] Kanaan, G.; Al-Shalabi, R.; AL-Kabi, M.N.;
Jaam, J.M.; Hasnah, A.; . 2004. "New
Approach for Extracting
Quadriliteral/Quadrilateral Arabic Roots ”, In
proceedings of 1st International Conference
on Information & Communication
Technologies: from Theory to Applications,
ICTTA'04, (Damascus, Syria, April 2004).
IEEE-France.
[9] Khoja S., Research Interests, Pacific
University, 2043 College Way, Forest Grove,
Oregon 97116,
http://zeus.cs.pacificu.edu/shereen/research.h
tm, July 8, 2006.
F-
measure
Precision (Accuracy
of Analyzed word)
Recall
Number of
words
0.9309
0.8889
0.9771
147
0.9322
0.8987
0.9682
244
0.9530
0.9411
0.9652
579
0.9606
0.9531
0.9682
857
0.9442
0.9204
0.9697
1827
The 2006 International Arab Conference on Information Technology (ACIT'2006)
6/7
[10] Larkey L., Ballesteros L., and Connell M.,
"Improving Stemming for Arabic Information
Retrieval: Light Stemming and Co-occurrence
Analysis," SIGIR 2002: 275-282, 2002.
[11] Larkey L. S., and Connell M. E., "Arabic
information retrieval at UMass in TREC-10". In
TREC 2001.
[12] Paice C.D., "An evaluation method for stemming
algorithms". In W.B. Croft and C.J. van
Rijsbergen, editors, Proceedings of the 17th
Annual International ACM SIGIR Conference on
Research and Development in Information
Retrieval, pages 69-90. Springer-Verlag, July
1994.
[13] Taghva, K., Elkoury, R., and Coombs, J.
"Arabic Stemming without a root
dictionary".2005.
www.isri.unlv.edu/publications/isripub/Tagh
va2005b.pdf
[14] Thabet, N. (2004). “Stemming the Qur’an”.
In Proceedings of Arabic Script-Based
Languages Workshop, COLING-04,
Switzerland, August 2004.
Appendix A:
Table 5: The problem of negative prefixes and
negative suffixes
Full
word
Removing
the suffix تا
Full
word
Removing
the suffix نا
Full word
Removing
the suffix نو
Full word
Removing
the suffix ﻦﯾ
تﺎﻛﺮﺒﻟا
كﺮﺒﻟا
نﺎﻣﻷا
مﻷا
نﻮﻌﻟﺎﺑ
ﻊﻟﺎﺑ
ﻦﯿﻣﻷا
مﻷا
تﺎﻤﯿﻠﻌﺘﻟا
ﻢﯿﻠﻌﺘﻟا
نﺎﺴﻧﻹا
ﺲﻧﻹا
ﺎﺒﻟانﻮﻟ
لﺎﺒﻟا
ﻦﯿﻣﺎﺘﻟا
مﺎﺘﻟا
تارﻮﺜﻟا
رﻮﺜﻟا
ناوﻷا
وﻷا
نﻮﻄﺑ
ﻂﺑ
ﻦﯿﺴﺤﺗ
ﺲﺤﺗ
تﺎﻋﺎﻤﺠﻟا
عﺎﻤﺠﻟا
نﺎﻃوﻷا
طوﻷا
نﻮﻠﺑ
ﻞﺑ
ﻦﯿﻨﺣ
ﻦﺣ
تﻼﻤﺤﻟا
ﻞﻤﺤﻟا
نﺎﻛﺮﺑ
كﺮﺑ
نوﺎﻌﺘﻟا
ﺎﻌﺘﻟا
ﻦﯾﺪﻟا
ﺪﻟا
تاروﺪﻟا
روﺪﻟا
نﺎﻨِﺠﻟا
ﻦِﺠﻟا
نﻮﺴﺤﻟا
ﺲﺤﻟا
ﻦﯾﺬﻟا
ﺬﻟا
تﺎﯾرود
يرود
نﺎﻨﺤﻟا
ﻦﺤﻟا
نﻮﻨﺣ
ﻦﺣ
ﻦﯿﺠﺳ
ﺞﺳ
تاﺬﻟا
ﺬﻟا
نﺎﺠﻠﺧ
ﺞﻠﺧ
نﻮﺘﺴﻟا
ﺖﺴﻟا
ﻦﯿﻜﺳ
ﻚﺳ
تﺎﻄﻠﺴﻟا
ﻂﻠﺴﻟا
نﺎﯾﺮﻟا
يﺮﻟا
نﻮﻜﺳ
ﻚﺳ
ﻦﯿﺘﻨﺳ
ﺖﻨﺳ
تاﻮﻨﺴﻟا
ﻮﻨﺴﻟا
نﺎﺤﯾﺮﻟا
ﺢﯾﺮﻟا
نﻮﺑﺎﺻ
بﺎﺻ
ﻦﯿﻨﺳ
ﻦﺳ
تﺎﺳﺎﯿﺴﻟا
سﺎﯿﺴﻟا
نﺎﻤﻀﻟا
ﻢﻀﻟا
نﻮﯿﻌﻟا
ﻲﻌﻟا
ﻦﯿﻋ
ع
تﺎﻛﺮﺸﻟا
كﺮﺸﻟا
نﺎﻤﺠﻋ
ﻢﺠﻋ
نوﺮﻗ
ﺮﻗ
ﻦﯿﻧاﻮﻗ
ناﻮﻗ
تﺎﻘﺒﻃ
ﻖﺒﻃ
ناﻮﻨﻋ
ﻮﻨﻋ
نﻮﻧﺎﻛ
نﺎﻛ
ﻦﯾﺪﻛ
ﺪﻛ
تاﻮﻘﻟا
ﻮﻘﻟا
نﺎﻨﺒﻟ
ﻦﺒﻟ
نﻮھﺮﻣ
هﺮﻣ
ﻦﯿﻟ
ل
تﺄﺠﻟ
ﺞﻟ
نﺎﻤﻌﻟ
ﻢﻌﻟ
نﻮﯿﻠﻤﻟا
ﻲﻠﻤﻟا
ﻦﯿﺘﻣ
ﺖﻣ
تاوﺬﻟ
وﺬﻟ
نﺎﻨﺒﻠﻟ
ﻦﺒﻠﻟ
نﻮﻣﺮﮭﻟا
مﺮﮭﻟا
ﻦﯿﻠﻟﺪﻣ
ﻞﻟﺪﻣ
ةاﻮﮭﻠﻟ
ﻮﮭﻠﻟ
نﺎﺟﺮﻣ
جﺮﻣ
نورﺪﯾ
رﺪﯾ
ﻦﯿﻜﺴﻣ
ﻚﺴﻣ
تﻻﺰﻨﻟ
لﺰﻨﻟ
ناﺰﯿﻤﻟا
ﺰﯿﻤﻟا
نﻮﻠِﺼَﯾ ،نﻮﻠﺼٌﯾ
ﻞﺼﯾ
ﻦﯿﻘﻠﻌﻤﻟا
ﻖﻠﻌﻤﻟا
تﻼﺧاﺪﻣ
اﺪﻣﻞﺧ
نﺎﺴﯿﻧ
ﺲﯿﻧ
نﻮﻤﻀﻣ
ﻢﻀﻣ
ﻦﯿﻌﻣ
ﻊﻣ
تﺎﺷﺎﻘﻨﻟا
شﺎﻘﻨﻟا
ناﻮﮭﻟا
ﻮﮭﻟا
نﻮﻜﺴﻣ
ﻚﺴﻣ
ﻦﯿﻤﯾ
ﻢﯾ
تارذو
رذو
نﺎﺑﺎﯿﻟا
بﺎﯿﻟا
نﻮﺘﻔﻣ
ﺖﻔﻣ
The 2006 International Arab Conference on Information Technology (ACIT'2006)
7/7
Table 6: Verbal and noun patterns used within the algorithm
Full word
Pattern's used
Length 3 patterns
Length 4 patterns
لﻌﺘﻓ ل
Length 5 patterns
ﻔﻟ لﺎﻌﻔﺒ لﻴﻌﻔﻤ لﻌﺘﻔﻤ
Length 6 patterns
ﻌﻓ لﻴﻌﻔﺘﻟ لﻴﻌﻔﺘﺒ
Length 7 patterns
ﺘﺴﻤ
ﻌﻔﻟ
Length 8 patterns
ﺘﻠﻌﻔﺒ ﺎﻤﻜﺘﻠﻌﻔﻟ ﺎﻤﻜﺘﻠﻌﻔﺒ ﺎﻤﻬﻠﻋﺎﻔﻟ
ﺎﻤﻜﺘﻠﻋﺎﻓ ﺎﻤﻬﺘﻠﻋﺎﻓ
Length 9 patterns
ﺎﻬﻨﻼﻌﻔﺘﺴ ﺎﻬﻨﻼﻌﻔﻴﺴ