Enhanced Algorithm for Extracting the Root of Arabic Words
DOI: 10.1109/CGIV.2009.10 Conference: Sixth International Conference on Computer Graphics, Imaging and Visualization: New Advances and Trends, CGIV 2009, 11-14 August 2009, Tianjin, China
Stemming is one of many tools used in information retrieval to combat the vocabulary mismatch problem, in which query words do not match document words. Stemming in the Arabic language does not fit into the usual mold, because stemming in most research in other languages so far depends only on eliminating prefixes and suffixes from the word, but Arabic words contain infixes as well. In this paper we have introduced an enhanced root-based algorithm that handles the problems of affixes, including prefixes, suffixes, and infixes depending on the morphological pattern of the word. The stemming concept has been used to eliminate all kinds of affixes, including infixes. Series of simulation experiments have been conducted to test the performance of the proposed algorithm. The results obtained showed that the algorithm extracts the correct roots with an accuracy rate up to 95%.
Available from: Attia Nehar
- "Weights and ranks are assigned to letters using a little bit information on language (Al-Serhan et al., 2003). In other works (Ghwanmeh et al., 2009; Harmanani et al., 2006; Momani and Faraj, 2007), a rule-based approach was used. For instance, (Harmanani et al., 2006) proposed a method in which roots are extracted based on a set of language dependent rules that are interpreted by a rule engine. "
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we address the problems of Arabic Text Classification and root extraction using transducers and rational kernels. We introduce a new root extraction approach on the basis of the use of Arabic patterns (Pattern Based Stemmer). Transducers are used to model these patterns and root extraction is done without relying on any dictionary. Using transducers for extracting roots, documents are transformed into finite state transducers. This document representation allows us to use and explore rational kernels as a framework for Arabic Text Classification. Root extraction experiments are conducted on three word collections and yield 75.6% of accuracy. Classification experiments are done on the Saudi Press Agency dataset and N-gram kernels are tested with different values of N. Accuracy and F1 report 90.79% and 62.93% respectively. These results show that our approach, when compared with other approaches, is promising specially in terms of accuracy and F1.
Available from: Hayel Khafajeh
- "Each Arabic word is formed from the root word and a suffix, a prefix or an infix. There are many Arabic language computerized applications rely on using of the roots of words, such as information retrieval systems, text classification, text summarization, auto-translation, Data mining, OCR (Ghwanmeh et al., 2009;Yousef et al., 2010) and other applications. The Arabic word's roots can be classified according to the vowels letters into two types (Wightwick and Gaafar, 2007), the first type is the strong roots which is the root that does not contain a vowel, whereas the roots that containing at least is called vocalic roots. "
[Show abstract] [Hide abstract]
ABSTRACT: Arabic language is distinguished by its morphological richness, which forces the workers in the field of Arabic language Processing (i.e., information retrieval, document's classification, text summarizing) to deal with many words that seem to be different but in reality they came from an identical root word. One of the methods to overcome this problem is to return the words to their roots. This research aims to provide a new algorithm, that returns roots of Arabic words using n-gram technique without using morphological rules in order to avoid the complexity arising from the morphological richness of the language in one hand and the multiplicity of morphological rules in other hand. The proposed algorithm uses a list that contains over 4,500 identical roots words.
Available from: Belal Mustafa Abuata
- "Most studies conclude that stemming of English text is beneficial, but this issue is controversial in studies related to Arabic stemming. Therefore a number of studies such as   concludes and asserts the effectiveness of stemming, while other studies conclude it is harmful and it degrades the performance of the system using it. Also there are studies which conclude that light stemming is better than heavy stemming, and there are others which conclude it is better to use heavy stemming relative to light stemming. "
[Show abstract] [Hide abstract]
ABSTRACT: Arabic language is a Semitic language used by 5% of people around the world, and it is one of the UN official languages. Natural languages like Arabic and English usually use words which are derived from the same root. Therefore researchers in text mining, information retrieval (IR), indexing, machine translation, and natural language processing (NLP) found it is beneficial to extract stem/base/root from different derived words, since it is normal to find a bunch of words derived from a common root/stem. Usually words derived from the same stem/root beside their origin are referring to the same concept. Arabic Stemming is not an easy task, since Arabic language uses many inflectional forms. Researchers are divided on the idea that is beneficial to use stemming in fields like IR, NLP...etc, since in Arabic the morphological variants of a certain word are not always semantically related. This study exhibits the design and implementation of a new Arabic light/heavy stemmer called Arabic Rule-Based Light Stemmer (ARBLS), which is not based on Arabic root patterns. Instead, it depends on mathematical rules and some relations between letters. A series of tests are conducted on ARBLS to compare the effectiveness of this new Arabic stemmer with the effectiveness of another well known Arabic stemmer. Test shows clearly ARBLS is more effective than the other tested Arabic stemmer.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.