Enhanced Algorithm for Extracting the Root of Arabic Words.
ABSTRACT Stemming is one of many tools used in information retrieval to combat the vocabulary mismatch problem, in which query words do not match document words. Stemming in the Arabic language does not fit into the usual mold, because stemming in most research in other languages so far depends only on eliminating prefixes and suffixes from the word, but Arabic words contain infixes as well. In this paper we have introduced an enhanced root-based algorithm that handles the problems of affixes, including prefixes, suffixes, and infixes depending on the morphological pattern of the word. The stemming concept has been used to eliminate all kinds of affixes, including infixes. Series of simulation experiments have been conducted to test the performance of the proposed algorithm. The results obtained showed that the algorithm extracts the correct roots with an accuracy rate up to 95%.
- SourceAvailable from: Belal Abuata
Conference Proceeding: Building and Benchmarking New Heavy/Light Arabic Stemmer[show abstract] [hide abstract]
ABSTRACT: Arabic language is a Semitic language used by 5% of people around the world, and it is one of the UN official languages. Natural languages like Arabic and English usually use words which are derived from the same root. Therefore researchers in text mining, information retrieval (IR), indexing, machine translation, and natural language processing (NLP) found it is beneficial to extract stem/base/root from different derived words, since it is normal to find a bunch of words derived from a common root/stem. Usually words derived from the same stem/root beside their origin are referring to the same concept. Arabic Stemming is not an easy task, since Arabic language uses many inflectional forms. Researchers are divided on the idea that is beneficial to use stemming in fields like IR, NLP...etc, since in Arabic the morphological variants of a certain word are not always semantically related. This study exhibits the design and implementation of a new Arabic light/heavy stemmer called Arabic Rule-Based Light Stemmer (ARBLS), which is not based on Arabic root patterns. Instead, it depends on mathematical rules and some relations between letters. A series of tests are conducted on ARBLS to compare the effectiveness of this new Arabic stemmer with the effectiveness of another well known Arabic stemmer. Test shows clearly ARBLS is more effective than the other tested Arabic stemmer.The fourth International Conference on Information and Communication Systems (ICICS 2013, JUST, Irbid, Jordan; 04/2013
- [show abstract] [hide abstract]
ABSTRACT: Previous studies on the stemming of the Arabic language lack fair evaluation, full description of algorithms used or access to the source code of the stemmers and the datasets used to evaluate such stemmers. Freeing source codes and datasets is an essential step to enable researchers to enhance stemmers currently in use and to verify the results of these studies. This study laid the foundation of establishing a benchmark for Arabic stemmers and presents an evaluation of four heavy (root-based) stemmers for the Arabic language. The evaluation aims to assess the accuracy of each of the four stemmers and to show the strength of each. The four algorithms are: Al-Mustafa stemmer, Al-Sarhan stemmer, Rabab’ah stemmer and Taghva stemmer. The accuracy and strength tests used in this study ranked Rabab’ah stemmer as the first followed by Al-Sarhan, Al-Mustafa, and Taghva stemmers respectively.Journal of Information Science 01/2011; 37:111-119. · 1.24 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.01/2011;