Conference Paper

Enhanced Algorithm for Extracting the Root of Arabic Words

DOI: 10.1109/CGIV.2009.10 Conference: Sixth International Conference on Computer Graphics, Imaging and Visualization: New Advances and Trends, CGIV 2009, 11-14 August 2009, Tianjin, China
Source: DBLP

ABSTRACT Stemming is one of many tools used in information retrieval to combat the vocabulary mismatch problem, in which query words do not match document words. Stemming in the Arabic language does not fit into the usual mold, because stemming in most research in other languages so far depends only on eliminating prefixes and suffixes from the word, but Arabic words contain infixes as well. In this paper we have introduced an enhanced root-based algorithm that handles the problems of affixes, including prefixes, suffixes, and infixes depending on the morphological pattern of the word. The stemming concept has been used to eliminate all kinds of affixes, including infixes. Series of simulation experiments have been conducted to test the performance of the proposed algorithm. The results obtained showed that the algorithm extracts the correct roots with an accuracy rate up to 95%.

1 Follower
  • [Show abstract] [Hide abstract]
    ABSTRACT: The amount of Arabic electronic information is growing drastically on the web. Statistics shows that the number of Internet users in the Middle East has increased enormously since the year 2000 due to increase in ICT awareness and its importance within Arab countries. As a result this has raised the need to find effective methods and techniques for allocating and retrieving the Arabic-based content from the web. This paper presents major Information Retrieval (IR) tools and techniques and it highlights few challenges in this regard.
    GCC Conference and Exhibition (GCC), 2011 IEEE; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the authors' study they evaluate and compare the storage efficiency of different sparse matrix storage structures as index structure for Arabic text collection and their corresponding sparse matrix-vector multiplication algorithms to perform query processing in any Information Retrieval IR system. The study covers six sparse matrix storage structures including the Coordinate Storage COO, Compressed Sparse Row CSR, Compressed Sparse Column CSC, Block Coordinate BCO, Block Sparse Row BSR, and Block Sparse Column BSC. Evaluation depends on the storage space requirements for each storage structure and the efficiency of the query processing algorithm. The experimental results demonstrate that CSR is more efficient in terms of storage space requirements and query processing time than the other sparse matrix storage structures. The results also show that CSR requires the least amount of disk space and performs the best in terms of query processing time compared with the other point entry storage structures COO, CSC. The results demonstrate that BSR requires the least amount of disk space and performs the best in terms of query processing time compared with the other block entry storage structures BCO, BSC.
    04/2012; 2(2):52-67. DOI:10.4018/ijirr.2012040105
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Arabic language is a Semitic language used by 5% of people around the world, and it is one of the UN official languages. Natural languages like Arabic and English usually use words which are derived from the same root. Therefore researchers in text mining, information retrieval (IR), indexing, machine translation, and natural language processing (NLP) found it is beneficial to extract stem/base/root from different derived words, since it is normal to find a bunch of words derived from a common root/stem. Usually words derived from the same stem/root beside their origin are referring to the same concept. Arabic Stemming is not an easy task, since Arabic language uses many inflectional forms. Researchers are divided on the idea that is beneficial to use stemming in fields like IR, NLP...etc, since in Arabic the morphological variants of a certain word are not always semantically related. This study exhibits the design and implementation of a new Arabic light/heavy stemmer called Arabic Rule-Based Light Stemmer (ARBLS), which is not based on Arabic root patterns. Instead, it depends on mathematical rules and some relations between letters. A series of tests are conducted on ARBLS to compare the effectiveness of this new Arabic stemmer with the effectiveness of another well known Arabic stemmer. Test shows clearly ARBLS is more effective than the other tested Arabic stemmer.
    The fourth International Conference on Information and Communication Systems (ICICS 2013); 04/2013