[Show abstract][Hide abstract] ABSTRACT: The Gender Identification (GI) problem is concerned with determining the gender of the author of a given text based on its contents. The GI problem is one of the authorship profiling problems which have a wide range of applications in various fields such as marketing and security. Due to its importance, extensive research efforts have been invested in the GI problem for different languages. Unfortunately, the same cannot be said about the Arabic language despite its strategic importance and widespread. In this work, we explore the GI problem for Arabic text as a supervised learning problem. Specifically, we consider and compare two approaches for feature extraction. The first one is the Bag-Of-Words (BOW) approach while the second one is based on computing features related to sentiments and emotions. One goal of this work is to confirm the validity of the common stereotype that female authors tend to write in a more emotional way than male authors. Our results show that there is no conclusive evidence that this is true for our dataset.
the 12th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2015); 11/2015
[Show abstract][Hide abstract] ABSTRACT: The prevalent use of Online Social Networks (OSN) and the anonymity and lack of accountability they inherent from being online give rise to many problems related to finding the connection between the massive amount of text data on OSN and the people who actually wrote them. Analyzing text data for such purposes is called authorship analysis. This work is focused on one specific type of authorship analysis , which is identifying the author's gender. Gender identification has various applications from marketing to security. The focus of this work is on Arabic articles. The problem is basically a classification problem and the current approaches differ in the way they compute the features of each document. However, they all agree on following some "stylometric features" approach. Unlike these works, ours treat this problem as a variation of the Text Classification (TC) problem and follow the Bag-Of-Words (BOW) approach for feature selection. We perform an extensive set of experiments on the feature selection and classification phase and the results show that such an approach yield surprisingly high results.
The 11th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA2014), Doha, Qatar; 11/2014
[Show abstract][Hide abstract] ABSTRACT: Building an effective stemmer for Arabic language has been always a hot research topic in the IR field since Arabic language has a very different and difficult structure than other languages, that's because it is a very rich language with complex morphology. Many linguistic and light stemmers have been developed for Arabic language but still there are many weakness and problems, in this paper we introduce a new light stemming technique and compare it with other used stemmers and show how it improves the search effectiveness.
Innovations in Information Technology, 2008. IIT 2008. International Conference on; 01/2009
[Show abstract][Hide abstract] ABSTRACT: This paper provides an improvement to Arabic Information Retrieval Systems. The proposed system relies on the stem-based query expansion method, which adds different morphological variations to each index term used in the query. This method is applied on Arabic corpus. Roots of the query terms are derived, then for each derived root from the query words; all words in the corpus descendant from the same root are collected and classified in a distinct class. Afterward; each class is reformulated by co-occurrence analysis (for each pair of terms) to guarantee a good relationship between terms. In addition, we used the whole word indexing technique to index the corpus.
Each term added to the query must be correct and correlated to the original query terms. We show how this technique improves recall (by 20%), and how it affects precision.
[Show abstract][Hide abstract] ABSTRACT: Many of Natural Language Processing (NLP) techniques have been used in Information Retrieval, the results is not encouraging. Proper names are problematic for cross language information retrieval (CLIR), detecting and extracting proper noun in Arabic language is a primary key for improving the effectiveness of the system. The value of information in the text usually is determined by proper nouns of people, places, and organizations, to collect this information it should be detected first. The proper nouns in Arabic language do not start with capital letter as in many other languages such as English language so special treatment is required to find them in a text. Little research has been conducted in this area; most efforts have been based on a number of heuristic rules used to find proper nouns in the text. In this research we use a new technique to retrieve proper nouns from the Arabic text by using set of keywords and particular rules to represent the words that might form a proper noun and the relationships between them.
[Show abstract][Hide abstract] ABSTRACT: Stemming is one of many tools used in information retrieval to combat the vocabulary mismatch problem, in which query words do not match document words. Stemming in the Arabic language does not fit into the usual mold, because stemming in most research in other languages so far depends only on eliminating prefixes and suffixes from the word, but Arabic words contain infixes as well. In this paper we have introduced an enhanced root-based algorithm that handles the problems of affixes, including prefixes, suffixes, and infixes depending on the morphological pattern of the word. The stemming concept has been used to eliminate all kinds of affixes, including infixes. Series of simulation experiments have been conducted to test the performance of the proposed algorithm. The results obtained showed that the algorithm extracts the correct roots with an accuracy rate up to 95%.
Sixth International Conference on Computer Graphics, Imaging and Visualization: New Advances and Trends, CGIV 2009, 11-14 August 2009, Tianjin, China; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Root extraction is one of the most important topics in information retrieval (IR), natural language processing (NLP), text summarization, and many other important fields. In the last two decades, several algorithms have been proposed to extract Arabic roots. Most of these algorithms dealt with triliteral roots only, and some with fixed length words only. In this study, a novel approach to the extraction of roots from Arabic words using bigrams is proposed. Two similarity measures are used, the dissimilarity measure called the “Manhattan distance,” and Dice's measure of similarity. The proposed algorithm is tested on the Holy Qu'ran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The two files used contain a wide range of data: the Holy Qu'ran contains most of the ancient Arabic words while the other file contains some modern Arabic words and some words borrowed from foreign languages in addition to the original Arabic words. The results of this study showed that combining N-grams with the Dice measure gives better results than using the Manhattan distance measure.
Journal of the American Society for Information Science and Technology 01/2009; 61(3):583-591. DOI:10.1002/asi.21247 · 1.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Text classification is the task of assigning a document to one or more of pre-defined categories based on its contents. This paper presents the results of classifying Arabic language documents by applying the KNN classifier, one time by using N-Gram namely unigrams and bigrams in documents indexing, and another time by using traditional single terms indexing method (bag of words) which supposes that the terms in the text are mutually independent which is not the case. Results show that using N-Grams produces better accuracy than using Single Terms for indexing; the average accuracy of using N-grams is .7357, while with Single terms indexing the average accuracy is .6688.
[Show abstract][Hide abstract] ABSTRACT: Much attention has been paid to the relative effectiveness of interactive query expansion (IQE) versus automatic query expansion (AQE). This research has been shown that automatic query expansion (collection dependent) strategy gives better performance than no query expansion. The percentage of queries that are improved by AQE strategy is 57% with average precision equal to 43.2. Compared against AQE (collection dependent) strategy, IQE gives better average precision than AQE strategy. The percentage of queries that are improved by best IQE decision is 86% with average precision equal to 44.1. Evaluation process reveals that the value of n in AQE strategy that gave the optimal value of average precision for the whole query set is equal to one.
Innovations in Information Technology, 2007. Innovations '07. 4th International Conference on; 12/2007
[Show abstract][Hide abstract] ABSTRACT: The paper describes a new stemmer algorithm to find the roots and patterns for Arabic words based on excessive letter locations. The algorithm locates the trilateral root , quadri-literal root as well as the pentaliteral root. The algorithm is written with the goal of supporting natural language processing programs such as parsers and information retrieval systems. The algorithm has been tested on thousands of Arabic words. Results reveals an accuracy reached to 95%.
Innovations in Information Technology, 2007. IIT '07. 4th International Conference on; 12/2007
[Show abstract][Hide abstract] ABSTRACT: The paper presents enhanced, effective and simple approach to text classification. The approach uses an algorithm to automatically classifying documents. The main idea of the algorithm is to select feature words from each document; those words cover all the ideas in the document. The results of this algorithm are list of the main subjects founded in the document. Also, in this paper the effects of the Arabic text classification on Information Retrieval have been investigated. The system evaluation was conducted in two cases based on precision/recall criteria: evaluate the system without using Arabic text classification and evaluate the system with Arabic text classification. A series of experiments were carried out to test the algorithm using 242 Arabic abstracts. Additionally, automatic phrase indexing was implemented. Experiments revealed that the system with text classification gives better performance than the system without text classification.
Innovations in Information Technology, 2007. Innovations '07. 4th International Conference on; 12/2007
[Show abstract][Hide abstract] ABSTRACT: Automatic Text Categorization (ATC) refer to the process of building software tools capable of assigning unseen documents to predefined categories or subjects. This study aims to automatically classify the verses (Ayat, sentences) of the Fatiha and Yaseen Surahs (Chapters) in the Quran according to the classifications of Islamic scholars. Our automatic text categorization is based on the traditional linear classification function (score function). A system (classifier) has been designed and implemented to categorize the different verses in each Sura (Chapter). This system fully normalizes the verses in the first stage and then the verses are categorized to classes for which they have highest score. The categorization process in this paper depends heavily on a specialized corpus of the Fatiha and Yaseen Surahs built by the authors. The corpus of the whole Quran has not been built before. To build a comprehensive corpus of the Holy Quran requires much time and efforts. Hence a corpus of the Fatiha and Yaseen Surahs was only built and this corpus will be extended to include all the Quran in later future work. This limitation of the corpus leads to a limitation of the categorization of the system to the Fatiha and Yaseen Surahs only. The accuracy of the system can be improved if a more powerful stemmer and a corpus is used. This study lays the foundation stone of building a full corpus of the Holy Quran and a classifier of different verses, which can be used to prove the unity of the subject and different verse similarities.
Journal of Applied Sciences 03/2005; 5(3):580-583. DOI:10.3923/jas.2005.580.583
[Show abstract][Hide abstract] ABSTRACT: This algorithm provides a new method for extracting the quadriliteral Arabic root (a four consonant string) from its morphological derivatives. Our stemming algorithm starts by excluding prefixes and checking the word starting from the last letter back to the first. A temporary vector is used to store the suffix letters being removed, and another vector is used to store the root. Particles and the definite article are removed before the suffix and root are partitioned. The algorithm has been tested on a sample of 145 words derived from quadriliteral Arabic verbs, with 95% accuracy for the initial results.
[Show abstract][Hide abstract] ABSTRACT: The objective of this research is to study the process of examining documents by computing comparisons between the representation of the information need (the queries) and the representations of the documents. Also, we will automate the process of representing information needs as user profiles by computing the comparison between the user profile and the representations of the documents. We consider an automated process to be successful when it produces results similar to those produced by human comparison of the documents themselves with actual information need. Thus, we will compare ad-hoc retrieval and filtering retrieval tasks and examine the differences between them in terms of the information retrieval process. We have selected 242 Arabic abstracts that were used by Hmeidi . All these abstracts involve computer science and information systems. We have also designed and built a system to compare two different retrieval tasks: ad-hoc retrieval and filtering retrieval. Here, we define ad-hoc and filtering retrieval systems and illustrate the development strategy for each one. We compare the two tasks on the basis of recall/precision evaluation, system usability, domain search, ranking, construction complexity, and methodology. From this experiment, we conclude that ad-hoc retrieval gives better performance than filtering retrieval. We also consider the advantages of using filtering services in the information retrieval process.
International Journal of Computer Processing Of Languages 09/2004; 17:181-199. DOI:10.1142/S0219427904001073
[Show abstract][Hide abstract] ABSTRACT: Summary form only given. We have designed and implemented an efficient stop-word removal algorithm for Arabic language based on a finite state machine (FSM). An efficient stop-word removal technique is needed in many natural language processing application such as: spelling normalization, stemming and stem weighting, Question answering systems and in information retrieval systems (IR). Most of the existing stop-word removal techniques bases on a dictionary that contains a list of stop-word, it is very expensive, it takes too much time for searching process and required too much space to store these stop-words. The new Arabic removal stop-word technique has been tested using a set of 242 Arabic abstracts chosen from the Proceedings of the Saudi Arabian National Computer conferences, and another set of data chosen from the holy Q'uran, and it gives impressive results that reached approximately to 98%.
Information and Communication Technologies: From Theory to Applications, 2004. Proceedings. 2004 International Conference on; 05/2004
[Show abstract][Hide abstract] ABSTRACT: Summary form only given. We present a new stemming algorithm to extract quadri-literal Arabic roots. The algorithm starts by excluding the prefixes and checks then the word characters starting from the last letter backward to the first one. A temporary matrix is used to store the suffix letters of the Arabic word, and another matrix is used to store the roots. The partition process is preceded by removing the particle from the source word. Checking the letters of any word includes checking whether the tested letter is included within the general standard Arabic word; if the test is positive then the letter will be stored in the temporary matrix, otherwise it will be stored in the root matrix. Mutation of some of the original letters in the word to be derived is used in some cases in order to store the substitute letters in the root matrix. Finally, the letters in the root matrix are arranged according to their order in the original word. The algorithm has been tested on a sample of 200 words generated randomly and descendant from quadri-literal Arabic verbs. It has shown a high performance reached 95% of accuracy rate.