ThesisPDF Available

Towards An Open Platform For Arabic Language Processing

Authors:

Abstract and Figures

Arabic is a large used language across world, but the lack of linguistics tools and resource make it still under-resourced language which influences on development and research. Many studies and research are done on Arabic for academic and experimental, but didn’t really adapted for development process and end-user usage and can’t be integrated in existing systems. The problem under investigation is to develop tools and resources which must be open source, multipurpose, usable by researchers, developers and end-users. Our solution named Adawat has many applications, API and corpora like: Light stemmer, verb conjugator, morphology analyzer, Spell checker, Text to speech system, Mishkal diacrtizer, vocalized texts corpus, synonyms dictionary, collocations etc… We use mainly rule based approach to build rules and data. These tools are developed to be integrated with existing systems like Hunspell spell checker used by millions users under Firefox and LibreOffice, and eSpeak text to speech. The availability of our tools and resources give high impact on new researches, which use mainly Tashkeela corpus, and Tashaphyne stemmer.
Content may be subject to copyright.
A preview of the PDF is not available
... In the rich part system of Arabic language, the same Arabic word can be joined to various parts of appends and clitics to generate new vocabularies that make the Arabic words synonyms are widespread. As a result, Arabic language is considered as a highly inflectional and derivational language, which make the problem of ambiguity is one of the biggest challenges in Arabic NLP compared to many other languages (Khalatia & Al-Romanyb, 2020;Khalil & Osman, 2014;Shaalan et al., 2018;Zerrouki, 2020). ...
... The order of syntactic relations could be Subject-Predicate-Object or Predicate-Subject-Object or Predicate-Object-Subject. These syntactic relation orders are all acceptable sentence structures (Khalatia & Al-Romanyb, 2020;Khalil & Osman, 2014;Maloney & Niv, 1998;Zerrouki, 2020). ...
... This subject could be a definite noun, proper noun, or pronoun in the nominative case, and the predicate is an indefinite nominative noun, proper noun, or adjective that agrees with the subject in number and gender. The predicate can be a prepositional phrase (Zerrouki, 2020). ...
Article
Full-text available
This survey has explored the literature on the fields of Arabic NLP tasks and Arabic IE applications to analyze the state-of-the-art trends, identify the research gaps in these research fields, and recommend solutions to fulfill these gaps. This study is set out to gather appropriate research articles in the targeted fields from Academic Search Engines and Academic Databases. Subsequently, these articles were surveyed to obtain information about research trends aspects. That is, the contributions achieved, the methodologies applied, and the technical and linguistic resources utilized. This review study has followed systematic review procedure steps to meet the requirements of high-quality survey studies. The collected and reviewed articles cover different research contributions. For instance, the Morphological resolution in the field of Arabic NLP tasks and the Sentiment Analysis (SA) applications in the field of Arabic IE applications. The findings of this study can be summarized into that most of the researchers in the field of Arabic NLP tasks prefer to contribute to NER and then to the Morphological resolution tasks; however, in the field of Arabic IE they prefer to contribute to SA applications and then to the Question and Answering applications. Secondly, most of the reviewed articles applied methodologies, tools, techniques, and algorithms, not for specific languages such as Machine Learning, Artificial Neural Networks, and Deep Learning Algorithms. Lastly, this study provides the first comprehensive assessment which examines associations between the dataset sources domain types and dataset sources ownership types in addition to the relation between articles’ contribution fields and the datasets ownership types. It confirms that the highest-reviewed articles numbers in the field of Arabic NLP tasks are for those that utilize existing and available dataset sources; specifically, in Linguistic domain dataset sources. Nonetheless, the highest reviewed articles numbers in the field of Arabic IE applications are for those whose authors are collecting and creating the dataset sources by themselves; also, in Linguistic domain dataset sources.
Article
Full-text available
Sentiment analysis (SA), also known as opinion mining, is a growing important research area. Generally, it helps to automatically determine if a text expresses a positive, negative or neutral sentiment. It enables to mine the huge increasing resources of shared opinions such as social networks, review sites and blogs. In fact, SA is used by many fields and for various languages such as English and Arabic. However, since Arabic is a highly inflectional and derivational language, it raises many challenges. In fact, SA of Arabic text should handle such complex morphology. To better handle these challenges, we decided to provide the research community and Arabic users with a new efficient framework for Arabic Sentiment Analysis (ASA). Our primary goal is to improve the performance of ASA by exploiting deep learning while varying the preprocessing techniques. For that, we implement and evaluate two deep learning models namely convolutional neural network (CNN) and long short-term memory (LSTM) models. The framework offers various preprocessing techniques for ASA (including stemming, normalisation, tokenization and stop words). As a result of this work, we first provide a new rich and publicly available Arabic corpus called Moroccan Sentiment Analysis Corpus (MSAC). Second, the proposed framework demonstrates improvement in ASA. In fact, the experimental results prove that deep learning models have a better performance for ASA than classical approaches (support vector machines, naive Bayes classifiers and maximum entropy). They also show the key role of morphological features in Arabic Natural Language Processing (NLP).
Conference Paper
Full-text available
Diacritization of Arabic text is both an interesting and a challenging problem at the same time with various applications ranging from speech synthesis to helping students learning the Arabic language. Like many other tasks or problems in Arabic language processing, the weak efforts invested into this problem and the lack of available (open-source) resources hinder the progress towards solving this problem. This work provides a critical review for the currently existing systems, measures and resources for Arabic text diacritization. Moreover, it introduces a much-needed free-for-all cleaned dataset that can be easily used to benchmark any work on Arabic diacritization. Extracted from the Tashkeela Corpus, the dataset consists of 55K lines containing about 2.3M words. After constructing the dataset, existing tools and systems are tested on it. The results of the experiments show that the neural Shakkala system significantly outperforms traditional rule-based approaches and other closed-source tools with a Diacritic Error Rate (DER) of 2.88% compared with 13.78%, which the best DER for the non-neural approach (obtained by the Mishkal tool).
Article
Full-text available
There is a tremendous number of Arabic text documents available online that is growing every day. Thus, categorizing these documents becomes very important. In this paper, an approach is proposed to enhance the accuracy of the Arabic text categorization. It is based on a new features representation technique that uses a mixture of a bag of words (BOW) and two adjacent words with different proportions. It also introduces a new features selection technique depends on Term Frequency (TF) and uses Frequency Ratio Accumulation Method (FRAM) as a classifier. Experiments are performed without both of normalization and stemming, with one of them, and with both of them. In addition, three data sets of different categories have been collected from online Arabic documents for evaluating the proposed approach. The highest accuracy obtained is 98.61% by the use of normalization.
Chapter
The volume of Arabic information is rapidly increasingly nowadays, and thus, access to the corrects is arguably one of the most difficult research problems facing readers and researchers. Text Summarisation Systems are utilised to produce a short text describing significant portions of the original text. That is by selecting the most important sentences, following several steps: preprocessing, stemming, scoring, and summary extraction. Nevertheless, summarisation systems remain still in their infancy for the Arabic language. Therefore, this paper proposes an automatic Arabic text summarisation systems, entitled Wajeez, that introduces a new inclusive scoring formula that generates a final summary from several top-ranking sentences. Wajeez was applied on two different datasets: the Essex Arabic Summaries Corpus (EASC) and a manual summary to assess its performance using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) set of metrics. In comparison to two other competitions systems, Wajeez performed comparatively well when a title is provisioned to support summarisation.
Conference Paper
The present research explored the impact of the aural feedback generated by text-to-speech (TTS) on Arabic language learners. A web-based dictation platform was developed for this purpose. It enables the users to have their text read out loud as they write. This aural feedback will enable them to auto detect their mistakes, make the perceived corrections and re-listen again. A study was carried out in two different primary schools to benchmark the effects of the tool on pupils from third, fourth and fifth grades. Overall, the test revealed a significant improvement in the pupils writing skills; the participants were able to locate and correct a high percentage of their mistakes under speech synthesis vocalization. Besides, in order to assess the extent to which the tool remains efficient even when applied to different categories of persons, a longitudinal study is now being launched on the illiterates and students with visual disabilities. Thus far, the first results are surprisingly very promising. Details of the results will be published in another paper.
Article
Associative classification (AC) integrates the task of mining association rules with the classification task to increase the efficiency of the classification process. AC algorithms produce accurate classification and generate easy to understand rules. However, AC algorithms suffer from two drawbacks: the large number of classification rules, and using different pruning methods that may remove vital information to achieve the right decision. In this paper, a new hybrid AC algorithm (HAC) is proposed. HAC applies the power of the Naïve Bayes (NB) algorithm to reduce the number of classification rules and to produce several rules that represent each attribute value. Two experiments are conducted on an Arabic textual dataset and the standard Reuters-21578 datasets using six different algorithms, namely J48, NB, classification based on associations (CBA), multi-class classification based on association rules (MCAR), expert multi-class classification based on association rules (EMCAR), and fast associative classification algorithm (FACA). The results of the experiments showed that the HAC approach produced higher classification accuracy than MCAR, CBA, EMCAR, FACA, J48 and NB with gains of 3.95%, 6.58%, 3.48%, 1.18%, 5.37% and 8.05% respectively. Furthermore, on Reuters-21578 datasets, the results indicated that the HAC algorithm has an excellent and stable performance in terms of classification accuracy and F measure.