ThesisPDF Available

Towards An Open Platform For Arabic Language Processing

Authors:

Abstract and Figures

Arabic is a large used language across world, but the lack of linguistics tools and resource make it still under-resourced language which influences on development and research. Many studies and research are done on Arabic for academic and experimental, but didn’t really adapted for development process and end-user usage and can’t be integrated in existing systems. The problem under investigation is to develop tools and resources which must be open source, multipurpose, usable by researchers, developers and end-users. Our solution named Adawat has many applications, API and corpora like: Light stemmer, verb conjugator, morphology analyzer, Spell checker, Text to speech system, Mishkal diacrtizer, vocalized texts corpus, synonyms dictionary, collocations etc… We use mainly rule based approach to build rules and data. These tools are developed to be integrated with existing systems like Hunspell spell checker used by millions users under Firefox and LibreOffice, and eSpeak text to speech. The availability of our tools and resources give high impact on new researches, which use mainly Tashkeela corpus, and Tashaphyne stemmer.
Content may be subject to copyright.
A preview of the PDF is not available
... In the rich part system of Arabic language, the same Arabic word can be joined to various parts of appends and clitics to generate new vocabularies that make the Arabic words synonyms are widespread. As a result, Arabic language is considered as a highly inflectional and derivational language, which make the problem of ambiguity is one of the biggest challenges in Arabic NLP compared to many other languages (Khalatia & Al-Romanyb, 2020;Khalil & Osman, 2014;Shaalan et al., 2018;Zerrouki, 2020). ...
... The order of syntactic relations could be Subject-Predicate-Object or Predicate-Subject-Object or Predicate-Object-Subject. These syntactic relation orders are all acceptable sentence structures (Khalatia & Al-Romanyb, 2020;Khalil & Osman, 2014;Maloney & Niv, 1998;Zerrouki, 2020). ...
... This subject could be a definite noun, proper noun, or pronoun in the nominative case, and the predicate is an indefinite nominative noun, proper noun, or adjective that agrees with the subject in number and gender. The predicate can be a prepositional phrase (Zerrouki, 2020). ...
Article
Full-text available
This survey has explored the literature on the fields of Arabic NLP tasks and Arabic IE applications to analyze the state-of-the-art trends, identify the research gaps in these research fields, and recommend solutions to fulfill these gaps. This study is set out to gather appropriate research articles in the targeted fields from Academic Search Engines and Academic Databases. Subsequently, these articles were surveyed to obtain information about research trends aspects. That is, the contributions achieved, the methodologies applied, and the technical and linguistic resources utilized. This review study has followed systematic review procedure steps to meet the requirements of high-quality survey studies. The collected and reviewed articles cover different research contributions. For instance, the Morphological resolution in the field of Arabic NLP tasks and the Sentiment Analysis (SA) applications in the field of Arabic IE applications. The findings of this study can be summarized into that most of the researchers in the field of Arabic NLP tasks prefer to contribute to NER and then to the Morphological resolution tasks; however, in the field of Arabic IE they prefer to contribute to SA applications and then to the Question and Answering applications. Secondly, most of the reviewed articles applied methodologies, tools, techniques, and algorithms, not for specific languages such as Machine Learning, Artificial Neural Networks, and Deep Learning Algorithms. Lastly, this study provides the first comprehensive assessment which examines associations between the dataset sources domain types and dataset sources ownership types in addition to the relation between articles’ contribution fields and the datasets ownership types. It confirms that the highest-reviewed articles numbers in the field of Arabic NLP tasks are for those that utilize existing and available dataset sources; specifically, in Linguistic domain dataset sources. Nonetheless, the highest reviewed articles numbers in the field of Arabic IE applications are for those whose authors are collecting and creating the dataset sources by themselves; also, in Linguistic domain dataset sources.
... On the other hand, ideas as well as works for Arabic NLP platforms have already been introduced by Jaafar et al. [9], Zerrouki [10] and Sakhr Software. ...
... As described by the authors, SAFAR includes: 1) Resources needed for different Arabic NLP services, 2) Basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics and 3) Applications for the ANLP such as Information Retrieval, Question/Answering, Named Entity Recognition, etc. SAFAR has been developed for several years by a whole team, which has been integrating several modules, and continues developing and adding custom features to its core. 2 https://aws.amazon.com/comprehend/ Zerrouki [10] introduces Adawat, a tool that integrates many applications, APIs and corpora such as: Light stemmer, verb conjugator, morphology analyzer, Spell checker, Text to speech system, Mishkal diacrtizer, vocalized texts corpus, synonyms dictionary, collocations etc. In Adawat, Zerrouki uses mainly rule based approach to build rules and data. ...
Article
Full-text available
Cloud Computing is getting more and more debated in the IT industry today. Its evolution is leading the next generation of internet services. Natural Language Processing (NLP) is a field of Artificial Intelligence that focuses on understanding, manipulating and generating human language by machines. Thus, the NLP is really at the interface between computer science and linguistics. It, therefore, concerns the ability of the machine to interact directly with humans. Arabic NLP is very poor compared to other languages such as English or German due to the complexity of this language and the lack of resources. In this work, we propose a new system for Arabic NLP based entirely on the Cloud. This system is based on two steps. It firstly uses a bridge between Arabic and other developed languages (English for occurrence) and then uses of the already developed features for that language. Hence those features apply not on the Arabic text but instead on the translated (the English) one. In some cases, the result needs to be in Arabic, in which case, we use the bridge another time to translate English result into Arabic. This can either be used in real NLP systems, such as Translation, IR, QA, Sentiment Analysis, or for validation or comparison purposes, especially for those who work in NLP and use other approaches. Experiments have been performed on a prototype we developed and the results obtained are satisfactory for this first version.
Preprint
Full-text available
Tashaphyne is a Python package that provides a comprehensive light stemmer and segmentor for the Arabic language. It stands out among other stemmers for its ability to perform stemming and root extraction simultaneously, unlike the Khoja stemmer, ISRI stemmer, Assem stemmer, and Farasa stemmer. Tashaphyne uses a modified finite state automaton that generates all possible segmentations, making it an extremely flexible tool for customizing stemmers without changing the code. Furthermore, Tashaphyne comes with default prefixes and suffixes, and allows for the use of customized lists to handle more complex aspects of stemming. Overall, Tashaphyne is an important contribution to the open-source community for Arabic language processing.
Article
Full-text available
Sentiment analysis (SA), also known as opinion mining, is a growing important research area. Generally, it helps to automatically determine if a text expresses a positive, negative or neutral sentiment. It enables to mine the huge increasing resources of shared opinions such as social networks, review sites and blogs. In fact, SA is used by many fields and for various languages such as English and Arabic. However, since Arabic is a highly inflectional and derivational language, it raises many challenges. In fact, SA of Arabic text should handle such complex morphology. To better handle these challenges, we decided to provide the research community and Arabic users with a new efficient framework for Arabic Sentiment Analysis (ASA). Our primary goal is to improve the performance of ASA by exploiting deep learning while varying the preprocessing techniques. For that, we implement and evaluate two deep learning models namely convolutional neural network (CNN) and long short-term memory (LSTM) models. The framework offers various preprocessing techniques for ASA (including stemming, normalisation, tokenization and stop words). As a result of this work, we first provide a new rich and publicly available Arabic corpus called Moroccan Sentiment Analysis Corpus (MSAC). Second, the proposed framework demonstrates improvement in ASA. In fact, the experimental results prove that deep learning models have a better performance for ASA than classical approaches (support vector machines, naive Bayes classifiers and maximum entropy). They also show the key role of morphological features in Arabic Natural Language Processing (NLP).
Conference Paper
Full-text available
Diacritization of Arabic text is both an interesting and a challenging problem at the same time with various applications ranging from speech synthesis to helping students learning the Arabic language. Like many other tasks or problems in Arabic language processing, the weak efforts invested into this problem and the lack of available (open-source) resources hinder the progress towards solving this problem. This work provides a critical review for the currently existing systems, measures and resources for Arabic text diacritization. Moreover, it introduces a much-needed free-for-all cleaned dataset that can be easily used to benchmark any work on Arabic diacritization. Extracted from the Tashkeela Corpus, the dataset consists of 55K lines containing about 2.3M words. After constructing the dataset, existing tools and systems are tested on it. The results of the experiments show that the neural Shakkala system significantly outperforms traditional rule-based approaches and other closed-source tools with a Diacritic Error Rate (DER) of 2.88% compared with 13.78%, which the best DER for the non-neural approach (obtained by the Mishkal tool).
Conference Paper
Full-text available
The present research explored the impact of the aural feedback generated by text-to-speech (TTS) on Arabic language learners. A web-based dictation platform was developed for this purpose. It enables the users to have their text read out loud as they write. This aural feedback will enable them to auto detect their mistakes, make the perceived corrections and re-listen again. A study was carried out in two different primary schools to benchmark the effects of the tool on pupils from third, fourth and fifth grades. Overall, the test revealed a significant improvement in the pupils writing skills; the participants were able to locate and correct a high percentage of their mistakes under speech synthesis vocalization. Besides, in order to assess the extent to which the tool remains efficient even when applied to different categories of persons, a longitudinal study is now being launched on the illiterates and students with visual disabilities. Thus far, the first results are surprisingly very promising. Details of the results will be published in another paper.
Article
Full-text available
There is a tremendous number of Arabic text documents available online that is growing every day. Thus, categorizing these documents becomes very important. In this paper, an approach is proposed to enhance the accuracy of the Arabic text categorization. It is based on a new features representation technique that uses a mixture of a bag of words (BOW) and two adjacent words with different proportions. It also introduces a new features selection technique depends on Term Frequency (TF) and uses Frequency Ratio Accumulation Method (FRAM) as a classifier. Experiments are performed without both of normalization and stemming, with one of them, and with both of them. In addition, three data sets of different categories have been collected from online Arabic documents for evaluating the proposed approach. The highest accuracy obtained is 98.61% by the use of normalization.
Chapter
The volume of Arabic information is rapidly increasingly nowadays, and thus, access to the corrects is arguably one of the most difficult research problems facing readers and researchers. Text Summarisation Systems are utilised to produce a short text describing significant portions of the original text. That is by selecting the most important sentences, following several steps: preprocessing, stemming, scoring, and summary extraction. Nevertheless, summarisation systems remain still in their infancy for the Arabic language. Therefore, this paper proposes an automatic Arabic text summarisation systems, entitled Wajeez, that introduces a new inclusive scoring formula that generates a final summary from several top-ranking sentences. Wajeez was applied on two different datasets: the Essex Arabic Summaries Corpus (EASC) and a manual summary to assess its performance using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) set of metrics. In comparison to two other competitions systems, Wajeez performed comparatively well when a title is provisioned to support summarisation.
Article
Associative classification (AC) integrates the task of mining association rules with the classification task to increase the efficiency of the classification process. AC algorithms produce accurate classification and generate easy to understand rules. However, AC algorithms suffer from two drawbacks: the large number of classification rules, and using different pruning methods that may remove vital information to achieve the right decision. In this paper, a new hybrid AC algorithm (HAC) is proposed. HAC applies the power of the Naïve Bayes (NB) algorithm to reduce the number of classification rules and to produce several rules that represent each attribute value. Two experiments are conducted on an Arabic textual dataset and the standard Reuters-21578 datasets using six different algorithms, namely J48, NB, classification based on associations (CBA), multi-class classification based on association rules (MCAR), expert multi-class classification based on association rules (EMCAR), and fast associative classification algorithm (FACA). The results of the experiments showed that the HAC approach produced higher classification accuracy than MCAR, CBA, EMCAR, FACA, J48 and NB with gains of 3.95%, 6.58%, 3.48%, 1.18%, 5.37% and 8.05% respectively. Furthermore, on Reuters-21578 datasets, the results indicated that the HAC algorithm has an excellent and stable performance in terms of classification accuracy and F measure.