Mourad Abbas

Mourad Abbas
Al-Tnall Al-Arabi

Research director

About

85
Publications
25,907
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
652
Citations
Introduction
My research interests include the major issues related to natural language processing, in particular to Arabic language and its dialects. We are interested in Machine translation, Speech recognition, Language identification, Natural Language Understanding, Under-resourced languages. We usually use statistical and neural approaches, trying to take into consideration the specificities of Arabic language.
Additional affiliations
September 2019 - February 2021
Centre de Recherche Scientifique et Technique pour le Développement de la Langue Arabe
Position
  • Managing Director
Description
  • Director of the Research Center affiliated to the Ministry of Higher Education and Scientific Research.
January 2015 - April 2021
Centre de Recherche Scientifique et Technique pour le Développement de la Langue Arabe
Position
  • Managing Director
June 2000 - December 2013
Independent Researcher
Independent Researcher
Position
  • Researcher

Publications

Publications (85)
Article
In this paper, a complete methodology of a corpus realization of authentic SMS from Algerian dialect and which are transcribed in Latin characters or symbols is presented. A linguistic material constituted by 6000 SMS coming from the different geographical regions of Algeria (Middle, East and West) corresponding to 42 administrative and geographica...
Conference Paper
Full-text available
Many academics are becoming more interested in Spoken Arabic Dialect Identification. Nonetheless, most under-resourced languages suffer from a lack of data, such as the common Algerian dialect, which provides an intriguing case study. As a result, the purpose of this research is to compare the performance of two techniques for the automated identif...
Preprint
In this paper, we conduct a study to evaluate the performance of statistical and neural methods to classify Arabic Dialects (AD). This evaluation is based on two kinds of corpora. The first one is a corpus named PADIC (Parallel Arabic DIalectal Corpus), which is a multi-dialectal corpus composed of six dialects: two Algerian dialects (of Algiers an...
Conference Paper
In this study, we describe a solution for dealing with the problem of data scarcity in Speech Processing tasks involving low-resource languages, including Automatic Speech Recognition (ASR). This method is based on a set of Data Augmentation (DA) techniques that will be applied to the small corpus that was initially used. This corpus comprises the...
Chapter
This paper presents ALP, an entirely new linguistic pipeline for natural language processing of text in Modern Standard Arabic. In contrary to the conventional pipeline architecture, we solve common NLP operations of word segmentation, POS tagging, and named entity recognition as a single sequence labeling task. Based on this single component, we a...
Article
Full-text available
In this paper, we present a set of experiments aiming to improve the recognition of spoken digits for under-resourced dialects of the Maghrebi region, using a hybrid system. Indeed, integrating a Dialect Identification module into an Automatic Speech Recognition (ASR) system has shown its efficiency in previous works. In order to make the ASR syste...
Conference Paper
Full-text available
In this paper we present a set of experiments performing machine translation related to low-resourced Arabic dialects in addition to a zero-resourced dialect (Berber). For this, we extended the parallel PADIC corpus by adding the Berber dialect corpus and translating manually more than 6000 Arabic sentences. We applied both Rule-based Machine Trans...
Conference Paper
Full-text available
This paper addresses the problem of phone number recognition taking into account some of the peculiarities of dialectal Arabic used in the daily life of Algerian people as code-switching and accent variety. Accordingly, we have set up an ASR system aiming to have the capacity to cope with these peculiarities, i.e. to recognize sequences of digits t...
Conference Paper
Full-text available
In this paper, we will investigate an empirical term selection method for text categorization, namely Transition Point (TP) technique, and we will compare it to two other widely used methods: Term Frequency (TF) and Document Frequency (DF). For evaluation, we have used the well-known TFIDF technique. Experiments have been conducted by using the Ara...
Book
ICNLSP is an opportunity and a forum for researchers, students, and industrials to exchange ideas and discuss research and trends in the field of Natural Language Processing. Indeed, many topics were discussed through the interesting works presented during: speech recognition, machine translation, text summarization, sentiment analysis, natural lan...
Book
The workshop aims to draw the attention of researchers in the under resourced languages communities and encourage them to cooperate and intensify efforts to provide solutions and resources for such under resourced languages.
Conference Paper
Full-text available
This paper describes our approach to detecting Sentiment and Sarcasm for Arabic in the ArSarcasm 2021 shared task. Data preprocessing is a crucial task for a successful learning, that is why we applied a set of preprocessing steps to the dataset before training two classifiers, namely Linear Support Vector Classifier (LSVC) and Bidirectional Long S...
Conference Paper
Full-text available
In this paper, we analyze the impact of the weighted concatenation of TF-IDF features for the Arabic Dialect Identification task while we participated in the NADI2021 shared task. This study is performed for two subtasks: subtask 1.1 (country-level MSA) and subtask 1.2 (country-level DA) identification. The classifiers supporting our comparative st...
Article
The term natural language refers to any system of symbolic communication (spoken, signed, or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (N...
Conference Paper
An analysis of the right vowel context influence on their spectrum of fourteen Arabic fricative was presented in this study. The results were validated on the base of the recognition of these fricatives from a system based on Neural Network (Multilayer Perceptron MLP). The latter showed that the number of frica-tives considered, the vowel context a...
Conference Paper
Full-text available
This paper describes our system developed for automatically classifying tweets that mention medications. We used the Decision Tree classifier for this task. We have shown that using some elementary preprocessing steps and TF-IDF n-grams led to acceptable classi-fier performance. Indeed, the F1-score recorded was 74.58% in the development phase and...
Conference Paper
Full-text available
In this paper, we present a description of our experiments on country-level Arabic dialect identification. A comparison study between a set of classifiers has been carried out. The best results were achieved using the Linear Support Vector Classification (LSVC) model by applying a Random Over Sampling (ROS) process yielding an F1-score of 18.74% in...
Preprint
Full-text available
The term natural language refers to any system of symbolic communication (spoken, signed or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NL...
Conference Paper
Full-text available
In this paper, we present a description of our experiments on Profiling Fake News Spreaders on Twitter based on TFIDF Features and Morphological Processes as stemming, lemmatization and part of speech tagging. A comparison study between a set of classifiers has been carried out. The best results were achieved using the model LSVC which yielded an f...
Conference Paper
In this paper, we aim to describe a novel technique which is a Combined Automatic Speech Recognition and Language Identification System that uses both ASR and LI technologies which consists of the recognition of spoken digits after identifying their language. An in-house corpus was used mainly for both speech-based multi-lingual identification and...
Chapter
In this work, we address the problem of identifying languages based on Voxforge speech corpus. We downloaded corpora for three languages: English, German and Persian from Voxforge. In addition, we recorded two additional corpora, the first one for Modern Standard Arabic (MSA) and the other one for Kabyl, one of the Algerian Berber dialects. To tack...
Chapter
As the most used approach to extend a Spoken language Understanding (SLU) from a language to another, Machine translation achieves high performance for English domains, which is not the case for other languages, especially low-resourced ones as Arabic and its dialects. To avoid Machine Translation approach which requires huge parallel corpora, we w...
Conference Paper
Full-text available
This paper describes our system and results on NSURL 2019 Semantic Question Similarity in Arabic task. We considered the solution to this problem from three point of view, where we adopted three approaches: lexical, statistical and neural. For the Lexical approach we applied a set of text similarity measures from the textdistance tools, where the b...
Book
The workshop aims to draw the attention of researchers in the under resourced languages communities and encourage them to cooperate and intensify efforts to provide solutions and resources for such under resourced languages.
Conference Paper
Full-text available
In this paper, we suggest the generalization of an Arabic Spoken Language Understanding (SLU) system in a multi-domain human-machine dialog. We are interested particularly in domain portability of SLU system related to both structured (DBMS) and unstructured data (Information Extraction), related to four domains. In this work, we used the thematic...
Conference Paper
Full-text available
In this paper, we present ArPod, a new Arabic speech corpus made of Arabic audio podcasts. We built this dataset, mainly for both speech-based multilingual and multi-dialectal identification tasks. It includes two languages: Modern Standard Arabic (MSA) and English, and four Arabic dialects: Saudi, Egyptian, Lebanese and Syrian. A set of supervised...
Chapter
This paper describes a system for classification of Arabic poems according to the eras in which they were written. We used machine learning techniques where we applied a bunch of filters and classifiers. The best results were achieved by using the Multinomial Naïve Bayes (MNB) algorithm, with an accuracy equal to 70.21%, an F1-Score of 68.8% and a...
Preprint
Full-text available
هذه اﻷبيات لصاحب الشذور وهو العالم الكيميائي العربي علي بن موسى بن خلف وهو أبو الحسن بن موسى بن أبي القاسم (515 - 593 هـ / 1121 - 1197 م) وكأنه سبق العالم رذرفورد الذي اقترح نموذج بنية الذرّة والشبيه بالنظام الشمسي عام 1911. حيث يعتبر رذرفورد أن الذرّة عبارة عن نواة موجبة مركزية وإلكترونات سالبة تحوم حولها في مدارات ثابتة كما تدور الكواكب ظاهريا حو...
Conference Paper
Full-text available
This paper describes the solution that we propose on MADAR 2019 Arabic Fine-Grained Dialect Identification task. The proposed solution utilized a set of classifiers that we trained on character and word features. These clas-sifiers are: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (L...
Conference Paper
In this work, we address the problem of identifying languages based on Voxforge speech corpus. We downloaded corpora for three languages: English, German and Persian from Voxforge. In addition, we recorded two additional corpora, the first one for Modern Standard Arabic (MSA) and the other one for Kabyl, one of the Algerian Berber dialects. To tack...
Conference Paper
As the most used approach to extend a Spoken language Understanding (SLU) from a language to another, Machine translation achieves high performance for English domains, which is not the case for other languages, especially low-resourced ones as Arabic and its dialects. To avoid Machine Translation approach which requires huge parallel corpora, we w...
Preprint
This paper presents ALP, an entirely new linguistic pipeline for natural language processing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture , we solve common NLP operations of word segmentation, POS tagging, and named entity recognition as a single sequence labeling task. Based on this single component , we al...
Presentation
Full-text available
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to th...
Conference Paper
Full-text available
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key pre processing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to t...
Conference Paper
Full-text available
In this paper, we investigate a set of methods for textual Arabic Dialect Identification, where we considered word-level and sentence-level approaches. We used three classifiers, namely: Linear Support Vector Machine L-SVM, Bernoulli Naive Bayes BNB and Multinomial Naive Bayes MNB. Then we combined them by using a voting procedure. We carried out e...
Conference Paper
Full-text available
Research on Arabic Dialect Treatment has recently become important in the literature. Although most work on these dialects considers only the messages or the portion of text written in Arabic letters, another style of writing has emerged on social media. This style is known by Arabizi and combines between Latin letters and numbers. To address this...
Conference Paper
Word error rates (WER) achieved for Modern Standard Arabic speech recognition systems are much lower than those performed for English. Two essential factors are behind this fact. The first one is the intentional omission of diacritics in the Arabic scripts, and the other one is the agglutinative nature of Arabic which is not taken into account in t...
Conference Paper
Packet loss is a major source of voice distortion in VoIP (Voice over IP). Packet loss may be due to several reasons: routing problems, transmission errors, and network congestion in VoIP communications. To mitigate the effect of these losses on voice quality, PLC (Packet Loss Concealment) mechanisms are introduced in the decoders to reconstruct th...
Article
Full-text available
This paper proposes a Packet loss concealment (PLC) techniques for recognition of speech coded with the G729 codec, which is widely used in VoIP networks. PLC at a receiver has a substantial effect on the performance of automatic speech recognition (ASR) systems in VoIP (Voice over IP). Many of the standard ITU-T CELP based speech coders, such as t...
Article
Full-text available
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to th...
Poster
Full-text available
The arabizi is the combination between the latine letters and the number in the same word. The transliteration is the process of transforming the text from one alphabet to another. In our case, it consists to transform arabizi to Arabic. The transliteration is considered as the first compound of automatic translation when users combine between diff...
Poster
Full-text available
Research on Arabic Dialect Treatment has recently become important in the literature. Although most work on these dialects considers only the messages or the portion of text written in Arabic letters, another style of writing has emerged on social media. This style is known by Arabizi and combines between Latin letters and numbers. To address this...
Conference Paper
Full-text available
Machine transliteration is a very important research area in the field of machine translation. Neural Machine transliteration (NMTR) is a new approach to machine transliteration that has shown promising results. However research on NMTR of Arabic has just begun to give results while no research has been done on neural transliteration of Arabic dial...
Article
Full-text available
Arabic is the official language overall Arab coun-tries, it is used for official speech, news-papers, public adminis-tration and school. In parallel, for everyday communication, non-official talks, songs and movies, Arab people use their dialects which are inspired from Standard Arabic and differ from one Arabic country to another. These linguistic...
Conference Paper
Full-text available
We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation.
Conference Paper
Full-text available
We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have bee...
Conference Paper
Full-text available
The Algerian Arabic dialects are under-resourced languages, which lack both corpora and Natural Language Processing (NLP) tools, although they are increasingly used in written form, especially on social media and forums. We aim through this paper, and for the first time, to build parallel corpora for Algerian dialects, because our ultimate purpose...
Conference Paper
Full-text available
We aim to develop a speech translation system between Modern Standard Arabic and Algiers dialect. Such a system must include a Text-to-Speech module which itself must include a grapheme-phoneme converter. Algiers dialect is an Arabic dialect concerned by the most problems of Modern Standard Arabic in NLP area. Furthermore, it could be considered as...
Conference Paper
Full-text available
In this paper we present a statistical approach for automatic diacritization of Algiers dialectal texts. This approach is based on statistical machine translation. We first investigate this approach on Modern Standard Arabic (MSA) texts using several data sources and extrapolated the results on available dialectal texts. For evaluation we used word...
Article
Full-text available
Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly comparable results of six text categorization methods on a new Arabic corpus Alwatan-2004. Hence, Topic Unigram Lang...
Article
Topic identification is used in several applications, as adapting language models for speech recognition and machine translation, focusing on a specific use for search engines, etc. Topic identification consists to assign one or several topic labels to a flow of textual data. Labels are chosen from a set of topics fixed a priori. In this paper, we...
Article
In this paper, we present a method of topic identification based on computing triggers pairs: TR-classifier (Triggers-based classifier). Indeed, it is used for the purpose to identify topics of texts. Hence, the first step to be realized is the construction of a vocabulary for each topic. Topic vocabularies are composed of words ranked according to...
Article
Full-text available
This paper focuses on studying topic identification for Arabic language by using two methods. The first method is the well-known kNN (k Nearest Neighbors) which is used as baseline. The second one is the TR-Classifier, mainly based on computing triggers. The experiments show that TR-Classifier has the advantage to give best performances compared to...
Conference Paper
Full-text available
Topic identification is based on topic training corpora, which represents specificities of each topic. It consists in finding the topic(s) treated in a piece of text (paragraph, article,...), among a set of topics. In this paper, we present a new method of topic identification based on computing triggers pairs: TR-classifier (TRiggers-based classif...
Article
Full-text available
This paper can be downloaded from: https://sites.google.com/site/mouradabbas9/publications/publicat
Article
Full-text available
Topic identification studies have been carried out for different languages, but this is not the case for Modern Standard Arabic. This is why this paper focuses on topic identification for this language. Thus, we have used three statistical methods for this task: the TFIDF classifier, the SVM method and the Topic Unigram Language Model. The first st...
Conference Paper
Full-text available
In this paper we present two well-known categorization methods and their use in topic identification for Modern Standard Arabic. The first one is the TFIDF approach, and the second is a Support Vector Machines (SVM) based classifier. In the best of our knowledge, we do not know several precedent works on Arabic topic identification, which is the ta...
Conference Paper
Full-text available
In this paper we present two well-known methods for topic identification. The first one is a TFIDF classifier approach, and the second one is a based machine learning approach which is called Support Vector Machines (SVM). In our knowledge, we do not know several works on Arabic topic identification. So that we decide to investigate in this article...
Article
Full-text available
The aim of this study is topic identification by using two methods, in this case, a new one that we have proposed: TR-classifier which is based on computing triggers, and the well-known k Nearest Neighbors. Performances are acceptable, particularly for TR-classifier, though we have used reduced sizes of vocabularies. For the TR-Classifier, each top...

Network

Cited By