About
85
Publications
25,907
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
652
Citations
Introduction
My research interests include the major issues related to natural language processing, in particular to Arabic language and its dialects. We are interested in Machine translation, Speech recognition, Language identification, Natural Language Understanding, Under-resourced languages.
We usually use statistical and neural approaches, trying to take into consideration the specificities of Arabic language.
Additional affiliations
Publications
Publications (85)
In this paper, a complete methodology of a corpus realization of authentic SMS from Algerian dialect and which are transcribed in Latin characters or symbols is presented. A linguistic material constituted by 6000 SMS coming from the different geographical regions of Algeria (Middle, East and West) corresponding to 42 administrative and geographica...
Many academics are becoming more interested in Spoken Arabic Dialect Identification. Nonetheless, most under-resourced languages suffer from a lack of data, such as the common Algerian dialect, which provides an intriguing case study. As a result, the purpose of this research is to compare the performance of two techniques for the automated identif...
In this paper, we conduct a study to evaluate the performance of statistical and neural methods to classify Arabic Dialects (AD). This evaluation is based on two kinds of corpora. The first one is a corpus named PADIC (Parallel Arabic DIalectal Corpus), which is a multi-dialectal corpus composed of six dialects: two Algerian dialects (of Algiers an...
In this study, we describe a solution for dealing with the problem of data scarcity in Speech Processing tasks involving low-resource languages, including Automatic Speech Recognition (ASR). This method is based on a set of Data Augmentation (DA) techniques that will be applied to the small corpus that was initially used. This corpus comprises the...
This paper presents ALP, an entirely new linguistic pipeline for natural language processing of text in Modern Standard Arabic. In contrary to the conventional pipeline architecture, we solve common NLP operations of word segmentation, POS tagging, and named entity recognition as a single sequence labeling task. Based on this single component, we a...
In this paper, we present a set of experiments aiming to improve the recognition of spoken digits for under-resourced dialects of the Maghrebi region, using a hybrid system. Indeed, integrating a Dialect Identification module into an Automatic Speech Recognition (ASR) system has shown its efficiency in previous works. In order to make the ASR syste...
In this paper we present a set of experiments performing machine translation related to low-resourced Arabic dialects in addition to a zero-resourced dialect (Berber). For this, we extended the parallel PADIC corpus by adding the Berber dialect corpus and translating manually more than 6000 Arabic sentences. We applied both Rule-based Machine Trans...
This paper addresses the problem of phone number recognition taking into account some of the peculiarities of dialectal Arabic used in the daily life of Algerian people as code-switching and accent variety. Accordingly, we have set up an ASR system aiming to have the capacity to cope with these peculiarities, i.e. to recognize sequences of digits t...
In this paper, we will investigate an empirical term selection method for text categorization, namely Transition Point (TP) technique, and we will compare it to two other widely used methods: Term Frequency (TF) and Document Frequency (DF). For evaluation, we have used the well-known TFIDF technique. Experiments have been conducted by using the Ara...
ICNLSP is an opportunity and a forum for researchers, students, and industrials to exchange ideas and discuss research and trends in the field of Natural Language Processing. Indeed, many topics were discussed through the interesting works presented during: speech recognition, machine translation, text summarization, sentiment
analysis, natural lan...
The workshop aims to draw the attention of researchers in the under resourced languages communities and encourage them to cooperate and intensify efforts to provide solutions and resources for such under resourced languages.
This paper describes our approach to detecting Sentiment and Sarcasm for Arabic in the ArSarcasm 2021 shared task. Data preprocessing is a crucial task for a successful learning, that is why we applied a set of preprocessing steps to the dataset before training two classifiers, namely Linear Support Vector Classifier (LSVC) and Bidirectional Long S...
In this paper, we analyze the impact of the weighted concatenation of TF-IDF features for the Arabic Dialect Identification task while we participated in the NADI2021 shared task. This study is performed for two subtasks: subtask 1.1 (country-level MSA) and subtask 1.2 (country-level DA) identification. The classifiers supporting our comparative st...
The term natural language refers to any system of symbolic communication (spoken, signed, or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (N...
An analysis of the right vowel context influence on their spectrum of fourteen Arabic fricative was presented in this study. The results were validated on the base of the recognition of these fricatives from a system based on Neural Network (Multilayer Perceptron MLP). The latter showed that the number of frica-tives considered, the vowel context a...
This paper describes our system developed for automatically classifying tweets that mention medications. We used the Decision Tree classifier for this task. We have shown that using some elementary preprocessing steps and TF-IDF n-grams led to acceptable classi-fier performance. Indeed, the F1-score recorded was 74.58% in the development phase and...
In this paper, we present a description of our experiments on country-level Arabic dialect identification. A comparison study between a set of classifiers has been carried out. The best results were achieved using the Linear Support Vector Classification (LSVC) model by applying a Random Over Sampling (ROS) process yielding an F1-score of 18.74% in...
The term natural language refers to any system of symbolic communication (spoken, signed or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NL...
In this paper, we present a description of our experiments on Profiling Fake News Spreaders on Twitter based on TFIDF Features and Morphological Processes as stemming, lemmatization and part of speech tagging. A comparison study between a set of classifiers has been carried out. The best results were achieved using the model LSVC which yielded an f...
In this paper, we aim to describe a novel technique which is a Combined Automatic Speech Recognition and Language Identification System that uses both ASR and LI technologies which consists of the recognition of spoken digits after identifying their language. An in-house corpus was used mainly for both speech-based multi-lingual identification and...
In this work, we address the problem of identifying languages based on Voxforge speech corpus. We downloaded corpora for three languages: English, German and Persian from Voxforge. In addition, we recorded two additional corpora, the first one for Modern Standard Arabic (MSA) and the other one for Kabyl, one of the Algerian Berber dialects. To tack...
As the most used approach to extend a Spoken language Understanding (SLU) from a language to another, Machine translation achieves high performance for English domains, which is not the case for other languages, especially low-resourced ones as Arabic and its dialects. To avoid Machine Translation approach which requires huge parallel corpora, we w...
This paper describes our system and results on
NSURL 2019 Semantic Question Similarity in Arabic task. We
considered the solution to this problem from three point of
view, where we adopted three approaches: lexical, statistical
and neural. For the Lexical approach we applied a set of text
similarity measures from the textdistance tools, where the b...
The workshop aims to draw the attention of researchers in the under resourced languages communities and encourage them to cooperate and intensify efforts to provide solutions and resources for such under resourced languages.
In this paper, we suggest the generalization of an Arabic Spoken Language Understanding (SLU) system in a multi-domain human-machine dialog. We are interested particularly
in domain portability of SLU system related to both structured (DBMS) and unstructured
data (Information Extraction), related to four domains. In this work, we used the thematic...
In this paper, we present ArPod, a new Arabic speech corpus made of Arabic audio podcasts. We built this dataset, mainly for both speech-based multilingual and multi-dialectal identification tasks. It includes two languages: Modern Standard Arabic (MSA) and English, and four Arabic dialects: Saudi, Egyptian, Lebanese and Syrian. A set of supervised...
This paper describes a system for classification of Arabic poems according to the eras in which they were written. We used machine learning techniques where we applied a bunch of filters and classifiers. The best results were achieved by using the Multinomial Naïve Bayes (MNB) algorithm, with an accuracy equal to 70.21%, an F1-Score of 68.8% and a...
هذه اﻷبيات لصاحب الشذور وهو العالم الكيميائي العربي علي بن موسى بن خلف وهو أبو الحسن بن موسى بن أبي القاسم (515 - 593 هـ / 1121 - 1197 م)
وكأنه سبق العالم رذرفورد الذي اقترح نموذج بنية الذرّة والشبيه بالنظام الشمسي عام 1911. حيث يعتبر رذرفورد أن الذرّة عبارة عن نواة موجبة مركزية وإلكترونات سالبة تحوم حولها في مدارات ثابتة كما تدور الكواكب ظاهريا حو...
This paper describes the solution that we propose on MADAR 2019 Arabic Fine-Grained Dialect Identification task. The proposed solution utilized a set of classifiers that we trained on character and word features. These clas-sifiers are: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (L...
In this work, we address the problem of identifying languages based on Voxforge speech corpus. We downloaded corpora for three languages: English, German and Persian from Voxforge. In addition, we recorded two additional corpora, the first one for Modern Standard Arabic (MSA) and the other one for Kabyl, one of the Algerian Berber dialects. To tack...
As the most used approach to extend a Spoken language Understanding (SLU) from a language to another, Machine translation achieves high performance for English domains, which is not the case for other languages, especially low-resourced ones as Arabic and its dialects. To avoid Machine Translation approach which requires huge parallel corpora, we w...
This paper presents ALP, an entirely new linguistic pipeline for natural language processing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture , we solve common NLP operations of word segmentation, POS tagging, and named entity recognition as a single sequence labeling task. Based on this single component , we al...
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to th...
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key pre processing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to t...
In this paper, we investigate a set of methods for textual Arabic Dialect Identification, where we considered word-level and sentence-level approaches. We used three classifiers, namely: Linear Support Vector Machine L-SVM, Bernoulli Naive Bayes BNB and Multinomial Naive Bayes MNB. Then we combined them by using a voting procedure. We carried out e...
Research on Arabic Dialect Treatment has recently
become important in the literature. Although most work
on these dialects considers only the messages or the
portion of text written in Arabic letters, another style of
writing has emerged on social media. This style is
known by Arabizi and combines between Latin letters
and numbers. To address this...
Word error rates (WER) achieved for Modern Standard Arabic speech recognition systems are much lower than those performed for English. Two essential factors are behind this fact. The first one is the intentional omission of diacritics in the Arabic scripts, and the other one is the agglutinative nature of Arabic which is not taken into account in t...
Packet loss is a major source of voice distortion in VoIP (Voice over IP). Packet loss may be due to several reasons: routing problems, transmission errors, and network congestion in VoIP communications. To mitigate the effect of these losses on voice quality, PLC (Packet Loss Concealment) mechanisms are introduced in the decoders to reconstruct th...
This paper proposes a Packet loss concealment
(PLC) techniques for recognition of speech coded with the G729
codec, which is widely used in VoIP networks. PLC at a receiver
has a substantial effect on the performance of automatic speech
recognition (ASR) systems in VoIP (Voice over IP). Many of the
standard ITU-T CELP based speech coders, such as t...
Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to th...
The arabizi is the combination between the latine letters and the number in the same word. The transliteration is the process of transforming the text from one alphabet to another. In our case, it consists to transform arabizi to Arabic. The transliteration is considered as the first compound of automatic translation when users combine between diff...
Research on Arabic Dialect Treatment has recently become important in the literature. Although most work on these dialects considers only the messages or the portion of text written in Arabic letters, another style of writing has emerged on social media. This style is known by Arabizi and combines between Latin letters and numbers. To address this...
Machine transliteration is a very important research area in the field of machine translation. Neural Machine transliteration (NMTR) is a new approach to machine transliteration that has shown promising results. However research on NMTR of Arabic has just begun to give results while no research has been done on neural transliteration of Arabic dial...
Arabic is the official language overall Arab coun-tries, it is used for official speech, news-papers, public adminis-tration and school. In parallel, for everyday communication, non-official talks, songs and movies, Arab people use their dialects which are inspired from Standard Arabic and differ from one Arabic country to another. These linguistic...
We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation.
We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have bee...
The Algerian Arabic dialects are under-resourced languages, which lack both corpora and Natural Language Processing (NLP) tools, although they are increasingly used in written form, especially on social media and forums. We aim through this paper, and for the first time, to build parallel corpora for Algerian dialects, because our ultimate purpose...
We aim to develop a speech translation system between Modern Standard Arabic and Algiers dialect. Such a system must include a Text-to-Speech module which itself must include a grapheme-phoneme converter. Algiers dialect is an Arabic dialect concerned by the most problems of Modern Standard Arabic in NLP area. Furthermore, it could be considered as...
In this paper we present a statistical approach for automatic diacritization of Algiers dialectal texts. This approach is based on statistical machine translation. We first investigate this approach on Modern Standard Arabic (MSA) texts using several data sources and extrapolated the results on available dialectal
texts. For evaluation we used word...
Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly comparable results of six text categorization methods on a new Arabic corpus Alwatan-2004. Hence, Topic Unigram Lang...
Topic identification is used in several applications, as adapting language models for speech recognition and machine translation, focusing on a specific use for search engines, etc. Topic identification consists to assign one or several topic labels to a flow of textual data. Labels are chosen from a set of topics fixed a priori. In this paper, we...
In this paper, we present a method of topic identification based on computing triggers pairs: TR-classifier (Triggers-based classifier). Indeed, it is used for the purpose to identify topics of texts. Hence, the first step to be realized is the construction of a vocabulary for each topic. Topic vocabularies are composed of words ranked according to...
This paper focuses on studying topic identification for Arabic language by using two methods. The first method is the well-known kNN (k Nearest Neighbors) which is used as baseline. The second one is the TR-Classifier, mainly based on computing triggers. The experiments show that TR-Classifier has the advantage to give best performances compared to...
Topic identification is based on topic training corpora, which represents specificities of each topic. It consists in finding the topic(s) treated in a piece of text (paragraph, article,...), among a set of topics. In this paper, we present a new method of topic identification based on computing triggers pairs: TR-classifier (TRiggers-based classif...
This paper can be downloaded from:
https://sites.google.com/site/mouradabbas9/publications/publicat
Topic identification studies have been carried out for different languages, but this is not the case for Modern Standard Arabic. This is why this paper focuses on topic identification for this language. Thus, we have used three statistical methods for this task: the TFIDF classifier, the SVM method and the Topic Unigram Language Model. The first st...
In this paper we present two well-known categorization methods and their use in topic identification for Modern Standard Arabic. The first one is the TFIDF approach, and the second is a Support Vector Machines (SVM) based classifier. In the best of our knowledge, we do not know several precedent works on Arabic topic identification, which is the ta...
In this paper we present two well-known methods for topic identification. The first one is a TFIDF classifier approach, and the second one is a based machine learning approach which is called Support Vector Machines (SVM). In our knowledge, we do not know several works on Arabic topic identification. So that we decide to investigate in this article...
The aim of this study is topic identification by using two methods, in this case, a new one that we have proposed: TR-classifier which is based on computing triggers, and the well-known k Nearest Neighbors. Performances are acceptable, particularly for TR-classifier, though we have used reduced sizes of vocabularies. For the TR-Classifier, each top...