ArticlePDF Available

Abstract

Arabic is the official language overall Arab coun-tries, it is used for official speech, news-papers, public adminis-tration and school. In parallel, for everyday communication, non-official talks, songs and movies, Arab people use their dialects which are inspired from Standard Arabic and differ from one Arabic country to another. These linguistic phenomenon is called disglossia, a situation in which two distinct varieties of a language are spoken within the same speech community. It is observed Throughout all Arab countries, standard Arabic widely written but not used in everyday conversation, dialect widely spoken in everyday life but almost never written. Thus, in NLP area, a lot of works have been dedicated for written Arabic. In contrast, Arabic dialects at a near time were not studied enough. Interest for them is recent. First work for these dialects began in the last decade for middle-east ones. Dialects of the Maghreb are just beginning to be studied. Compared to written Arabic, dialects are under-resourced languages which suffer from lack of NLP resources despite their large use. We deal in this paper with Arabic Algerian dialect a non-resourced language for which no known resource is available to date. We present a first linguistic study introducing its most important features and we describe the resources that we created from scratch for this dialect.
An Algerian dialect: Study and Resources
Salima Harrat, Karima Meftouh, Mourad Abbas, Khaled-Walid Hidouci§and Kamel Smaili
Ecole Sup´
erieure d’Informatique (ESI), Algiers, Algeria
Badji Mokhtar University, Annaba, Algeria
CRSTDLA Centre de Recherche Scientifique et Technique
pour le D´
eveloppement de la Langue Arabe, Algiers, Algeria
§Ecole Sup´
erieure d’Informatique (ESI), Algiers, Algeria
Campus Scientifique LORIA , Nancy, France
Abstract—Arabic is the official language overall Arab coun-
tries, it is used for official speech, news-papers, public adminis-
tration and school. In Parallel, for everyday communication, non-
official talks, songs and movies, Arab people use their dialects
which are inspired from Standard Arabic and differ from one
Arabic country to another. These linguistic phenomenon is called
disglossia, a situation in which two distinct varieties of a language
are spoken within the same speech community. It is observed
Throughout all Arab countries, standard Arabic widely written
but not used in everyday conversation, dialect widely spoken in
everyday life but almost never written. Thus, in NLP area, a lot
of works have been dedicated for written Arabic. In contrast,
Arabic dialects at a near time were not studied enough. Interest
for them is recent. First work for these dialects began in the last
decade for middle-east ones. Dialects of the Maghreb are just
beginning to be studied. Compared to written Arabic, dialects
are under-resourced languages which suffer from lack of NLP
resources despite their large use. We deal in this paper with
Arabic Algerian dialect a non-resourced language for which no
known resource is available to date. We present a first linguistic
study introducing its most important features and we describe
the resources that we created from scratch for this dialect.
KeywordsArabic dialect, Algerian dialect, Modern Standard
Arabic, Grapheme to Phoneme Conversion, Morphological Analysis
I. INT ROD UC TI ON
Under-resourced languages are languages which lacks re-
sources dedicated for natural language processing. In fact,
these languages suffer from unavailability of basic tools like
corpora, mono or multilingual dictionaries, morphological and
syntactic analyzers, etc. This lack of resources makes working
with these languages a great challenge, especially when we
deal with unwritten languages like Arabic dialects. Compared
to other under-resourced languages, Arabic dialects present the
following additional difficulties:
Since they are spoken languages they are not written
and there are no established rules to write them. A
same word could have many orthographic forms which
are all acceptable since there is no writing rules as
reference.
The flexibility in the grammatical and lexical levels
despite their belonging to Arabic Language.
Besides the fact that these dialects are different from
Arabic, they are also different from each other. For
instance, dialects of the Maghreb differ from those of
the middle-east. They may be also different inside the
same country.
These dialects are also widely influenced by other
languages such as French, English, Spanish, Turkish
and Berber.
In Algeria, as well as in all arab countries, these dialects are
used in everyday conversations. However, with the advent of
the internet they are increasingly used in social networks and
forums. They emerge on the web as a real communication
language due to the ease to communicate in dialect especially
for people with low level of education. But unfortunately basic
NLP tools for these dialects are not available.
This work is a first part of the Project TORJMAN1which
is a Speech-To-Speech Translator between Algerian Arabic
dialects and MSA. Unlike Middle-East Arabic dialects, Al-
gerian Arabic dialects are non-resourced languages, they lack
all kinds of NLP resources. Consequently, TORJMAN begins
from Scratch.
In this paper, we describe and extend resources creation tasks
for Arabic dialect of Algeria that appeared in [1] and [2].
We focus on Algiers dialect which is the spoken Arabic of
Algiers (capital city of Algeria) and its periphery. This choice
is justified by the fact that this dialect is the one we know
best and practice since we are native speakers of this dialect.
For convenience of reference, we will design Algiers dialect
by ALG, this will make this manuscript easier to read.
This paper is organized as follows: before dealing with Alge-
rian dialect we give in Section II a brief overview of Arabic
language, whereas in Section III we present different aspects of
ALG. The following Sections will be dedicated to the resources
that we created, we detail how we made the first corpus of
Algiers dialect (Section IV). Then we present ALG grapheme-
phoneme converter(Section V) which has allowed us to get a
phonetized corpus of Algiers dialect. In Section VI we describe
how we created a morphological analyzer for ALG by adapting
BAMA[3] the well known analyser for MSA. Finally, we will
conclude by summarizing the main ideas of this work and by
giving our future tendencies.
II. AR AB IC L AN GUAGE
Arabic is a Semitic language, it is used by around 420
million people. It is the official language of about 22 countries.
Arabic is a generic term covering 3separate groups:
1TORJMAN is a national research project which is totally financed by the
Algerian research ministry, this appellation means translator or interpreter in
English.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
384 | P a g e
www.ijacsa.thesai.org
Classical Arabic: is principally defined as the Arabic
used in the Qur’an and in the earliest literature from
the Arabian peninsula, but also forms the core of much
literature until the present day.
Modern Standard Arabic: Generally referred as MSA
(Alfus’ha in Arabic), is the variety of Arabic which
was retained as the official language in all Arab
countries, and as a common language. It is essentially
a modern variant of classical Arabic. Standard Arabic
is not acquired as a mother tongue, but rather it is
learned as a second language at school and through
exposure to formal broadcast programs (such as the
daily news), religious practice, and newspaper [4].
Arabic dialects: also called colloquial Arabic or ver-
naculars are spoken varieties of Arabic language. In
contrast to classical Arabic and MSA, they are not
written. These dialects have mixed form with many
variations. They are influenced both by the ancient
local tongues and by European languages such as
French, Spanish, English, and Italian.2Differences
between these variants of spoken Arabic throughout
the Arab world can be large enough to make them
incomprehensible to one another. Hence, regarding
the large differences between dialects, we can con-
sider them as disparate languages depending on the
geographical place in which they are practiced. Thus,
most of the literature describe Arabic dialects from
the viewpoint of east-west dichotomy [5]:3
Middle-east dialects: include spoken Arabic of
Arabian peninsula(Gulf countries and Yemen),
Levantine dialect (Syria, Lebanese, Palestinian
and Jordan), Iraqi dialect Egyptian and Sudan
dialect.
Maghreb dialects: Spoken mostly in Algeria,
Tunisia, Morocco, Libya and Mauritania. Note
that, Maltese a form of Arabic dialect is most
often found in Malta.
In the next section, we will focus on a Maghreb dialect from
Algeria and more specifically the dialect spoken in Algiers the
capital city of Algeria, we will highlight its most features in
contrast to MSA.
III. SPE CI FIC IT IE S OF AL GI ERS DIALECT
Algiers dialect (ALG) is the dialectical Arabic spoken in
Algiers and its periphery. This dialect is different from the
dialects spoken in the other places of Algeria. It is not used in
schools, television or newspapers, which usually use standard
Arabic or French, but is more likely, heard in songs if not just
heard in Algerian homes and on the street. Algerian Arabic is
spoken daily by the vast majority of Algerians [7].
ALG as the other Arabic dialects simplifies the morphological
and syntactic rules of the written Arabic. In [8], the author
draws how match spoken Arabic is different from written
2The influence of European languages is due to the fact that most of the
Arab countries were European colonies during the 19th century.
3An other classification is given in [6] where rural and Bedouin Arabic
dialects are distinguished because of ethnic and social diversity of Arabic
speakers. The author states that Bedouin dialects tend to be more conservative
and homogeneous, while urban dialects show more evolutive tendencies.
Arabic in various language levels: Phonological differences
between Classical Arabic and spoken Arabic are moderate
(compared to other pairs of language-dialect), whereas gram-
matical differences are the most striking ones. At lexical
level, differences are marked with variations in form and with
differences of use and meaning.
Indeed, at phonological level, ALG (naturally) shares the most
features related to Arabic. In addition to the 28 consonants
phonemes of Arabic4(given in Table I), ALG consonantal
system includes non Arabic phonemes like /g/ as in the word
¨A
¯(all), and the phonemes /p/ and /v/ used mainly in words
borrowed from French like the case of
éJ
Óñ
K
(adapted from the
French word ”pompe” which means a pump) and
è
Q
Ê
¯(adapted
from the French word ”valise” which means a bag). Also, it
should be noted that the use of the phonemes (
) and (
X) is
very rare, most of the time
is pronounced /d‘/(
) and
X
is pronounced /d/(X). The same case is observed for /T/ (
H)
which is pronounced /t/(
H). Note that the last two substitutions
are observed also for Jordanian dialect [9].
TABLE I: Arabic phonemes using SAMPA 5
Letter Phoneme Letter Phoneme Letter Phoneme
@/?/
P/z/
/q/
H
./b//s/ ¸/k/
H/t/
/S/ È/l/
H/T/ /s‘/Ð/m/
h
./Z/
/d‘/
à/n/
h/x/ /t‘/ë/h/
p/X/
/D/ð/w/
X/d/ ¨/?‘/h
./j/
X/D/
¨/G/
P/r/
¬/f/
/a/ 
/i/
/u/
@/a:/ ø/i:/ ð/u:/
Phonological features of ALG will be detailed further in
this paper (section V).
A. Vocabulary
Algerian dialect has a vocabulary inspired from Arabic but
the original words have been altered phonologically, with sig-
nificant Berber substrates, and many new words and loanwords
borrowed from French, Turkish and Spanish. Even though
most of this vocabulary is from MSA, there is significant
variation in the vocalization in most cases, and the omis-
sion or modification of some letters in other cases (mainly
the Hamza)6. Vocabulary of Algiers’s dialect includes verbs,
nouns, pronouns and particles. In the following a brief descrip-
tion of each category.
Verbs
Some verbs in ALG can adopt entirely the same
4including three long vowels (@¯
a,ðwand ø
y).
5We use the Speech Assessment Methods Phonetic Alphabet for phoneme
representation, http://www.phon.ucl.ac.uk/home/sampa/index.html.
6The Hamza is a letter in the Arabic alphabet, representing the glottal stop.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
385 | P a g e
www.ijacsa.thesai.org
scheme of MSA verbs by respecting the same vocal-
ization such as in the case of the verb ù
ÖÞ(to name) or
Õ
Î(to salute). Other verbs are pronounced differently
from corresponding MSA verbs by adopting different
diacritics marks as the case of the verbs
H
.
(to
drink). An other set of dialect verbs are obtained by
the omission or modification of some letters. In Table
II we give some examples of each listed case.
TABLE II: Examples of verbs scheme differences between
ALG and MSA.
ALG Verb Corresponding MSA Verb Meaning Situation Situation
Õ
Î
Õ
Î
To salute Same scheme
É
K.A
¯
É
K.A
¯To confront Same diacritics marks
H
.
H
.Q
å
To drink Same scheme
I
.
J
»
I
.
J
»To write /Different Diacritics marks
A
g
.ZA
g
.To come
ù
®
K.
ù
®
K.To remain Letters omission
C
¿
É¿@To eat or modification
É
Ò
»
É
Ò
»
@To finish
Another set of ALG verbs are those borrowed from
foreign languages especially French such as A
g
.PA
which corresponds to the French verb ”charger” (to
load) or @
PA
¯a modification of the verb ”garer” which
means to park.
Nouns
Arabic ALG nouns can be primitive (not derived from
any verbal root) or derived from verbs like for verbal
names and participles (active and passive), in Table III
an exemple is given. We should note that ALG nouns
TABLE III: Example of ALG nouns derived from a verb.
Verb Verbal name Acive participle Passive participle
¨AK.©J
K.©K
AK.¨ñJ
J.Ó
To sale Sale Seller Sold
include an important portion of french words. Most of
them are the results of a wide phonological alteration
of original words such as
KñÓ (”moteur” in French,
motor ),
àñJ
ñ£B (”la tension”, blood pressure)
and 
ËñK
(”policier”, policeman). Nouns include also
numbers which represent units, tens, hundreds, etc.
From 1 to 10 the numbers are close to MSA (with
different vocalization), except for the numbers 0 and
2: the first one is pronounced as in French /zero/,
and the second is h
.ð
P, whereas in MSA it is
á
J
K@.
From number 11 to 19 the pronunciation in ALG
differs from MSA, some letters and diacritics change
but the number can be perceived easily by an Arab
speaker. Numbers greater than 20 are also close to
MSA numbers, only the diacritics marks differ.
Pronouns
The list of the pronouns is a closed list; it contains
demonstrative and personal pronouns. For relative
pronouns, there is only one in Algiers dialect which is
ú
Í@(that); this pronoun is used for female, masculine,
singular and plural. We give in Tables IV and V all
ALG used pronouns. It is important to note that the
TABLE IV: Personal pronouns of Algiers dialect.
Singular Plural
Female Masculine Female & Masculine
1st Person AK@ AK@ AJk
I I We
2nd Person
I
K@
IK@ AÓñ
JK@
You You You
3rd Person ùë ñë AÓñë
She He They
dual in ALG does not exist; there are no equivalent
for Arabic pronouns
JK@(second person, dual) and
AÒë (third person, dual). Similarly, personal pronouns
relative to feminine plural
á
K@and
áë related to
second and third person respectively do not exist.
TABLE V: Demonstrative pronouns of Algiers dialect.
Singular Plural
Female Masculine Female & Masculine
øXAë @X ðX
This This These
¹K
XAë ¸@X ¸ðX
That That Those ones
Particles
Particles are used in order to situate facts or objects
relatively to time and place. They include different
categories such us: prepositions (ú
¯in, úΫon, K.
with), coordinating conjunctions (ðand, YªJ.Óð@af-
ter),quantifiers (É¿,
Ê¿,¨A
¯all,
íK
ñ
, few ).
B. Inflection
Algiers dialect is an inflected language such as Arabic.
Words in this language are modified to express different
grammatical categories such as tense, voice, person, number,
and gender. It is well-known that depending on word category,
the inflection is called conjugation when it is related to a
verb, and declension when it is related to nouns, adjectives or
pronouns. We show in the following these linguistic aspects
for Algiers dialect.
1) Verbs conjugation: Verb conjugation in ALG is affected
(as in MSA) by person (first, second or third person), num-
ber(singular or plural), gender (feminine or masculine), tense
(past, present or future), and voice (active or passive). Algiers
dialect uses as MSA the followings forms:
The past: Its forms are obtained by adding suffixes
relative to number and gender to the verb root and by
changing its diacritic marks(see Table VI for a sample)
.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
386 | P a g e
www.ijacsa.thesai.org
TABLE VI: The verb I
.
conjugation in the past tense.
Pronouns ALG MSA English
1st Person AK@
I
.
J
»
I
.
J
»I wrote
AJk AJ
.
J
» AJ
.
J
»We wrote
2nd Person
I
K@ ú
æ
J.
J
»
I
.
J
»You wrote
IK@
I
.
J
»
I
.
J
»You wrote
AÓñ
JK@ ñ
J
.
J
»Õ
æ
J.
J
»You wrote
3rd Person ùë
I
.
J
»
I
.
J
»She wrote
ñë
I
.
J
»
I
.
J
»He wrote
AÓñë ñ
J.
J
» @ñ
J.
J
»They wrote
The present and future: The present form of a ALG
verb is achieved by affixation: the prefixes K
,Kand
K
and the suffixes øand ð(Table VIII). The verb could
be preceded by the particle è@P (in its inflected form7)
to express a present continuous tense. The future is
obtained in the same way as present (same prefixes
and affixes) but it must be marked by the ante-position
of a particle or an expression that indicates the future
like YªJ.Óð@(later) or @ðY
«(tomorrow), next month,
...etc.
TABLE VII: The verb I
.ªË conjugation in the present tense.
Pronouns ALG MSA English
1st Person AK@
I
.
ª
ÊK
I
.
ª
Ë
@I play
AJk ñ
J.
ª
ÊK
I
.
ª
ÊKWe play
2nd Person
I
K@ úæ
.
ª
Ê
K
á
J.
ª
Ê
KYou play
IK@
I
.
ª
Ê
K
I
.
ª
Ê
KYou play
AÓñ
JK@ ñ
J.
ª
Ê
K
àñ
J.
ª
Ê
KYou play
3rd Person ùë
I
.
ª
Ê
K
I
.
ª
Ê
KShe plays
ñë
I
.
ª
Ê
K
I
.
ª
Ê
K
He plays
AÓñë ñ
J.
ª
Ê
K
àñ
J.
ª
Ê
K
They play
The imperative: It expresses commands or requests,
and is used only for the second person. It is generally
realised by adding the prefix @and the suffixes øand
ðto the verb.
TABLE VIII: The verb h
.Q
kconjugation in the present tense.
Pronouns ALG MSA English
I
K@ úk
.
Q
k
@ úk
.
Q
k
@Get out (you, singular, feminine)
IK@
h
.
Q
k
@
h
.
Q
k
@Get out (you, singular, masculine)
AÓñ
JK@ ñ
k
.
Q
k
@
k
.
Q
k
@Get out (you, plural, feminine & masculine)
2) Declension: Singular word declension in written Ara-
bic corresponds to three cases: the nominative, the genitive,
and the accusative which take the short vowels
,
and
7See next section III-B2
respectively attached to the end of the word. These three
cases are used to indicate grammatical functions of the words.
It should be noted that also the vowels (
,
,
) represent
the tanween doubled case endings corresponding to the three
cases cited above and express nominal indefiniteness. ALG has
dropped these case endings such as all Arabic dialects. The
disappearance of final short vowels and dropping of /h/ in cer-
tain conditions in many dialects of Arabic are very significant
changes [10]. The same author in [8] states: Classical Arabic
has three cases in the noun marked by endings; colloquial
dialects have none. Thus, a major feature of ALG is that it
does not accept the three cases declension of singular nouns
and adjectives as written Arabic.
For singular nouns declension to the plural, ALG have the
same plural classes as MSA:
Masculine regular plural: which is formed without
modifying the word structure by post-fixing the sin-
gular word by
áK
, unlike written Arabic where the
masculine regular plural of a noun is obtained by
adding the suffixes
àð (for the nominative), and
áK
(for both the accusative and genitive) depending on the
grammatical function of the word. For example, mas-
culine regular plural of MSA word ÕÎªÓ (teacher) could
be
àñÒÊªÓ (nominative case) or
á
ÒÊªÓ (accusative
or genitive). In contrast, for instance the ALG word
l'
@P (going) always takes
á
m'
@P for the regular plural
whatever its grammatical category.
Feminine regular plural: is obtained by adding the
suffix
H@ to the word without changing the structure
of the word as in MSA but with a single difference in
case endings. Indeed, in MSA, the feminine regular
plural has the following marks cases (
H@ or
H@ for
nominative and
H
@or
H
@for accusative and genitive),
ALG has only one mark case which is the Sukun
àñºË@ (absence of diacritic whose symbol is ). For
example the plural of MSA word
íÊJ
Ôg
.is
HCJ
Ôg
.or
H
CJ
Ôg
.8and the plural of ALG word
íK.A
is always
HAK.A
(both MSA and ALG words mean beautiful).
Broken plural: an irregular form of plural which
modifies the structure of the singular word to get its
plural. As in MSA it has different rules depending
on the word pattern. Like singular words, the MSA
broken plural takes the three case endings in ALG it
does not.
In Table IX we give an example for each ALG plural category.
Another major difference between Algiers dialect and the
written Arabic is the absence of the dual (a kind of plural
which designs 2 items). Indeed in MSA, for example the dual
of Y
Ë
ð(a boy) is designed by
à@Y
Ë
ð( the word is post-fixed by
à@ or
áK
depending on the case9). In ALG Generally, the dual
is obtained by the word h
.ð
P(two) followed by the plural
8
HCJ
Ôg
.or
H
CJ
Ôg
.also.
9
à@ for nominative case and
áK
for both accusative and genitive
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
387 | P a g e
www.ijacsa.thesai.org
TABLE IX: Examples of ALG plural forms.
Plural ALG MSA English
Singular Plural Singular Plural
Regular masculine hC
¯
á
gC
¯ hC
¯
àñkC
¯/
á
gC
¯Farmer/Farmers
Case ending No vowel
áK
,
,
,
, ,
àð/
áK
Regular feminine
íJ.
J.£
HAJ.
J.£
íJ.
J.£
HA
J.
J.£/
H
A
J.
J.£Doctor/Doctors
Case ending No vowel
H@
,
,
,
, ,
HA/
H
A
HA/
H
A
Irregular Q
£ PñJ
£Q
£ PñJ
£Bird/Birds
ÐñK
ÐA
K
@/
HAÓA
K
@ ÐñK
ÐA
K
@Day/Days
Case ending No vowel No vowel
,
,
,
, ,
,
,
,
,
,
(feminine or masculine) of the noun or the adjective.10 For
example, the dual of Y
Ë
ðis XB
ð h
.ð
P(two boys)
C. Syntactic level
1) Declarative form: Words order of a declarative sentence
in ALG is relatively flexible. Indeed, in common usage ALG
sentences could begin with the verb, the subject or even the
object. This order is based on the importance given by the
speaker to each of these entities; usually the sentence begins
with the item that the speaker wishes to highlight. In Table
X we give an example of different word orders for a same
sentence. It should be noted that the two first forms (SVO,
TABLE X: Example of word order in a ALG declarative
sentence.
Order Dialect Sentence English
SVO YJ
ÒÊË h@P YËñË@
The boy went to school
VSO YJ
ÒÊË YËñË@ h@P
OVS h@P YËñË@ YJ
ÒÊË
OSV h@P YJ
ÒÊË YËñË@
VSO) are the most used in the every day conversations.
2) Interrogative form: In Algiers, any sentence can be
turned into a question, in any one of the following ways:
1) It may be uttered in an interrogative tone of voice,
like ?@Q
®
K h@P (Will you revise?).
2) By introducing an interrogative pronoun or particle
as ?@Q
®
K h@P
áK
ð(where will you revise?).
We list in Table XI the most common interrogative particles
and pronouns used in the dialect of Algiers. We mention
particularly the particle ¸AK
used in questions that accept a
yes or no answer.
3) Negative form: The particles and úæare generally
used to express negation. is used both in Algiers’s dialect
and MSA, but the form of negation differs between the two
languages whereas úæAÓ is specific to the ALG. Using these
particles, the negative form is obtained in different ways in
ALG (we give in Table XII some examples labeled with each
enumerated case):
10 An exception is made for words like
á
JJ
«(two eyes),
á
K(two ears),
...
TABLE XI: Interrogative particles and pronouns in ALG and
their equivalents in MSA.
ALG MSA English
àñº
áÓ Who
@ ø@Which
áK
ð
áK
@Where
á
áK
@
áÓ From where
á
@ð /
@ð @
XAÓ What
AK.@
XAÖß.With what
A
¯ @
XAÓ ú
¯In What
A
J
¯ð ú
æÓ When
C«ð @
XAÖÏWhy
A
®»
J
»How
ÈAmÕ» How many
Negation with particle
1) Adding the affixes and
to conjugated
verbs ( as prefix and
as suffix).
2) We can enumerate a particular case with the
particle è@P which is equivalent to the verb
to be in present tense11. The negation is
obtained by adding the affixes and
to the
particle è@P possibly combined with a personal
pronoun.
Negation with úæAÓ particle
3) The particle úæAÓ can be added at the begin-
ning of a verbal declarative sentence without
modification of the sentence.
4) The particle úæAÓ can be added at the be-
ginning of a verbal declarative sentence by
introducing the relative pronoun ú
Í@.
5) In the case of a nominal sentence,úæAÓ can
be added at the beginning of the sentence by
reversing the order of its constituents.
6) Also úæAÓ could be added in the middle of a
nominal sentence with no modification.
Table XII illustrates some examples of declarative sentences
with their negations.
11We can not consider this particle as a verb because it could not be
conjugated to any other tense
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
388 | P a g e
www.ijacsa.thesai.org
TABLE XII: Declarative sentences with their Negation.
Case ALG MSA English
1
IJ.ªË
IJ.ªË she played
J.ªË AÓ
IJ.ªË /I
.ªÊ
K ÕË she didnt play
2
í

QÓ ùë@P
í

QÓ AêK@She is ill
í


ë@P AÓ
í

I
ËShe is not ill
3ñJ.
J» AÓñë @ñJ.
J» Ñë They wrote
ñJ.
J» AÓñë úæAÓ @ñJ.
áÓ Ñë @ñ
ËThey are not those who wrote
4ñJ.
J» AÓñë @ñJ.
J» Ñë They wrote
ñJ.
J» ú
Í@ AÓñë úæAÓ @ñJ.
áK
YË@ Ñë @ñ
ËThey are not those who wrote
5

QÓ YËñË@

QÓ YËñË@ The boy is ill
YËñË@

QÓ úæAÓ

ß.YËñË@ 
ËThe boy is not ill
6

QÓ YËñË@

QÓ YËñË@ The boy is ill

QÓ úæAÓ YËñË@ A

QÓ 
Ë YËñË@ The boy is not ill
IV. COR PU S CR EATION
As mentioned above, this work began from scratch. No
kind of resources was available for Algiers dialect. The foun-
dation stone of the work was a corpus that we created by
transcribing conversations recorded from everyday life and
also from some TV shows and movies. This transcription step
required conventional writing rules to make the transcribed
text homogeneous. Considering the fact that ALG is an Arabic
dialect, we adopted the following writing policy: when writing
a word in Algiers dialect we look if there is an Arabic word
close to this dialect word, if it does exist we adopt the Arabic
writing for the dialect word, otherwise the word is written as
it is pronounced.
The transcription step produced a corpus of 6400 sentences
that we afterwards translated to MSA. Thus, we got a parallel
corpus of 6400 aligned sentences. In Table XIII, we give
informations about the size of this corpus.
TABLE XIII: Parallel corpus description.
Corpus #Distinct words #Words
ALG 8966 38707
MSA 9131 40906
It should be noted that all tasks described above were done
by hand. It was time consuming but the result was a clean
parallel corpus. Furthermore, ALG side of this corpus has
been vocalized with our diacritizer described in [11] and used
to develop the first NLP resources dedicated to an Algerian
dialect (at our knowledge). The next sections of this paper are
dedicated to describe these resources.
V. GR AP HE ME -TO-PHONEME CONVERSION
As pointed out above, the general purpose of the project
TORJMAN is a speech translation system between Modern
Standard Arabic and Algiers dialect. Such a system must
include a Text-to-Speech module that requires a Grapheme-
To-Phoneme converter. We therefore dedicated our efforts
to develop this converter by using ALG vocalized corpus
described earlier.
Grapheme-to-Phoneme (G2P) conversion or phonetic tran-
scription is the process which converts a written form of a
word to its pronunciation form. Grapheme phoneme conversion
is not a simple deal, especially for non-transparent languages
like English where a phoneme may be represented by a letter
or a group of letters and vice-versa. Unlike English, Arabic
is considered a transparent language, in fact the relationship
between grapheme and phoneme is one to one, but note
that this feature is conditioned by the presence of diacritics.
Lack of vocalization generates ambiguity at all levels (lexical,
syntactic and semantic) and the phonetic level consequently,
such as the word I
.
/ktb/, its phonetic transcription could be
/kataba/, /kutiba/, /kutubun/, /kutubi/, /katbin/... Algiers dialect
obeys to the same rule, without diacritics grapheme-phoneme
conversion will be a difficult issue to resolve.
Most works on G2P conversion obey to two approaches: the
first one is dictionary-based approach, where a phonetized dic-
tionary contains for each word of the language its correct pro-
nunciation. The G2P conversion is reduced to a lookup of this
dictionary. The second approach is rule-based [12], [13], [14],
in which the conversion is done by applying phonetic rules,
these rules are deduced from phonological and phonetic studies
of the considered language or learned on a phonetized corpus
using a statistical approach based on significant quantities of
data[15], [16]. For Algiers dialect which is a non-resourced
language, a dictionary based solution for a G2P converter is
not feasible since a phonetized dictionary with a large amount
of data is not available. The first intuitive approach (regards
to the lack of resource) is a rule based one, but the specificity
of Algiers dialect (that we will detail hereafter in the next
section.) had led us to a statistical approach in order to consider
all features related to this language.
A. Issues of G2P conversion for Algiers dialect
Algiers dialect G2P conversion obeys to the same rules
as MSA. Indeed, ALG could be considered as a transparent
language since alignment between grapheme and phoneme is
one to one when the input text is vocalized. But unfortunately,
it is not as simple as what has been presented, since ALG
contains several borrowed words from foreign languages which
most of them have been altered phonologically and adapted
to it. Henceforth, the vocabulary of this dialect contains many
French words used in everyday conversation. French borrowed
words could be divided into two categories: the first includes
French words phonologically altered such as the word
íJ
ÊÓA
¯
(famille in French, family) and the second one includes words
which are uttered as in French like the word Pñ (s ˆ
ur in
French, sure) whose utterance is /syö/(/y/ is not an Arabic
phoneme but a French phoneme). This last category constitutes
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
389 | P a g e
www.ijacsa.thesai.org
a serious deal for G2P conversion since these words do not
obey to Arabic pronunciation rules.
TABLE XIV: Example of French words used in ALG.
Dialect word Dialect phonetic transcription French word English
íJK
Pñ» /ku:sina/ Cuisine Kitchen
íÊK./t‘a:bla/ Table Table
àñJ
ºKñ» /kOnnEksj˜O/ Connection Connexion
Q
¯ðX /d@viz/ Devise Currency
In the examples of Table XIV, although the first two words
are French, they are phonetized as Arabic words. The French
phoneme /t/ is replaced by the Arabic Phoneme /t‘/ in the word
table. On the other side, the last two words are phoentized
as French words since they are pronounced as in French by
Algiers dialect speakers. In order to take account of this word
category, the French phonemes like /E/,/˜O/ and /@/ must be
included in Algiers dialect phonemes.
B. Rule based approach
As stated previously, the rule based approach for G2P
conversion applied to ALG requires a diacritized text, that is
why we used our ALG vocalized corpus. The diacritized text
is converted into its phonetic form by applying the followings
rules. It should be noted that most of these rules are those
adopted also for Arabic [12], [13] and are applicable only for
Arabic words and foreign words phonologically altered in our
corpus.
Let consider: BS is a mark of the beginning of a sentence, ES
is a mark of the end of a sentence, BL is a blank character,
C is a consonant, V is a vowel, LC is a lunar consonant,
SC is a solar consonant, and LV is a long vowel. A sample
conversion rule could be written as follows:
LF T +GR +RGT =/P H /
The rule is read as follows: a grapheme GR having as left and
right contexts LFT and RGT respectively, is converted to the
phoneme PH. Left and right contexts could be a grapheme,
a word separator, the beginning or the end of a sentence or
empty.
We give in the following all rules that we used for Algiers
dialect G2P (the representation of these rules according to the
sample below is given in the Appendix (Table XXIII).
1)
X,
and
Hrules
In Algiers dialect, the letters
X,
and
Hare
not used, they are in most cases pronounced as the
graphemes X,
and
H, respectively.
2) Foreign letters rules Algiers dialect alphabet corre-
sponds to Arabic alphabet extended to three foreign
letters G, V and P.
3) Definite article È@
The definite article È@ is not pronounced when
it is followed by a lunar consonant(witch does
not assimilate the @).
Example : Q
Ò
®Ë@ (the moon) =/laqmar/
This rule is the same as in MSA with the
difference that in MSA the @is pronounced if
the definite article is in the beginning of the
sentence.
When the definite article È@ is followed by
a solar consonant the Èis not pronounced
and the consonant following the Èis doubled
(gemination).
Example :
®Ë@ (the roof)=/?assqaf/
When the definite article È@ is preceded by
a long vowel øand followed by a solar
consonant the definite article is omitted and
the solar consonant is doubled (gemination).
Example: P@
YË@ ú
¯=P@
Y
¯=/fddAr/
4) Words Case-ending
Words case ending in Algiers dialect is the Sukun
(Absence of diacritics), so the last consonant of a
word should be pronounced without any diacritic.
Example : ÉJ.
¯(before) =/qbal/
5) Long vowel rules
When @,ðand øappear in a word preceded by the
short vowels
,
and 
, respectively, their relative
long vowels are generated.
Examples:
A¿(a cup) =/ka:s/
Èñ
¯(beans) =/fu:l/
Q
J.»(a well) =/kbi:r/
6) Glottal stop rule
In Algiers dialect, when a word begins with a Hamza,
its phonetic representation begins with a glottal stop.
in the end of a word the Hamza preceded by @is not
pronounced.
Example:
Iº@(stop talking) =/?askut/ and ZÞ
(sky) =/sm?/
It should be noted that the Hamza in the middle of
the word is replaced by the long vowels @or øin
Algiers dialect. For example the Arabic words Q
K.
(hole) and
A
¯(poleax) correspond to /bi:r/ and /fa:s/,
respectively.
7) Alif Maqsura rule ø
Alif Maqsura ø(which is always preceded by a fatha)
at the end of a word is realized as the short vowel
/a/.
Example: ú×P (he throws) =/rmaa/
8) Alif Madda
@
Alif madda
@is realized as alef /?/ with the long vowel
/a:/.
Example:
áÓ
@(he trusts)=/?a:man/
9) Words ending with
è
The
èis not pronounced in Algiers dialect unlike in
MSA where it is realized with the two phonemes /t/
and /h/ (depending on the word position)
Example:
íÊ
®£ (a girl)=/t’afla/
10) Words ending with è
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
390 | P a g e
www.ijacsa.thesai.org
The èis not pronounced in Algiers dialect when it is
preceded by
.
Example: í
K.A
(his book)=/kta:bu/
11) Words containing the sequences
à,H
.
When a
àis followed by a H
., the
àis pronounced
as /m/
Example: Q
.J
Ó(a foretop) =/mambar/
12) Gemination rule
When the Shadda appears on a consonant, this con-
sonant is doubled (geminated)
Example: Q
º
(sugar) =/sukkur/
It should be noted that most of these rules could be
applied for other Algerian dialects and Arabic dialect
close to them such Tunisian and Moroccan.
Experiment:As indicated above for experiment we used
our ALG vocalized corpus which includes three categories of
words:
1) Arabic words.
2) French words phonologically altered and their pro-
nunciation is realized with Arabic phonemes.
3) French words for which the pronunciation is realized
with French phonemes.
We applied phonetization rules seen below on the ALG corpus.
In addition to Arabic words, French words of the second
category are correctly phonetized because their phonetic real-
ization is close to Algiers dialect. For example the word
íJK
Pñ»
(kitchen, original French word is cuisine) which is a borrowed
French word phonologically altered is correctly converted as
/ku:zina/, while a word in the third category as
àñJ
ºKñ»
(connection, original French word connexion) is incorrectly
converted to /ku:niksju:n/ since it is realized /kOnnEksj˜O/ with
French phonemes. Considering these words, system accuracy
is 92%. The issue of these words is that we can not introduce
rules for French words written in Arabic script, since the
relation between Arabic graphemes and French phonemes is
not one to one. For example the graphemes ñ in a French
word written in Arabic script could correspond to the French
phonemes /y/, /u/, /O/ or /O/ (see some examples in Table XV).
TABLE XV: Examples of mappings between Arabic grapheme
ñ and French phonemes.
Dialect word French phonetic transcription French word English
Pñ syöSˆ
ure Sure
PñK
pPort Port
Xñ sudœöSoudeur Wilder
C. Statistical Approach
Rule based approach adopted above does not take into
account French words used in ALG which are pronounced as
in French language. This issue takes us to choose a statistical
approach in order to consider this feature. We use statistical
machine translation system where source language is a text
(a set of graphemes) and target language is its phonetic
representation (a set of phonemes). This system uses Moses
package[17], Giza++[18] for alignment and SRILM[19] for
language model training. The main motivation of using a sta-
tistical approach is that we can include French phonemes in the
training data. For building this system, the first component is a
parallel corpus including a text and its phonetic representation.
Actually, this resource is not available, so we created it by
using the rule based converter described above. We proceed as
follows: we used the rule based system to convert Arabic words
and French words phonologically altered (category 1 and 2)
to Arabic phonemes. Whereas for French words realized with
French phonemes (category 3), we began by identifying them
and we transliterated them to their original form in Latin script,
then converted them to French phonemes (using a free French
G2P converter), all these operations were done by hand. For
example the word
àñJ
ºKñ»is transliterated to connexion then
converted to /kOnnEksj˜O/.
This system operates at grapheme and phoneme level,
we split the parallel corpus into individual graphemes and
phonemes including a special character as word separator in
order to restore the word after conversion process (see Table
XVI).
TABLE XVI: Examples of aligned graphemes and phonemes.
o
I  º o  K
Null /t/ /u/ /k/ Null /s/ /a/ /n/
Experiment:For evaluating the statistical approach, we
split the parallel corpus into three datasets: training data (80%)
tuning data (10%) and testing data (10%).First we tested the
statistical approach on a corpus containing only Arabic words
and French words phonologically altered (category 1 and 2).
We got an accuracy of 93%. Then we proceeded to a test
on a corpus including the three words categories, system
accuracy decreases to 85%. This result is due to the increase of
hypothesis number of each grapheme because of introducing
French phonemes in the training data. The graphemes ñ for
example in some Arabic words (category 1) are phonetized as
the French phonemes /y/ or /˜O/ instead of the Arabic long vowel
/u:/, the phoneme /˜O/ instead of /u:n/. Contrary to that some
words in category 3 are phonetized with Arabic phonemes by
substituting for example the phonemes /y/, /u/, /O/ or /O/ by
the /u:/, and /E/ by /a:/.
D. Discussion
At first glance, and regards to accuracy rates, we could de-
duce that rule based approach is more efficient than statistical
approach (92% vs 85%). Rule based approach does not take
into account French words of category 3, it achieves efficient
results only for Arabic words and French phonologically
altered words (category 1 and 2). Results of statistical approach
must be analysed regards to the small amount of the training
data. On another side, a hybrid approach could be adopted:
instead of using one corpus including all categories of words
for training the statistical G2P converter, we can use two
corpora: the first one including words of categories 1 and
2, could be processed by rule based approach. The second
corpus is a parallel corpus including words of category 3 with
their French phonetization used for training the statistical G2P
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
391 | P a g e
www.ijacsa.thesai.org
converter. Unfortunately, we have not sufficient data for testing
such a converter, since our corpus includes only about 1k
words of category 3. In terms of resources, this work allowed
us to build a phonetized dictionary for Algiers dialect; at our
knowledge no such resource is available at this time.
VI. MO RP HO LO GI CA L ANALYZER FOR ALGERIAN
DIA LE CT
A. Related works
Compared to MSA, there are a little number of Morpho-
logical Analysers (MA) dedicated to Arabic dialects. Works
in this area could be divided into two categories. The first
one includes MA that are built from scratch such as in [20]
and [21], the second includes works that attempt to adapt
existing MSA Morphological Analysers to Arabic dialect.
This trend is adopted for several dialects since it is not time
consuming. In [22], authors used BAMA Buckwalter Arabic
Morphological Analyser [3] by extending its affixes table with
Levantine/Egyptian dialectal affixes. The same approach is
adopted in [23] where a list of dialectal affixes (belonging to
four Arabic dialects) was added to Al-Khalil [24] affix list.
Authors in [25] converted the ECAL (Egyptian Colloquial
Arabic Lexicon) to SAMA (Standard Modern Arabic Anal-
yser) representation [26]. For Tunisian dialect, authors in [27]
adapted Al-Khalil MA, they create a lexicon by converting
MSA patterns to Tunisian dialect patterns and then extracting
specific roots and patterns from a training corpus that they
created.
B. Adopted Approach
To build a MA for Algiers dialect, we decide to adapt
BAMA, since it does not consume time and takes profit
from the fact that it is widely used. BAMA is based on a
dictionary of three tables containing Arabic stems, suffixes
and prefixes and three compatibility tables defining relations
between stems, prefixes and suffixes. Adaptation of BAMA is
got by populating these tables by dialect data.
C. Building the dialect dictionary
We built dialect dictionary by adopting the following
principle: in order to exploit BAMA dictionary, we kept from
it all entries that belong also to ALG with some modification
(for example MSA prefixes K
,
Kand K.are used in ALG
so we kept them as ALG prefixes). Beside that, we deleted all
entries which are not suitable for Algiers dialect. Moreover, we
created entries that are purely dialectal and which did never
exist in MSA dictionary.
1) Affixes tables: For affixes tables, common affixes be-
tween MSA and ALG are kept (in prefixes and suffixes tables),
whereas all other MSA affixes which do not belong to dialect
were deleted. However, some dialect affixes which do not exist
in MSA were added to affixes tables. Note that when an affix
is deleted, all complex affixes where it occurs are also deleted.
1) Prefixes table: We kept some prefixes unchanged like
prefixes K
and
Kthat precede imperfect verbs (for
the singular third person masculine and feminine,
respectively). We eliminated purely MSA prefixes12
12Prefixes that could not belong to Algiers dialect.
and all complex prefixes where they appear instead of
the prefix J
(expressing the future when it precedes
imperfect verbs ) and the prefix
¬13(conjunction),
some examples are given in Table XVII.
TABLE XVII: Examples of kept, deleted and added prefixes
in ALG prefixes table.
Kept pref. Description
K,K
Imperfect Verb Prefix(sing.,third person,masc.,fem.)
È@ Noun Prefix (definite article)
H
.,ÈPreposition Prefix
Del. pref. Description
¬Conjunction Prefix
Future Imperfect Verb Prefix
ÈAJ.
¯Conj.Pre.+Preposition Pre.+Definite Art. Pre.
Add. pref. Description
¬Preposition Prefix
ÈA
¯Preposition Pre.+Definite Art. Pre.
áK
Perfect verb pre. (past voice, (sing., masc.) and (plu, masc/fem.))
á
KPerfect verb pre. (past voice, (sing. fem.)
2) Suffixes table: We also eliminated all MSA suffixes
not used in Algiers dialect mainly:
Suffixes related to the dual both feminine and
masculine,
Feminine plural suffixes,
All word case endings suffixes
All complex suffixes where they appear were also
deleted. Likewise, we added dialectal suffixes like the
suffix
for negation and all complex suffixes that
must be included with it.
We integrated also a set of suffixes to take into
account all various writings of dialects words which
are not normalized. An example is the suffix ð, which
could express the plural (feminine and masculine) in
the end of a verb, a possessive pronoun at the end
of a noun exactly like the MSA suffix ë. We give in
table XVIII a set of examples of each case.
2) Stems table: Dialect stems table was populated by the
lexicon of Algiers dialect corpus and MSA stems included in
BAMA. We used a part (85%, 9170 distinct words) of our
ALG corpus for creating dialect stems, the remaining 15%
(1618 distinct words) is used for test.
Stems from ALG corpus lexicon
First, we began by extracting a list of nouns easily identifi-
able by affixes
èand definite article Ë@(used only with nouns).
We deleted these two affixes from all extracted words, then
from obtained list of words we created stem entries according
to BAMA. Next, the rest of the corpus was analysed and
classified into three sets: function words, verbs and nouns
(which do not include
èand Ë@suffixes) and converted to stems
according to BAMA stems categories. Let us indicate that we
added some stems categories to take into account all dialectal
features. For example, in MSA the perfect verb stem category
13Note that
¬as MSA conjunction prefix has been deleted (since it does
not exist in ALG), and
¬as preposition prefix has been created.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
392 | P a g e
www.ijacsa.thesai.org
TABLE XVIII: Examples of kept, deleted and added suffixes in ALG suffixes table.
Kept Aff. Description
áK
Accusative/genitive noun Suffix(masc.,plu.)
H@ Noun Suffix(fem.,plu.)
Hperfect verb suffix (fem.,sing)
Del. suff. Description
àPerfect/Imperfect Verb Suffix(subject, plu., fem.)
ßPerfect/Imperfect Verb Suffix(subject, dual., fem/masc., 2nd person)
AÒë Perfect/Imperfect Verb Suffix(direct object, dual., fem/masc., 3rd person)
àð Nominative Noun Suffix (masc.,plu.)
à@ Nominative Noun Suffix (masc.,dual)
áë Perfect/Imperfect Verb Suffix(direct object, plu., fem.)
áê
KPerfect Verb Suffix(subject sing.,2nd person,masc.,direct object, plu., 3rd person, fem.)
Add. suff. Description
Perfect/Imperfect Verb Negation Suffix
Òë Perfect/Imperfect Verb Negation Suffix (direct object,plu., 3nd person, masc./fem)
Ò» Perfect/Imperfect Verb Negation Suffix (direct object,plu., 2nd person, masc./fem.)
ðPer./Imp. Verb Suffix(direct object,plu.,masc.,fem.)
with the pattern É
ª
¯covers the three persons, the two genders,
the single, the dual and plural; just relative suffixes are added
to it to have its different inflected forms. In ALG, we split this
stem category into two distinct stems:É
ª
¯and ɪ
¯to cover
all perfect verbs inflected forms, in Table XIX we give an
example related to the stem ©ÖÞ(to hear).
TABLE XIX: Example of splitting a MSA stem to two
Dialectal stems.
Eng. pro. Dia pro. Dia. verb Dia. stem MSA pro. MSA verb MSA stem
She ùë
I
ª
ÖÞ©
ÖÞùë
I
ªÖÞ
©ÖÞ
They A
Óñë ñª
ÖÞÑë @ñªÖÞ
He ñë ©
ÖÞ©
ÖÞñë
©ÖÞ
We AJë A
ÖÞ
ám'AJªÖÞ
Exploiting MSA BAMA stems
1) Verbs
The main idea for creating ALG verb stems from
MSA stems is using verbs pattern. For example the
verbs having ALG pattern É
ª
¯are in most cases
Arabic verbs with the patterns
É
ª
¯,
É
ª
¯or
ɪ
¯. Some
other ALG verbs keep the same pattern as in MSA
like verbs with the patterns É
ª
¯
From stems table, we extracted all perfect verbs hav-
ing the patterns É
ª
¯,É
ª
¯,ɪ
¯and É
ª
¯. After that,
the verbs having the three first patterns are converted
to Algiers dialect pattern by changing diacritic marks
to É
ª
¯while the verbs corresponding to pattern É
ª
¯
are kept as they are (since this pattern is used in
ALG). At this stage, we constructed a set of Arabic
verb stems having dialect pattern, we analysed them
and eliminated all stems that are not used in ALG.
We give in Table XX some examples.
We proceed as explained above for other patterns as
É
ª
®
K,É
«A
®
K,É
«A
¯,É
ª
®
J@. It should be noted that,
TABLE XX: Examples of converted stems from MSA to ALG.
Stems ALG Dialect MSA English
H
.
 H
.
 H
.
He beat
H
. H
.
 H
.Q
åHe drunk
ÈYK.È
Y
K.È
Y
K.He changed
Q.»Q
.
»Q
.
»He grew
we constructed imperfect verb stems and command
verb stems from the ALG perfect verb stems that we
created as described above.
2) Nouns
We kept all proper nouns from MSA stems table
because it contains an important number of entries
related to countries, currencies, personal nouns,... We
analysed all other types of words and kept from them
those existing in ALG by modifying diacritics, adding
or deleting one or more letters.
3) Function words
We deleted all function words that do not exist in
ALG like relative pronouns and personal pronouns
related to the dual and feminine plural, then we
translated remaining ones to ALG.
Note that we introduced dialect stems with non Arabic
letters
¬G,¬V, and H
Pin stems table and we modified
BAMA code to consider words containing these letters. Also,
since every stem entry in BAMA contains an English glossary,
when creating a dialect entry, we added the Arabic word to
English glossary, so for each dialect entry is associated an
English and Arabic glossary.
After creating affixes and stems tables for ALG, compatibility
tables of BAMA were updated according to the data included
in these tables.
D. Experiment
As mentioned above, we tested our MA on the Algiers
Dialect corpus, the test set contains 1618 distinct words
extracted from 600 sentences chosen randomly. We consider
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
393 | P a g e
www.ijacsa.thesai.org
that a word is correctly analysed if it is correctly decomposed
to prefix+stem+suffix and if all the features related to them
are correct (POS, gender, number, person). We first began by
testing the MA with stems extracted only from the ALG corpus
lexicon, then we introduced stems created from the MSA stems
table. We list in Table XXI the obtained results.
TABLE XXI: Results of ALG morphological Analyser.
Results ALG MSA stems+ALG
corpus stems corpus stems
# Analysed words 703 1115
Percentage 43% 69%
# Unanalysed words 915 503
Percentage 57% 31%
We examinated the words for which no answer were given
by the morphological analyzer(see Table XXII), most of the
cases are:
French words which do not exist in the stem table like
ú
æJ

Q
K(´
electricit´
e , electricity), or words like PñJ
Jm.'@
(ing´
enieur, engineer) and ðQÒJ
JË@ (num ´
ero, number)
that are included in stems table but with an other
orthography (respectively PñJ
J
m.'@and ðQ
ÒJ
JË@). The
same case is observed for nouns written with long
vowel @in the end instead of
èsuch as ACK
(place).
We noticed also that some words are written with
missed letters like the word A
Ë@ which appears in
stems table as ZA
Ë@. The same case is noticed for
úÍA
¯(he said to me) instead of ú
ÍA
¯or úÍ ÈA
¯or ñÊ
J
¯
(I said to him) instead of ñÊ
¯or ñË
¯.
Some Unanalyzed words also are proper nouns.
TABLE XXII: Examples of unanalyzed words.
Unanalyzed word Corresponding stem English
HAKQ
K@
IKQ
K@ Internet
YªJ.Ó@ YªJ.Óð@ After
Q
AÖßQ
KQ
ÖßQ
KTrimester
àñ
®J
ÊJ
K
àñ
®J
Ê
KPhone
VII. CON CL US IO N
This paper summarize a first attempt to work on Algerian
Arabic dialects which are non-resourced languages. These
dialects lag behind compared to other dialects of the Middle-
east for which several works were dedicated and produced
many NLP tools. The presented work is the first part of a
big project of Speech translation between MSA and Algerian
dialects. We focus in this first part on the one spoken in
Algiers and its periphery. We began by a study showing all
fearures related to it, then we introduced resources that we
created from scratch. This process was expensive in terms of
time and human effort but the results were worth it. We get a
cleaned corpus of Algiers dialect aligned to MSA, this corpus
is the first parallel corpus which includes Algerian dialect to
date. We presented also the Grapheme-to-Phoneme converter
that we created for Algiers dialect. We combined a rule based
approach to a statistical appraoch. The level of correctness for
the G2P converter is about 85%. In terms of corpus resources,
this task enabled us to transcribe the ALG corpus to a phonetic
form. We also proposed a morphological analyser for AlG that
we adapted from the well known BAMA dedicated for MSA.
We reached an accuracy rate of 69% when evaluating it on
a dataset extracted from ALG corpus. Our future work before
developing a statistical machine translation system, is to extend
the corpus we created to other Algerian Arabic dialects, and
to adapt all tools dedicated to ALG to these dialects.
ACK NOW LE DG EM EN T
This work has been supported by PNR (Projet National
de Recherche of Algerian Ministry of Higher Education and
Scientific Research).
REF ER EN CE S
[1] S. Harrat, K. Meftouh, M. Abbas, and K. Smaili, “Building resources
for algerian arabic dialects,” in Proceedings of Interspeech, 2014, pp.
2123–2127.
[2] ——, “Grapheme to phoneme conversion: An arabic dialect case,
in Proceedings of 4th International Workshop On Spoken Language
Technologies For Under-resourced Languages SLTU, 2014, pp. 257–
262.
[3] B. Tim, “Buckwalter arabic morphological analyzer version 1.0,Lin-
guistic Data Consortium LDC2002L49, 2002.
[4] K. Kirchhoff, J. Bilmes, S. Das, N. Duta, M. Egan, G. Ji, F. He,
J. Henderson, D. Liu, M. Noamany, P. Schone, R. Schwartz, and
D. Vergyri, “Novel approaches to arabic speech recognition: Report
from the 2002 johns-hopkins summer workshop,” in Proceedings of
IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing,(ICASSP ’03), vol. 1, April 2003, pp. I–344–I–347.
[5] R. Hetzron, The Semitic Languages, ser. Routledge language
family descriptions. Routledge, 1997. [Online]. Available: https:
//books.google.com/books?id=nbUOAAAAQAAJ
[6] J. C. Watson, The phonology and morphology of Arabic. Oxford
university press, 2007.
[7] A. Boucherit, L’Arabe parl´
e`
a Alger. ANEP Edition, 2002.
[8] C. A. Ferguson, “Diglossia,” Word, vol. 15, pp. 325–340, 1959.
[9] F. H. Amer, B. A. Adaileh, and B. A. Rakhieh, “Arabic diglossia:
A phonological study,Argumentum 7, Debreceni Egyetemi Kiad´
o,
Tanulm`
any, pp. 19–36, 2011.
[10] C. A. Ferguson, “Two problems in arabic phonology,Word, vol. 13,
pp. 460–478, 1957.
[11] S. Harrat, M. Abbas, K. Meftouh, and K. Smaili, “Diacritics restoration
for arabic dialect texts,” in Proceedings of Interspeech, 2013, pp. 125–
132.
[12] M. Alghamdi, H. Almuhtasab, and M. Alshafi, “Arabic phonological
rules,” Journal of King Saud University: Computer Sciences and Infor-
mation (in Arabic), vol. 16, pp. 1–25, 2004.
[13] Y. A. El-Imam, “Phonetization of arabic: rules and algorithms,” Com-
puter Speech Language, vol. 18, no. 4, pp. 339–373, 2004.
[14] M. Zeki, O. O. Khalifa, and A. Naji, “Development of an arabic
text-to-speech system,” in International Conference on Computer and
Communication Engineering (ICCCE). IEEE, 2010, pp. 1–5.
[15] P.Taylor, “Hidden markov model for grapheme to phoneme conversion,
in Proceedings of Interspeech, 2005, pp. 1973–1976.
[16] K. U. Ogbureke, P. Cahill, and J. Carson-Berndsen, “Hidden markov
models with context-sensitive observations for grapheme-to-phoneme
conversion.” in Proceedings of Interspeech, 2010, pp. 1105–1108.
[17] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico,
N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar,
A. Constantin, and E. Herbst, “Moses: Open Source Toolkit for Statis-
tical Machine Translation,Proceedings of the Annual Meeting of the
Association for Computational Linguistics, demonstation session, pp.
177–180, 2007.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
394 | P a g e
www.ijacsa.thesai.org
[18] F. J. Och and H. Ney, “A systematic comparison of various statistical
alignment models,” Computational Linguistics, volume 29, number 1,
pp. 19–51, 2003.
[19] A. Stolcke, “Srilm – an Extensible Language Modeling Toolkit,” in
Proceedings of Interspeech, Denver, USA, 2002, pp. 901–904.
[20] N. Habash and O. Rambow, “Magead: A morphological analyzer and
generator for the arabic dialects,” in Proceedings of the 21st Interna-
tional Conference on Computational Linguistics and the 44th annual
meeting of the Association for Computational Linguistics. Association
for Computational Linguistics, 2006, pp. 681–688.
[21] M. Altantawy, N. Habash, and O. Rambow, “Fast yet rich morphological
analysis,” in Proceedings of the 9th International Workshop on Finite
State Methods and Natural Language Processing. Association for
Computational Linguistics, 2011, pp. 116–124.
[22] W. Salloum and N. Habash, “Dialectal to standard arabic paraphrasing
to improve arabic-english statistical machine translation,” in Proceed-
ings of the First Workshop on Algorithms and Resources for Modelling
of Dialects and Language Varieties. Association for Computational
Linguistics, 2011, pp. 10–21.
[23] K. Almeman and M. Lee, “Towards developing a multi-dialect morpho-
logical analyser for arabic,” in 4th International Conference on Arabic
Language Processing, 2012, pp. 19–25.
[24] A. Boudlal, A. Lakhouaja, A. Mazroui, A. Meziane, M. O. A. O.
Bebah, and M. Shoul, “Alkhalil morpho sys: A morphosyntactic analysis
system for arabic texts,” in Proceedings of 7th International Computing
Conference in Arab ACIT, 2011.
[25] N. Habash, R. Eskander, and A. Hawwari, “Morphological analyzer
for egyptian arabic,” in Proceedings of the Twelfth Meeting of the
Special Interest Group on Computational Morphology and Phonology
SIGMORPHON. Association for Computational Linguistics, 2012, pp.
1–9.
[26] D. Graff, M. Maamouri, B. Bouziri, S. Krouna, S. Kulick, and T. Buck-
walter, “Standard arabic morphological analyzer (SAMA) version 3.1,
Linguistic Data Consortium LDC2009E73, 2009.
[27] I. Zribi, M. E. Khemakhem, and L. H. Belguith, “Morphological
analysis of tunisian dialect,” in International Joint Conference on
Natural Language Processing, 2013, pp. 992–996.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
395 | P a g e
www.ijacsa.thesai.org
Appendix
TABLE XXIII: Algiers dialect Rules for G2P conversion.
# Rule title Rule
1
X,
and
Hrules
{C, V }+
X+{C, V }=/d/
{C, V }+
+{C, V }=/d0/
{C, V }+
H+{C, V }=/T/
2 Foreign letters rules
{C, V }+
¬+{C, V }=/g/
{C, V }+¬+{C, V }=/v/
{C, V }+H
+{C, V }=/p/
3Definite article È@ {LC}+È@+{BL, BS}=/l/ +/LC/
{SC }+È@+{BL BS}=/?a/+/SC/ +/SC/
4 Words Case-ending {BL, E S}+C+{C, V }=/C/
5 Long vowel rules
{C+ @}+
+{C}=/a :/
{C+ð}++{C}=/u :/
{C+ø}++{C}=/i :/
6 Glottal stop rule {C, V }+@+{B S, BL}=/?/
{BL, E S}+Z+{@}=/Null/
7 Alif Maqsura rule ø{B L, ES}+ø+{
+C}=/a/
8 Alif Madda
@{C}+
@+{C}=/?a:/
9 Words ending with
è{BL, E S}+
è+{C, V }=/Null/
10 Words ending with è{B L, ES}+è+{
}=/Null/
11 Words containing the sequences
à,H
.{H
.}+
à+{C, V }=/m/
12 Gemination rule {V}+ω+{C}=/CC/
396 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
... In these Dialect families, we will find also subfamilies. In the case of the Algerian dialect, the work of [19] classify Algerian dialects into four groups: ...
... Also we found in several cases the use of the letter ‫د‬ d, instead of ‫ذ‬ ð, which is a characteristic of the Dialect of Algiers the capital of Algeria [19], as the case in the comment ‫و‬ ‫المهمة‬ ‫له‬ ‫اسندة‬ ‫الدى‬ ‫المسؤل‬ ‫حال‬ ‫هو‬ ‫هدا‬ ‫حفيظ‬ ‫يا‬ ‫شكرا‬ ‫فشل‬ ‫الدى‬ ‫الزمن‬ ‫عن‬ ‫يدافع‬ ‫و‬ ‫يطبل‬ ‫و‬ ‫ينتقد‬ ‫ان‬ ‫اجل‬ ‫من‬ ‫ينتقد‬ ‫اصبح‬ ‫مر‬ ‫الوقائع‬ ‫بتحديهم‬ ‫المجد‬ ‫يصنعون‬ ‫رجال‬ ‫هناك‬ ‫ولكن‬ šukrã ya HafiyĎ hada huwa HaAl Almasŵul Al~adi Ȃus.nidaħ. lahu Almuhim~aħ wa fašil. ...
... 4 www.ennaharonline.com (2) while (IAA <= 100% ) do (3) read (URL); (4) Page =load (URL); (5) while (there are comments in Page) do (6) Extract the following Comment (7) if (Comment in Data_base) then (8) Delete Comment ; (9) Else (10) Add Comment to the Data_base; (11) end if (12) end while (13) MODEL the phenomenon (14) ANNOTATE (15) Calculate New_IAA (16) if New_IAA <= IAA then (17) go to MODEL (18) end if (19) PROCESS (20) TRAIN And TEST (21) EVALUATE (22) if (results sufficient) (23) Break; (24) Else (25) REVISE (26) end if (27) end while (28) end ...
Preprint
In this paper, we propose our enhanced approach to create a dedicated corpus for Algerian Arabic newspapers comments. The developed approach has to enhance an existing approach by the enrichment of the available corpus and the inclusion of the annotation step by following the Model Annotate Train Test Evaluate Revise (MATTER) approach. A corpus is created by collecting comments from web sites of three well know Algerian newspapers. Three classifiers, support vector machines, na{\"i}ve Bayes, and k-nearest neighbors, were used for classification of comments into positive and negative classes. To identify the influence of the stemming in the obtained results, the classification was tested with and without stemming. Obtained results show that stemming does not enhance considerably the classification due to the nature of Algerian comments tied to Algerian Arabic Dialect. The promising results constitute a motivation for us to improve our approach especially in dealing with non Arabic sentences, especially Dialectal and French ones.
... In these Dialect families, we will find also subfamilies. In the case of the Algerian dialect, the work of [19] classify Algerian dialects into four groups: ...
... Also we found in several cases the use of the letter ‫د‬ d, instead of ‫ذ‬ ð, which is a characteristic of the Dialect of Algiers the capital of Algeria [19], as the case in the comment ‫و‬ ‫المهمة‬ ‫له‬ ‫اسندة‬ ‫الدى‬ ‫المسؤل‬ ‫حال‬ ‫هو‬ ‫هدا‬ ‫حفيظ‬ ‫يا‬ ‫شكرا‬ ‫فشل‬ ‫الدى‬ ‫الزمن‬ ‫عن‬ ‫يدافع‬ ‫و‬ ‫يطبل‬ ‫و‬ ‫ينتقد‬ ‫ان‬ ‫اجل‬ ‫من‬ ‫ينتقد‬ ‫اصبح‬ ‫مر‬ ‫الوقائع‬ ‫بتحديهم‬ ‫المجد‬ ‫يصنعون‬ ‫رجال‬ ‫هناك‬ ‫ولكن‬ šukrã ya HafiyĎ hada huwa HaAl Almasŵul Al~adi Ȃus.nidaħ. lahu Almuhim~aħ wa fašil. ...
... 4 www.ennaharonline.com (2) while (IAA <= 100% ) do (3) read (URL); (4) Page =load (URL); (5) while (there are comments in Page) do (6) Extract the following Comment (7) if (Comment in Data_base) then (8) Delete Comment ; (9) Else (10) Add Comment to the Data_base; (11) end if (12) end while (13) MODEL the phenomenon (14) ANNOTATE (15) Calculate New_IAA (16) if New_IAA <= IAA then (17) go to MODEL (18) end if (19) PROCESS (20) TRAIN And TEST (21) EVALUATE (22) if (results sufficient) (23) Break; (24) Else (25) REVISE (26) end if (27) end while (28) end ...
Article
In this paper, we propose our enhanced approach to create a dedicated corpus for Algerian Arabic newspapers comments. The developed approach has to enhance an existing approach by the enrichment of the available corpus and the inclusion of the annotation step by following the Model Annotate Train Test Evaluate Revise (MATTER) approach. A corpus is created by collecting comments from web sites of three well know Algerian newspapers. Three classifiers, support vector machines, naïve Bayes, and k-nearest neighbors, were used for classification of comments into positive and negative classes. To identify the influence of the stemming in the obtained results, the classification was tested with and without stemming. Obtained results show that stemming does not enhance considerably the classification due to the nature of Algerian comments tied to Algerian Arabic Dialect. The promising results constitute a motivation for us to improve our approach especially in dealing with non Arabic sentences, especially Dialectal and French ones
... It is understood by the Tunisians and Libyans very well. This rings true to the argument of Harrat et al. [35] that dialects are morphologically and syntactically simplified, especially in the regions where one dialect coincides with one another. 76 | P a g e www.ijacsa.thesai.org ...
... The Sudanese dialect lends similarity to the Egyptian dialect in expressing lexical items that feature minimization. According to [35], differences in lexical items among dialects are marked with variations in form. This indicates that even though dialects differ in their morphological structure, they still represent the same meaning. ...
Article
Full-text available
The recent years have witnessed the development of different computational approaches to the study of linguistic variations and regional dialectology in different languages including English, German, Spanish and Chinese. These approaches have proved effective in dealing with large corpora and making reliable generalizations about the data. In Arabic, however, much of the work on regional dialectology is so far based on traditional methods; therefore, it is difficult to provide a comprehensive mapping of the dialectal variations of all the colloquial dialects of Arabic. As thus, this study is concerned with proposing a computational statistical model for mapping the linguistic variation and regional dialectology in Colloquial Arabic through Twitter based on the lexical choices of speakers. The aim is to explore the lexical patterns for generating regional dialect maps as derived from Twitter users. The study is based on a corpus of 1597348 geolocated Twitter posts. Using principal component analysis (PCA), data were classified into distinct classes and the lexical features of each class were identified. Results indicate that lexical choices of Twitter users can be usefully used for mapping the regional dialect variation in Colloquial Arabic. Keywords-Colloquial Arabic; computational statistical model; lexical patterns; linguistic mapping; principal component analysis (PCA)
... Concerning Algerian Dialect, Harrat et al. 11,12 and Guellil and Azouaou 13 have created morpho-grammatical analyzers allowing POS-tagging. ...
Article
Nowadays, Tunisian Dialect (TD) is widely used in social media such as Facebook and Twitter to communicate and express opinions. However, the absence of a TD translator prevents comprehension and communication for non-Tunisian users. Thus, Tunisian translator becomes an important task to overcome these problems. In this article, our main objective is to create a translator from TD to Modern Standard Arabic (MSA). In this context, we are carrying out an in-depth linguistic study in order to better choose the translation rules which are applicable on the TD sentences. This study guarantees the best quality to obtain MSA sentences. Indeed, our proposed method is based on a linguistic approach. In addition, the method consists of the elaboration of a set of dictionaries and the construction of inflectional, morphological, and syntactic grammars using finite-state transducers. Then, our method is implemented and tested with new technologies provided by the NooJ linguistic platform. To evaluate our constructed tool, we apply it to two different test corpora containing more than 15,000 words. The obtained results are promising and do highlight our proposed method.
... This study likens the AD and the GL. Detailed descriptions from [16,17] tackle the phonetic composition of both AD and GL. ...
Chapter
Monitoring emotions through speech is crucial in medical psychology in addition to emotional health. Angry speech automatic detection can be expedient in several healthcare applications, e.g., (i) the estimation of the level of stress and (ii) the incorporation of intelligence to nursing care robots. Profound knowledge of the linguistic and acoustic characteristics of emotional speech assist categorizing angry talking. This research manuscript elaborates on the audile distinction among neutral as well as hostile talking by (i) probing the variation of prosodic features, such as pitch (F0), energy (E), and duration (D), in the Algerian dialect (AD), while (ii) comparing it with the German language (GL). The authors recommend a tactic to quantify the separation among irate and neutral states exploring subsets of emotional speech corpora of AD and GL. The authors identified a noteworthy dissimilarity between AD and GL regarding the deviations of neutral and angry prosodic features.
Article
Embeddings are very popular representations that allow computing semantic and syntactic similarities between linguistic units from text co-occurrence matrix. Units can vary from character n-grams to words, including more coarse-grained units such as sentences and documents. Recently, multi-level embeddings combining representations from different units have been proposed as an alternative to single-level embeddings to account for the internal structure of words (i.e., morphology) and help systems to generalise well over out of vocabulary words. These representations, either pre-trained or learned, have shown to be quite effective, outperforming word-level baselines in several NLP tasks such as machine translation, part of speech tagging and named entity recognition. Our aim here is to contribute to this line of research proposing for the first time in Arabic NLP an in-depth study of the impact of various subwords configurations ranging from character to character n-grams (including word) for social media text classification. We propose several neural architectures to learn character, subword and word embeddings, as well as a combination of these three levels, exploring different composition functions to obtain the final representation of a given text. To evaluate the effectiveness of these representations, we perform extrinsic evaluations on three text classification tasks (sentiment analysis, emotion detection and irony detection) while accounting for different Arabic varieties (Modern Standard Arabic, dialects (Levantine and Maghrebi)). For each task, we experiment with well-known dialect-agnostic and dialect-specific datasets, including those that have been recently used in shared tasks to better compare our results with those reported in previous studies on the same datasets. The results show that the multi-level embeddings we propose outperform current static and contextualised embeddings as well as best performing state of the art models in sentiment and emotion detection. In addition, we achieve competitive results in irony detection. Our models are also the most productive across dialects observing that different dialects require different composition configurations. We finally show that these performances tend to increase when coupling the multi-level representations with task-specific features.
Thesis
This work is dedicated to statistical machine translation for poorly resourced languages. We are interested in Arabic dialects which represent the daily language of all Arab peoples. These dialects differ from one Arab country to another and even in the same country several variations of dialects coexist. These dialects by their oral nature and non-standard represent a challenge in NLP. In machine translation, these dialects are difficult to translate because of the lack of resources (of all natures) in particular the monolingual and especially parallel corpora necessary for training. In this thesis, we are interested by this issue with particular attention to the Algerian dialect and more precisely to the Algiers dialect. A parallel multi-dialect PADIC corpus (for Parallel Arabic Dialect Corpus) has been created, this is a textual resource important which includes, so far, six Arabic dialects in addition to Modern Standard Arabic. This corpus was the subject of an analytical study to highlight the relationship between dialects (between them) and Standard Arabic. By means of the corpus PADIC, we tackled the problem of statistical machine translation between the different dialect pairs and Standard Arabic. Several results have been obtained and all point to the difficulty of translating dialects. In addition, several tools dedicated to the Algiers dialect have been produced in the framework of this thesis. The problem of code-switching was also discussed where an identification tool was implemented using techniques of "Machine Learning".
Preprint
Monitoring emotions through speech is crucial in medical psychology in addition to emotional health. Angry speech automatic detection can be expedient in several healthcare applications, e.g., (i) the estimation of the level of stress and (ii) the incorporation of intelligence to nursing care robots. Profound knowledge of the linguistic and acoustic characteristics of emotional speech assist categorizing angry talking. This research manuscript elaborates on the audile distinction among neutral as well as hostile talking by (i) probing the variation of prosodic features, such as pitch (F0), energy (E), and duration (D), in the Algerian dialect (AD), while (ii) comparing it with the German language (GL). The authors recommend a tactic to quantify the separation among irate and neutral states exploring subsets of emotional speech corpora of AD and GL. The authors identified a noteworthy dissimilarity between AD and GL regarding the deviations of neutral and angry prosodic features.
Thesis
Internet becomes more and more an important need in our daily life. Thus, the large mass of internet users is rather producer of content than only its consumers. It is usual in current days that a consumer seeks for others feeling about their experience in the web before a simple decision of buying a product or a service, taking a position toward current events, or even candidate choice in an election. Sentiment analysis or Opinion mining, as subfield in natural language processing (NLP), uses data mining techniques to extract opinions in subjective texts. Opinion mining focus is to help people in taking profit from the available opinionated text on the web in their decision making. In the last two decades, important sentiment analysis studies are conducted in Indo-European languages, and especially in English, which are considered as Resource-Rich languages. Arabic is the most used one in Semitic languages family, and this is true for daily life conversations and also for Internet user generated content. Despite the important number of Arabic speakers and Internet users, studies in Arabic sentiment analysis still insufficient. In the current thesis we are interested in analyzing sentiments in Algerian Arabic daily newspapers reviews. To conduct this work, it is prominent to have efficient resources that are well prepared and suitable to such studies. From our study of the literature, we found that resources in Arabic languages are sparse, and the most of those available is related to movie reviews which is different to our domain i.e. newspaper reviews. We start by creating a dedicate corpus for this work, ARAACOM ARAbic Algerian Corpus for Opinion Mining, from three Arabic Algerian newspaper websites. Then a set of experimental studies are achieved using three well-known machine learning Algorithms, Support Vector Machines SVM, Naïve Bayes NB and K-Nearest Neighbors KNN, the same experiment are executed twice with OCA (opinion corpus for Arabic) in the goal of comparing results. Obtained results are promising and further studies are to achieve in next works.
Book
The monitoring of emotions using speech is crucial in medical psychology and emotional health. The automatic detection of angry speech can be useful in several healthcare applications. One can mention, (i) the estimation of the level of stress and (ii) the incorporation of intelligence to nursing care robots. To achieve this goal, a deep knowledge of the linguistic and acoustic properties of emotional speech is needed. In this paper, we study the acoustic contrast between neutral and angry speech by (i) investigating the variation of prosodic features (pitch (F0), energy (E), and duration (D)) in the Algerian Dialect (AD) and (ii) comparing it with the German Language (GL). We propose a methodology that allows quantifying the degree of separation between anger and neutral states using subsets of emotional speech corpora of AD and GL. We found a significant difference between AD and GL in the variation of neutral and angry prosodic features.
Conference Paper
Full-text available
In this paper, we address the problem of the morphological analysis of an Arabic dialect. We propose a method to adapt an Arabic mor-phological analyzer for the Tunisian dialect (TD). In order to do that, we create a lexicon for the TD. The creation of the lexicon is done in two steps. The first step consists in adapting a Modern Standard Arabic (MSA) lexicon. We adapted a list of MSA derivation patterns to TD. The second step consists in improving the resulting lists of patterns and roots by using TD specific roots and patterns. The proposed method has been tested and has achieved an F-measure performance of 88%.
Article
Full-text available
In this paper we address the problem of the analysis of multi-dialect Arabic morphology. Our method involves based on the synthesis of two methods. The first method is linguistic based, using an adopted Modern Standard Arabic (MSA) Morphology Analyser to first deal with dialect prefixes and suffixes and then analyse the words. This method improves accuracy of dialect words by 69%. The second method involves segmenting the word and then using 'the web as corpus' to estimate frequency of different segment combinations which are used to guess the correct base form. The overall synthesis is shown to have 94% accuracy on a corpus of Arabic dialects.
Conference Paper
Full-text available
The Algerian Arabic dialects are under-resourced languages, which lack both corpora and Natural Language Processing (NLP) tools, although they are increasingly used in written form, especially on social media and forums. We aim through this paper, and for the first time, to build parallel corpora for Algerian dialects, because our ultimate purpose is to achieve a Machine Translation (MT) for Modern Standard Arabic (MSA) and Algerian dialects (AD), in both directions. We also propose language tools to process these dialects. First, we developed a morphological analysis model of dialects by adapting BAMA, a well-known MSA analyzer. Then we propose a diacritization system, based on a MT process which allows to restore the vowels to dialects corpora. And finally, we propose results on machine translation between MSA and Algerian dialects.
Conference Paper
Full-text available
In this paper we present a statistical approach for automatic diacritization of Algiers dialectal texts. This approach is based on statistical machine translation. We first investigate this approach on Modern Standard Arabic (MSA) texts using several data sources and extrapolated the results on available dialectal texts. For evaluation we used word and diacritization error rates and also precision and recall.
Conference Paper
Full-text available
Research on Text-to-speech technology has received the interest of professional researchers in many languages which is a consequence of wide range of applications where Text-To-Speech is implemented. However, Arabic language, spoken by millions of people as an official language in 24 different countries, gained less attention compared with other languages despite the fact that it has a religious value for more than 1.6 billion Muslim worldwide. These facts exhibit the need for a high quality, small size, and completely free Arabic TTS with the ability of future improvements. The vowelized written text of Arabic language carries the pronunciation rules with limited exceptions, so rulebased system with an exception dictionary for words that fail with those letter-to-phoneme rules may be a much more reasonable approach. This paper is a development of a rulebased text- to- speech Hybrid synthesis system which is a combination formant and concatenation techniques with acceptable naturalness. The simulation results of the system shows good quality in handling word, phrase, and sentence level compared to other available Arabic TTS systems. The accuracy of the overall system is 96%. Further improvements need to be done for stressed syllable position and intonation.
Conference Paper
Full-text available
Most tools and resources developed for natural language processing of Arabic are designed for Modern Standard Arabic (MSA) and perform terribly on Arabic dialects, such as Egyptian Arabic. Egyptian Arabic differs from MSA phonologically, morphologically and lexically and has no standardized orthography. We present a linguistically accurate, large-scale morphological analyzer for Egyptian Arabic. The analyzer extends an existing resource, the Egyptian Colloquial Arabic Lexicon, and follows the part-of-speech guidelines used by the Linguistic Data Consortium for Egyptian Arabic. It accepts multiple orthographic variants and normalizes them to a conventional orthography.
Book
This book is the first comprehensive account of the phonology and morphology of Arabic. It is a pioneering work of scholarship based on the author's research in the region. Arabic is a Semitic language spoken by some 250 million people in an area stretching from Morocco in the West to parts of Iran in the East. Apart from its great intrinsic interest, the importance of the language for phonological and morphological theory lies, as the author shows, in its rich root-and-pattern morphology and its large set of guttural consonants. Dr Watson focuses on two eastern dialects, Cairene and San'ani. Cairene is typical of an advanced urban Mediterranean dialect and has a cultural importance throughout the Arab world; it is also the variety learned by most foreign speakers of Arabic. San'ani, spoken in Yemen, is representative of a conservative peninsula dialect. In addition the book makes extensive reference to other dialects as well as to classical and Modern Standard Arabic. The volume opens with an overview of the history and varieties of Arabic, and the position of Arabic within Semitic. Dialectal differences and similarities are discussed in successive chapters which cover the phoneme system and the representation of phonological features; the syllable and syllabification; word stress; derivational morphology; inflectional morphology; lexical phonology; and post-lexical phonology. The Phonology and Morphology of Arabic will be of great interest to Arabists and comparative Semiticists, as well as to phonologists, morphologists, and linguists more generally.