Conference PaperPDF Available

A Conventional Orthography for Algerian Arabic

Authors:
  • GEOLSemantics

Abstract

Algerian Arabic is an Arabic dialect spoken in Algeria characterized by the absence of writing resources and standardization, hence it is considered as an under-resourced language. It differs from Modern Standard Arabic on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. In this paper, we present a conventional orthography for Algerian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA), demonstrated for Egyptian and Tunisian Arabic. We explain the design principles of Algerian CODA and provide a detailed description of its guidelines.
Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 69–79,
Beijing, China, July 26-31, 2015. c
2014 Association for Computational Linguistics
A Conventional Orthography for Algerian Arabic
Houda Saadane
1
and
Nizar
Habash
2
(1) Univ. Grenoble Alpes, LIDILEM, Grenoble, France
GEOLSemantics & Consulting, Paris, France
houda.saadane@e.u-grenoble3.fr
(2) New York University Abu Dhabi, United Arab Emirates
nizar.habash@nyu.edu
Abstract
Algerian Arabic is an Arabic dialect spoken
in Algeria characterized by the absence of
writing resources and standardization, hence
it is considered as an under-resourced lan-
guage. It differs from Modern Standard Ara-
bic on all levels of linguistic representation,
from phonology and morphology to lexicon
and syntax. In this paper, we present a con-
ventional orthography for Algerian Arabic,
following a previous effort on developing a
conventional orthography for Dialectal Ara-
bic (or CODA), demonstrated for Egyptian
and Tunisian Arabic. We explain the design
principles of Algerian CODA and provide a
detailed description of its guidelines.
1 Introduction
The Arabic language today is characterized by a
complex state of polyglossia. Modern Standard
Arabic (MSA) is the official variety of Arabic
used primarily in written literal contexts. There is
also a large number of dialects whose dominant
features are noticeable to Arab-speaking people.
The Arabic dialects differ from Modern Standard
Arabic (MSA) on all levels of linguistic repre-
sentation, from phonology and morphology to
lexicon and syntax. MSA is classified as a high
variety as is contains lot of normalization and
standardization. It is generally considered as a
prestigious, valued and official language; hence
it is used for training (media and education). Ar-
abic Dialects (DA) are considered a low variety
which includes languages with less normaliza-
tion and standardization. These languages are
used in daily life, interviews and for informal
conversations. Algerian Arabic (henceforth,
ALG) is one of the Western group of Arabic dia-
lects spoken in Algeria. ALG differs from other
Arabic dialects, neighboring or far ones by hav-
ing some specific features. In addition to MSA
and DA, foreign languages, particularly French
and English have been increasingly part of the
Arabic spoken in daily basis.
With the emergence of Internet and social media,
ALG (and other DAs) have become the language
of informal online communication, for instance
emails, blogs, discussion forums, SMS, etc. Most
Arabic natural language processing (NLP) tools
and resources were developed specially to treat
MSA. Corresponding tools processing ALG are
not as mature and sophisticated as those for
MSA. This is due to the recent involvement of
works on ALG dialect and the limit quantity of
results and resources generated till today. To ad-
dress this problem, some solutions propose to
apply NLP tools designed for MSA directly to
ALG. This proposition is interesting but yields to
significantly low performance. This is why it is
necessary to develop solutions and build re-
sources for ALG treatment.
In this paper, we present a basic layout of ALG
processing which is necessary to build efficient
NLP tools and applications. This layout is a de-
sign of standard common convention orthogra-
phy dedicated to ALG dialect. The proposed
standard is an extension of that proposed in the
work of Habash et al, (2012a) who proposed a
Conventional Orthography for Dialectal Arabic
(CODA). CODA is designed in order to develop
computational models of Arabic dialects and
provided a detailed description of its guidelines
as applied to Egyptian Arabic (EGY).
In this paper, we present a conventional orthog-
raphy for Algerian Arabic. The paper is orga-
nized as follows. Section 2, discusses related
works. In Section 3, we present an historical
overview of ALG. In Section 4, we highlight the
69
linguistic differences between ALG and the lan-
guages MSA, EGY and TUN in order to moti-
vate some of our ALG CODA decisions. In Sec-
tion 5, we present ALG CODA guidelines.
2 Related works
Studying and processing dialects is an interesting
recent research area which took progressively a
big attention, especially with the explosion of
internet public communications. Hence, there is
actually a big interest to develop new tools to
process an exploit the huge quantities of re-
sources established using dialects (oral commu-
nications, web, social networks, etc.). However,
Arabic dialects are languages without standardi-
zation or normalization, these why much efforts
are necessary to modernize Arabic orthography
and develop orthographies for Arabic dialects.
Maamouri et al. (2004) have developed a set of
rules for Levantine dialects. These rules define
the conversational Levantine Arabic transcription
guidelines and annotation conventions. Habsh et
al.(2012a) have proposed a conventional orthog-
raphy for Egyptian dialectal (CODA). This work
is inspired by the Linguistic Data Consortium
(LDC) guidelines for transcribing. However,
CODA is intended for general purpose writing
allowing many abstracts from these variations,
whereas the LDC guideline are dedicated for
transcription, and thus focus more on phonologi-
cal variations in sub-dialects. A proposition for
transcription Algerain dialect are developed in
(Harrat et., 2014) where a set of rules for tran-
scription Algerain dialect are defined and a
grapheme-to-phoneme converter for this dialect
was presented. Grapheme-to-Phoneme (G2P)
conversion or phonetic transcription is the pro-
cess which converts a written form of a word to
its pronunciation form; hence this technique fo-
cuses only on phonological variations.
To remedy the lack of building resources and
tools dedicated to the treatment of ALG issue,
(Harrat et al., 2014) built parallel corpora for Al-
gerian dialects, because their ultimate purpose is
to achieve a Machine Translation (MT) for Mod-
ern Standard Arabic (MSA) and Algerian dia-
lects (AD), in both directions. They also propose
language tools to process these dialects. First,
they developed a morphological analysis model
of dialects by adapting BAMA, a well-known
MSA analyzer. Then they propose a
diacritization system, based on a MT process
which allows restoring the vowels to dialects
corpora. And finally, they propose results on ma-
chine translation between MSA and Algerian
dialects.
In the same way, (Harrat et al., 2015) present an
Arabic multi-dialect study including dialects
from both the Maghreb and the Middle-east that
they compare to the Modern Standard Arabic
(MSA). Three dialects from Maghreb are con-
cerned by this study: two from Algeria : Anna-
ba's dialect (ANB), the language spoken in the
east of Algeria, on Algiers's dialect (ALG), the
language used in the capital of Algeria, and one
from Tunisia, on Sfax's dialect (TUN) spoken in
the south of Tunisia and two dialects from Mid-
dle-east (Syria and Palestine). The resources
which have been built from scratch have lead to
a collection of a multi-dialect parallel resource.
Furthermore, (Zribi, et al., 2014) extend the
CODA guidelines to take into account to Tunisi-
an dialect and (Jarrar, et al., 2014) have adapted
it to the Palestinian dialect. In addition, authors
of Egyptian and Tunisian CODA encourage the
adaptation of CODA to other Arabic dialects in
order to create linguistic resources. Following
this council, we extend in this paper CODA
guidelines to ALG.
3 Algerian Arabic: Historical Overview
Arabic speakers have Arabic dialects or vernacu-
lar as their mother tongues. These dialects can be
stratified in two big families of dialects: the
Western group (the Maghreb) or North African
group and the Eastern group (the Mashriq). Alge-
rian dialect, noted ALG, is one of the Western
group which is spoken in Algeria. This dialect is
also called

daArjaħ
1
or

jazaAyriy or

dziyriy simply meaning "Algerian". These
variations do not create generally barriers to un-
derstand the dialect. In addition to ALG, the Al-
gerian’s population speaks also Berber but with
different ratios: ALG is used by 70 to 80% of the
population however; the Berber language is the
mother tongue of 25% to 30% of population.
Berber is used mainly in center of Algeria (Al-
giers and Kabylie), East of Algeria (Béjaia and
Sétif), in Aures (chaoui), the Mzab (north of the
1
Arabic transliteration is presented in the Habash-Soudi-
Buckwalter (HSB) scheme (Habash et al., 2007).
Phonological transcriptions will be presented between /…/
but we will use the HSB consonant forms when possible to
minimize confusion from different symbol sets.
70
Sahara) and it is used by the Twaregue based in
south of the Sahara (Hoggar mountains). Even if
ALG is spoken by Algeria’s population, estimat-
ed to 40 million of persons, it is characterized by
variation of this same dialect according to geo-
graphic location of ALG’s speakers.
This dialect cannot be presented as homogeneous
linguistic system but it has many varieties. Ac-
cording to (Derradji et al., 2002) we distinguish
four varieties for ALG as follow: I) the Oranais:
is the variety spoken in the Western of Algeria,
precisely from Moroccan frontiers to the limit of
Ténès, ii) Algérois: this variety covers the central
zones of Algeria to Béjaia and it is widely
spread, iii) Rural: the speakers of this variety are
located in the East of Algeria like Constantine,
Annaba or Sétif, and iv) Sahara: is the dialect of
the south of Algeria population. ALG is also the
language used in press, television, social com-
munication, internet exchanges, SMS, etc. Only
in official communications, both reading and
writing ones, where ALG is not used.
Furthermore, we note that ALG is enriched by
the languages of the groups colonized or man-
aged the Algerian population during the history
of the country. Among these group’s languages
we can cite: Turkish, Spanish, Italian and more
recently French. This enrichment, materialized
by the presence of foreign words in the dialect,
has contributed to create many varieties of ALG
from one region to another one, with a quite
complex linguistic situation resulting from this
language mixture. Indeed, this language mixture
has been studied by many socio-linguistic like
(Morsly, 1986; Ibrahimi, 1997; Benrabah, 1999;
Arezki, 2008). They described the linguistic
landscape of Algeria as 'multilingualism' or
“poly-glossic” where multiple languages and
language varieties coexist. In other words, the
ALG is a suitable example of a complex socio-
linguistic situation (Morsly, 1986).
Historically, Berber was the native language of
the population of the Maghreb in general and
Algeria in particular before the Islamic conquest,
which introduced Arabic in all aspects of life.
Centuries of various foreign powers introduced
vocabulary from Turkish, Spanish and finally
(and most dominantly today) French. French
colonization tried to impose the French language
as the only way of communication during its 132
year control of Algeria. This situation caused a
significant decline in the Arabic language, char-
acterized by increased French influence and the
introduction of some other languages like Italian
and Spanish due to migratory flow from Europe
(Ibrahimi, 2006). The influence of these lan-
guages on ALG realizes in frequent code-
switching without any phonology adaptation in
daily conversations, particularly from French,
e.g., “lycée”, “salon”, “quartier”, “normal”, etc.
4 Comparison among Algerian, Egyp-
tian, Tunisian and Standard Arabic
There are many differences among ALG, EGY,
TUN and MSA regarding many levels: phono-
logical, morphological and orthographic. In this
section we present some of these differences that
are important and determinant of the distinction
between these Arabic flavors. We refer the read-
er to (Habash, 2010) for further elements and
discussions.
4.1 Phonological Variations
We give in the following list the major phono-
logical differences between ALG and both MSA
and EGY:
The consonant equivalent the MSA (
) /q/ is one
of the sounds that deserve special attention. This
sound has many varieties of pronunciation in
Algerian Arabic dialects that we can find in the
different regions, cities and localities of Algeria.
Hence, the pronunciation of "q" can be realized
as q, g, ʔ, or k.
uvular stop "
" [q]: like Moroccan and Tuni-
sian dialects, this pronunciation is present in
ALG in different localities as in some urban
cities like Algiers or Constantine.
palatal sound "
" [g]: this sound is also used
in both Moroccan and Tunisian dialects in ad-
dition to the ALG one. In Algeria, this sound
is used in some cities like Annaba and Sétif,
in addition to the Bedouin dialects where this
sound is widely employed.
glottal stop []: this sound is used in Tlemcen
city in the same manner we find it in the
Egyptian dialect.
k postpalatal: this sound is a particularity of
the ALG dialect that we do not find it in the
other north African dialects. This sound is
used in the rural localities and some cities like
Kabylia, Jijel, Msirda and Trara.
We note that in the case of dialects not using
glottal stop consonant, there are some exceptions
where the pronunciation is the same way regard-
less of the dialect. This is the case of the word
71

bagra
ħ
cow’ which is pronounced in the
same way using the palatal sound bagra.
The pronunciation of the consonant () /j/ has
also different from specific for a location or a
group of speakers in the north of Africa. It is
pronounced [dj] in Algiers and most of central
Algeria as in the word  ndjaH ‘success’, but
when the consonant () /j/ precedes a () /d/ con-
sonant it will be pronounced with the allophone
[j] like in the word  jdid ‘new’. In Egypt this
consonant is pronounced as /g/. For Tunisian,
Tlemcenian and east Algerian speakers, '' is
realized as /j/ or /z/ when the word contains the
consonant () /s/ or () /z/ like in the words 
ğibs or djibs ‘plaster’ become  zebs; and
 ςadjuwz ‘old women’ become  ςzuwz.
The MSA consonant (
) /γ/ is assimilated in dif-
ferent manner according to some categories of
speakers. In the eastern Algerian Sahara, like
M'sila and BouSaâda, /γ/ is assimilated to (
) /q/,
for instance, the words

γaAliy ‘expensive’
and

sγayraħ ‘small’, are pronounced re-
spectively /qaAliy/, and /sqayra/. Sometimes, it
is assimilated to (
) /x/, like Tunisian and eastern
Algeria speakers, e.g., the word

‘washed’ is
pronounced /xssel/ or /γssel/.
The interdental MSA consonant () /θ/ can be
pronounced as () /t/, in both ALG and EGY
dialects like for the word  θuwm ' garlic’ is
pronounced as  /tuwm/. But it is also pro-
nounced /θ/ in some urban Algerian dialects as in
the word  θuwm , () /f/ like in nomadic dia-
lects of Mostaganem where for instance the word
 θaAniy ‘also’ is pronounced  faAniy; or
() /s/ in some cases in EGY dialect, for exam-
ple, the word  θaAbit ‘fixe’ is pronounced
 saabit. Another MSA interdental consonant
has also special pronunciations; it is the conso-
nant () /ð/. In the EGY dialect, it can be pro-
nounced () /d/, like the word  ðhab ‘gold’
pronounced  dhab, or () /z/ for instance the
word  ‘clever’ is realized zakiy. However, in
the ALG dialect, the consonant () /ð/ has one of
the following pronunciations: () /ð/ or () /d/.
For instance the word  ‘arm’ can be pro-
nounced ðraAς or draAς. Moreover, in some re-
gions in Algeria, like Mostaganem, this conso-
nant is realized as () /v/, like for the word 
ðhab ' gold’ pronounced  vhab.
The pronunciation of the glottal stop phoneme
that appears in many MSA words in ALG dialect
has different forms:
The glottal stop becomes longue: this pronun-
ciation is also present in TUN and EGY dia-
lects. We can give as example the words :

faÂs /fa’s/ /fa:s/

faAs ‘pickaxe',

Diŷb
/Di’b/ /Di:b/

diyb ‘wolf’, and

muŵmin /mu’men/ /mumin/

muwmin
‘beliver’.
The glottal stop disappears: it consists on
simply removing the glottal when pronouncing
the word. This form is also used in TUN and
EGY dialects. For instance, let us take the fol-
lowing word:

zarqaA’ /zarqa:’/ /zarqa:/

zarqA ‘blue’.
The glottal stop is replaced by a semi-vowel
/w/ or /y/: this pronunciation is found in ALG
and TUN dialects and not in EGY one. It is
used for instance in the case of the words
/Âak~al /‘to give eating’

wuk~al,

/Âams/ ‘yesterday’

yaAmas
The glottal stop is replaced by the letter /l/:
This form is also used uniquely in the ALG
and TUN dialects unlike the EGY one. Let us
take the following examples of using of this
form:

/Âafςa/ ‘snake’

/lafςa/,

/ÂaarD/ ‘earth’

/larD/. We note that
the given examples are also exceptions where
we use the same form for both definite and in-
definite.
The glottal stop is replaced by the letter /h/:
opposite to the EGY dialect, the ALG and
TUN ones use this form to pronounce in some
cases the glottal stop, like in the words

Âaj~aAlaħ
/Âajja:la/ ‘widow’

hajjaAlaħ /hajja:la/,

Âam~aAlaA
/Âamma:laA/ ‘however’

ham~aAlaA
/hamma:laA/.
Unlike the Egyptian dialect, the Algerian dialect
elides many short vowels in unstressed contexts.
This feature characterizes also the other Maghreb
dialects. This is the case of the following words:
MSA

jamal ‘Camel’ (and EGY /gamal/) be-
comes ALG /jmal/. In addition, this feature in-
troduce an interesting element to distinguish the
Maghreb dialects from the EGY one, this ele-
ment is the presence of a succession of two con-
sonants at the beginning of the word which in-
troduces a specific particularity in the verb
scheme ‘fςal’ in ALG instead of ‘faςal’ in EGY,
like in the verb MSA

/qatal/ ‘he killed’ (and
EGY /’atal/) becomes ALG /qtal/.
72
The MSA diphthongs ay and aw are generally
reduced uniformly to /i:/ and /u:/. For example,
let us take the words:

/HayT/ ‘wall’ becomes
ALG /Hi:T/,

/lawn/ ‘color’ becomes ALG
/lu:n/. We note that this particularity is found in
the younger generation speakers; however, older
speakers still retain them in some words and con-
texts, for instance the word

stills pronounced
/ςawd/ ‘horse’ by some old speakers.
Another feature of ALG dialect, shared with the
TUN one, is the pronunciation of the MSA /a:/:
in some words it is realized as /e:/ and in others
remains /a:/. For example, the word

/jam:al/
‘beauty’ with this signification is pronounced
with /a:/ but it is realized with /e:/ in the word

/jme:l/ meaning ‘camels’.
4.2 Morphological Variations
ALG dialect has also some morphological as-
pects that are different from that of the MSA, and
closer to that of Maghreb dialects. These aspects
consist essentially on a simplification of some
inflexions and inclusion of new clitics as follow:
As regards the inflexion, in ALG dialect, like
other Arabic ones, the casual endings in nouns
and verbs mood are lost. We note that the indica-
tive mood is the one which is used as default un-
like the other moods that are not used. Moreover,
the dual and the feminine plural disappeared;
they are assimilated to the masculine in the plural
form. For example, the word
šakartun
a
‘they (fem.pl.) thank’ is normalized in the ALG
dialect in

škar-tuwA ‘they thank’. In addi-
tion, the first and the second person of the singu-
lar form are conjugated in the same way in the
dialect, e.g., in MSA we say
šakartu ‘I
thank’ and
šakartayou thanks’, these two
forms are normalized in ALG dialect in the fol-
lowing unique form:
škart ‘I/you thank’.
This simplification can lead to some ambiguities
in ALG.
The ALG dialect modifies the interne form of the
verbs when it does their flexion in imperfective
form. It introduces a gemination in the first radi-
cal letter and moving to this radical the vowel of
the second one. This modification is applied only
in the plural form and the 2nd person of feminine
singular. For example, in ALG the verb ‘to
thank’ in 3rd person of masculine singular is 
yu-škur (he is thanking) and in 3rd person of
masculine plural we have:  yuš~ukr-uwA
(they are thanking) but in EGY the same case
have the form:  yuškur-uwA. To enforce
this statement we refer to (Souag, 2005) work
where they defend that: As is common in Alge-
ria, when normal short vowel elision would lead
to another short vowel being in an open syllable,
we have slight lengthening on the first member
so as to change the stress:

yaDab 'he hits’
yaD~arbuwA

"they hit”,

rukba
'knee’

ruk~ubtiy ‘my knee’; this
gemination need not occur, however, if the con-
sonant to be geminated is one of the sonorants r,
, l, n, although for younger speakers it often
does. I have the impression that these compensa-
tory geminates are not held as long as normal
geminates; this needs further investigation.”
Otherwise, ALG dialect uses, like the other Ara-
bic dialects, only the suffix

/yn/ to form the
regular plural. However, the ALG elides the
short vowels in plural forms like in the following
examples:
mulHad 'unbeliever', in the plural
form

mulHdiyn,
muhandis 'engineer',
pl.

muhandsiyn. But in some dialects, like
the EGY one, they don’t elide the short vowel,
for instance the plural of
muhandis
'engineer' in EGY is

muhandisiyn. But for
some exception, like for the active participle
[1A2i3] [1A23-iyn] (Gadalla, 2000), this eli-
sion is maintained whatever the dialect like for
the word

SaAyim 'fasting'

SaAymiyn.
Cohen (1912) describes the emphatic suffix  /–
tiyk/ as a characteristics of the Muslim Algiers
dialect that is used to express adverbs ending
with –a like in for the words  gana ‘also’
which becomes ganaAtiyk,  zaςma ‘suppos-
edly’ which becomes zaςmaAtiyk.
For the form  [Aista12a3] which exists in
the different dialects, the ALG introduces in ad-
dition a new variant of this form. This variant is
 [ssa-12a3] and it is used essentially by the
speakers of the west of Algeria (Marçais, 1902).
For example, let us take the verb
Aistaklaf
‘take care of’ can be also used like
ssaklaf
or
saklaf.
Another feature of the ALG dialect is the inser-
tion of vowel /i:/ between the stem and the con-
sonantal suffixes of the perfect form of the pri-
mary geminate verb, e.g in MSA the verb
/ šad
a/šadadtu 'he/I pulled' becomes in
ALG
/ šad
/šad
iyt. This feature is also
present in the other Arabic dialects.
73
The passive voice in classical Arabic uses vowel
changes and not verb derivation but in ALG as in
many Arabic dialects, the passive form is ob-
tained by prefixing the verb with one the follow-
ing elements:
t- / tt-, for example :  tabnaý ‘it was
built’,  ttarfad ‘it was lifted’
n-, for instance :  nftah 'it opened'
/tn- / or /nt/, e.g.,  ntkal 'was edible', 
tnaqtal 'to be killed'. We note that this last
element is specific for the ALG dialect.
The ALG dialect uses the particle «n» for the
first person of singular like the other Maghreb
dialects. This particle is generally absent from
the Mashreq dialects like EGY one. In those dia-
lects the «n» is substituted by the «a» like shown
in the following example:

/naktab/ ‘I write’
in ALG while the equivalent of it in EGY is

/Aaktib/.
Like several dialects (EGY and TUN), ALG in-
clude the clictics, that are reduced forms of the
MSA words, e.g., the demonstrative proclitic +
ha+ which strictly precedes with the definite ar-
ticle + Al+ is related to the MSA demonstrative
pronouns haðaA and  haðihi, e.g.; (MSA
ALG)   haðihi AldunyaA haAldinyaA
'this life'.
Several dialects include the proclitic +, ςa+ a
reduced form of the preposition  /ςalaý/
'on/upon/about/to’. For example, (MSA ALG)
  /ςalaý AlTaAwilaħ/ 
ςaAlmaAydaħ 'on the table'. The same interpreta-
tion is valid for the proclitics + f + a and +
m+; which are the reduced form of the preposi-
tions  fiy 'in' and  min 'from' respectively.
Also, several dialect include the non-MSA nega-
tion circum-clitic + mA+ + +š. For example 
 mA qriyteš ‘I haven’t read’.
Furthermore, ALG almost lost all of the nominal
dual forms, which are replaced with the word

zudwj /zu:dj/ 'two' with the plural form, e.g.,
(MSA ALG)

kitaAbayn
 
zuwdj
ktub 'two books'
4.3 Orthographic Variations
The orthographic variation in writing of Arabic
dialects words is due to two reasons: i) the non-
existence of an orthographic standard for Arabic
dialects because these varieties are not codified
and normalized, and ii) the phonological differ-
ences between MSA and Algerian dialect (ALG).
For these dialects words can be spelled phono-
logically or etymologically using their corre-
sponding MSA form. This fact creates some in-
consistency among dialect writers. For example,
the corresponding word to ‘gold’ can be written

dhab or

ðhab. In addition, in some cases
the phonology or underlying morphology is re-
flected by some regular phonological assimila-
tion writing, e.g.

Tuwmuwbiyl ‘cars’ is
also written as

Tuwnuwbiyl,

AismaAςiyl, ‘Ismaël’ is also written as

AismaAςiyn,
 
min baςd ‘after’ is also written
as
 
mim baςd. Furthermore, these different
spelling can conduce to some semantic confu-
sion, like for

šrbw may be

šarbuwA
‘they drank’ or

šarbuh ‘he drank it’. Finally,
the shortened long vowels, can be spelled long or
short, for instance,
/
šAfw+hA/ šfw+hA
‘they saw her, and

majaAbaš ‘he didn’t
bring’

mAjaAbaš.
4.4 Lexical Variations
As presented in Section 3, the Algerian dialect,
like other Arabic dialects, has been influenced,
over centuries, by other languages like Berber,
Turkish, Italian, Spanish and French. Table 1
shows some examples of borrowed words
2
in
ALG.
5 Algerian Arabic CODA Guidelines
In this section we present a mapping of the
CODA convention for the Algerian dialect. The
CODA convention is presented and its goals and
2
We refer to (Guella, 2011) for more examples.
Words Translation Transliteration Origin
 a tortoise Fakruwn Berber
 Moustache šliAγam
 a throat Qarjuwmaħ
 Socks tqaAšiyr Turkish  a drunkard sukaArjiy
 Feast Zardaħ
 Party fiyšTaħ Italian
 Foul Zablaħ
 Money Suwrdiy
 a week siymaAnaħ Spanish
 Snickers Spardiynaħ
 a school Sukwiylaħ
 Table TaAblaħ French
 Phone Tiyliyfuwn
 Nurse Farmliy
Table 1: The origin and the meaning of some bor-
rowed words used in ALG.
74
principals are described in details in (Habash et
al, 2012a). An example of Algerian CODA is
presented in Table 5.
5.1 CODA Guiding Principles
We summarize the main CODA design elements
(Habash et al., 2012a, Eskander et al., 2013):
CODA is an internally consistent and coher-
ent convention for writing Dialectal Arabic.
CODA is created for computational purposes.
CODA uses the Arabic script.
CODA is intended as a unified framework for
writing all Arabic dialects.
CODA aims to strike an optimal balance be-
tween maintaining a level of dialectal unique-
ness and establishing conventions based on
MSA-DA similarities.
CODA is designed respecting many principles:
1. CODA is an Ad Hoc convention which uses
only the Arabic script characters including
the diacritics used for writing MSA.
2. CODA is consistent as it associates to each
DA word a unique orthographic form that
represents its phonology and morphology.
3. CODA uses and extends the basic MSA or-
thographic decisions (rules, exceptions and
ad hoc choices), e.g., using Shadda for pho-
nological gemination or spelling the definite
article morphemically.
4. CODA generally preserves the phonological
form of dialectal words given the unique
phonological rules of each dialect (e.g.,
vowel shortening), and the limitations of
Arabic script (e.g., using a diacritic and a
glide consonant to write a long vowel).
5. CODA preserves DA morphology and syn-
tax.
6. CODA is easy to learn and write.
7. The CODA principles are the same for all
the dialects, however each dialect will have
its proper CODA map. This unique map re-
spects the phonology and the morphology of
the considered dialect.
8. CODA is not a purely phonological repre-
sentation. Text in CODA can be read per-
fectly in dialect given the specific dialect
and its CODA map.
5.2 Algerian CODA
As we said above, CODA principles are applica-
ble for all dialects but with a specific map for
each dialect. Hence, in this section we present
the map of the Algerian dialect (ALG) to CODA
by summarizing the specific CODA guidelines
for ALG. Firstly we chose a variant of the ALG
which is the one used in the media as default.
This variant represents the dialect of the capital
city Algiers and follows the same orthographic
rules as MSA by taking into accounts all the fol-
lowing exceptions and extensions.
5.3 Phonological Extensions
Long Vowels In ALG CODA the long vowel
/e:/, which do not exist in MSA, will be written
as ay or iA depending on its MSA cognate: ay or
aA, respectively. In MSA orthography, the se-
quence iA is not possible, hence using words
with aA MSA cognates can be a good solution
for ALG. This orientation is suitable since the
basic non-diacritical form of the word is pre-
served, for instance,  daAr /da:r/ ‘turn’ and
diAr /de:r/ ‘do’. This extension is present also in
Tunisian CODA unlike the Egyptian one.
Vowel Shortening Like the EGY and TUN
CODA, the ALG long vowels are written in long
form. In some cases, which are shortened in cer-
tain cases such as when adding affixes and clitics
even if it is writing long. For example, 
mA jAb+hA+š ‘he did not forghets for her’ and
  tquwl lhm /tqullhum/ ‘you tell them’ (not
 tqulhm). This vowel shorting can be also
considered in words with two long vowels. Pho-
nologically, in DA, even if the two long vowels
are written, only one is allowed in a word, in
other terms, it should be only one stressed sylla-
ble in each phonological word. For instance,
 SaAymiyn ‘fasting’ (not  Saymiyn).
5.4 Phono-Lexical Exceptions
The Algerian "qaf" The letter () /q/ is used to
represent the four following consonants: /q/, /g/
(like TUN), /k/ and (') (like EGY). The table 2
gives some examples of exceptional pronuncia-
tion for /g/.
Consonant with Multiple Pronunciations
In ALG we use the MSA forms to write conso-
nants with multiple pronunciations. The used
MSA form has to be closer to the considerate
CODA Pronunciation English
 baqra
ħ
/bagra/ Cow
 qaAnaAtiyk /ga :na :ti :k/ so …
 qiAwriy /ge:wriy/ foreign
Table 2: ALG exceptional pronunciation examples
75
consonant if it has a corresponding MSA cog-
nate. We give in Table 3 some examples. Like
TUN CODA, the ALG one has more variations
than the ones addressed in EGY CODA as for
the former the efforts were focused on Cairene
Arabic. Hence, ALG seems to have more MSA-
like pronunciations where MSA spelling is simp-
ly the same as ALG.
Hamza Spelling Hamzated MSA cognate may
not be spelled in ALG CODA in a way corre-
sponding to the MSA cognate. In other words,
the glottal stop will be spelled phonologically.
This feature is also present in EGY and TUN
CODA. However, when Hamza is pronounced in
ALG, we apply the same MSA spelling rules.
Furthermore, the glottal stop phoneme, appearing
in many MSA words, has disappeared in ALG,
like in the words:

fAs 'pickaxe' (not like MSA

faÂs),

Diyb 'wolf' (not like MSA

Diŷr). In addition, words starting with Hamzated
Alif are not seen in ALG CODA, e.g,

AlAarD /larD/ ‘earth’ (not

larD).
Definite Article If the word contains the article
Al (), we must distinguish between the sun and
the moon letters. In the case of the sun letters, the
"L" is silent and the letter that follows is doubled
(gemination) in pronunciation and in writing,
e.g.,  AlnnhAr 'day' (not  AnnhAr). Con-
versely, with the moon letters, the ‘A’ is not pro-
nounced, the "L" of the article is pronounced and
the letter that follows is not doubled, neither in
pronunciation nor in writing, e.g.,  Alqmar
‘the moon’ (not  lqmar) (Saadane and
Semmar, 2012; Biadsy et al., 2009).
N of Number Construct The ALG CODA adds
the phoneme /n/ after some numerals in construct
cases, e.g.,  sTaAšn TaAblaħ ‘16 ta-
bles’ whereas the number 16 is pronounced alone
 sTaAš. This exception is valid for Number
Construct forms with number between 11 and 19
preceding a noun in the singular. This property is
also valid in TUN CODA.
5.5 Morphological Extensions
Attached clitics ALG dialect, as many other dia-
lects, uses almost all the attached clitics in MSA,
the definite article +  Al+, the future particle
proclitic + Ha+ (expressed in east of Algeria
like Annaba city), the coordinating conjunction
+ w+, the negation particle enclitic + +š. In
addition ALG uses the new attached clitics re-
duced forms of the MSA, e.g., + ς+, + m+, +
h+, + f+. The following table illustrates some
examples of these clitics where we consider the
word  wikliynaAhaAlkum ‘and we have
eaten your food’
Separated Clitics The spelling rule for the indi-
rect object enclitics and the negation proclitic 
mA is preserved in the ALG CODA map. This
map puts a separation using a space between the
negation particle and the indirect object, e.g., 
  mA jAb lkumš /ma+jab+lkum+š/ 'he did
not give/com you'.
5.6 Lexical Exceptions
The ALG CODA, like the TUN and EGY ones,
contains a list of Algerian dialect words that have
a specific ad hoc spelling. This specific spelling
may be inconsistent with the map of CODA in-
troduced above and can be spelled commonly in
different ways. These exceptions include for in-
stance:
The demonstratives  haðuwk (not 
haðukaħ) ‘that’,  hakðaA ‘like this’ (not
 haAkðaA, or  hakdaA or 
haAkdaA)
The preposition 'I know' is expressed with the
phrase  ςlaý baAliy (not 
ςambaAliy, or   ςan baAliy, or 
ςlabaAliy)
CODA Pronunciations English
 ςjuwz
/ςadju:z/, /ςzu:z/
/ςju:z/ old women
 θaAniy /fa:niy/, /θa:niy/ Also
 Sadr /sadr/, /Sadr/ Chest
 qahwaħ /qahwa/, /gahwa/,
/kahwa/, /’ahwa/ Coffee
 γsal /γsal/,/xsal/ he washed
 γaliy /γaa:li/, /qaa:li/ Expensive
 faAsdaħ /fa:zda/, /fa:sda/ Corrupt
 ðhab /ðhab/, /dhab/
/vhab/ Gold
 hbaT /hbaT/, /HbaT/ he descended
Table3: examples of multiple pronunciations in
ALG.
Enclitics Suffixes Stem Proclitics




kum l haA
naA kliy wi
Table 4: Tokenization of the word 
wikliynaAhaAlkum
76
The adverbs  zaςmaħ (not  zaςma)
‘supposedly’,  Durkaħ (not  Durka)
‘now’,  gaAnaħ (not  gaAna) ‘also’
In addition, in influence and integration of for-
eign words from other languages, like French,
Berber or Italian, have emerged new phonemes
like /g/, /p/ or /v/. These phonemes are used to
express sounds that do not exist in MSA, but in
CODA we will use the following Arabic charac-
ters: /q/, /b/ and /f/ to express respectively g, p
and v. For example,

jaAfiAl ‘detergent’,

kaAvi ‘stupid’,

puwpiyaħ ‘doll’,

qiyduwn ‘handlebar’.
6 Conclusions and Future Work
We presented in this paper a set of guidelines
towards a conventional orthography for Algerian
Arabic. We discussed the various challenges of
working with Algerian Arabic and how we ad-
dress them. In the future, we plan to use the de-
veloped guidelines to annotated collections of
Algerian Arabic texts, in a first step towards de-
veloping resources and tools for Algerian Arabic
processing.
Acknowledgment
The first author was supported by the DGCIS
(Ministry of Industry) and DGA (Ministry of
Defense): RAPID Project 'ORELO', referenced
by N°142906001. The second author was sup-
ported by DARPA Contract No. HR0011-12-C-
0014. Any opinions, findings and conclusions or
recommendations expressed in this paper are
those of the authors and do not necessarily reflect
the views of DARPA. We would like to thank
Bilel Gueni, Emad Mohamed and Djamel
Belarbi for helpful feedback.
Raw Text
              .    
         .        .  
               .
mrHbA bkm fy plAtw HSħ brnAmj AlxT lHmr lnhAr Alywmħ wlly ytzAmn mς ςyd
AlmrÂħ. ǍnšA' Allh gAς AlnsA' Aly rAhm yšwfw fynA ǍnšA' Allh ÂyAm sςydh wjmylħ
fHyAthm. ǍnšA' Allh ythnAw b mAlyhm, b wAldyhm wwlAdhm. qbl mnrwhw llmwDwς
ntAς Alywmħ wAly xSSnAh llmrÂħ f AljzAyr wkyfAš rAhy ςAyšħ xlwnA nrhbw bAlDywf
tς lbrnAmj
.
CODA               .      
        .      .    
             .
mrHbA bkm fy blAtw HSħ brnAmj AlxT AlHmr lnhAr Alywm wAlly ytzAmn mς ςyd
AlmrAħ, AnšA Allh qAς AlnsA Aly rAhm yšwfwA fynA AnšA Allh AyAm sςydħ wjmylħ
fHyAthm. AnšA Allh ythnAwA bmAlyhm, bwAldyhm wwlAdhm. qbl mA nrwhwA
llmwDwς tAς Alywm wAlly xSSnAh llmrAħ fAljzAyr wkfAš rAhy ςAyšħ xlwnA nrhbwA
bAlDywf tAς AlbrnAmj.
English Hello everyone, in « The Red Line » daily show, which coincides with the Women's
Day. God willing, for all the women who watch this show, they may have happy and
beautiful days in their lives. God willing, and they will rejoice in their families, parents
and children. Before addressing the topic of the day, where we focus on women in
Algeria and how they are living, let's welcome to our program's guests.
Table 5: An example sentence in ALG
77
References
Abdenour Arezki. 2008. Le rôle et la place du
français dans le système éducatif algérien. Re-
vue du Réseau des Observatoires du Français
Contemporain en Afrique, (23), 21-31.
Mohamed Benrabah. 1999. Langue et pouvoir en
Algérie: Histoire d'un traumatisme linguistique.
Seguier Editions.
Fadi Biadsy, Nizar Habash and Julia Hirschberg.
2009, Improving the Arabic Pronunciation Dic-
tionary for Phone and Word Recognition with
Linguistically-Based Pronunciation Rules, The
2009 Annual Conference of the North American
Chapter of the ACL, pages 397–405, Boulder,
Colorado.
Marcel Cohen. 1912. Le parler arabe des Juifs
d’Alger. Champion :Paris.
Yacine Derradji, Valéry Debov, Ambroise Quef-
félec, Dalila S. Dekdouk and Yasmina C. Ben-
chefra. 2002. Le français en Algérie : lexique et
dynamique des langues, Ed. Duclot, AUF, 2002,
590 p.
Ramy Eskander, Nizar Habash, Owen Rambow
and Nadi Tomeh. 2013. Processing Spontaneous
Orthography. In Proceedings of Conference of
the North American Association for Computa-
tional Linguistics (NAACL), Atlanta, Georgia.
Charles A. Ferguson. 1959. Diglossia. Word-
Journal of the International Linguistic Associa-
tion, 1959, vol. 15, no 2, p. 325-340.
Hassan A. Gadalla. 2000. Comparative Mor-
phology of Standard and Egyptian Arabic (Vol.
5). Lincom Europa.
Noureddine Guella. 2011. Emprunts lexicaux
dans des dialectes arabes algériens. Synergies
Monde arabe, 8, 81-88.
Nizar Habash, Abdelhadi Soudi and Tim Buck-
walter. 2007. On Arabic Transliteration. Book
Chapter. In Arabic Computational Morphology:
Knowledge-based and Empirical Methods. Edi-
tors Antal van den Bosch and Abdelhadi Soudi.
Nizar Habash. 2010. Introduction to Arabic Nat-
ural Language Processing. Synthesis Lectures
on Human Language Technologies, Graeme
Hirst, editor. Morgan & Claypool Publishers.
Nizar Habash, Mona Diab and Owen Rambow.
2012a. Conventional Orthography for Dialectal
Arabic. In: Proceedings of the Language Re-
sources and Evaluation Conference (LREC), Is-
tanbul.
Nizar Habash, Ramy Eskander and Abdelati
Hawwari. 2012b. A Morphological Analyzer for
Egyptian Arabic. In the Proceedings of the
Workshop on Computational Research in Pho-
netics, Phonology, and Morphology
(SIGMORPHON) in the North American chapter
of the Association for Computational Linguistics
(NAACL), Montreal, Canada.
Salima Harrat, Karima Meftouh, Mourad Abbas
and Kamel Smaïli. 2014. Grapheme To Phoneme
Conversion-An Arabic Dialect Case. In Spoken
Language Technologies for Under-resourced
Languages.
Salima Harrat, Karima Meftouh, Mourad Abbas
and Kamel Smaili. 2014. Building Resources for
Algerian Arabic Dialects. Corpus (sentences),
4000(6415), 2415.
Salima Harrat, Karima Meftouh, Mourad Abbas,
Salma Jamoussi, Motaz Saad, and Kamel Smaili.
2015. Cross-Dialectal Arabic Processing. In
Computational Linguistics and Intelligent Text
Processing (pp. 620-632). Springer International
Publishing.
Khawla T. Ibrahimi. 1997. Les Algériens et leur
(s) langue (s): éléments pour une approche so-
ciolinguistique de la société algérienne. Éds. El
Hikma.
Khawla T. Ibrahimi, K. 2006. L’Algérie: coexis-
tence et concurrence des langues. L’Année du
Maghreb, (I), 207-218.
Mustafa Jarrar, Nizar Habash, Diyam Akra and
Nasser Zalmout. 2014. Building a Corpus for
Palestinian Arabic: a Preliminary Study. ANLP
2014, 18.
Mohamed Maamouri, Tim Buckwalter and
Christopher Cieri. 2004. Dialectal Arabic tele-
phone speech corpus: Principles, tool design,
and transcription conventions. In NEMLAR In-
78
ternational Conference on Arabic Language Re-
sources and Tools, Cairo (pp. 22-23).
William Marçais. 1902. Le dialecte arabe parlé
à Tlemcen: grammaire, textes et glossaire (Vol.
26). E. Leroux.
Philippe Marçais. 1956. Le parler arabe de Djid-
jelli: Nord constantinois, Algérie (Vol. 16). Li-
brairie d'Amérique et d'Orient Adrien-
Maisonneuve.
Dalila Morsly. 1986. Multilingualism in Algeria.
The Fergusonian Impact: In Honor of Charles A.
Ferguson on the Occasion of His, 65.
Houda Saadane, Aurélie Rossi, Christian Fluhr
and Mathieu Guidère. 2012. Transcription of
Arabic names into Latin. In Sciences of Electron-
ics, Technologies of Information and Telecom-
munications (SETIT), 2012 6th International
Conference on (pp. 857-866). IEEE.
Houda Saadane and Nasredine Semma. 2013.
Transcription des noms arabes en écriture latine.
Revue RIST| Vol, 20(2), 57.
Lameen Souag. 2005. Notes on the Algerian Ar-
abic dialect of Dellys. Estudios de dialectología
norteafricana y andalusí, 9, 1-30.
Ines Zribi, Rahma Boujelbane, Abir Masmoudi,
Mariem Ellouze, Lamia Belguith, and Nizar Ha-
bash. 2014. A Conventional Orthography for
Tunisian Arabic. In Proceedings of the Language
Resources and Evaluation Conference (LREC),
Reykjavik, Iceland.
79
... As pointed out by , Algerian is a non-codified spoken Semitic language. It is a morphologically-rich language (Tsarfaty et al., 2010), although less so than MSA (Saadane and Habash, 2015). Similarly to other north African languages, it uses heavy code-switching and borrowings, which can either be lexicalized borrowings that receive Arabic-like morphology, or borrowings that remain invariant or take the morphology of the borrowings' original language (e.g., French). ...
... We are aware of the various efforts to develop guidelines for conventional orthography of Algerian and other Arabic dialects (Saadane and Habash, 2015;Habash et al., 2018;Adouane et al., 2019), but we decided to keep the transliterations as identical as possible to the original NArabizi pronunciations and spellings, to reflect the distinctiveness of the language and its use in normal settings in social media. During the transliteration annotations, several issues were identified in the original NArabizi treebank by . ...
... As pointed out by , Algerian is a non-codified spoken Semitic language. It is a morphologically-rich language (Tsarfaty et al., 2010), although less so than MSA (Saadane and Habash, 2015). Similarly to other north African languages, it uses heavy code-switching and borrowings, which can either be lexicalized borrowings that receive Arabic-like morphology, or borrowings that remain invariant or take the morphology of the borrowings' original language (e.g., French). ...
... We are aware of the various efforts to develop guidelines for conventional orthography of Algerian and other Arabic dialects (Saadane and Habash, 2015;Habash et al., 2018;Adouane et al., 2019), but we decided to keep the transliterations as identical as possible to the original NArabizi pro-nunciations and spellings, to reflect the distinctiveness of the language and its use in normal settings on social media. ...
Preprint
Full-text available
Recent years have seen a rise in interest for cross-lingual transfer between languages with similar typology, and between languages of various scripts. However, the interplay between language similarity and difference in script on cross-lingual transfer is a less studied problem. We explore this interplay on cross-lingual transfer for two supervised tasks, namely part-of-speech tagging and sentiment analysis. We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts, as well as annotations for sentiment and topic categories. We perform baseline experiments by fine-tuning multi-lingual language models. We further explore the effect of script vs. language similarity in cross-lingual transfer by fine-tuning multi-lingual models on languages which are a) typologically distinct, but use the same script, b) typologically similar, but use a distinct script, or c) are typologically similar and use the same script. We find there is a delicate relationship between script and typology for part-of-speech, while sentiment analysis is less sensitive.
... CODA was defined for the Egyptian dialect. Then, the study by [34] proposed an orthographic convention for the TD, also [29] made an extension of the spelling convention for the Algerian dialect, and many other extensions were proposed, including the Palestinian convention [18], the Gulf convention [19], the convention proposed for Moroccan and Yemeni Arabic [1]. The orthographic convention proposed by [15] was a generalized version of CODA to normalize 28 Arabic dialects. ...
Conference Paper
Full-text available
The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic nor-malization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.
... MSA has a standard written form and acquires an official status across the Arab countries, while Dialectal Arabic refers to the informal spoken dialects in the Arab World (Habash, 2010). These dialects are used in daily life but have no standard written form (Saadane and Habash, 2015;Eryani et al., 2020). Geographically and according to (Zaidan and Callison-Burch, 2014), Arabic dialects can be classified into five coarse-grained regional dialects: Egyptian, Levantine, Gulf, Iraqi, and Maghrebi. ...
Conference Paper
Full-text available
Finetuning deep pre-trained language models has shown state-of-the-art performances on a wide range of Natural Language Processing (NLP) applications. Nevertheless, their generalization performance drops under domain shift. In the case of Arabic language, diglossia makes building and annotating corpora for each dialect and/or domain a more challenging task. Unsupervised Domain Adaptation tackles this issue by transferring the learned knowledge from labeled source domain data to unlabeled target domain data. In this paper, we propose a new unsupervised domain adaptation method for Arabic cross-domain and cross-dialect sentiment analysis from Contextualized Word Embedding. Several experiments are performed adopting the coarse-grained and the fine-grained taxonomies of Arabic dialects. The obtained results show that our method yields very promising results and outperforms several domain adaptation methods for most of the evaluated datasets. On average, our method increases the performance by an improvement rate of 20.8% over the zero-shot transfer learning from BERT. Paper available online at: https://www.aclweb.org/anthology/2021.naacl-main.226/
Article
Full-text available
The accessibility to new technological devices and social media platforms led to the emergence of a new form of written Arabic script among Algerian young users named Arabizi, composed of Roman alphabet and which adheres the rule of writing simultaneously as speaking. The present paper aims at (i) synthesizing the samples of Arabizi in order to determine its characteristic features, and (ii) to understand and explain the reasons behind these tendencies. The Findings were collected by adopting a qualitative research design with the use of a Text Corpus analysis approach applied on written discourse uttered by Algerian EFL students on a Facebook page, who study at the Department of English, University of Guelma, Algeria. The compiled results revealed that the users produce an electronic discourse, which varies in terms of transliteration from one interlocutor to another, and in which features of code-switching into French and English are spotted. The non-standard application of the Latin orthography caused character ambiguity in relation to the corresponding Algerian Arabic alphabet. Moreover, a distinction has been made between the transliterated structures and electronic texting as many of the extracted forms comprise chat abbreviations in English and French; which are not linked to Arabic conventions.
Book
Full-text available
This book is a collection of articles accepted at the ICNLSSP 2017 conference held in Casablanca in December 2017. This conference aimed to create a synergy between different areas related to language processing: Automatic recognition, Social networks, Opinion mining, Images , Videos, ... The conference highlighted new approaches to language processing, from basic theories to their applications. ICNLSSP is an international conference dedicated to natural language processing, signal processing and speech recognition (https://isga.ma/icnlsp_web/index.php). This conference was a technical conference offering not only new research methodologies on relevant topics but also enabled the exchange of ideas between researchers from all over the world, which was very useful for doctoral students, developers and researchers in this domain.
Article
The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.
Article
Full-text available
Unlike other tongues, Arabic language is characterized by its written form which is essentially consonant and may not have short vowels. One of the major functions of short vowels is to determine and facilitate the meaning of words or sentences. However, MSA texts are generally written without vowels. This fact gives rise to a great deal of morphological, semantic, and syntactic ambiguities. Thus, this ambiguity problem is not only associated with Modern Standard Arabic (MSA) but also related to Arabic dialects in general and Tunisian Dialect (TD) in particular. Compared to MSA, TD suffers from the unavailability of basic tools and linguistic resources, like sufficient amount of corpora, multilingual dictionaries, morphological and syntactic analyzers of these resources makes the processing of this language a great challenge (Masmoudi et al., 2020). Despite the numerous efforts currently underway, still some shortages persist in this field. Hence, we tried to challenge this lack by presenting our work that investigates the automatic diacritization of TD texts. In this respect, we regard the diacritization problem as a simplified phrase-based SMT (Statistical Machine Translation) task. The source language is the undiacritic text while the target language is the diacritic text. We initially go deeper into the details of TD corpus creation. This corpus is finally approved and used to build a diacritic restoration system for the TD. It is called TDTACHKIL and it can achieve a Word Error Rate (WER) of 16.7% and Diacritic Error Rate (DER) of 8.89%.
Conference Paper
Full-text available
We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have been built from scratch have lead to a collection of a multi-dialect parallel resource. Furthermore, this collection has been aligned by hand with a MSA corpus. We conducted several analytical studies in order to understand the relationship between these vernacular languages. For this, we studied the closeness between all the pairs of dialects and MSA in terms of Hellinger distance. We also performed an experiment of dialect identification. This experiment showed that neighbouring dialects as expected tend to be confused, making difficult their identification. Because the Arabic dialects are different from one region to another which make the communication between people difficult, we conducted cross-lingual machine translation between all the pairs of dialects and also with MSA. Several interesting conclusions have been carried out from this experiment.
Conference Paper
Full-text available
This paper presents preliminary results in building an annotated corpus of the Palestinian Arabic dialect. The corpus consists of about 43K words, stemming from diverse resources. The paper discusses some linguistic facts about the Palestinian dialect, compared with the Modern Standard Arabic, especially in terms of morphological, orthographic, and lexical variations, and suggests some directions to resolve the challenges these differences pose to the annotation goal. Furthermore, we present two pilot studies that investigate whether existing tools for processing Modern Standard Arabic and Egyptian Arabic can be used to speed up the annotation process of our Palestinian Arabic corpus.
Conference Paper
Full-text available
Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.
Conference Paper
Full-text available
The Algerian Arabic dialects are under-resourced languages, which lack both corpora and Natural Language Processing (NLP) tools, although they are increasingly used in written form, especially on social media and forums. We aim through this paper, and for the first time, to build parallel corpora for Algerian dialects, because our ultimate purpose is to achieve a Machine Translation (MT) for Modern Standard Arabic (MSA) and Algerian dialects (AD), in both directions. We also propose language tools to process these dialects. First, we developed a morphological analysis model of dialects by adapting BAMA, a well-known MSA analyzer. Then we propose a diacritization system, based on a MT process which allows to restore the vowels to dialects corpora. And finally, we propose results on machine translation between MSA and Algerian dialects.
Conference Paper
Full-text available
We aim to develop a speech translation system between Modern Standard Arabic and Algiers dialect. Such a system must include a Text-to-Speech module which itself must include a grapheme-phoneme converter. Algiers dialect is an Arabic dialect concerned by the most problems of Modern Standard Arabic in NLP area. Furthermore, it could be considered as an under-resourced language because it is a vernacular language for which no substantial corpus exists. In this paper we present a grapheme-to-phoneme converter for this language. We used a rule based approach and a statistical approach, we got an accuracy of 92% VS 85% despite the lack of resource for this language.
Conference Paper
Full-text available
Most tools and resources developed for natural language processing of Arabic are designed for Modern Standard Arabic (MSA) and perform terribly on Arabic dialects, such as Egyptian Arabic. Egyptian Arabic differs from MSA phonologically, morphologically and lexically and has no standardized orthography. We present a linguistically accurate, large-scale morphological analyzer for Egyptian Arabic. The analyzer extends an existing resource, the Egyptian Colloquial Arabic Lexicon, and follows the part-of-speech guidelines used by the Linguistic Data Consortium for Egyptian Arabic. It accepts multiple orthographic variants and normalizes them to a conventional orthography.
Article
Full-text available
Dialectal Arabic (DA) refers to the day-today vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited body of dialectal literature that follows the same spelling standard. In this paper, we present CODA, a conventional orthography for dialectal Arabic; it is designed primarily for the purpose of developing computational models of Ara-bic dialects. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Egyptian Arabic.
Conference Paper
Transcription and transliteration are experiencing significant growth due to the increasingly multilingual Internet and to the exponential needs in the field of cross-lingual information retrieval. This is especially true for finding named entities (names of persons, places, companies, organizations, etc.), but these entities have a plurality of forms, spellings, and transcripts depending on languages and countries. The case of Arabic names illustrates this complex and multifaceted situation. In this article, we will briefly introduce the theoretical and practical difficulties that arise in the transcription and transliteration of Arabic names into Latin characters, as well as possible solutions and processing that can solve these difficulties.