ArticlePDF Available

Abstract

Arabic dialects also called colloquial Arabic or vernaculars are spoken varieties of Standard Arabic. These dialects have mixed form with many variations due to the influence of ancient local tongues and other languages like European ones. Many of these dialects are mutually incomprehensible. Arabic dialects were not written until recently and were used only in a speech form. Nowadays, with the advent of the internet and mobile telephony technologies, these dialects are increasingly used in a written form. Indeed, this kind of communication brought everyday conversations to a written format. This allows Arab people to use their dialects, which are their actual native languages for expressing their opinion on social media, for chatting, texting, etc. This growing use opens new research direction for Arabic natural language processing (NLP). We focus, in this paper, on machine translation in the context of Arabic dialects. We provide a survey of recent research in this area. We report for each study a detailed description of the adopted approach and we give its most relevant contribution. Here is a link to download the paper (available until February 26, 2019) https://authors.elsevier.com/a/1YM4I15hYdUYic
Machine Translation for Arabic Dialects (Survey)
Salima Harrata, Karima Meftouhb, Kamel Smailic
a´
Ecole Sup´erieure d’Informatique (ESI), ´
Ecole Normale Sup´erieure de Bouzar´eah
(ENSB), Algeria
bBadji Mokhtar University, Annaba, Algeria
cCampus Scientifique LORIA, France
Abstract
Arabic dialects also called colloquial Arabic or vernaculars are spoken vari-
eties of Standard Arabic. These dialects have mixed form with many varia-
tions due to the influence of ancient local tongues and other languages like
European ones. Many of these dialects are mutually incomprehensible. Ara-
bic dialects were not written until recently and were used only in a speech
form. Nowadays, with the advent of the internet and mobile telephony tech-
nologies, these dialects are increasingly used in a written form. Indeed, this
kind of communication brought everyday conversations to a written format.
This allows Arab people to use their dialects, which are their actual native
languages for expressing their opinion on social media, for chatting, texting,
etc. This growing use opens new research direction for Arabic natural lan-
guage processing (NLP). We focus, in this paper, on machine translation in
the context of Arabic dialects. We provide a survey of recent research in
this area. We report for each study a detailed description of the adopted
approach and we give its most relevant contribution.
Keywords: Arabic dialect, Modern Standard Arabic, Machine translation
1. Introduction
Arabic dialects are informal spoken language used all over Arab countries.
These dialects are used in everyday life, in contrast to modern standard Ara-
bic that is used in official speeches, newspapers, school, etc. This coexistence
of two variants of a language in the same community is known as diglossia
which is defined in (Ferguson, 1959) as: “A relatively stable language situa-
tion in which, in addition to the primary dialects of the language, there is a
Preprint submitted to Elsevier August 23, 2017
very divergent highly codified superposed variety, the vehicle of a large and
respected body of written literature which is learned largely by formal educa-
tion and is used for most written and formal spoken purposes, but is not used
by any sector of the community for ordinary conversation”. This linguistic
phenomenon exists in all Arab countries. Furthermore, in the last decade
these dialects emerged in social networks, SMS, TV-Shows, etc. They are
increasingly used even in a written form. This usage generates new needs in
NLP area. Indeed, these dialects are not enough resourced in terms of NLP
tools and those concerning modern standard Arabic (MSA) are not adapted
to process them.
In this paper, we focus on machine translation of Arabic dialects. This
area has become an interesting research field because of the many challenges
to overcome. In fact, Arabic dialects differ from Standard Arabic at phono-
logical, lexical, morphological and syntactic levels. They simplify a wide
range of written Arabic rules1on one hand but add other new rules2that
generates a lot of complexities on the other hand. In addition, these dialects
(especially Maghrebi ones) are influenced by other languages such as French,
Spanish, Turkish and Berber. Besides the fact that these dialects are differ-
ent from Standard Arabic, they are also different to each other; even within
the same country these dialects are not the same.
2. NLP challenges for Arabic dialects
Arabic dialects, despite their large use are under-resourced languages,
they lack basic NLP tools. Except some work dedicated to Middle-east di-
alects (Egyptian dialect mostly), these dialects are not enough studied re-
garding to NLP area. Most MSA resources and tools are not adapted to
them and do not take into account their features. The reader can refer to
(Habash, 2010) which presented a comprehensive survey on Arabic NLP, the
work focused on MSA but included many interesting notes on pratical issues
1The dual form, for example, as well as the feminine plural form used in standard
Arabic do not exist in most Arabic dialects.
2Standard Arabic has a strong case system where most cases are denoted by diacritics.
In Arabic dialects, there is no grammatical case, thus generating more syntactic ambigu-
ities compared to MSA. Also, the verbs negation in dialects is more complex than MSA,
the circumfix negation is placed around the verb with all its prefixes and suffixed direct
and indirect object pronouns.
2
concerning Arabic dialects or to (Shoufan and Al-Ameri, 2015) where au-
thors reported all available tools and resources recently produced for these
dialects. One of the main issue of Arabic dialects is the fact that they have
no conventional orthographies for writing them. Their large use (because the
advent of Internet technologies) produces important volumes of data which
are difficult to exploit and require important pre-processing steps.
In the area of machine translation, Arabic dialect translation research
efforts are still at an early stage. Rule-based approaches are difficult to en-
visage because of unavailability of dedicated tools for most of these dialects.
Indeed, these approaches are being used less and less in MT systems because
they are time consuming and require important linguistic resources. Also,
MT systems based on these approaches are difficult to maintain, adding new
linguistic features involves updating rules or adding new rules. For Arabic
dialects, these approaches are more problematic. These dialects are not writ-
ten and have no strong theoretical linguistic studies that could allow such
approach. In addition, these dialects differ from one Arab country to an-
other, even in the same country significant variations exist, any rule-based
MT system could not take into account all related features. On the other
hand, data-driven approaches are also hard to consider due to the lack of
resources like parallel and even monolingual corpora. In the context of statis-
tical machine translation, Arabic dialects lack bi-texts with reasonable sizes
that allow building efficient statistical machine translation (SMT) systems
readily.
It should be noted that this issue does not arise only in the case of
Arabic dialects; it concerns also several other under-resourced languages
and many research activities focus on machine translation in the context of
under-resourced or non-resourced languages. The main idea of these contri-
butions is exploiting the proximity between an under-resourced language and
the closest related resourced language (CantoneseMandarin (Zhang, 1998),
CzechSlovak (Hajiˇc et al., 2000), TurkishCrimean Tatar (Altintas and
Cicekli, 2002), IrishScottish Gaelic (Scannell, 2006), IndonesianEnglish
using Malay (Nakov and Ng, 2012) and Standard Austrian GermanViennese
dialect (Haddow et al., 2013)).
3. Machine translation related to Arabic dialects
In this section we present most important studies dedicated to Arabic di-
alects machine translation. We first introduce research dedicated to machine
3
translation between modern standard Arabic and its dialects. Then, we fo-
cus on MT between foreign languages and Arabic dialects. In this context,
we would point out that all contributions concern mainly English language
(as we will see later). We attempt to draw a clear picture of each study by
describing its approach, the used data and the achieved results. We will show
that most of them exploit the proximity between these dialects and MSA,
and attempt to use available MSA resources to deal with Arabic dialects.
3.1. Translating between MSA and Arabic dialects
Bakr et al. (2008) presented a generic approach for converting an Egyptian
colloquial Arabic sentence into vocalized MSA sentence. They combined
a statistical approach to automatically tokenize and tag Arabic sentences
and a rule-based approach for creating the target diacritized MSA sentence.
The work was evaluated on a dataset of 1K of Egyptian dialect sentences
(including training and test 800 and 200, respectively). For converting dialect
words to MSA words, the system achieved an accuracy of 88%, whereas for
producing these words into their correct order the system performed 78%.
Elissa (Salloum and Habash, 2012) is a rule-based machine translation
system from Arabic dialects to MSA. It handles Levantine, Egyptian, Iraqi,
and to a lesser degree Gulf Arabic. After identifying dialectal words in a
source sentence, Elissa produces MSA paraphrases using ADAM (Salloum
and Habash, 2011) dialectal morphological analyzer, morphological transfer
rules and dialect-MSA dictionaries. These paraphrases are used to form an
MSA lattice that passes through a language model (LM) for n-best decoding
and then selects the best MSA translations. In this paper, no evaluation has
been provided.3
Mohamed et al. (2012) presented a rule-based approach to produce Col-
loquial Egyptian Arabic (CEA) from modern standard Arabic, they provide
an application case to the Part-Of-Speech (POS) tagging task for which the
accuracy has been improved from 73.24% to 86.84% on unseen CEA text, and
the percentage of Out-Of-Vocabulary (OOV) words decreased from 28.98%
to 16.66%.
Al-Gaphari and Al-Yadoumi (2012) used a rule-based approach to convert
Sanaani dialect to MSA. Their system reached 77.32% of accuracy when
tested on a Sanani corpus of 9386 words.
3Elissa is evaluated later in (Salloum and Habash, 2013).
4
Hamdi et al. (2013) presented a translation system between MSA and
Tunisian dialect verbal forms. The work is based on deep morphological rep-
resentations of roots and patterns which is an important feature of Arabic
and its variants (dialects). The approach is similar to that used in (Mohamed
et al., 2012), (Sawaf, 2010) and (Salloum and Habash, 2013) but is charac-
terized by a deep morphological representation based on MAGEAD (Habash
and Rambow, 2006) (morphological analyzer and generator for the Arabic
dialects). The system translates in both directions (MSA to Tunisian dialect
and vice versa). It reached a recall of 84% from dialect to MSA and 80% in
the opposite side.
For translating Moroccan dialect to MSA, a rule-based approach relying
on a language model was used in (Tachicart and Bouzoubaa, 2014). The
system is based on a morphological analysis with Alkhalil morphological
analyzer (Boudlal et al., 2010) adapted and extended with Moroccan dialects
affixes and a bilingual dictionary (built from television productions scenarios
and data collected from the web). After an identification step which separates
dialectal data from MSA, the text is analyzed and segmented into annotated
dialect units. These outputs are linked into one or more MSA corresponding
units by using the bilingual dictionary. In the generation step, MSA phrases
are produced then passed to a language model to produce the most fluent
MSA sentences (no evaluation was given for this work).
Sadat et al. (2014) provided a framework for translating Tunisian dialect
text of social media into MSA. The work is based on a bilingual lexicon cre-
ated for this context. It adopts a set of grammatical mapping rules with
a disambiguation step which relies on a language modeling of MSA for the
selection of the best translation phrases. It should be noted that the transla-
tion system is word-based. It performs a BLEU (Papineni et al., 2002) score
of 14.32 on a test set of 50 Tunisian dialect sentences (the reference was made
by hand).
Meftouh et al. (2015) presented PADIC a multi-dialect Arabic corpus
that includes MSA, Maghrebi dialects (Algerian and Tunisian) and Levan-
tine dialects (Palestinian and Syrian). Unlike other contributions, several
experiments were performed on different SMT systems between all pairs of
languages (MSA and dialects). The authors analyzed the impact of the lan-
guage model on machine translation by varying the smoothing techniques
and by interpolating it with a larger one. The best results of translation
were achieved between the dialects of Algeria which is not a surprising result
since they share a large part of the vocabulary. It was also shown that the
5
Table 1: MT work between Arabic dialects and MSA: (Source and Target languages)
Work Source Target
(Bakr et al., 2008) Egyptian MSA
(Salloum and Habash, 2012) Levantine, Egyptian, MSA
Iraqi, Gulf Arabic
(Mohamed et al., 2012) MSA Egyptian
(Al-Gaphari and Al-Yadoumi, 2012) Sanaani (Yemenite) MSA
(Hamdi et al., 2013) Tunisian MSA
MSA Tunisian
(Tachicart and Bouzoubaa, 2014) Moroccan MSA
(Sadat et al., 2014) Tunisian MSA
(Meftouh et al., 2015) Algerian, Tunisian, MSA
Syrian and Palestinian
MSA Algerian, Tunisian,
Syrian and Palestinian
performance of machine translation between Palestinian and Syrian was rel-
atively high because of the closeness of the two dialects. Concerning MSA,
the best results of machine translation have been achieved with Palestinian
dialect.
In Table 1, we summarize all the work cited above in terms of concerned
dialects and translation direction.
3.2. Translating between Arabic dialects and foreign languages
Sawaf (2010) built a hybrid MT system combining statistical and rule-
based approaches. This system translates from Arabic dialects (spontaneous
and noisy text from broadcast transmissions and web content) to English
using MSA as pivot language. Dialect texts were normalized into MSA using
character-based rules which utilizes simple rules to convert words into the
most similar MSA words, then the text is analyzed by a dialect-specific and
a MSA morphological analyzers. The results are entered into dialect normal-
ization decoder that relies on language models and a lexicon. The work deals
with a set of Arabic dialects: Levantine (Lebanese, North Syria, Damascus,
Palestine and Jordan), Gulf Arabic (Northern Iraq, Baghdad, Southern Iraq,
Gulf, Saudi-Arabia, and Southern Arabic Peninsula), Nile Region (Egypt and
Sudan) and Maghreb Arabic (Libya, Morocco and Tunisia). Achieved results
showed that hybrid MT performs better than statistical MT and rule-based
MT and that normalizing and processing the text (both training and test
corpora) improve translation quality in terms of BLEU by 2% for Web text
6
and about 1% for broadcast news/conversations.
In (Salloum and Habash, 2011), the authors improved an Arabic-English
SMT system by producing MSA paraphrases for OOV dialectal words and
low-frequency words through a light-weight rule-based approach. They cre-
ated ADAM (Arabic Dialect Morphological Analyzer) by extending the well-
known BAMA (Buckwalter, 2004) with Levantine/Egyptian dialectal affixes
and clitics. In addition to ADAM, they used a set of hand-write morpho-
syntactic transfer rules. This allows to generating paraphrases that are input
as a lattice to a state-of-the-art phrase-based SMT system. This last point is
the main difference between this work and the one presented above (Sawaf,
2010). The latter produces unique MSA version for a dialect word where the
former produces multiple MSA paraphrases (or alternative normalizations).
Two SMT systems were built within this work, they were trained on two dif-
ferent data conditions, a MSA(only)-English parallel corpus (of 12M words
on the Arabic side) and a large (MSA&Dialect)–English parallel corpus (of
64M words on the Arabic side). When evaluated on a blind test set, the
SMT system trained on the large corpus using ADAM and transfer rules
outperformed the baseline system (SMT system trained on the same data)
by 0.56 absolute BLEU.
The same authors in (Salloum and Habash, 2013), presented a manual
evaluation of Elissa (cited above). It was shown that 93% of MSA sentences
produced by Elissa were correct. In addition, Elissa was used for pivoting
through MSA in a dialect-English SMT system whose BLEU score was im-
proved between 0.6% and 1.4%.
Sajjad et al. (2013) provided a dialectal Egyptian Arabic to English sta-
tistical machine translation system. They converted Egyptian to MSA by
applying a character level transformational model (including morphological,
phonological and spelling changes) learned from Egyptian-MSA words pairs.
The MT system built on the adapted parallel data showed improvement in
the quality of machine translation. Transformation task reduces the OOV
words rate from 5.2% to 2.6% and improves BLEU score by 1.87 points.
Whereas adapting large MSA/English parallel data gives significant reduc-
tion of OOV rate to 0.7% and leads to an absolute BLEU increase of 2.73
points.
Salloum et al. (2014) explored the impact of sentence-level dialect identi-
fication used with various linguistic features on machine translation perfor-
mance. They attempted to optimize the selection of outputs produced by
different MT systems given an input text including a mixture of dialects
7
and MSA. The study concerns machine translation from Arabic dialect,
namely Egyptian and Levantine to English. Four MT systems were used
for this purpose, the first three ones are SMT systems trained on different
corpora4: dialect-English (5M tokenized words of Egyptian and Levantine),
MSA-English (57M tokenized words) and dialect+MSA-English (62M tok-
enized words). The fourth one is a MSA-pivoting system that combines
dialect-to-MSA MT system (Salloum and Habash, 2013) and an Arabic-
English SMT system. This last system is trained on dialect+MSA-English
corpus augmented with dialect-English corpus where the dialectal side has
been preprocessed with the dialect-MSA MT previously cited (Salloum and
Habash, 2013). The size of this training corpus is 67M. We note that the
MSA-pivoting system (the fourth one) produces the best BLEU score among
all systems, it is the first baseline system. In this wrok, the same MT al-
gorithms are used for training, tuning and testing each MT system, but as
regards data each system is trained on a different dataset (as we saw above) in
terms of the degree of source language dialectness. An interesting approach
was adopted in this reaserach, instead of finding the most performant MT,
the authors tried to identify automatically the most suitable MT system for
a given sentence. They assume that these systems complement each other
and combining their selections could lead to better overall performance. A
baseline MT system selection based on a binary classification was built by
using a sentence-level dialect identifier Elfardy and Diab (2013). This base-
line selection system decides what MT system to use among the four systems
described above. According to the authors, the best configuration defined
is to select the MSA-English system for sentences tagged as MSA sentences
and MSA-pivoting for sentences tagged as dialectal ones. The main contri-
bution of this work is a MT selection system created using machine learning
techniques trained on only source language features to select the best MT
system that should translate each sentence in the test set. This selection
system is a Naive Bayes Classifier (NBC) with four classes corresponding to
the four MT systems. The training data of the classifier is a set of 5562
sentences labeled with the class label of the MT system that has produced
the highest BLEU score (at sentence-level). The NBC uses a set of basic
features such as: token-Level features which use language models, MSA &
dialectal morphological analyzers and a dialectal lexicon (to decide whether
4Similar to (Zbib et al., 2012) discussed further below.
8
each word is MSA, dialectal, both, or OOV), perplexity features that include
two features related to the perplexity of a sentence computed on the two
languages models (MSA and dialect). In addition, the classifier uses some
extended features extracted from the cited dialect-MSA MT system like sen-
tence length (in words), number of punctuation marks, and number of words
that are written in Latin script. Another set of extended features are used
like the sentence perplexity computed on each source-side of the training
data of each of the four MT systems. Using the NBC to predict the best MT
system to use for translating a sentence had improved the BLEU score by
1% over the best score recorded for a single MT (which corresponds to the
MSA-pivoting system). It also outperforms the baseline selection system by
0.6% BLEU.
Jeblee et al. (2014) presented a SMT system that translates (in contrast to
all other research efforts) from English to Arabic dialect by pivoting through
MSA. The translator is based on a core SMT system trained on a parallel
English-MSA corpus of (5M pairs of sentences), the output of this system
is translated to Egyptian dialect by using both dialect and domain adap-
tation system. It should be noted that for adaptation systems the authors
created a tri-side parallel corpus (English, MSA and Egyptian dialect) of
100k sentences by using a rule-based method. For convenience of reading we
refer to each side of this corpus as Eng-100k, MSA-100k and Egy-100k. Two
variants of adaptation system were presented. The first variant translated
with the core SMT system the English side (Eng-100k) of the tri-parallel
corpus to MSA (we call the result MSA-100k-trans). This dataset is used
with the Egyptian side (Egy-100k) of the corpus as training data to trans-
late from MSA to Egyptian. An English test set is translated to MSA (by
using the core SMT English-MSA), the result is then translated to Egyptian
dialect by using the SMT trained on the parallel corpus (MSA-100k-trans,
Egy100k). The second variant includes two adaptation steps. The first one
is used to adapt the MSA output of the core system to the domain of the
MSA side in the tri-parallel corpus and a second one to translate the MSA
output of the domain adaptation system into Egyptian Arabic. An English
test set is translated to MSA with the SMT core system, the result is then
translated by the first adaptation system trained on (MSA-100k-trans, MSA-
100k). The output of this step is then translated into Egyptian by using the
second adaptation system trained on (MSA100k, Egy100k). The main result
of this work showed that it is possible to increase the MT quality by using
domain adaptation between MSA and Egyptian dialect as adapting between
9
different domains of the same language. Furthermore, using MSA as a pivot
then adapting to dialect could improve MT performance.
Al-Mannai et al. (2014) proposed an unsupervised morphological segmen-
tation for Arabic dialects to improve machine translation quality. The study
concerned a Qatari Arabic to English SMT. It was shown that segmentation
with Morfessor (Siivola et al., 2007) (unsupervised morphological segmenter)
improves the translation quality compared to a system without segmentation
at all or to a system using Arabic Treebank (ATB) segmentation. In addi-
tion, a multi-dialectal word segmentation model was trained on the Arabic
part of a parallel corpus including Qatari Arabic, Egyptian, Levantine, MSA
and English. This segmented corpus was used to train the Qatari Arabic to
English SMT, the BLEU score increased by 1.5 points when compared to a
baseline system which does not use segmentation. In the other direction, a
preliminary SMT system was trained to translate English to Qatari Arabic
using the same parallel corpus without segmentation and by training the
language model with other dialect corpora. The best system shows an abso-
lute improvement of 0.22 in terms of BLEU compared to the baseline system
that only uses the Arabic side of the Qatari Arabic corpus for language model
(LM) training.
Durrani et al. (2014) improved Egyptian-to-English translation quality
by handling OOV words. They first proceed to convert Egyptian to MSA
by using a large monolingual language model to score the MSA-candidates
for Egyptian OOV words (via a stack-based search with a beam-search al-
gorithm). These candidates are got mainly through spelling correction and
suggesting synonyms on context, MSA results are then translated to English
via a SMT system. They showed that the spelling-based correction could im-
prove the BLEU score by 1.7 points over the baseline system that translates
unedited Egyptian into English. This work introduced an interesting idea
to map Egyptian words into MSA by applying a convolution model using
English as a pivot, the model relies on two corpora of 8.5K parallel sentences
of Egyptian-English and 300K sentences of MSA-English.
Bolt Project5.DARPA launched the Broad Operational Language Transla-
tion (BOLT) program (2011-2014) to attempt to create new techniques for
automated translation and linguistic analysis that can be applied to the infor-
mal genres of text and speech common in online and in-person communication
5http://www.darpa.mil/program/broad-operational-language-translation
10
in English, Chinese and Egyptian Arabic. BOLT has three technical areas:
developing algorithms and integrated systems to support the translation,
data collection and an evaluation step. Under this program, in (Zbib et al.,
2012), two parallel corpora Levantine-English (1.1M words) and Egyptian-
English (380K words) were built by translating parts extracted from a large
corpus of Arabic web texts to English. Classification by dialect and trans-
lation were done by using Amazon’s Mechanical Turk. Authors performed
several experiments on a SMT system using these corpora in addition to a
MSA-English parallel corpus (150M tokens for Arabic side). It was shown
that morphological segmentation (using MADA (Habash and Rambow, 2005)
morphological analyzer) uniformly improves translation quality. The work
studied also the impact of dialectal training data size on MT performance.
They show that a system trained on the combined dialectal-MSA data is
likely to give the best performance, since informal Arabic data is usually a
mixture of dialectal Arabic and MSA. Another interesting result was pre-
sented regards to pivoting through MSA or translating directly from dialect
into English (the experiment was performed for Levantine only). The per-
formance of the system improves by 2.3 BLEU points when pivoting through
MSA for first experiment, but when adding more dialectal data to training
set (400k words) direct translation becomes better than mapping to MSA
despite the significantly low OOV rate with MSA-mapping.
Aminian et al. (2014) dealt with OOV words in the context of Ara-
bic to English SMT system. They adopted an approach that normalizes
dialectal words to MSA words by using AIDA6(Elfardy et al., 2014) and
MADAMIRA7(Pasha et al., 2014), to identify and replace dialectal Arabic
OOV words with their MSA equivalents. When tested on a blind dataset
test, this approach improved SMT quality by 0.4% and 0.3% absolute BLEU
for AIDA and MADAMIRA, respectively.
Within the same program, in (Aransa, 2015) a focus was made on Ara-
bic dialect to English translation especially for Egyptian dialect. Several
techniques have been implemented such as adapting SMT systems to the
Egyptian dialect since the available training corpora, in the context of Bolt
project, contain MSA and several dialects (Egyptian, Levantine and Iraqi).
6A dialect identification tool that identifies and classifies dialectal words on the token
and sentence levels.
7A morphological analysis and disambiguation system for MSA and Egyptian dialect.
11
The performance of the system were improved by considering and treating
the different dialects as different domains. An example of adaptation tech-
nique is using instance weighting of translation models to improve the trans-
lation quality by giving more weights to Egyptian than MSA or other Arabic
dialects. It should be noted that the systems were adapted by using data
selection techniques because the training data include various genres (News,
Web, Discussion forums, SMS/CHAT). Data selection techniques consist of
selecting the relevant sentences from monolingual corpora to improve and
adapt the language models, or selecting the most relevant sentences from the
bilingual corpora to improve the translation models. Another possible way of
improving the system performance and translation quality was morphological
segmentation. Several segmentation schemes were evaluated. Furthermore,
in order to deal with the out-of-vocabulary words and to decrease the OOV
rate proper noun transliteration was performed.
Recently, as regards the script used in dialectal texts, a new research line
has been open up for Arabic dialect MT. It concerns Arabizi, also known
as Romanized Arabic or Arabish. Arabizi is a non-standard writing sys-
tem that uses Latin characters8to write Arabic dialects. It is widely used
in the context of social media communications like Facebook, Twitter and
YouTube, chat rooms and SMS. Arabizi is a mixture of both transliteration
and transcription mappings, it does not obey to strict rules, it differs from
one dialect to another, even in the same dialect community it differs from
one user to another. Despite it has no standard form, a large amount of Ara-
bizi data is generated by everyday communication (social media, SMS, etc).
Thus, Arabizi creates new needs in the area of dialect NLP, it brings new
challenges, especially for Machine Translation. It should be noted that the
NIST OpenMT159evaluation competition focused on informal data genres
(SMS/Chat and Conversational Telephone Speech (CTS)) in Arabic dialect,
precisely Egyptian, and Mandarin Chinese.10 The task consisted in translat-
ing from Egyptian dialect and Mandarin Chinese into English.11 It is worth
noting that Egyptian dialect data within this campaign is a mixture of texts
8Including letters and numbers
9Open Machine Translation 2015
10https://www.nist.gov/sites/default/files/documents/itl/iad/mig/
OpenMT15_EvalPlan_v0-9.pdf
11Official Evaluation results of NIST openMT15 are available in ftp://jaguar.ncsl.
nist.gov/mt/mt2015/openmt15results.html
12
Table 2: MT work between Arabic dialects and English: Source/Target and MSA pivoting
Work Source MSA Target
Pivoting
(Sawaf, 2010) Levantine, Gulf Arabic, Yes English
Egyptian, Sudanese,
Libyan, Moroccan, Tunisian
(Salloum and Habash, 2011) Levantine, Egyptian Yes English
(Zbib et al., 2012) Levantine, Egyptian No English
(Salloum and Habash, 2013) Levantine, Egyptian, Yes English
Iraqi, Gulf Arabic
(Sajjad et al., 2013) Egyptian Yes English
(Jeblee et al., 2014) English Yes Egyptian
(Al-Mannai et al., 2014) Qatari No English
(Durrani et al., 2014) Egyptian Yes English
(Aminian et al., 2014) Egyptian Yes English
(Salloum et al., 2014) Levantine, Egyptian Yes English
(May et al., 2014) Egyptian No English
(Aransa, 2015) Egyptian No English
(Van der Wees et al., 2016) Egyptian No English
in both Arabic script and Arabizi.
In this respect, May et al. (2014) presented a SMT system that translates
informal Egyptian dialect to English which deals with Arabizi. In this study,
the authors created a deromanization module (converts Arabizi to Arabic
script) whose output is translated into English via a SMT system trained
on informal Arabic/English parallel and monolingual data (from DARPA
BOLT). Their deromanization approach uses a character-based weighted fi-
nite state transducers (wFSTs) Mohri (1997) with a 5-gram character-based
language model of Arabic dialect (learned from 5.4M words). We note that a
character-based language model is used instead of a word-based one to avoid
OOV words. Three methods were experimented to build Arabizi-to-Arabic
script wFST, (1) manually by human experts12, (2) automatically by using
machine translation and (3) hybrid method (combining the two last). The
first method consists in asking a native Arabic speaker to generate proba-
bilistic character sequence pairs in order to encode the wFST transitions,
whereas the automatic method is a SMT system trained on a corpus of 863
Arabizi/Arabic dialect (Arabic script) word pairs (where the words pairs are
12Familiar with finite-state machines
13
treated as sentence pairs and character are treated as words). According to
the authors, this method produces more correspondences than the manual
method and sequence pairs with longer context but generates also a set of
noisy pairs that are useless. Another negative aspect of this method is that
it does not generate vowel-dropping sequence pairs (that are taken into ac-
count by the first method). The hybrid method involves using sequences pairs
(with Arabizi length of less than three characters) from those generated by
the SMT system in addition to vowel-dropping sequence pairs from the man-
ual wFST pairs. For the evaluation of both the deromanization module and
the Arabizi-English SMT, the authors used two parallel corpora of Arabizi-
English of 7,794 and 27,901 aligned sentences with reference deromanizations
of the Arabizi side of each corpus. For the deromanization module, the auto-
matic and hybrid methods outperform the manual one. However, the results
of the hybrid approach are slightly better than the automatic approach. As
regards the SMT systems scores, they track those of deromanization results.
The SMT using automatically learned wFST approach outperforms the man-
ual wFST (BLEU scores of respectively 12.0 and 8.9 Vs 15.1 and 13.2). In
addition, the BLEU score (15.3 and 13.4) of the SMT system using the hy-
brid approach outperforms slightly the score of the SMT system using the
automatic approach (15.1 and 13.2).
Van der Wees et al. (2016) attempted to improve Arabizi-to-English ma-
chine translation by using an Arabizi-to-Arabic script converter that does
not require human knowledge (experts or native Arabic speakers). This con-
verter has been incorporated into a phrase-based SMT system whose per-
formance yields results that are comparable to those achieved after human
transliteration. This work uses a set of resources including : a large Arabic
dialect-English parallel corpus (1.75M sentence pairs with 52.9M Arabic to-
kens), a small tri-text Arabizi-dialect (Arabic script)-English (10K parallel
sentences belonging to the SMS and chat genres13) from which 1788 paral-
lel Arabizi-dialect (Arabic script) sentences were split into two test sets for
evaluation, and finally, an Arabizi-English parallel corpus14 crawled from a
variety of web pages (10K sentence pairs with 180K Arabizi tokens). The first
step of deromanization is generating transliteration candidates, this is done
13LDC catalog number: LDC2013E125, data set released for the most recent NIST
OpenMT
14This resource has been created in the context of this work but the authors did not
give any details about how they proceed.
14
by character mapping module 15 following the phrase-based SMT paradigm.
Since the generated candidates could include character sequences that are
not actual Arabic words, they are filtered by comparing them to a large
Arabic dialect vocabulary (200K of distinct words) and the OOV candidates
are then eliminated. This operation reduced the number of candidates for a
given Arabizi word by 50% and also excluded Arabizi words with character
repetitions.16 After generating candidates and filtering steps, an ambiguous
Arabizi-to-dialect(Arabic script) lexicon is created. This lexicon, in addi-
tion to a 3-gram Arabic dialect language model (trained on the source side
of the available parallel dialect (Arabic script)-English corpora) are passed
through a contextual disambiguation process using srilm-disambig17 in order
to search for the best transliteration of each Arabizi sentence. At this stage
(we call it a first variant of the romanization), the WERs (Word Error Rates)
recorded for the two set tests were 46.4% and 50.8%. For improving these
results, the authors exploited transliterated word pairs extracted from the
tritext Arabizi-dialect (Arabic script)-English described above. They added
them to the transliterated lexicon used by the contextual disambiguation by
prioritizing them with a high score (0.9 Vs 0.1 for the other transliteration
candidates). This step (the second variant of deromanization) contributed
to an improvement of the WERs by 50% (25.7% and 027.9%) on the two
test sets. This transliteration module has been incorporated into an in-house
phrase-based SMT trained on the collection of dialect (Arabic script)-English
corpora described above (1.75M parallel sentences with 52.9M Arabic tokens)
and a 5-gram English language model. On the other hand, the Arabizi-
English corpus of web-crawled user comments has been used to train a small
SMT system whose phrase translation and phrase reordering models have
been merged to the main SMT system models. This increases the chance
of translating (directly by the Arabizi-English models) a non-transliterated
Arabizi word. For the two variants of the transliteration module, the SMT
system has been evaluated using BLEU score. The best BLEU is recorded
for the transliteration module that uses character-level mapping with con-
textual disambiguation augmented by words pairs (second variant) with 8.68
15The mapping of Arabic letters to Arabizi character sequences uses the publicly avail-
able character table described in http://en.wikipedia.org/wiki/Arabic chat alphabet
16character repetition is widely used in social media networks, SMS and Chat in order
to lay emphasis on the word where it (the repetition) appears).
17http://www.speech.sri.com/projects/srilm/manpages/disambig.1.html
15
and 10.32 on the two test sets Vs a BLEU of 7.46 and 9.42 (for the first
variant). Table 2 provides a summary of MT work listed above with regard
to concerned dialects, translation direction and pivoting through MSA.
MuDMaT. Another project dedicated to machine translation of Arabic di-
alects is MuDMAT project (Multi-Dialect Machine Translation) (Sadat, 2015)
supported by NSERC.18 MuDMaT is speared over the period of (2014-2017).
It aims to build MT systems between Maghreb dialects (Algerian, Moroccan
and Tunisian), MSA and French using hybrid approach. According to the
author, a demonstration of a rule-based machine translation from Tunisian
dialect to MSA and French was achieved.
All the work cited above is related to text machine translation. For speech
translation there are no relevant projects dedicated for Arabic dialects, ex-
cept those funded by DARPA such as TRANSTAC 19 project (Hsiao et al.,
2006), a predecessor program to BOLT which deals with MT between Iraqi
dialect and English. The goal of TRANSTAC is a rapid development of
bi-directional translation systems that allow speakers of different languages
to communicate in real-world tactical situations. Several prototype systems
were developed for military and medical screening domains to enable con-
versations with local foreign language speakers of Iraqi Arabic, Mandarin,
Farsi, Pashto, and Thai. Some research was dedicated to evaluate MT scores
of Iraqi Arabic and English translators such as (Condon et al., 2010) and
(Condon et al., 2008). In the same context, IBM MASTOR (Gao et al.,
2006), is a speech-to-speech translation system that translates spontaneous
free-form speech in real-time on both laptop and hand-held PDAs for two
language pairs, English-Mandarin Chinese, and English-Arabic dialect.
4. Discussion
We presented above a set of recent machine translation studies dedicated
to Arabic dialects. This research work has been described in terms of used
approach, data configuration and relevant results. In the following, we sum
up the most significant findings of these different contributions:
The limited number of covered languages shows that MT for Arabic
dialects is just beginning. Indeed, all contributions are dedicated to
18National Science and Engineering Research Council of Canada.
19The Spoken Language Communication and Translation System for Tactical Use
16
translate between dialects, MSA and English. We note that there is
only one work which translates to French but unfortunately, no results
are available for it. In terms of translation direction, most of the con-
tributions translate from dialects to MSA or English, whereas there
is very little work that uses the dialect as target language. This may
be explained by the fact that using dialect as target language for a
SMT system for example requires important amount of cleaned data in
order to build reliable language models. Even for rule-based MT sys-
tems, it requires adapted tools (morphological, syntactic and semantic
generators). Such requirements are still unavailable for most Arabic
dialects.
Regards to the used dialects (see Table 3), it is clear that middle-
east dialects are the most used ones especially Egyptian (spoken in the
most populous Arab country20), followed by Levantine, whilst Maghrebi
dialects are less present when for the other dialects like Koweitian,
Bahraini, Omani and Mauritanian no work in this field was found.
20The current population of Egypt is 94,899,254, based on the latest United Nations
estimates.
17
Table 3: Arabic dialects concerned by MT research
Dialect Translation Work between
MSA and dialects dialects and English
Egyptian (Bakr et al., 2008), (Sawaf, 2010),
(Salloum and Habash, 2012), (Zbib et al., 2012),
(Mohamed et al., 2012) (Salloum and Habash, 2013),
(Sajjad et al., 2013),
(Jeblee et al., 2014),
(Aminian et al., 2014),
(Durrani et al., 2014),
(Salloum et al., 2014),
(May et al., 2014),
(Aransa, 2015),
(Van der Wees et al., 2016)
Levantine (Salloum and Habash, 2012), (Sawaf, 2010),
(Meftouh et al., 2015) (Zbib et al., 2012),
(Salloum and Habash, 2013),
(Salloum et al., 2014)
Tunisian (Hamdi et al., 2013), (Sawaf, 2010)
(Sadat et al., 2014),
(Meftouh et al., 2015)
Iraqi (Salloum and Habash, 2012) (Sawaf, 2010),
(Salloum and Habash, 2013)
Gulf Arabic (Salloum and Habash, 2012) (Sawaf, 2010),
(Salloum and Habash, 2013)
Moroccan (Tachicart and Bouzoubaa, 2014) (Sawaf, 2010)
Sanaani (Al-Gaphari and Al-Yadoumi, 2012)
(Yemenite)
Algerian (Meftouh et al., 2015)
Sudanese (Sawaf, 2010)
Libyan (Sawaf, 2010)
Qatari (Al-Mannai et al., 2014)
In terms of methodology, for translating between MSA and dialects
the rule based approach with morphological analysis is the most used
method. In addition, most work exploit bilingual lexicons and rely on
relatively small language models (see data description in Table 4) com-
pared to those used for standard languages. We can see that Egyptian
and Levantine to a lesser degree, are in advance compared to other
dialects. Recent work on Moroccan, Tunisian and Yemenite dialects
adopt almost the same approach that was used in the first studies of
Egyptian and Levantine. It is clear that when no relevant corpora are
18
available, the rule-based approach is adopted despite its drawbacks.
Table 4: MT work between Dialects and MSA: Approaches, data description and results
Work Approach Data description
& Best results
(Bakr et al., 2008) Statistical 1k sentences
Accuracy: 88% tokenization & tagging
+ Rule-based transformation
(Salloum and Habash, 2012) Rule-based approach + LM 300 sentences
Accuracy: 93.15%
(Mohamed et al., 2012) Rule-based approach 100 user comments
POS tagging evaluation
Accuracy: 73.24%
(Al-Gaphari and Al-Yadoumi, 2012) Rule-based approach 9386 words
Accuracy: 77.32%
(Hamdi et al., 2013) Rule based approach Parallel Tunisian/MSA
Accuracy: (deep morphological corpus of 1500 sentence pairs
Tunisian-to-MSA 84% representation of data) Dev/test set 750 sentence pairs,
MSA-to-Tunisian 80%
(Sadat et al., 2014) Rule-based approach 50 sentences
BLEU score: 14.32 +Bilingual lexicon+LM
(Tachicart and Bouzoubaa, 2014) Rule-based approach -
+Bilingual lexicon+LM
(Meftouh et al., 2015) Statistical approach 6 sides parallel corpus
A set of BLEU scores of 6400 sentences
Dev/test set 500 sentence for
each corpus
For machine translation between Arabic dialects and English, the dom-
inant methodology is hybridizing rule-based and statistical approaches,
especially for the first research work (see table 5). The SMT systems
are trained on large MSA/English corpora in addition to relatively
smaller dialectal corpora. The rule-based methods rely on morpholog-
ical analysis and transfer rules to normalize dialectal words into MSA
words (Sawaf, 2010; Salloum and Habash, 2011, 2013; Sajjad et al.,
2013). Other work uses domain adaptation techniques by considering
dialect adaptation as a domain adaption problem (Sajjad et al., 2013;
Jeblee et al., 2014; Al-Mannai et al., 2014; Aransa, 2015). The avail-
ability of some parallel corpora makes this research direction possible.
Furthermore, availability of new tools related to dialect identification
(at word and sentences levels) has a positive impact on machine trans-
19
lation performance as it was shown in (Aminian et al., 2014; Salloum
et al., 2014). Indeed, in this last work, identifying either the sentence
is dialectal or MSA guides the selection of the MT system to use. Also,
It must be stressed that the impact of segmentation have been showed
in most work, it improves MT scores significantly.
An important point related to Arabic dialects MT is using MSA as a
pivot language when translating to or from English. As mentionned
above, exploiting the proximity between close languages has been used
in NLP research dedicated to under-resourced languages. The idea is
to adapt existing resources of a rich-resourced language to process an
under-resourced language, particularly, in the context of standard lan-
gauges and their dialects. This research direction has been adopted in
the area of Arabic dialects NLP and especially for machine translation.
We observe that the first efforts were (naturally) dedicated to translat-
ing between dialects and MSA, probably with a view to reaching other
standard languages. Pivoting through MSA has been used in a major-
ity of contributions, they state that it improves MT quality, except one
work (Zbib et al., 2012) which shows that increasing the dialect train-
ing data increases MT performance better than pivoting through MSA,
but it noticed that the OOV rate is lower with MSA-mapping. The au-
thors concluded that differences in genre between MSA and dialects
make vocabulary coverage insufficient and considering the domain is
an important research direction. We note that an intersting idea has
been introduced in the work of Salloum et al. (2014) where authors
have combined four MT systems among them a system which pivots
through MSA and a system that translates directly from dialect to En-
glish. By using learning machine techniques and according to dialect
level of the sentence, they select the adequate MT system (among the
four ones) to translate (the considered sentence). Consequently, they
continue to take profit of MSA-mapping whenever it is possible. In this
way, the MT systems form a whole and complement each other.
Another big challenge of Arabic dialects MT is Arabizi (aka Arabish
or Romanized Arabic). Indeed, important amount of user-generated
data from social networks are a mixture of dialect written in both
Arabic and Roman script. Given their size, these data could be an
important source of dialect corpora if they are processed. It is in this
20
context that most recent MT work for Arabic dialects attempt to deal
with translation from Arabizi to English. But, despite its large use,
Arabizi is still a new research direction, few work are dedicated to it
and they concern only Egyptian dialect. Other Arabic dialects are at a
preliminary stage. The contributions presented in this survey related to
Arabizi are based on a SMT system built on the top of a deromanization
module that converts Arabizi texts to Arabic script. The importance of
deromanization is evident, it was shown that its accuracy rates correlate
with MT scores (in both of the two papers presented above). We expect
that future work attempt to bypass the step of deromanization when
more parallel corpora including Arabizi will be available. Thus, Arabic-
script pivoting and direct translation will be certainly experimented.
In this respect, direct translation from Arabizi to English or French
will probably reduces the complexity of two serious problems related
to Arabic dialects MT; proper nouns translation and code-switching.
Since Arabizi uses Roman script, there is no need to translate proper
nouns even more English or French words21 (included in dialect text in
the case of code-switching)
Regards to the data, we notice a significant lack of textual resources
dedicated for dialects. All research efforts deal with this issue. We can
see that MSA-dialect parallel corpora are fewer than English-dialect
ones (see data description in Tables 4 and 5). This is due to the fact
that MT projects between Arabic dialects and English are more funded
mainly in the case of BOLT project. Yet, even with this funding, the
corpora including dialects are smaller than those of standard languages
(MSA/English for example). Also, in terms of coverage, Egyptian and
Levantine remain the most resourced dialects in contrast to all others.
It is worth noting that an important portion of several MT efforts is
dedicated to produce dialect resources.
21In the Middle-east, Arab people switch between dialect, MSA and English, whereas
in the Maghreb the code-switching is observed between dialect, MSA and French.
21
Table 5: MT work between Dialects and English: Approaches, data description and results
Work Approach Data description
& Best results (BLEU)
(Sawaf, 2010) SMT Training/test(Dialect/English):
Broadcast News 36.4 +Rule-based approach Broadcast News 14.3M/ 12.4K sentences
Web content 42.1 Web content 38.5K/ 547 sentences
(Salloum and Habash, 2011) SMT Training(MSA/English): 32M words (MSA side)(LDC2007E103)
37.8 +Rule-based approach (Dialect&MSA-English)64M words (MSA side)
dev&test sets of 1496&1568 sentences
(Zbib et al., 2012) SMT Training(dialect/English):
Egyptian 20.66 + Morpho. segmentation 180k sentence pairs (1.1M Levantine ,380k Egyptian,
Levantine 19.29 English 2.3M words)
Training(MSA/English):8M MSA-English sentence pairs
(Salloum and Habash, 2013) SMT Training(MSA/English): 64M words (MSA side)
Dev10 set: 39.13 +Rule-based approach Dev10 test set 1568 sentences(audio dev data
Levantine test set: 10.54 DARPA GALE program
Egyptian test set: 19.59 Levantine test set 2728 sentences (Zbib et al., 2012),
Egyptian test set 1553 sentences (BOLT program)
(Sajjad et al., 2013) Character-level Dialect/English parallel corpus of 38k sentences
16.96 transformational model (Zbib et al., 2012) Training: 32k sentences ,Test: 4k sentences
+SMT Training(MSA/English): 200k sentences
+Data adaptation from (LDC2004T17,LDC2004E72,
& parallel corpora of the GALE program)
(Jeblee et al., 2014) SMT Training set:English/MSA 5M parallel sentences(NIST 2012)
42.9 +Domain adaptation Test set 1313 ( NIST MT09)
+Dialect adaptation A 100k artificial tri-parallel Egyptian-MSA-English corpus
(Al-Mannai et al., 2014) SMT+segmentation Qatari Arabic/English corpus(Elmahdy et al., 2014)
15.2 +Adapting MSA Training set: 12k sentences
and other dialects Test set: 1k sentences
(Durrani et al., 2014) Egyptian-to-MSA decoder Gale-dev10 set and Bolt Egyptian (tahyyes dev set)
23.72 + MSA-to-English decoder
(Aminian et al., 2014) SMT Training set:MSA side 29M tokenized words
AIDA 25.9 +Dialectal words and Dialect side 5M DA tokenized words
MADAMIRA: 28.8 identification Dialect test set (BOLT-arz-test)
and replacement 1065 sentences(LDC2012E30),16177 tokenized words
MSA test set(MT09-test)
1445 sentences (LDC2010T23)40858 tokenized words
(Salloum et al., 2014) Sentence level Training set:Dialect/English parallel
33.5 dialect identification corpus of 5M tokenized words (BOLT)
+SMT selection using MSA/English parallel corpus of 57M tokenized words
Naive Bayes Classifier NBC Training 2562 sentences
Dev set/ Test set of 1802 & 1804 sentences
(May et al., 2014) Deromanization system Training Deromanization system
15.3 +SMT Dialect corpus of 5.4 M words & 863 Arabizi/Arabic words
13.4 Dialect/English parallel corpus (BOLT)
Two test sets of 7794 & 27901 sentences
(Aransa, 2015) SMT+ Language & translation Different datasets of(training/dev/test):
A set of scores models adaptation Discussion forum
BLEU and (Ter - Bleu)/2 +Segmentation schemes SMS/Chat system
(Servan and Schwenk, 2011) +Proper nouns transliteration Conversational telephone speech (CTS) transcript
(Van der Wees et al., 2016) Deromanization system Dialect/English parallel corpus of 1.75M sentence pairs
8.68 +SMT Arabizi-Arabic-English corpus of 10K sentences (LDC2013E125)
10.32 Arabizi-English corpus of 10K sentences(180K Arabizi tokens)
1788 pairs of sentences split into two test sets.
22
5. Conclusion
The above findings draw a picture of machine translation in the context
of Arabic dialects. We can observe that dialects emerge as real languages and
any NLP tools and resources dedicated to MSA should taking into account
these dialects. Machine translation for Arabic dialects is still an immature
area of research. There is still a long way to walk. Several important issues
need to be solved. The dialects themselves, as they are presented in all the
research work are classified by country or by region: Levantine dialect, Egyp-
tian, Algerian, Tunisian, etc. This classification simplifies considerably the
real linguistic situation through Arab countries. In fact, each Arab country
has multiple varieties of dialects with specific features. MT systems dedi-
cated to dialect have to deal with all these variants. In addition, the wide
use of Arabizi in social networks generates new challenges that needs to be
addressed also.
Another issue has to be taken into account is the code switching, Arab
people switch in their conversation between dialect, Arabic and other lan-
guages, especially in the Maghreb where people tend to use French, Arabic,
dialect and even Berber. This code-switching is a challenge for dialects MT.
Also, it should be noted that for Maghreb dialect an important source of
OOV words could be the use of French words, handling this issue must take
into account this aspect since MSA pivoting or normalizing Maghreb dialect
words to MSA could be insufficient. In the same vein, fast evolution of di-
alects needs to be considered for machine translation. Indeed, everyday new
dialectal words appear and are adopted by people spontaneously without any
official or academic validation.
As regards resources, a way to get parallel data is to use an iterative ap-
proach to produce artificial dialectal data from available dialect MT systems
by post-editing their output. Another interesting track is to investigate com-
parable corpora for producing parallel corpora for training machine transla-
tion systems. This is already done for natural language such as in : (Jehl
et al., 2012), (Hewavitharana and Vogel, 2011) for the pair Arabic-English,
(Cettolo et al., 2010) for English-German and Arabic-English, (Munteanu
and Marcu, 2006) for Romanian-English, and (Tillmann and Xu, 2009) for
Spanish-English and Portuguese-English. This approach is feasible for Ara-
bic dialects by using social networks which are a rich source containing a huge
quantity of data expressed in dialects. But unfortunately, these noisy data
require a considerable pre-processing steps such as: dialect identification,
23
morphological analysis with specific tools, cleaning the data by eliminating
non-exploitable fragments and writing normalization.
References
C. A. Ferguson, Diglossia, Word 15 (1959) 325–340.
N. Y. Habash, Introduction to Arabic natural language processing, Synthesis
lectures on human language technologies 3 (1) (2010) 1–187, ISSN 1573-
0573.
A. Shoufan, S. Al-Ameri, Natural Language Processing for Dialectical Ara-
bic: A Survey, in: Proceedings of the 53rd Annual Meeting of the Associ-
ation for Computational Linguistic (ACL), the Arabic Natural Language
Processing workshop (ANLP), 36–48, 2015.
X. Zhang, Dialect MT: A Case Study between Cantonese and Mandarin,
in: Proceedings of the 36th Annual Meeting of the Association for Com-
putational Linguistics (ACL) and 17th International Conference on Com-
putational Linguistics (COLING), Volume 2, Montreal, Quebec, Canada,
1460–1464, 1998.
J. Hajiˇc, J. Hric, V. Kuboˇn, Machine translation of very close languages, in:
Proceedings of the 6th Conference on Applied natural language processing
(ANLC), Association for Computational Linguistics, 7–12, 2000.
K. Altintas, I. Cicekli, A machine translation system between a pair of closely
related languages, in: Proceedings of the 17th International Symposium
on Computer and Information Sciences (ISCIS), 192–196, 2002.
K. P. Scannell, Machine translation for closely related language pairs, in:
Proceedings of the Workshop Strategies for developing machine translation
for minority languages, Citeseer, 103–109, 2006.
P. Nakov, H. T. Ng, Improving statistical machine translation for a resource-
poor language using related resource-rich languages, Journal of Artificial
Intelligence Research (4) (2012) 179–222.
B. Haddow, A. H. Huerta, F. Neubarth, H. Trost, Corpus development for
machine translation between standard and dialectal varieties, in: Proceed-
ings of Adaptation of Language Resources and Tools for Closely Related
Languages and Language Variants, 7–14, 2013.
24
H. A. Bakr, K. Shaalan, I. Ziedan, A hybrid approach for converting written
Egyptian colloquial dialect into diacritized Arabic, in: Proceedings of the
6th International Conference on Informatics and Systems (INFOS). Cairo
University, 2008.
W. Salloum, N. Habash, Elissa: A Dialectal to Standard Arabic Machine
Translation System, in: 24th International Conference on Computational
Linguistics (COLING), 385–392, 2012.
W. Salloum, N. Habash, Dialectal to Standard Arabic paraphrasing to im-
prove Arabic-English statistical machine translation, in: Proceedings of
the First Workshop on Algorithms and Resources for Modelling of Di-
alects and Language Varieties, Association for Computational Linguistics,
10–21, 2011.
W. Salloum, N. Habash, Dialectal Arabic to English Machine Translation:
Pivoting through Modern Standard Arabic, in: Proceedings of the 2013
Conference of the North American Chapter of the Association for Com-
putational Linguistics (NAACL): Human Language Technologies (HLT),
348–358, 2013.
E. Mohamed, B. Mohit, K. Oflazer, Transforming Standard Arabic to Col-
loquial Arabic, in: Proceedings of the 50th Annual Meeting of the As-
sociation for Computational Linguistic (ACL): Short Papers - Volume 2,
176–180, 2012.
G. Al-Gaphari, M. Al-Yadoumi, A method to convert Sanaani accent to
Modern Standard Arabic, International Journal of Information Science
and Management (IJISM) 8 (1) (2012) 39–49.
A. Hamdi, R. Boujelbane, N. Habash, A. Nasr, The effects of factorizing root
and pattern mapping in bidirectional Tunisian-Standard Arabic machine
translation, in: MT Summit, 2013.
H. Sawaf, Arabic dialect handling in hybrid machine translation, in: Pro-
ceedings of the Conference of the Association for Machine Translation in
the Americas (AMTA), Denver, Colorado, 2010.
N. Habash, O. Rambow, MAGEAD: A morphological analyzer and gener-
ator for the Arabic dialects, in: Proceedings of the 21st International
25
Conference on Computational Linguistics (COLING) and the 44th annual
meeting of the Association for Computational Linguistics (ACL), 681–688,
2006.
R. Tachicart, K. Bouzoubaa, A hybrid approach to translate Moroccan Ara-
bic dialect, in: Proceedings of the 9th International Conference on Intelli-
gent Systems: Theories and Applications (SITA-14), IEEE, 1–5, 2014.
A. Boudlal, A. Lakhouaja, A. Mazroui, A. Meziane, M. O. A. o. Bebah,
M. Shoul, Alkhalil morpho sys1: A morphosyntactic analysis system for
Arabic texts, in: Proceedings of the International Arab Conference on
Information Technology, ACIT, 2010.
F. Sadat, F. Mallek, M. Boudabous, R. Sellami, A. Farzindar, Collabora-
tively Constructed Linguistic Resources for Language Variants and their
Exploitation in NLP Application, the case of Tunisian Arabic and the So-
cial Media, in: Proceedings of the Workshop on Lexical and Grammatical
Resources for Language Processing, Association for Computational Lin-
guistics and Dublin City University, 102–110, 2014.
K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: A Method for Au-
tomatic Evaluation of Machine Translation, in: Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics (ACL), 311–
318, 2002.
K. Meftouh, S. Harrat, S. Jamoussi, M. Abbas, K. Smaili, Machine Transla-
tion Experiments on PADIC: A Parallel Arabic DIalect Corpus, in: Pro-
ceedings of the 29th Asia Conference on Language, Information and Com-
putation (PACLIC), 26–34, 2015.
T. Buckwalter, Buckwalter Arabic Morphological Analyzer Version 2.0. Lin-
guistic Data Consortium, University of Pennsylvania, 2002. LDC Cat alog
No.: LDC2004L02, Tech. Rep., ISBN 1-58563-324-0, 2004.
H. Sajjad, K. Darwish, Y. Belinkov, Translating Dialectal Arabic to English,
in: Proceedings of the 51st Annual Meeting of the Association for Com-
putational Linguistics (ACL), Sofia, Bulgaria, 1–6, 2013.
W. Salloum, H. Elfardy, L. Alamir-Salloum, N. Habash, M. Diab, Sentence
Level Dialect Identification for Machine Translation System Selection, in:
26
Proceedings of the 52nd Annual Meeting of the Association for Computa-
tional Linguistic (ACL), 772–778, 2014.
R. Zbib, E. Malchiodi, J. Devlin, D. Stallard, S. Matsoukas, R. Schwartz,
J. Makhoul, O. F. Zaidan, C. Callison-Burch, Machine translation of Ara-
bic dialects, in: Proceedings of the 2012 Conference of the North American
Chapter of the Association for Computational Linguistics (NAACL): Hu-
man Language Technologies (HLT), 49–59, 2012.
H. Elfardy, M. T. Diab, Sentence Level Dialect Identification in Arabic, in:
Proceedings of the 51st Annual Meeting of the Association for Computa-
tional Linguistic (ACL):, 456–461, 2013.
S. Jeblee, W. Feely, H. Bouamor, A. Lavie, N. Habash, K. Oflazer, Domain
and Dialect Adaptation for Machine Translation into Egyptian Arabic,
in: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Workshop on Arabic Natural Langauge
Processing (ANLP), 196–206, 2014.
K. Al-Mannai, H. Sajjad, A. Khader, F. Al Obaidli, P. Nakov, S. Vogel,
Unsupervised Word Segmentation Improves Dialectal Arabic to English
Machine Translation, in: Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Workshop on Arabic
Natural Langauge Processing (ANLP), 207–216, 2014.
V. Siivola, M. Creutz, M. Kurimo, Morfessor and variKN machine learning
tools for speech and language technology, in: Proceedings of the Annual
Conference of the International Speech Communication Association (In-
terspeech), 1549–1552, 2007.
N. Durrani, Y. Al-Onaizan, A. Ittycheriah, Improving Egyptian-to-English
SMT by Mapping Egyptian into MSA, in: Proceedings of 15th Interna-
tional Conference on Computational Linguistics and Intelligent Text Pro-
cessing (CICLing), 271–282, 2014.
N. Habash, O. Rambow, Arabic tokenization, part-of-speech tagging and
morphological disambiguation in one fell swoop, in: Proceedings of the
43rd Annual Meeting on Association for Computational Linguistics (ACL),
573–580, 2005.
27
M. Aminian, M. Ghoneim, M. Diab, Handling OOV Words in Dialectal Ara-
bic to English Machine Translation, in: Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), Work-
shop Language Technology for Closely Related Languages and Language
Variants (LT4CloseLang), 99–108, 2014.
H. Elfardy, M. Al-Badrashiny, M. Diab, AIDA: Identifying code switching
in informal Arabic text, Proceedings of The First Workshop on Computa-
tional Approaches to Code Switching (2014) 94–101.
A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash,
M. Pooleery, O. Rambow, R. M. Roth, Madamira: A fast, comprehensive
tool for morphological analysis and disambiguation of Arabic, in: Pro-
ceedings of the Language Resources and Evaluation Conference (LREC),
Reykjavik, Iceland, 2014.
W. Aransa, Statistical Machine Translation of the Arabic Dialect, Ph.D.
thesis, University of Maine, doctoral school STIM, 2015.
J. May, Y. Benjira, A. Echihabi, An Arabizi-English social media statistical
machine translation system, in: Proceedings of the 11th Conference of the
Association for Machine Translation in the Americas (AMTA), 329–341,
2014.
M. Van der Wees, A. Bisazza, C. Monz, A Simple but Effective Approach to
Improve Arabizi-to-English Statistical Machine Translation, in: Proceed-
ings of the International Conference on Computational Linguistics (COL-
ING), Workshop on Noisy User-generated Text (WNUT), 43–50, 2016.
M. Mohri, Finite-State Transducers in language and speech processing, Com-
putational linguistics 23 (2) (1997) 269–311.
F. Sadat, Multi-Dialect Machine Translation (MuDMat), in: Proceedings
of the 18th Annual Conference of the European Association for Machine
Translation (EAMT), Antalya, Turkey, 226, 2015.
R. Hsiao, A. Venugopal, T. K¨ohler, Y. Zhang, P. Charoenpornsawat, A. Zoll-
mann, S. Vogel, A. W. Black, T. Schultz, A. Waibel, Optimizing compo-
nents for handheld two-way speech translation for an English-Iraqi Arabic
system., in: Proceedings of the Annual Conference of the International
Speech Communication Association (Interspeech), 2006.
28
S. Condon, D. Parvaz, J. Aberdeen, C. Doran, A. Freeman, M. Awad, Eval-
uation of machine translation errors in English and Iraqi Arabic, Tech.
Rep., 2010.
S. Condon, J. Phillips, C. Doran, J. Aberdeen, D. Parvaz, B. Oshika,
G. Sanders, C. Schlenoff, Applying Automated Metrics to Speech Transla-
tion Dialogs, in: Proceedings of the International Conference on Language
Resources and Evaluation (LREC), 2008.
Y. Gao, L. Gu, B. Zhou, R. Sarikaya, M. Afify, H.-K. Kuo, W.-z. Zhu,
Y. Deng, C. Prosser, W. Zhang, L. Besacier, IBM MASTOR System:
Multilingual Automatic Speech-to-speech Translator, in: Proceedings of
the 2006 Conference of the North American Chapter of the Association
for Computational Linguistics (NAACL): Human Language Technologies
(HLT), the Workshop on Medical Speech Translation (MST), 57–60, 2006.
M. Elmahdy, M. Hasegawa-Johnson, E. Mustafawi, Development of a TV
Broadcasts Speech Recognition System for Qatari Arabic, in: Proceedings
of the International Conference on Language Resources and Evaluation
(LREC), Reyk-javik, Iceland, 2014.
C. Servan, H. Schwenk, Optimising multiple metrics with MERT, The Prague
Bulletin of Mathematical Linguistics 96 (2011) 109–117.
L. Jehl, F. Hieber, S. Riezler, Twitter translation using translation-based
cross-lingual retrieval, in: Proceedings of the 7th workshop on statistical
machine translation, Association for Computational Linguistics, 410–421,
2012.
S. Hewavitharana, S. Vogel, Extracting Parallel Phrases from Comparable
Data, in: Proceedings of the 4th Workshop on Building and Using Com-
parable Corpora: Comparable Corpora and the Web, Association for Com-
putational Linguistics, Portland, Oregon, 61–68, 2011.
M. Cettolo, M. Federico, N. Bertoldi, Mining parallel fragments from compa-
rable texts, in: Proceedings of the 7th International Workshop on Spoken
Language Translation (IWSLT), 227–234, 2010.
D. S. Munteanu, D. Marcu, Extracting parallel sub-sentential fragments from
non-parallel corpora, in: Proceedings of the 21st International Conference
29
on Computational Linguistics (COLING) and the 44th annual meeting
of the Association for Computational Linguistics (ACL), Association for
Computational Linguistics, 81–88, 2006.
C. Tillmann, J.-m. Xu, A simple sentence-level extraction algorithm for com-
parable data, in: Proceedings of the 2009 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics (NAACL):
Human Language Technologies (HLT), Companion Volume: Short Papers,
Association for Computational Linguistics, 93–96, 2009.
30
... Harat et al. analyzed that people have recognized the limitations of machine translation and no longer expect machine translation to completely replace human translation. Just using the advantages of fast speed and large processing volume of machine translation, as an aid to human translation, can greatly improve the work efficiency of translators [16]. Beatriz and Helena improved the traditional rule-based machine translation model, using the English machine translation model based on semantic network, in the specific implementation process, using the phrase synthesis semantic statistical English machine translation method based on vector mixture [17]. ...
... λ m h m e J I , f J I . (16) e logarithmic linear model has strong expansibility, can set corresponding features according to different target requirements, and can apply various linguistic methods to machine translation. ...
Article
Full-text available
The era of big data and cloud computing has come, communication between different languages is becoming more and more common, and the barriers between languages are becoming more and more prominent. As the most important means to overcome language barriers, machine translation will play an increasingly important role in modern society. The previous machine translation technology has more or less disadvantages. The accuracy of translation is too low, which is a huge bottleneck hindering the further development of machine translation technology. Therefore, based on this, we can consider modeling the cross-context accurate English translation model based on the machine translation model and rely on the working principle of machine learning. This experiment shows that the translation accuracy of our method reaches 94.2%, which is higher than 39.5% of the benchmark method. This shows that the method in this paper can reduce the influence of other factors, ensure the accuracy of cross-context English translation to a certain extent, and meet the performance improvement requirements of the English translation system.
... After the emergence of social media networks, and specially, after the Arab Spring revolutions, the data has become available everywhere. This led to have an increased attention in the field of Natural Language Processing (NLP) for Colloquial Arabic Dialects (CADs) where the adopted NLP tools for Modern Standard Arabic (MSA) are not suitable to process and understand them (Harrat, Meftouh, & Smaïli, 2017). ...
Conference Paper
Full-text available
This paper sheds light on an in-progress work for building a morphological analyzer for Egyptian Arabic (EGY). To build such a tool, a tag-set schema is developed depending on a corpus of 527,000 EGY words covering different sources and genres. This tag-set schema is used in annotating about 318,940 words, morphologically, according to their contexts. Each annotated word is associated with its suitable prefix(s), original stem, tag, suffix(s), glossary, number, gender, definiteness, and conventional lemma and stem. These morphologically annotated words, in turns, are used in developing the proposed morphological analyzer where the morphological lexicons and the compatibility tables are extracted and tested. The system is compared with one of best EGY morphological analyzers; CALIMA.
... Particularly, for the Arabic language, the emergence of a computerized Arabic dictionary has now been delayed, and further delay will prompt commercial companies to build their deficient dictionaries because of their commercial needs (Harrat et al., 2019;Luqman & Mahmoud, 2020). Some companies have already started collecting colloquial speech to be the basis for speech recognition devices, which reduces the use of the classical language and spreads colloquial dialects and their writing. ...
Article
the human need for language translation has been increasing because of knowledge fields’ expansion and open communications across all countries throughout the world. Accordingly, the traditional translation has become insufficient and machine translation is the best alternative. However, despite its astounding development during the past decades, as an inevitable alternative, machine translation still faces many challenges that make it incomparable with human professional translation. This indicates that machine translation in all its types has to be supported by highly-developed tools that can enhance its effectiveness. This study showed the advantages of machine translation, discussed some of its most common challenges, and accordingly introduced some recommendations that should be taken into account to improve its effectiveness regarding Arabic Language.
Article
Full-text available
Over the past decade, Sentiment analysis has attracted significant researcher attention. Despite a huge number of studies in this field, Sentiment analysis of authors’ books (classical Arabic) with extracting the embedding features has not yet been done. The recent feature extraction of Arabic text depends on the frequency of the words within the corpus without extracting the relation between these words. This paper aims to create a new classical Arabic dataset CASAD from many art books by collecting sentences from several stories with human-expert labeling. Additionally, the feature extraction of those datasets is created by word embedding techniques equivalent to Word2vec that are able to extract the deep relation which means features of the formal Arabic language. These features are evaluated by several types of machine learning for classical Arabic, for example, support vector machines (SVM), Logistic Regression (LR), Naive Bayes (NB) K-Nearest Neighbors (KNN), Latent Dirichlet Allocation (LDA) and Classification And Regression Trees (CART). Moreover, statistical methods such as validation and reliability are applied to evaluate this dataset’s label. Finally, our experiments evaluated the classification rate of the feature-extraction matrices in two and three classes using six machine-learning algorithms for tenfold cross-validation that showed that the Logistic Regression with Word2Vec approach is the most accurate in predicting topic-polarity occurrence.
Book
Based on an annotated multimedia corpus, television series Marāyā 2013, we dig into the question of "automatic standardization" of Arabic dialects for machine translation. Here we distinguish between rule-based machine translation and statistical machine translation. Machine translation from Arabic most of the time takes standard or modern Arabic as the source language and produces quite satisfactory translations thanks to the availability of the translation memories necessary for training the models. The case is different for the translation of Arabic dialects. The productions are much less efficient. In our research we try to apply machine translation methods to a dialect/standard (or modern) Arabic pair to automatically produce a standard Arabic text from a dialect input, a process we call "automatic standardization". we opt here for the application of "statistical models" because "automatic standardization" based on rules is more hard with the lack of "diglossic" dictionaries on the one hand and the difficulty of creating linguistic rules for each dialect on the other. Carrying out this research could then lead to combining "automatic standardization" software and automatic translation software so that we take the output of the first software and introduce it as input into the second one to obtain at the end a quality machine translation. This approach may also have educational applications such as the development of applications to help understand different Arabic dialects by transforming dialectal texts into standard Arabic.
Preprint
Based on an annotated multimedia corpus, television series Mar{\=a}y{\=a} 2013, we dig into the question of ''automatic standardization'' of Arabic dialects for machine translation. Here we distinguish between rule-based machine translation and statistical machine translation. Machine translation from Arabic most of the time takes standard or modern Arabic as the source language and produces quite satisfactory translations thanks to the availability of the translation memories necessary for training the models. The case is different for the translation of Arabic dialects. The productions are much less efficient. In our research we try to apply machine translation methods to a dialect/standard (or modern) Arabic pair to automatically produce a standard Arabic text from a dialect input, a process we call ''automatic standardization''. we opt here for the application of ''statistical models'' because ''automatic standardization'' based on rules is more hard with the lack of ''diglossic'' dictionaries on the one hand and the difficulty of creating linguistic rules for each dialect on the other. Carrying out this research could then lead to combining ''automatic standardization'' software and automatic translation software so that we take the output of the first software and introduce it as input into the second one to obtain at the end a quality machine translation. This approach may also have educational applications such as the development of applications to help understand different Arabic dialects by transforming dialectal texts into standard Arabic.
Article
Full-text available
The field of translation has experienced a remarkable technological leap in recent years. All over the world, translators are using software, electronic dictionaries, and turning their computers into real workstations. This paper aims to explore the machine-human translator or machine translation-bio translation relationship in Algeria, by attempting to answer the following questions: are Algerian translators limited to using electronic dictionaries, or do they use the most developed software and CAT tools? And do they carry out pre-(post)editing operations? To answer these questions, we will conduct a questionnaire targeting Algerian translators, official and/or independent, on social networks. The questionnaire provides important information on the types of software/tools used by Algerian machine translators, the degree of integration of technology in their daily work, and the problems they encounter. The results show that respondents are aware of the role technology can play in the field of translation; they go hand in hand with the latest inventions and new usages. These results will help to improve the use of technology, and consequently the quality of translations, in Algeria. Résumé: Le domaine de la traduction a connu les dernières années un saut technologique remarquable. Partout dans le monde, les traducteurs utilisent des logiciels, se servent des dictionnaires électroniques et font de leurs ordinateurs de vrais postes de travail. Dans cette optique, la présente étude vise à explorer la relation machine-traducteur humain ou traduction automatique-bio traduction en Algérie, en tentant de répondre aux questions suivantes : les traducteurs Algériens, dans leur recours à la traduction automatique, se limitent-ils aux dictionnaires électroniques ou font-ils appel aux logiciels et aux CAT tools les plus développés ? Et est-ce qu'ils procèdent à des opérations de pré (post)-édition ? Pour répondre à ces questions, nous Journal of Languages & Translation Vol 03 Issue 01 January 2023 Dr. BOUNAAS Chaouki 72 avons soumis un questionnaire à des traducteurs algériens, officiels ou/et indépendants, sur les réseaux sociaux. Le questionnaire a permis de récolter des informations importantes sur les types de logiciels/outils utilisés par les traducteurs algériens, le degré d'intégration de la technologie dans leur travail quotidien et les problèmes qu'ils rencontrent. Les résultats montrent que les répondants sont conscients du rôle que les technologies peuvent jouer dans le domaine de la traduction ; ils sont à jour avec les dernières inventions et les nouvelles utilisations. Ces résultats permettront d'améliorer les utilisations technologiques, et par conséquent la qualité des traductions produites, en Algérie. Mots clés : CAT tools-biotraduction-dictionnaire électronique-qualité de traduction-traduction automatique Introduction :
Article
Text in one language can be mechanically translated into another language using machine translation (MT). It is possible to anticipate a sequence of words, generally modeling full sentences using machine translation in a single integrated model. Human language's flexibility makes automatic translation an artificial intelligence (AI) challenge of the highest order. A single model rather than a pipeline of fine-tuned models is now the best way to attain state-of-the-art outcomes in machine translation. For example, words having numerous meanings, phrases that use more than one grammatical structure, and other grammar issues make it difficult for a machine to translate; however, many misinterpretations translate to be a breeze. A teacher's job is to assist pupils in overcoming the emotional and cognitive obstacles that stand in the way of developing effective problem-solving abilities. Students will benefit from developing problem-solving abilities since they will apply what they have learned to new circumstances. MT-AI, machine translation technology, and products have been employed in a wide range of applications, including business travel, tourism, and cross-lingual information retrieval. Text translation and phonetic translation are two types of translations that focus on the content of the source language. It is possible to create self-learning systems by injecting machine learning techniques into existing software and then observing the results of such injection. Computer software can translate a massive volume of text in a short period. It takes longer for a human translator to perform the same work as a computer program. The simulation investigation is developed based on correctness and effectiveness, demonstrating the proposed framework's reliability of 95.1%.
Preprint
Full-text available
The brain circuit is enormous regarding quantities of neurons and neuro-transmitters, proposing that huge circuits are the main entity to the brain-core processing. Hyper-Dimensional Processing depends on the understanding that minds register with examples of neural action that are not promptly connected with quantities. Truth be told, the mind capacity to analyze with numbers is weak and, in any case, because of the exceptionally large circuits, a neural processing models are replicated with purposes of a large dimensional portion, that is, with hyper-vectors. At the point when the dimension (D) is in the large numbers (For example, D is equal to ten thousand) it is called hyper dimensional. Hyper-vectors are holographic and randomly processed with independent-and-identically-distributed tools. A hyper-vector includes whole data merged as well as spread over the entirety of its pieces in a full all-encompassing portrayal, so no spot is more dependable to store any snippet of data compare to others. Hyper-vectors are joined with tasks likened to expansion, increase and change that structure numerical processing on vector region. Hyper-Vectors are intended to analyze for closeness utilizing a separation metric over the vector-region. These activities are nothing but hyper-vectors, in which it can be joined into intriguing processing conduct with novel highlights that cause them vigorous and proficient. This paper focuses on a utilization of hyper-dimensional processing for distinguishing the language of text-tests, in view of encoding sequential letters into hyper-vectors. Perceiving the language of a given book is the initial phase in all sorts of language handling, for example, text examination, arrangement, interpretation, and so forth. High dimension vectors model is mainstream in Natural Language Processing and are utilized to catch word significance from word use-insights. In this paper, first task is high dimensional computing based classification on Arabic datasets which contain three datasets such as arabiya, khaleej and akhbarona. High dimensional computing is applied to obtain the result from previous dataset when it is applied to N-gram encoding. The accuracy of high computing when utilizing SANAD Single-label Arabic news articles datasets with 12 N-gram encoding is 0.9665 %. The high dimensional computing with 6 N-gram encoding when utilizing RTA dataset provide the accuracy of 0.6648%. ANT dataset with 12 N-gram encoding when high dimensional computing is applied to give us accuracy 0.9248 %. The second task is applying high dimensional computing on Arabic language recognition for Levantine dialects three dataset is utilized. The first dataset is SDC Shami Dialects Corpus which contain Jordanian, Lebanese, Palestinian and Syrian that provide an accuracy of 0.8234% when applied high dimensional computing with 7 N-gram encoding. PADIC (Parallel Arabic DIalect Corpus) is the second dataset which contains Syria and Palestine Arabic dialects provide an accuracy of 0.7458 % when applied high dimensional computing with 5 N-gram encoding. The high dimensional computing when applied to third dataset MADAR (Multi-Arabic Dialect Applications and Resources) with 6 N-gram encoding provide us accuracy 0.7800%.
Conference Paper
Full-text available
Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Ara-bic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of social media. More precisely, this paper focuses on the Tunisian Dialect of Arabic (TAD) with an application on automatic machine translation for a social media text into MSA and any other target language. Linguistic tools such as a bilingual TAD-MSA lexicon and a set of grammatical mapping rules are collaboratively constructed and exploited in addition to a language model to produce MSA sentences of Tunisian dialectal sentences. This work is a first-step towards collaboratively constructed semantic and lexical resources for Arabic Social Media within the ASMAT (Arabic Social Media Analysis Tools) project.
Conference Paper
Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and character-level models can yield good results even with small training data by exploiting the relative proximity between the two varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German and Viennese dialect. In general, for a phrase-based approach to SMT, complex lexical transformations and syntactic reordering cannot be dealt with satisfyingly. In a situation with sparse resources it becomes merely impossible. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of the resulting system. One such case is the transformation between synthetic imperfect verb forms to perfect tense with finite auxiliary and past participle, which involves detection of clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and discuss the problems that arise using such an approach. Within the developed SMT system, the models trained on preprocessed data unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows that introducing a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system, most probably due to a higher accuracy in the alignment.
Thesis
The Arabic language received a lot of attention in the machine translation community during the last decade. It is the official language of 25 countries and it is spoken by more than 380 million people. The interest in Arabic language and its dialects increased more after the Arab spring and the political change in the Arab countries. In this thesis, I worked on improving LIUM's machine translation system for Arabic-English in the frame-work of the BOLT project.In this thesis, I have extend LIUM's phrase-based statistical machine translation system in many ways. Phrase-based systems are considered to be one of the best performing approaches. Basically, two probabilistic models are used, a translation model and a language model.I have been working on improving the translation quality. This is done by focusing on three different aspects. The first aspect is reducing the number of unknown words in the translated output. Second, the entities like numbers or dates that can be translated efficiently by some transfer rules. Finally, I have been working on the transliteration of named entities. The second aspect of my work is the adaptation of the translation model to the domain or genre of the translation task.Finally, I have been working on improved language modeling, based on neural network language models, also called continuous space language models. They are used to rescore the n-best translation hypotheses.All the developed techniques have been thoroughly evaluated and I took part in three international evaluations of the BOLT project.
Conference Paper
This paper presents a wide literature review of natural language processing for dialectical Arabic. Four main research areas were identified and the dialect coverage in research work was outlined. The paper can be used as a quick reference to identify relevant contributions that address a specific NLP aspect for a specific dialect.