ArticlePDF Available

Abstract and Figures

Diglossia is a very common phenomenon in Arabic-speaking communities, where the spoken language is different from both Classical Arabic (CA) and Modern Standard Arabic (MSA). The spoken language is characterised as a number of dialects used in everyday communication as well as informal writing. In this paper, we highlight the lexical relation between MSA and Dialectal Arabic (DA) in more than one Arabic region. We conduct a computational cross dialectal lexical distance study to measure the similarities and differences between dialects and MSA. We exploit several methods from Natural Language Processing (NLP) and Information Retrieval (IR) like Vector Space Model (VSM), Latent Semantic Indexing (LSI) and Hellinger Distance (HD), and apply them on different Arabic dialectal corpora. We measure the overlap among all the dialects and compute the frequencies of the most frequent words in every dialect. The results are informative and indicate that Levantine dialects are very similar to each other and furthermore, that Palestinian appears to be the closest to MSA
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 142 (2018) 2–13
1877-0509 © 2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
10.1016/j.procs.2018.10.456
10.1016/j.procs.2018.10.456
© 2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientic committee of the 4th International Conference on Arabic Computational Linguistics.
1877-0509
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
A Lexical Distance Study of Arabic Dialects
Kathrein Abu Kwaika,, Motaz Saadb, Stergios Chatzikyriakidisa, Simon Dobnika
aCLASP, Department of Philosophy, Linguistics and Theory of Science, Ölof Wijksgatan 6, Gothenburg, 412 55, Sweden
bThe Islamic University of Gaza, Gaza, Palestine
Abstract
Diglossia is a very common phenomenon in Arabic-speaking communities, where the spoken language is dierent from both
Classical Arabic (CA) and Modern Standard Arabic (MSA). The spoken language is characterised as a number of dialects used in
everyday communication as well as informal writing. In this paper, we highlight the lexical relation between the MSA and Dialectal
Arabic (DA) in more than one Arabic region. We conduct a computational cross dialectal lexical distance study to measure the
similarities and dierences between dialects and the MSA. We exploit several methods from Natural Language Processing (NLP)
and Information Retrieval (IR) like Vector Space Model (VSM), Latent Semantic Indexing (LSI) and Hellinger Distance (HD), and
apply them on dierent Arabic dialectal corpora. We measure the overlap among all the dialects and compute the frequencies of
the most frequent words in every dialect. The results are informative and indicate that Levantine dialects are very similar to each
other and furthermore, that Palestinian appears to be the closest to MSA.
c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguis-
tics.
Keywords: Diglossia; Lexical Distance; Vector Space Model; Latent Semantic Indexing; Hellinger Distance
1. Introduction
The number of the native Arabic speakers in the world varies from 290 million according to UNESCO1to 313
million, according to the Ethnologue2. There are three varieties in Arabic language: Classical Arabic, Modern Standard
Arabic (MSA), and Arabic dialects (Colloquialism). Classical Arabic (CA) is the form of the Arabic language used in
Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Quran
was not developed for the standardized form of Classical Arabic [1]. MSA is the ocial language used for education,
news, politics, religion and, in general, in any type of formal setting. Colloquialisms (dialects) are used in everyday
communication as well as informal writing, e.g. in social media [2].
Corresponding author.
E-mail address: kathrein.abu.kwaik@gu.se
1https://en.unesco.org/news/world-arabic-language-day-2017-looking-digital-world
2Simons, Gary F. and Charles D. Fennig (eds.). 2018. Ethnologue: Languages of the World, Twenty-first edition. Dallas, Texas: SIL International.
Online version: http://www.ethnologue.com.
1877-0509 c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
A Lexical Distance Study of Arabic Dialects
Kathrein Abu Kwaika,, Motaz Saadb, Stergios Chatzikyriakidisa, Simon Dobnika
aCLASP, Department of Philosophy, Linguistics and Theory of Science, Ölof Wijksgatan 6, Gothenburg, 412 55, Sweden
bThe Islamic University of Gaza, Gaza, Palestine
Abstract
Diglossia is a very common phenomenon in Arabic-speaking communities, where the spoken language is dierent from both
Classical Arabic (CA) and Modern Standard Arabic (MSA). The spoken language is characterised as a number of dialects used in
everyday communication as well as informal writing. In this paper, we highlight the lexical relation between the MSA and Dialectal
Arabic (DA) in more than one Arabic region. We conduct a computational cross dialectal lexical distance study to measure the
similarities and dierences between dialects and the MSA. We exploit several methods from Natural Language Processing (NLP)
and Information Retrieval (IR) like Vector Space Model (VSM), Latent Semantic Indexing (LSI) and Hellinger Distance (HD), and
apply them on dierent Arabic dialectal corpora. We measure the overlap among all the dialects and compute the frequencies of
the most frequent words in every dialect. The results are informative and indicate that Levantine dialects are very similar to each
other and furthermore, that Palestinian appears to be the closest to MSA.
c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguis-
tics.
Keywords: Diglossia; Lexical Distance; Vector Space Model; Latent Semantic Indexing; Hellinger Distance
1. Introduction
The number of the native Arabic speakers in the world varies from 290 million according to UNESCO1to 313
million, according to the Ethnologue2. There are three varieties in Arabic language: Classical Arabic, Modern Standard
Arabic (MSA), and Arabic dialects (Colloquialism). Classical Arabic (CA) is the form of the Arabic language used in
Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Quran
was not developed for the standardized form of Classical Arabic [1]. MSA is the ocial language used for education,
news, politics, religion and, in general, in any type of formal setting. Colloquialisms (dialects) are used in everyday
communication as well as informal writing, e.g. in social media [2].
Corresponding author.
E-mail address: kathrein.abu.kwaik@gu.se
1https://en.unesco.org/news/world-arabic-language-day-2017-looking-digital-world
2Simons, Gary F. and Charles D. Fennig (eds.). 2018. Ethnologue: Languages of the World, Twenty-first edition. Dallas, Texas: SIL International.
Online version: http://www.ethnologue.com.
1877-0509 c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
2Author name /Procedia Computer Science 00 (2018) 000–000
As a result of this situation, diglossia, a case where two distinct varieties of a language are spoken within the same
speech community [3], is a very common phenomenon in Arabic-speaking communities. In some parts of the Arab
speaking world, more than two varieties are spoken within the same community. For example, this is the case in North
African communities like Morocco where Arabic, Berber, French, English and Spanish are spoken within the same
speech community [4]. In a diglossic situation, the standard formal language assumes the role of the High variety (H),
while the other languages or dialects act as the Low variety (L) [5]. MSA is so dierent from the colloquial dialects
that they are in some cases not mutually intelligible. The dierences are clearly evident in all linguistic aspects:
pronunciation, phonology, morphology, lexicon, syntax and semantics. However, the degree in which the individual
dialects dier with respect to these aspects has not been yet quantitatively measured.
In this paper, we focus on measuring the lexical distance between MSA and Arabic dialects using natural language
processing techniques, tools and text corpora. We use various distance metrics such as the Vector space model (VSM)
based on word distribution over documents as common in Information Retrieval (IR) [6], Latent semantic indexing
(LSI) [7] and the Divergence Distance algorithm as Hellinger Distance (HD) [8]. We hope that this study will shed
light on similarities and dierences between the varieties and therefore inform our future work on building NLP tools
and applications for these domains, in particular how these can be ported.
To the best of our knowledge, our work is the most extensive eort to measure the distance or similarity across
Arabic dialects using natural language processing tools and text corpora.
2. Related Work
Several approaches have been used to measure the distance between European languages [9,10], Indian dialects
[11], and similar languages [12,13]. These approaches can be classified according to the type of linguistic represen-
tations they investigate: characters, terms and documents. Lexical similarity measures operate on string sequences at
a character level and on corpora of texts at a term level. Table 1shows the most popular approaches in the litera-
ture. There is not much research on measuring the lexical closeness and divergence between Arabic and its dialects.
Table 1. Summary of the most commonly used approaches for measuring similarity between texts
Approach Level Approach Name Description
Character Level Longest Common SubString Measures the length of the longest contiguous sequence of characters existing in the
string under comparison [14,15].
Levenshtein distance Measure the minimum number of insertions, deletions and substitutions needed
to transform one string into another [16,17].
N-gram models
Can be used in dierent ways to estimate similarity or dissimilarity.
One of the most eective approaches is to build n-gram models for
Language Identification and measure the perplexity of n-grams [18].
Dynamic programming Used for biological sequence comparison;
e.g. the Needleman-Wunsch and Smith-Waterman algorithms [19].
Term Level Vector space models Represent documents as vectors of word frequencies and then apply vector comparison
measures to compare vectors of dierent documents [20].
Cosine Similarity Measures the cosine angle as a similarity indicator between two vector spaces [21].
Divergence Distance For example, the Kullback-Leibler distance, Hellinger, Manhattan distance, etc.
These are used to measure the divergence between probability distributions [22,23]
Jacquard similarity Measures the number of overlapping strings over the number of unique strings
between texts to indicate the similarity [24].
Latent Semantic Indexing Words that are close in meaning will occur frequently in similar positions in the text [25].
Abunasser in [16] compares five Arabic dialects (MSA, Gulf, Levantine, Egyptian and Morocco) in terms of lexical
and pronunciation variation. He depends on the Swadesh list [26] and the concept of non-cognate words to measure
the amount of linguistic variations between the dialects. As the Swadesh list is a phonological list rather than a lexi-
con, the author collected the data from two male speakers for each dialect. The Swadesh list has been adapted to the
MSA list using two modern Arabic dictionaries ( ¯
almwrd [27] and 
 
q¯
amws ¯
abn ¯
ay¯
as [28]).
To rule out the chance of lexical ambiguity, a context sentence per each lexical item has been provided. Thus, the
Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13 3
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
A Lexical Distance Study of Arabic Dialects
Kathrein Abu Kwaika,, Motaz Saadb, Stergios Chatzikyriakidisa, Simon Dobnika
aCLASP, Department of Philosophy, Linguistics and Theory of Science, Ölof Wijksgatan 6, Gothenburg, 412 55, Sweden
bThe Islamic University of Gaza, Gaza, Palestine
Abstract
Diglossia is a very common phenomenon in Arabic-speaking communities, where the spoken language is dierent from both
Classical Arabic (CA) and Modern Standard Arabic (MSA). The spoken language is characterised as a number of dialects used in
everyday communication as well as informal writing. In this paper, we highlight the lexical relation between the MSA and Dialectal
Arabic (DA) in more than one Arabic region. We conduct a computational cross dialectal lexical distance study to measure the
similarities and dierences between dialects and the MSA. We exploit several methods from Natural Language Processing (NLP)
and Information Retrieval (IR) like Vector Space Model (VSM), Latent Semantic Indexing (LSI) and Hellinger Distance (HD), and
apply them on dierent Arabic dialectal corpora. We measure the overlap among all the dialects and compute the frequencies of
the most frequent words in every dialect. The results are informative and indicate that Levantine dialects are very similar to each
other and furthermore, that Palestinian appears to be the closest to MSA.
c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguis-
tics.
Keywords: Diglossia; Lexical Distance; Vector Space Model; Latent Semantic Indexing; Hellinger Distance
1. Introduction
The number of the native Arabic speakers in the world varies from 290 million according to UNESCO1to 313
million, according to the Ethnologue2. There are three varieties in Arabic language: Classical Arabic, Modern Standard
Arabic (MSA), and Arabic dialects (Colloquialism). Classical Arabic (CA) is the form of the Arabic language used in
Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Quran
was not developed for the standardized form of Classical Arabic [1]. MSA is the ocial language used for education,
news, politics, religion and, in general, in any type of formal setting. Colloquialisms (dialects) are used in everyday
communication as well as informal writing, e.g. in social media [2].
Corresponding author.
E-mail address: kathrein.abu.kwaik@gu.se
1https://en.unesco.org/news/world-arabic-language-day-2017-looking-digital-world
2Simons, Gary F. and Charles D. Fennig (eds.). 2018. Ethnologue: Languages of the World, Twenty-first edition. Dallas, Texas: SIL International.
Online version: http://www.ethnologue.com.
1877-0509 c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),
November 17-19 2018, Dubai, United Arab Emirates
A Lexical Distance Study of Arabic Dialects
Kathrein Abu Kwaika,, Motaz Saadb, Stergios Chatzikyriakidisa, Simon Dobnika
aCLASP, Department of Philosophy, Linguistics and Theory of Science, Ölof Wijksgatan 6, Gothenburg, 412 55, Sweden
bThe Islamic University of Gaza, Gaza, Palestine
Abstract
Diglossia is a very common phenomenon in Arabic-speaking communities, where the spoken language is dierent from both
Classical Arabic (CA) and Modern Standard Arabic (MSA). The spoken language is characterised as a number of dialects used in
everyday communication as well as informal writing. In this paper, we highlight the lexical relation between the MSA and Dialectal
Arabic (DA) in more than one Arabic region. We conduct a computational cross dialectal lexical distance study to measure the
similarities and dierences between dialects and the MSA. We exploit several methods from Natural Language Processing (NLP)
and Information Retrieval (IR) like Vector Space Model (VSM), Latent Semantic Indexing (LSI) and Hellinger Distance (HD), and
apply them on dierent Arabic dialectal corpora. We measure the overlap among all the dialects and compute the frequencies of
the most frequent words in every dialect. The results are informative and indicate that Levantine dialects are very similar to each
other and furthermore, that Palestinian appears to be the closest to MSA.
c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguis-
tics.
Keywords: Diglossia; Lexical Distance; Vector Space Model; Latent Semantic Indexing; Hellinger Distance
1. Introduction
The number of the native Arabic speakers in the world varies from 290 million according to UNESCO1to 313
million, according to the Ethnologue2. There are three varieties in Arabic language: Classical Arabic, Modern Standard
Arabic (MSA), and Arabic dialects (Colloquialism). Classical Arabic (CA) is the form of the Arabic language used in
Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Quran
was not developed for the standardized form of Classical Arabic [1]. MSA is the ocial language used for education,
news, politics, religion and, in general, in any type of formal setting. Colloquialisms (dialects) are used in everyday
communication as well as informal writing, e.g. in social media [2].
Corresponding author.
E-mail address: kathrein.abu.kwaik@gu.se
1https://en.unesco.org/news/world-arabic-language-day-2017-looking-digital-world
2Simons, Gary F. and Charles D. Fennig (eds.). 2018. Ethnologue: Languages of the World, Twenty-first edition. Dallas, Texas: SIL International.
Online version: http://www.ethnologue.com.
1877-0509 c
2018 Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, Simon Dobnik. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
2Author name /Procedia Computer Science 00 (2018) 000–000
As a result of this situation, diglossia, a case where two distinct varieties of a language are spoken within the same
speech community [3], is a very common phenomenon in Arabic-speaking communities. In some parts of the Arab
speaking world, more than two varieties are spoken within the same community. For example, this is the case in North
African communities like Morocco where Arabic, Berber, French, English and Spanish are spoken within the same
speech community [4]. In a diglossic situation, the standard formal language assumes the role of the High variety (H),
while the other languages or dialects act as the Low variety (L) [5]. MSA is so dierent from the colloquial dialects
that they are in some cases not mutually intelligible. The dierences are clearly evident in all linguistic aspects:
pronunciation, phonology, morphology, lexicon, syntax and semantics. However, the degree in which the individual
dialects dier with respect to these aspects has not been yet quantitatively measured.
In this paper, we focus on measuring the lexical distance between MSA and Arabic dialects using natural language
processing techniques, tools and text corpora. We use various distance metrics such as the Vector space model (VSM)
based on word distribution over documents as common in Information Retrieval (IR) [6], Latent semantic indexing
(LSI) [7] and the Divergence Distance algorithm as Hellinger Distance (HD) [8]. We hope that this study will shed
light on similarities and dierences between the varieties and therefore inform our future work on building NLP tools
and applications for these domains, in particular how these can be ported.
To the best of our knowledge, our work is the most extensive eort to measure the distance or similarity across
Arabic dialects using natural language processing tools and text corpora.
2. Related Work
Several approaches have been used to measure the distance between European languages [9,10], Indian dialects
[11], and similar languages [12,13]. These approaches can be classified according to the type of linguistic represen-
tations they investigate: characters, terms and documents. Lexical similarity measures operate on string sequences at
a character level and on corpora of texts at a term level. Table 1shows the most popular approaches in the litera-
ture. There is not much research on measuring the lexical closeness and divergence between Arabic and its dialects.
Table 1. Summary of the most commonly used approaches for measuring similarity between texts
Approach Level Approach Name Description
Character Level Longest Common SubString Measures the length of the longest contiguous sequence of characters existing in the
string under comparison [14,15].
Levenshtein distance Measure the minimum number of insertions, deletions and substitutions needed
to transform one string into another [16,17].
N-gram models
Can be used in dierent ways to estimate similarity or dissimilarity.
One of the most eective approaches is to build n-gram models for
Language Identification and measure the perplexity of n-grams [18].
Dynamic programming Used for biological sequence comparison;
e.g. the Needleman-Wunsch and Smith-Waterman algorithms [19].
Term Level Vector space models Represent documents as vectors of word frequencies and then apply vector comparison
measures to compare vectors of dierent documents [20].
Cosine Similarity Measures the cosine angle as a similarity indicator between two vector spaces [21].
Divergence Distance For example, the Kullback-Leibler distance, Hellinger, Manhattan distance, etc.
These are used to measure the divergence between probability distributions [22,23]
Jacquard similarity Measures the number of overlapping strings over the number of unique strings
between texts to indicate the similarity [24].
Latent Semantic Indexing Words that are close in meaning will occur frequently in similar positions in the text [25].
Abunasser in [16] compares five Arabic dialects (MSA, Gulf, Levantine, Egyptian and Morocco) in terms of lexical
and pronunciation variation. He depends on the Swadesh list [26] and the concept of non-cognate words to measure
the amount of linguistic variations between the dialects. As the Swadesh list is a phonological list rather than a lexi-
con, the author collected the data from two male speakers for each dialect. The Swadesh list has been adapted to the
MSA list using two modern Arabic dictionaries ( ¯
almwrd [27] and 
 
q¯
amws ¯
abn ¯
ay¯
as [28]).
To rule out the chance of lexical ambiguity, a context sentence per each lexical item has been provided. Thus, the
4 Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13
Author name /Procedia Computer Science 00 (2018) 000–000 3
distance between dialects is measured based on the percentage of non-cognates in the MSA Swadesh list. Moreover,
he employs Levenshtein distance to compute the distance between lexical items at the phonemic level based on the
IPA transcription of the words in the Swadesh list. He concludes that Gulf and Levantine are the closet dialects to
MSA followed by Egyptian, while Morocco is the farthest. The most significant limitation of this experiment is how
the data were collected where speakers, gender and the geographical location were limited to two male speakers per
dialect only. Also, the two modern dictionaries that are used to translate the Swadesh list to the corresponding MSA
list have been authored by Levantine authors which might bias MSA to Levantine to some degree. Finally, with the
intention to measure lexical variation, the study uses phonemic representation of words which may also reveal other
more subtle non-lexical dierences.
Meftouh et al. [22] present PADIC (Parallel Arabic Dialect Corpus). It includes five dialects: two Algerian (from
the cities of Algiers and Annaba), one Tunisian and two Levantine dialects (Palestinian and Syrian). The authors
present a linguistic analytical study of PADIC where they employ experiments on every pair of dialect and MSA,
including:
identifying the most frequent words in each dialect;
computing the percentage of common lexical units both at the document and the sentence level to emphasize
the relation between the dialects and the MSA; and
measuring the cross language divergence in terms of the Hellinger distance to measure which language is closer
to which one.
The experiments have shown that the Palestinian dialect is the closest dialect to MSA followed by Tunisian and Syrian,
whereas Algerian dialects are the most dierent. The results are expected, as they demonstrate that Tunisian is closer
to Algerian than to Palestinian and Syrian. In addition, the closest dialects according to the distance measurements are
Algerian dialects on one hand and Palestinian and Syrian on the other hand. Even though the results are reasonable,
the corpus has a shortcoming that it has been manually translated from Algerian conversations to MSA and further to
other dialects by one native speaker of each dialect which introduces several biases.
Rama et al. [12] present a computational classification of the Gondi dialects which are spoken in central India by
applying tools from dialectometry and phylogenetics. They use multilingual word lists for 210 concepts at 46 sites
where Gondi is the dominate dialect. They depend on the Glottolog classification as a gold standard to evaluate their
results. To be able to compute the aggregate distances, they employ the IPA to convert the word lists to pronunciation
data. Levenshtein distance and Long-Short Term Memory neural networks are used as dialectometry methods to
measure the distance between every word pair of words on the list. Moreover, they also apply Bayesian analysis on
cognate analysis as a phylogentic method. They find that phylogentic methods perform best when compared to the
gold standard classification.
Ruette et al. [20] measure the distance between Belgian and Netherlandic Dutch using two similarity measures in
the Vector Space Model (VSM). They apply the two methods on a Dutch corpus collected from two registers (qual-
ity newspapers and Usenet) and topics related to politics and economy. They also exploit the profile-based approach
(where the frequency of pre-selected words is compared from speakers’ data) in addition to the text categorization
method. For the profile based approach they implement the City-Block distance as a straightforward descriptive dis-
tance measure. On the other hand, text categorisation is using TFxIDF on documents and cosine similarity to measure
distance as the complement of cosine similarity.
3. Qualitative dierences between MSA and DA
Arabic is characterized by its rich morphology and vocabulary. For instance the Arabic word 

 wsy֒t
.yk
means “and he will give you” in English so one word in Arabic may correspond to 5 words in English [29] and that
would make the comparison between languages/dialects challenging. This is true for both MSA as well as dialectal
Arabic. However, MSA and DA have a number of dierences that make it dicult for one to apply state of the art
MSA natural language processing tools to DA. Previous attempts to do so have resulted in very low performance
due to the significant dierence between the varieties. [30] report that over one third of Levantine verbs cannot be
analysed using an MSA morphological analyser. The degree of variation between MSA and dialectal Arabic depends
4Author name /Procedia Computer Science 00 (2018) 000–000
on the specific dialect of Arabic. MSA and dialectal Arabic dier to a dierent degree phonologically, orthographi-
cally, morphologically, syntactically, lexically and semantically [31,32]. In this section we describe some qualitative
dierences between MSA and the dialects based on our observation of examples.
3.1. Orthographical and Phonological Dierences
Dialectal Arabic (DA) does not have an established standard orthography like MSA. Mostly, Arabic script is used
to write DA but in some cases, e.g. in Lebanese, the Latin alphabet is used for writing short messages or posting on
social media. For example,

kyfk /“how are you” is represented as Keifk. Another example is the pronunciation
of dialectal words containing the letter
qwhich depends on the dialect and regions. For instance, the Palestinian
speakers from rural and urban regions pronounce it like //glottal stop or /k/while Bedouin pronounce it as /g/. The
word 
q¯
al /say is pronounced and sometimes written as 
q¯
al , k¯
al ,
֓y¯
al or 
ˇ
g¯
al [33].
3.2. Morphological Dierences
Dialects, like MSA and other Semitic languages, make extensive use of particular morphological patterns in addi-
tion to a large set of axes (prefixes, suxes, or infixes) and clitics, and therefore there are some important dierences
between MSA and dialectal Arabic in terms of morphology because of the way of using these clitics, particles and
axes [34]. Some examples are illustrated in Table 2and 3.
Table 2. Examples for Morphological dierences
Example Dialect word Dialect MSA English
Using multiple words together

kyfk Levantine 

kyf h
.¯
alk How are you?
 m֒Egyptian 
l¯
a yhm Does not matter
Sharing the stem with dierent axes
mbdrsš Palestinian 
l¯
aydrs He does not study

 m¯
a bydrs Syrian

mbydrsš Egyptian
The future marker     h
.,r
¯
ah
.
Palestinian
 swf will

h
.yl֒b

 swf yl֒bHe will play

 r¯
ah
.yl֒b
Clitics
bfor present

by¯
akl Egyptian 

y֓akl He is eating
 ֒m bt
.bh
Syrain 
֓an ¯
a֓at
.bh
I am cooking
3.3. Syntactic Dierences
Syntactically, MSA and DA are very similar with some dierences regarding word order. For example, the OVS
and OSV word orders are most commonly used in MSA while in dialects other word order patterns can be found.
For example, in Levantine SVO is most commonly used, while in Maghrebi VSO is used to a great extent [35].
Furthermore, in dialectal Arabic it is common to use masculine plural or singular forms instead of dual and feminine
plural forms [36].
3.4. Lexical and Semantic dierences
Many DA words are borrowed from a variety of other languages like Turkish, French, English, Hebrew, Persian
and others depending on the speaker contact with these languages. Table 4shows some of the borrowed words. New
Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13 5
Author name /Procedia Computer Science 00 (2018) 000–000 3
distance between dialects is measured based on the percentage of non-cognates in the MSA Swadesh list. Moreover,
he employs Levenshtein distance to compute the distance between lexical items at the phonemic level based on the
IPA transcription of the words in the Swadesh list. He concludes that Gulf and Levantine are the closet dialects to
MSA followed by Egyptian, while Morocco is the farthest. The most significant limitation of this experiment is how
the data were collected where speakers, gender and the geographical location were limited to two male speakers per
dialect only. Also, the two modern dictionaries that are used to translate the Swadesh list to the corresponding MSA
list have been authored by Levantine authors which might bias MSA to Levantine to some degree. Finally, with the
intention to measure lexical variation, the study uses phonemic representation of words which may also reveal other
more subtle non-lexical dierences.
Meftouh et al. [22] present PADIC (Parallel Arabic Dialect Corpus). It includes five dialects: two Algerian (from
the cities of Algiers and Annaba), one Tunisian and two Levantine dialects (Palestinian and Syrian). The authors
present a linguistic analytical study of PADIC where they employ experiments on every pair of dialect and MSA,
including:
identifying the most frequent words in each dialect;
computing the percentage of common lexical units both at the document and the sentence level to emphasize
the relation between the dialects and the MSA; and
measuring the cross language divergence in terms of the Hellinger distance to measure which language is closer
to which one.
The experiments have shown that the Palestinian dialect is the closest dialect to MSA followed by Tunisian and Syrian,
whereas Algerian dialects are the most dierent. The results are expected, as they demonstrate that Tunisian is closer
to Algerian than to Palestinian and Syrian. In addition, the closest dialects according to the distance measurements are
Algerian dialects on one hand and Palestinian and Syrian on the other hand. Even though the results are reasonable,
the corpus has a shortcoming that it has been manually translated from Algerian conversations to MSA and further to
other dialects by one native speaker of each dialect which introduces several biases.
Rama et al. [12] present a computational classification of the Gondi dialects which are spoken in central India by
applying tools from dialectometry and phylogenetics. They use multilingual word lists for 210 concepts at 46 sites
where Gondi is the dominate dialect. They depend on the Glottolog classification as a gold standard to evaluate their
results. To be able to compute the aggregate distances, they employ the IPA to convert the word lists to pronunciation
data. Levenshtein distance and Long-Short Term Memory neural networks are used as dialectometry methods to
measure the distance between every word pair of words on the list. Moreover, they also apply Bayesian analysis on
cognate analysis as a phylogentic method. They find that phylogentic methods perform best when compared to the
gold standard classification.
Ruette et al. [20] measure the distance between Belgian and Netherlandic Dutch using two similarity measures in
the Vector Space Model (VSM). They apply the two methods on a Dutch corpus collected from two registers (qual-
ity newspapers and Usenet) and topics related to politics and economy. They also exploit the profile-based approach
(where the frequency of pre-selected words is compared from speakers’ data) in addition to the text categorization
method. For the profile based approach they implement the City-Block distance as a straightforward descriptive dis-
tance measure. On the other hand, text categorisation is using TFxIDF on documents and cosine similarity to measure
distance as the complement of cosine similarity.
3. Qualitative dierences between MSA and DA
Arabic is characterized by its rich morphology and vocabulary. For instance the Arabic word 

 wsy֒t
.yk
means “and he will give you” in English so one word in Arabic may correspond to 5 words in English [29] and that
would make the comparison between languages/dialects challenging. This is true for both MSA as well as dialectal
Arabic. However, MSA and DA have a number of dierences that make it dicult for one to apply state of the art
MSA natural language processing tools to DA. Previous attempts to do so have resulted in very low performance
due to the significant dierence between the varieties. [30] report that over one third of Levantine verbs cannot be
analysed using an MSA morphological analyser. The degree of variation between MSA and dialectal Arabic depends
4Author name /Procedia Computer Science 00 (2018) 000–000
on the specific dialect of Arabic. MSA and dialectal Arabic dier to a dierent degree phonologically, orthographi-
cally, morphologically, syntactically, lexically and semantically [31,32]. In this section we describe some qualitative
dierences between MSA and the dialects based on our observation of examples.
3.1. Orthographical and Phonological Dierences
Dialectal Arabic (DA) does not have an established standard orthography like MSA. Mostly, Arabic script is used
to write DA but in some cases, e.g. in Lebanese, the Latin alphabet is used for writing short messages or posting on
social media. For example,

kyfk /“how are you” is represented as Keifk. Another example is the pronunciation
of dialectal words containing the letter
qwhich depends on the dialect and regions. For instance, the Palestinian
speakers from rural and urban regions pronounce it like //glottal stop or /k/while Bedouin pronounce it as /g/. The
word 
q¯
al /say is pronounced and sometimes written as 
q¯
al , k¯
al ,
֓y¯
al or 
ˇ
g¯
al [33].
3.2. Morphological Dierences
Dialects, like MSA and other Semitic languages, make extensive use of particular morphological patterns in addi-
tion to a large set of axes (prefixes, suxes, or infixes) and clitics, and therefore there are some important dierences
between MSA and dialectal Arabic in terms of morphology because of the way of using these clitics, particles and
axes [34]. Some examples are illustrated in Table 2and 3.
Table 2. Examples for Morphological dierences
Example Dialect word Dialect MSA English
Using multiple words together

kyfk Levantine 

kyf h
.¯
alk How are you?
 m֒Egyptian 
l¯
a yhm Does not matter
Sharing the stem with dierent axes
mbdrsš Palestinian 
l¯
aydrs He does not study

 m¯
a bydrs Syrian

mbydrsš Egyptian
The future marker     h
.,r
¯
ah
.
Palestinian
 swf will

h
.yl֒b

 swf yl֒bHe will play

 r¯
ah
.yl֒b
Clitics
bfor present

by¯
akl Egyptian 

y֓akl He is eating
 ֒m bt
.bh
Syrain 
֓an ¯
a֓at
.bh
I am cooking
3.3. Syntactic Dierences
Syntactically, MSA and DA are very similar with some dierences regarding word order. For example, the OVS
and OSV word orders are most commonly used in MSA while in dialects other word order patterns can be found.
For example, in Levantine SVO is most commonly used, while in Maghrebi VSO is used to a great extent [35].
Furthermore, in dialectal Arabic it is common to use masculine plural or singular forms instead of dual and feminine
plural forms [36].
3.4. Lexical and Semantic dierences
Many DA words are borrowed from a variety of other languages like Turkish, French, English, Hebrew, Persian
and others depending on the speaker contact with these languages. Table 4shows some of the borrowed words. New
6 Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13
Author name /Procedia Computer Science 00 (2018) 000–000 5
Table 3. Dierences in negation between the dialects
MSA Englihs Negation English
֓a֒rf know
l¯
a֓a֒rf Don’t know
Palestinian Jordanian Syrian Lebanese

 ֒¯
arf

 ֒¯
arf


 m¯
ab֒rif


 m¯
ab֒rif
Egyptian Algerian Tunisian
m֒rfš

 mš n֒rf


 mlb֒¯
alyš


 mnyš ֒¯
arf
Gulf Iraqi
 mdry
  m¯
a֓adry
lexical items appear mostly in dialects and not MSA as shown by the example in in Table 5. Another thing to note is
dialects and MSA share words but with dierent meanings. For example, the word  dwl means ’these’ in Egyptian
but “countries” in MSA.
Table 4. Examples of borrow words from other languages
Word Original MSA English Word Original MSA English
 t
.rbyzh Turkish
 t
.¯
awlh Table
 bndwrh Italian  t
.m¯
at
.mTomatoes

֓ast¯
ad
¯Persion mdrs Teacher

twf Hebrew 
ˇ
gyd Good
   
֓afwk ¯
adw French
mh
.¯
amy lawyer


tlyfwn English
 h¯
atf Telephone
Table 5. Examples for new lexicon in dialects
MSA English
 ¯
al֓¯
an Now
Levantine Bedouin Saudi Arabia Iraqi


 hl֓a, hlqyt
 hlh
.yn
 dh
.yn
  h¯
alwqt
Libyan Tunisian Algerian Egyptian

tw¯
a

twh 
tw¯
a
   
 dlwqty, dlwqt
4. Quantitative dierences between MSA and DA
4.1. Arabic Corpora
Ferguson [3] was the first to define the term diglossia. He stated and defined the most important features in order
to understand the dierence between the ocial languages (H) and the informal varieties (L). One of these features is
the lexicon. In his own words: “A striking feature of Diglossia is the existence of many paired items, one H and one
L, referring to fairly common concepts frequently used in both H and L, where the range of meaning of the two items
is roughly the same, and the use of one or the other immediately stamps the utterance or written sequence as H or L.
In this work, we examine several existing Arabic corpora, so that we can include as many dialectal data as we can.
Table 6shows the corpora we use and the dialectal data they contain. Table 7shows the statistics about each corpus
where |d|is the number of documents (sentences) in the corpus, |w|is the number of words in the corpus, and |v|is the
vocabulary size (number of unique words).
6Author name /Procedia Computer Science 00 (2018) 000–000
Table 6. List of Arabic corpora used to investigate the dierences between dialects
Corpus Name Type Dialects Description
PADIC
(Parallel Arabic Dialect Corpus) Parallel MSA, Algerian,
Tunisian, Palestinian, Syrian
The corpus is collected from Algerian chats
and conversations which are translated to
MSA and then to other dialects.
Multi-dialectal Arabic
parallel corpus Parallel MSA, Egyptian, Syrian,
Palestinian, Tunisian, Jordanian
This corpus is originally build on Egyptian dialects
extracted from Egyptian-English corpus. It has been translated
to the remaining dialects by four translators
SDC
(Shami Dialect Corpus) Non-parallel Palestinian, Syrian,
Jordanian, Lebanese
The corpus is collected from dierent sources
of social media, blogs, stories and public figures
on the Internet.
WikiDocs Corpus Comparable MSA, Egyptian It contains a comparable documents from Wikipedia.
The two Algerian dialects are the basis of the Parallel Arabic Dialect Corpus (PADIC), that was collected from
daily conversations, movies and TV shows were presented in Annaba and Algeria dialects. The two corpora were
transcribed by hand and then translated to MSA. Hence, the MSA is considered the pivot language to construct the
Syrian, Palestinian and Tunisian dialects. They adopt Arabic notation to write dialectal words. If the dialectal word
does exist in MSA, it is written as MSA without any change, otherwise, it is written as it is uttered. Some consider
these rules as drawbacks of the corpus which bias the dialect to the MSA and the translated sentence is subjected to
the annotators [22]. The corpus is not considered fully representative for every dialect due to the lack of translators
particularly for the Levantine dialects where only 2 translators are involved while for Tunisian they had 20 speakers
all of them from the South of Tunisia where their dialect is close to the Standard Arabic.
The Multi-dialectal Arabic parallel corpus is built on the English-Egyptian corpus [37], where the Egyptian sen-
tences have been selected as the starting point for the new parallel corpus. Five translators, one for every dialect,
were asked to translate the Egyptian corpus to Palestinian, Jordanian, Syrian and Tunisian dialects, while the Egyp-
tian speakers translated the corpus to corresponding MSA [38]. Using the Egyptian sentences as the pivot dialect
makes the corpus heavily influenced and biased by the Egyptian dialects, which is clearly shown in our results in the
following sections.
The WikiDocs corpus is extracted from Arabic Wikipedia articles and their corresponding Egyptian Wikipedia
articles [39]. It should be noted that a lot of the Egyptian articles are not detailed, as most of these only contain one or
two sentences. This is in contrast to the MSA articles, which contain full details on each subject. The Shami Dialect
Corpus (SDC) corpus is collected from dierent domains like social life, sports, house work, cooking, etc. and from
resources such as personal blogs, social media public figures posts and stories written in DA. It focuses on public
figures from Levantine countries. It is not a parallel corpus, thus the measures are done over the whole corpus and not
on every document [32].
Table 7. Statistics about the used corpora
PADIC SDC
MSA PA AlG SY Tn PA JO SY LB
|d|6.4K 6.4K 6.4K 6.4K 6.4K 21K 32K 48K 16K
|w|51K 51K 48K 49K 48K 0.35M 0.47M 0.7M 0.2M
|v|9.4K 9.6K 9.4K 10K 10.6K 56K 69K 63K 34K
Multi-dialect corpus WikiDocs corpus
MSA PA JO SY TN EG MSA EG
|d|1 K 1 K 1 K 1 K 1 K 1K 459K 16K
|w|11.9K 10.5K 9.7K 11.5K 10.6K 10.9K 83.5M 2.18M
|v|4.4K 4K 3.6K 4K 3.8K 4.5K 4.7M 293.5K
Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13 7
Author name /Procedia Computer Science 00 (2018) 000–000 5
Table 3. Dierences in negation between the dialects
MSA Englihs Negation English
֓a֒rf know
l¯
a֓a֒rf Don’t know
Palestinian Jordanian Syrian Lebanese

 ֒¯
arf

 ֒¯
arf


 m¯
ab֒rif


 m¯
ab֒rif
Egyptian Algerian Tunisian
m֒rfš

 mš n֒rf


 mlb֒¯
alyš


 mnyš ֒¯
arf
Gulf Iraqi
 mdry
  m¯
a֓adry
lexical items appear mostly in dialects and not MSA as shown by the example in in Table 5. Another thing to note is
dialects and MSA share words but with dierent meanings. For example, the word  dwl means ’these’ in Egyptian
but “countries” in MSA.
Table 4. Examples of borrow words from other languages
Word Original MSA English Word Original MSA English
 t
.rbyzh Turkish
 t
.¯
awlh Table
 bndwrh Italian  t
.m¯
at
.mTomatoes

֓ast¯
ad
¯Persion mdrs Teacher

twf Hebrew 
ˇ
gyd Good
   
֓afwk ¯
adw French
mh
.¯
amy lawyer


tlyfwn English
 h¯
atf Telephone
Table 5. Examples for new lexicon in dialects
MSA English
 ¯
al֓¯
an Now
Levantine Bedouin Saudi Arabia Iraqi


 hl֓a, hlqyt
 hlh
.yn
 dh
.yn
  h¯
alwqt
Libyan Tunisian Algerian Egyptian

tw¯
a

twh 
tw¯
a
   
 dlwqty, dlwqt
4. Quantitative dierences between MSA and DA
4.1. Arabic Corpora
Ferguson [3] was the first to define the term diglossia. He stated and defined the most important features in order
to understand the dierence between the ocial languages (H) and the informal varieties (L). One of these features is
the lexicon. In his own words: “A striking feature of Diglossia is the existence of many paired items, one H and one
L, referring to fairly common concepts frequently used in both H and L, where the range of meaning of the two items
is roughly the same, and the use of one or the other immediately stamps the utterance or written sequence as H or L.
In this work, we examine several existing Arabic corpora, so that we can include as many dialectal data as we can.
Table 6shows the corpora we use and the dialectal data they contain. Table 7shows the statistics about each corpus
where |d|is the number of documents (sentences) in the corpus, |w|is the number of words in the corpus, and |v|is the
vocabulary size (number of unique words).
6Author name /Procedia Computer Science 00 (2018) 000–000
Table 6. List of Arabic corpora used to investigate the dierences between dialects
Corpus Name Type Dialects Description
PADIC
(Parallel Arabic Dialect Corpus) Parallel MSA, Algerian,
Tunisian, Palestinian, Syrian
The corpus is collected from Algerian chats
and conversations which are translated to
MSA and then to other dialects.
Multi-dialectal Arabic
parallel corpus Parallel MSA, Egyptian, Syrian,
Palestinian, Tunisian, Jordanian
This corpus is originally build on Egyptian dialects
extracted from Egyptian-English corpus. It has been translated
to the remaining dialects by four translators
SDC
(Shami Dialect Corpus) Non-parallel Palestinian, Syrian,
Jordanian, Lebanese
The corpus is collected from dierent sources
of social media, blogs, stories and public figures
on the Internet.
WikiDocs Corpus Comparable MSA, Egyptian It contains a comparable documents from Wikipedia.
The two Algerian dialects are the basis of the Parallel Arabic Dialect Corpus (PADIC), that was collected from
daily conversations, movies and TV shows were presented in Annaba and Algeria dialects. The two corpora were
transcribed by hand and then translated to MSA. Hence, the MSA is considered the pivot language to construct the
Syrian, Palestinian and Tunisian dialects. They adopt Arabic notation to write dialectal words. If the dialectal word
does exist in MSA, it is written as MSA without any change, otherwise, it is written as it is uttered. Some consider
these rules as drawbacks of the corpus which bias the dialect to the MSA and the translated sentence is subjected to
the annotators [22]. The corpus is not considered fully representative for every dialect due to the lack of translators
particularly for the Levantine dialects where only 2 translators are involved while for Tunisian they had 20 speakers
all of them from the South of Tunisia where their dialect is close to the Standard Arabic.
The Multi-dialectal Arabic parallel corpus is built on the English-Egyptian corpus [37], where the Egyptian sen-
tences have been selected as the starting point for the new parallel corpus. Five translators, one for every dialect,
were asked to translate the Egyptian corpus to Palestinian, Jordanian, Syrian and Tunisian dialects, while the Egyp-
tian speakers translated the corpus to corresponding MSA [38]. Using the Egyptian sentences as the pivot dialect
makes the corpus heavily influenced and biased by the Egyptian dialects, which is clearly shown in our results in the
following sections.
The WikiDocs corpus is extracted from Arabic Wikipedia articles and their corresponding Egyptian Wikipedia
articles [39]. It should be noted that a lot of the Egyptian articles are not detailed, as most of these only contain one or
two sentences. This is in contrast to the MSA articles, which contain full details on each subject. The Shami Dialect
Corpus (SDC) corpus is collected from dierent domains like social life, sports, house work, cooking, etc. and from
resources such as personal blogs, social media public figures posts and stories written in DA. It focuses on public
figures from Levantine countries. It is not a parallel corpus, thus the measures are done over the whole corpus and not
on every document [32].
Table 7. Statistics about the used corpora
PADIC SDC
MSA PA AlG SY Tn PA JO SY LB
|d|6.4K 6.4K 6.4K 6.4K 6.4K 21K 32K 48K 16K
|w|51K 51K 48K 49K 48K 0.35M 0.47M 0.7M 0.2M
|v|9.4K 9.6K 9.4K 10K 10.6K 56K 69K 63K 34K
Multi-dialect corpus WikiDocs corpus
MSA PA JO SY TN EG MSA EG
|d|1 K 1 K 1 K 1 K 1 K 1K 459K 16K
|w|11.9K 10.5K 9.7K 11.5K 10.6K 10.9K 83.5M 2.18M
|v|4.4K 4K 3.6K 4K 3.8K 4.5K 4.7M 293.5K
8 Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13
Author name /Procedia Computer Science 00 (2018) 000–000 7
In what follows, we exploit various approaches to the lexicon to precisely clarify the dierence between MSA and
other Arabic dialects in term of lexical distance. The type of corpora aects the way we implement each measure as
follows:
for parallel and comparable corpora: the comparison is at the document (sentence) level, then the average is
taken at a corpus level;
for non-parallel and non-comparable corpora: the comparison is at the corpus level, given that the data belong
to the same domain.
In all experiments we have used Python as a programming language to implement the algorithms and used the Gensim
library for some methods. As the corpora are already preprocessed, we did not do any further pre-processing.3In the
next subsections, we present the measures what we use in our experiments.
4.2. Lexical Sharing and Overlapping
Jaccard Index is a measure of how similar two data sets are. Given that dialects share many words, we compute the
percentage of vocabularies that overlap between these dialects according to Equation 1. Table 8presents the similarity
overlap across dialects. Palestinian is the most similar to MSA, that coming after the Egyptian dialect, with the
highest percentage of vocabulary overlap in both parallel corpora. The measurement on the SDC shows a reasonable
overlapping across the Levantine dialects, while in the comparable corpus the overlapping between the MSA and the
Egyptian does not exceed the 0.1.
JaccardIndex(A,B)=|AB|
|AB|(1)
Table 8. Percentage of vocabulary overlapping between dialects
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.1 0.14 0.14 0.19 MSA 0.21 0.14 0.13 0.15 0.16
PA 0.13 0.14 0.25 PA 0.23 0.25 0.18 0.24
SY 0.12 0.16 SY 0.23 0.26 0.18
TN 0.17 TN 0.18 0.18
JO 0.21
SDC WikiDocs corpus
LB JO SY EG
PA 0.15 0.21 0.19 MSA 0.1
SY 0.16 0.2
JO 0.16
4.3. Vector Space Model (VSM)
VSM is broken down into three steps. First, document indexing where each document is represented by the content
bearing words which, in turn, are represented as a document-terms vector. VSM represents all documents as vectors
in a high dimensional space in which each dimension of the space corresponds to a term in the document collection
[7]. Secondly, term weighting where a weighting schema is used to compute the term weightings for each term in
the represented vector (document). The most common weighting schema is to employ the frequency of occurrence
3It is possible that the preprocessing techniques that have been used on dierent corpora might aect their comparison, which is an unfortunate
limitation of our approach in terms of the implications for language use in general.
8Author name /Procedia Computer Science 00 (2018) 000–000
expressed as a ration between frequency and inverse document frequency (tf-idf). A similarity coecient is then
computed between each pair of vectors to indicate a ranking of documents [40].
We utilize the VSM to measure the similarity across dialects and MSA by comparing the similarity between the
terms in their documents or sentences. Clearly, not all words in a dialect or a document are equally important. Most
current approaches remove all the stop words during the preprocessing phase. However, we have decided to index all
words as many of the stop words act like function words and therefore are distinguishing of certain dialects. In order
to overcome the out-of-dictionary problem we build a vector for each pair of dialects. Therefore, for the first dialect
(MSA), we draw a vector model and employ the tf-idf weighting schema. The second dialect is considered as the
query vector compared to the first dialect. Spatial closeness corresponds to conceptual similarity (words that are used
in the same documents are similar) so we measure the cosine similarity between the main vector model (first-dialect)
and the query vector (second-dialect) (what a vector represents in each case depends on the kind of corpora we are
comparing as explained above) which is a symmetric measurement. Table 9present the similarity across dialects for
all corpora.
Table 9. Similarity across dialects for all corpora based on VSM
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.27 0.38 0.37 0.5 MSA 0.5 0.38 0.37 0.4 0.4
PA 0.38 0.47 0.63 PA 0.59 0.66 0.48 0.62
SY 0.34 0.41 SY 0.63 0.7 0.5
TN 0.44 TN 0.49 0.47
JO 0.56
SDC WikiDocs corpus
LB JO SY EG
PA 0.84 0.86 0.77 MSA 0.4
SY 0.81 0.9
JO 0.84
The results show that the Palestinian dialects in both the PADIC and the Multi dialect corpus are closer to MSA,
with 0.5 and 0.4 similarity respectively, while the Tunisian and Algerian dialects are furthest from MSA. Moreover,
on SDC we can demonstrate a high similarity between individual Levantine dialects. For example Jordanian is the
closest to Palestinian, which seems to coincide with informal observations by native speakers of both dialects.
It is worth mentioning that the Egyptian dialect records the highest relation with MSA in Multi-dialect corpus, as
we previously expected. The corpus is biased towards the Egyptian dialect, as Egyptian was the pivot language when
the corpus was built. This is reflected in all the measures used here. However, the bias of the pivot language is not
reflected between Algerian and MSA in the PADIC corpus as these are the least similar varieties.
4.4. Latent Semantic Indexing LSI
Unlike VSM and other retrieval methods, LSI can address the problem of synonymy and polysemy among words.
It analyzes the documents in order to represent the concepts they contain. LSI tries to map the vector space into a new
compressed space by reducing the dimensions of the terms matrix using Singular Value Decomposition (SVD). By
using SVD, the main associative patterns and trends are extracted from the document space and the noise is ignored.
In other words, it makes the best possible reconstruction of the document matrix with the most valuable information
[7]. We exploit the LSI model to measure the similarity between the dialects. We build the model with all the dialects
(full corpus) and test it on one dialect in each run. The model outputs the similarity between the test dialect and every
dialect used to build the model. Table 10 shows the similarities among the Arabic dialects for all corpora.
Palestinian appears to be close to MSA only in PADIC, whereas the Tunisian dialect shows a close relation to MSA
in both corpora. In addition to this, it is obvious that the relation between the dialects in the Levantine corpus (SDC)
is very strong as well as the relation between the Algerian and Tunisian. These results show the artefacts of the LSI
model which connects the data according to topics and clusters.
Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13 9
Author name /Procedia Computer Science 00 (2018) 000–000 7
In what follows, we exploit various approaches to the lexicon to precisely clarify the dierence between MSA and
other Arabic dialects in term of lexical distance. The type of corpora aects the way we implement each measure as
follows:
for parallel and comparable corpora: the comparison is at the document (sentence) level, then the average is
taken at a corpus level;
for non-parallel and non-comparable corpora: the comparison is at the corpus level, given that the data belong
to the same domain.
In all experiments we have used Python as a programming language to implement the algorithms and used the Gensim
library for some methods. As the corpora are already preprocessed, we did not do any further pre-processing.3In the
next subsections, we present the measures what we use in our experiments.
4.2. Lexical Sharing and Overlapping
Jaccard Index is a measure of how similar two data sets are. Given that dialects share many words, we compute the
percentage of vocabularies that overlap between these dialects according to Equation 1. Table 8presents the similarity
overlap across dialects. Palestinian is the most similar to MSA, that coming after the Egyptian dialect, with the
highest percentage of vocabulary overlap in both parallel corpora. The measurement on the SDC shows a reasonable
overlapping across the Levantine dialects, while in the comparable corpus the overlapping between the MSA and the
Egyptian does not exceed the 0.1.
JaccardIndex(A,B)=|AB|
|AB|(1)
Table 8. Percentage of vocabulary overlapping between dialects
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.1 0.14 0.14 0.19 MSA 0.21 0.14 0.13 0.15 0.16
PA 0.13 0.14 0.25 PA 0.23 0.25 0.18 0.24
SY 0.12 0.16 SY 0.23 0.26 0.18
TN 0.17 TN 0.18 0.18
JO 0.21
SDC WikiDocs corpus
LB JO SY EG
PA 0.15 0.21 0.19 MSA 0.1
SY 0.16 0.2
JO 0.16
4.3. Vector Space Model (VSM)
VSM is broken down into three steps. First, document indexing where each document is represented by the content
bearing words which, in turn, are represented as a document-terms vector. VSM represents all documents as vectors
in a high dimensional space in which each dimension of the space corresponds to a term in the document collection
[7]. Secondly, term weighting where a weighting schema is used to compute the term weightings for each term in
the represented vector (document). The most common weighting schema is to employ the frequency of occurrence
3It is possible that the preprocessing techniques that have been used on dierent corpora might aect their comparison, which is an unfortunate
limitation of our approach in terms of the implications for language use in general.
8Author name /Procedia Computer Science 00 (2018) 000–000
expressed as a ration between frequency and inverse document frequency (tf-idf). A similarity coecient is then
computed between each pair of vectors to indicate a ranking of documents [40].
We utilize the VSM to measure the similarity across dialects and MSA by comparing the similarity between the
terms in their documents or sentences. Clearly, not all words in a dialect or a document are equally important. Most
current approaches remove all the stop words during the preprocessing phase. However, we have decided to index all
words as many of the stop words act like function words and therefore are distinguishing of certain dialects. In order
to overcome the out-of-dictionary problem we build a vector for each pair of dialects. Therefore, for the first dialect
(MSA), we draw a vector model and employ the tf-idf weighting schema. The second dialect is considered as the
query vector compared to the first dialect. Spatial closeness corresponds to conceptual similarity (words that are used
in the same documents are similar) so we measure the cosine similarity between the main vector model (first-dialect)
and the query vector (second-dialect) (what a vector represents in each case depends on the kind of corpora we are
comparing as explained above) which is a symmetric measurement. Table 9present the similarity across dialects for
all corpora.
Table 9. Similarity across dialects for all corpora based on VSM
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.27 0.38 0.37 0.5 MSA 0.5 0.38 0.37 0.4 0.4
PA 0.38 0.47 0.63 PA 0.59 0.66 0.48 0.62
SY 0.34 0.41 SY 0.63 0.7 0.5
TN 0.44 TN 0.49 0.47
JO 0.56
SDC WikiDocs corpus
LB JO SY EG
PA 0.84 0.86 0.77 MSA 0.4
SY 0.81 0.9
JO 0.84
The results show that the Palestinian dialects in both the PADIC and the Multi dialect corpus are closer to MSA,
with 0.5 and 0.4 similarity respectively, while the Tunisian and Algerian dialects are furthest from MSA. Moreover,
on SDC we can demonstrate a high similarity between individual Levantine dialects. For example Jordanian is the
closest to Palestinian, which seems to coincide with informal observations by native speakers of both dialects.
It is worth mentioning that the Egyptian dialect records the highest relation with MSA in Multi-dialect corpus, as
we previously expected. The corpus is biased towards the Egyptian dialect, as Egyptian was the pivot language when
the corpus was built. This is reflected in all the measures used here. However, the bias of the pivot language is not
reflected between Algerian and MSA in the PADIC corpus as these are the least similar varieties.
4.4. Latent Semantic Indexing LSI
Unlike VSM and other retrieval methods, LSI can address the problem of synonymy and polysemy among words.
It analyzes the documents in order to represent the concepts they contain. LSI tries to map the vector space into a new
compressed space by reducing the dimensions of the terms matrix using Singular Value Decomposition (SVD). By
using SVD, the main associative patterns and trends are extracted from the document space and the noise is ignored.
In other words, it makes the best possible reconstruction of the document matrix with the most valuable information
[7]. We exploit the LSI model to measure the similarity between the dialects. We build the model with all the dialects
(full corpus) and test it on one dialect in each run. The model outputs the similarity between the test dialect and every
dialect used to build the model. Table 10 shows the similarities among the Arabic dialects for all corpora.
Palestinian appears to be close to MSA only in PADIC, whereas the Tunisian dialect shows a close relation to MSA
in both corpora. In addition to this, it is obvious that the relation between the dialects in the Levantine corpus (SDC)
is very strong as well as the relation between the Algerian and Tunisian. These results show the artefacts of the LSI
model which connects the data according to topics and clusters.
10 Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13
Author name /Procedia Computer Science 00 (2018) 000–000 9
Table 10. Similarity across dialects for all corpora based on LSI
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.68 0.75 0.69 0.75 MSA 0.72 0.37 0.75 0.4 0.41
PA 0.78 0.82 0.85 PA 0.82 0.88 0.63 0.9
SY 0.74 0.74 SY 0.7 0.94 0.59
TN 0.82 TN 0.74 0.55
JO 0.73
SDC WikiDocs corpus
LB JO SY EG
PA 0.84 0.86 0.77 MSA 0.8
SY 0.81 0.9
JO 0.84
4.5. Hellinger Distance
We are interested to measure the divergence between the dialects. Here we will use the Hellinger Distance (HD) that
measures the dierence between two probability distributions [22]. In this work we use Latent Dirichlet Allocation
(LDA) to model a vector of discrete probability distributions of topics to measure the distance between dialects in
comparison. LDA is a very common technique used to uncover topics in the data [41]. For simplicity, a Bag Of Words
(BOW) model is used to represent the data from our corpora. LDA gives us a probability distribution over a specified
number of unknown topics. LDA therefore works like a way of soft clustering the documents made up of words. Later
HD is then used to measure the distance between these topics and new documents. The greater the distance the less
the similarity between the dialects and vice versa.
Table 11 shows the distance between the dialects cross all corpora. Palestinian is less dissimilar from MSA com-
pared to the rest of the dialects in PADIC. Even though in the Multi-dialect corpus the results for the distance of all
dialects, except of the Egyptian, to MSA is quite close, the Tunisian seems to be the closest to MSA. Considering that
the Levantine dialects in SDC are very similar to each other, the Jordanian and the Syrian dialects are the closest to
each other, while the Palestinian and the Lebanese dialects are most dissimilar.
Table 11. Distance between dialects for all corpora based on Hellinger Distance
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.91 0.83 0.77 0.77 MSA 0.01 0.77 0.76 0.78 0.78
PA 0.73 0.64 0.58 PA 0.52 0.34 0.77 0.55
SY 0.87 0.81 SY 0.53 0.54 0.72
TN 0.72 TN 0.35 0.69
JO 0.51
SDC WikiDocs corpus
LB JO SY EG
PA 0.26 0.18 0.23 MSA 0.73
SY 0.25 0.1
JO 0.2
10 Author name /Procedia Computer Science 00 (2018) 000–000
4.6. Frequent words and Correlation Coecient
This step consists of two parts. At first, we extract the 30 most frequent words in each dialect and then we collect
those words that appear in all dialects to calculate the Pearson correlation coecient among them in respect to their
frequency as shown in Table 12.
Table 12. The Person correlation coecient between dialects in PADIC and SDC
PADIC SDC
ALG TN SY PA LB JO SY
MSA 0.76 0.92 0.67 0.85 PA 0.31 0.42 -0.05
PA 0.97 0.95 0.86 SY 0.13 0.74
SY 0.83 0.71 JO 0.47
TN 0.92
The result shows high correlation for the frequent words between the MSA and Tunisian, followed by the Pales-
tinian dialects in PADIC. This sheds the light on the dierent usage of frequent words cross dialects. For example
Palestinian speakers say
 
fy ¯
almdrsh /“at the school” while the Syrian speakers say

b¯
almdrsh.
For the words that are not shared and have not been included in the correlation experiment, we have calculated the
Term Frequency (TF) as in Equation 2.
TF(t)=(Number of times term t appears in a dialect)
(T otal number o f terms in the d ialect).(2)
As we have already mentioned, we have not eliminated stop words from the corpora as these keywords are discrim-
inative and representative for each dialect and hence can be used to build a dialectal lexicon. Table 13 shows the 20
most frequent words in PADIC4.
5. Conclusion
In this paper, we estimate the degree of similarity and dissimilarity between MSA and DA on one hand, and across
dialects of Arabic on the other. Dierent measures have been exploited, such as as VSM, LSI, HD as well as simple
measures like vocabulary overlap, coecient correlation and Jaccard similarity. More than one corpus has been used.
In particular, PADIC, the Multi-Dialect corpus, SDC and Wiki-Docs were used, that include MSA, Levantine dialects,
Egyptian and Dialects from North Africa. This was done in order to minimise the bias of any of the individual corpora
and to address the question of the degree of the text representativeness. Most of the measurements used indicate that
the Levantine dialects are in general the closet to MSA, while the North African dialects the farthest. Although the
results show some dierences due to the nature of the corpora, in general, the results are homogeneous. For example,
it is expected that the Egyptian dialects appear very close to MSA in the Multi-Dialect corpus. This is, as mentioned
earlier, due to a strong bias of the specific corpus towards the Egyptian dialect, given that it was built from an Egyptian
corpus and then translated into other dialects and MSA.
We have shown the degree of convergence between the dialects of the Levant and the linguistic overlap to such
an extent that in some cases it seems impossible to distinguish between them in writing without the presence of
phonological information or without adding accent diacritic marks.
It is very clear that we have a new variety, i.e. an informal writing dialect, which diers from the spoken dialects.
Even if some dialects appear close to each other based on the speakers’ intuitions, there may be dierences in the
writing form due to the lack of accent diacritics. The reverse is also true. Some dialects appear closer lexically in their
writing form given that a big part of their vocabulary overlaps, but in their spoken form, they are not that close.
This study can be seen as a basis for building Natural Language Processing tools for dialectal processing by adapt-
ing what already exists for MSA and focusing on areas of similarity and degrees of dierence. The study is the
4The full tables for all corpora can be found in https://github.com/GU-CLASP/DAdistance
Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13 11
Author name /Procedia Computer Science 00 (2018) 000–000 9
Table 10. Similarity across dialects for all corpora based on LSI
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.68 0.75 0.69 0.75 MSA 0.72 0.37 0.75 0.4 0.41
PA 0.78 0.82 0.85 PA 0.82 0.88 0.63 0.9
SY 0.74 0.74 SY 0.7 0.94 0.59
TN 0.82 TN 0.74 0.55
JO 0.73
SDC WikiDocs corpus
LB JO SY EG
PA 0.84 0.86 0.77 MSA 0.8
SY 0.81 0.9
JO 0.84
4.5. Hellinger Distance
We are interested to measure the divergence between the dialects. Here we will use the Hellinger Distance (HD) that
measures the dierence between two probability distributions [22]. In this work we use Latent Dirichlet Allocation
(LDA) to model a vector of discrete probability distributions of topics to measure the distance between dialects in
comparison. LDA is a very common technique used to uncover topics in the data [41]. For simplicity, a Bag Of Words
(BOW) model is used to represent the data from our corpora. LDA gives us a probability distribution over a specified
number of unknown topics. LDA therefore works like a way of soft clustering the documents made up of words. Later
HD is then used to measure the distance between these topics and new documents. The greater the distance the less
the similarity between the dialects and vice versa.
Table 11 shows the distance between the dialects cross all corpora. Palestinian is less dissimilar from MSA com-
pared to the rest of the dialects in PADIC. Even though in the Multi-dialect corpus the results for the distance of all
dialects, except of the Egyptian, to MSA is quite close, the Tunisian seems to be the closest to MSA. Considering that
the Levantine dialects in SDC are very similar to each other, the Jordanian and the Syrian dialects are the closest to
each other, while the Palestinian and the Lebanese dialects are most dissimilar.
Table 11. Distance between dialects for all corpora based on Hellinger Distance
PADIC Multi-dialect corpus
ALG TN SY PA EG JO TN SY PA
MSA 0.91 0.83 0.77 0.77 MSA 0.01 0.77 0.76 0.78 0.78
PA 0.73 0.64 0.58 PA 0.52 0.34 0.77 0.55
SY 0.87 0.81 SY 0.53 0.54 0.72
TN 0.72 TN 0.35 0.69
JO 0.51
SDC WikiDocs corpus
LB JO SY EG
PA 0.26 0.18 0.23 MSA 0.73
SY 0.25 0.1
JO 0.2
10 Author name /Procedia Computer Science 00 (2018) 000–000
4.6. Frequent words and Correlation Coecient
This step consists of two parts. At first, we extract the 30 most frequent words in each dialect and then we collect
those words that appear in all dialects to calculate the Pearson correlation coecient among them in respect to their
frequency as shown in Table 12.
Table 12. The Person correlation coecient between dialects in PADIC and SDC
PADIC SDC
ALG TN SY PA LB JO SY
MSA 0.76 0.92 0.67 0.85 PA 0.31 0.42 -0.05
PA 0.97 0.95 0.86 SY 0.13 0.74
SY 0.83 0.71 JO 0.47
TN 0.92
The result shows high correlation for the frequent words between the MSA and Tunisian, followed by the Pales-
tinian dialects in PADIC. This sheds the light on the dierent usage of frequent words cross dialects. For example
Palestinian speakers say
 
fy ¯
almdrsh /“at the school” while the Syrian speakers say

b¯
almdrsh.
For the words that are not shared and have not been included in the correlation experiment, we have calculated the
Term Frequency (TF) as in Equation 2.
TF(t)=(Number o f times term t appears in a dialect)
(T otal number o f t erms in t he dialect ).(2)
As we have already mentioned, we have not eliminated stop words from the corpora as these keywords are discrim-
inative and representative for each dialect and hence can be used to build a dialectal lexicon. Table 13 shows the 20
most frequent words in PADIC4.
5. Conclusion
In this paper, we estimate the degree of similarity and dissimilarity between MSA and DA on one hand, and across
dialects of Arabic on the other. Dierent measures have been exploited, such as as VSM, LSI, HD as well as simple
measures like vocabulary overlap, coecient correlation and Jaccard similarity. More than one corpus has been used.
In particular, PADIC, the Multi-Dialect corpus, SDC and Wiki-Docs were used, that include MSA, Levantine dialects,
Egyptian and Dialects from North Africa. This was done in order to minimise the bias of any of the individual corpora
and to address the question of the degree of the text representativeness. Most of the measurements used indicate that
the Levantine dialects are in general the closet to MSA, while the North African dialects the farthest. Although the
results show some dierences due to the nature of the corpora, in general, the results are homogeneous. For example,
it is expected that the Egyptian dialects appear very close to MSA in the Multi-Dialect corpus. This is, as mentioned
earlier, due to a strong bias of the specific corpus towards the Egyptian dialect, given that it was built from an Egyptian
corpus and then translated into other dialects and MSA.
We have shown the degree of convergence between the dialects of the Levant and the linguistic overlap to such
an extent that in some cases it seems impossible to distinguish between them in writing without the presence of
phonological information or without adding accent diacritic marks.
It is very clear that we have a new variety, i.e. an informal writing dialect, which diers from the spoken dialects.
Even if some dialects appear close to each other based on the speakers’ intuitions, there may be dierences in the
writing form due to the lack of accent diacritics. The reverse is also true. Some dialects appear closer lexically in their
writing form given that a big part of their vocabulary overlaps, but in their spoken form, they are not that close.
This study can be seen as a basis for building Natural Language Processing tools for dialectal processing by adapt-
ing what already exists for MSA and focusing on areas of similarity and degrees of dierence. The study is the
4The full tables for all corpora can be found in https://github.com/GU-CLASP/DAdistance
12 Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13
Author name /Procedia Computer Science 00 (2018) 000–000 11
Table 13. The percentage of the most frequent words in PADIC
MSA Palestinian Syrian Tunisian Algerian
Word TF% Word TF % Word TF % Word TF% Word TF%
l¯
a1.96
 ¯
ally 0.84 bs 0.98

b¯
0.85
ly 1.14
֓an 1.44  ¯
anh 0.83
¯
ay 0.92
 ¯
aly 0.78
 w¯
1
 lm 0.81 bs 0.81  ֒m0.89 
¯
ayh 0.73 
¯
ayh 0.82
ly 0.7

¯
ayš 0.8
šw 0.88 l¯
a0.72 
t¯
a֒0.79
n֒m0.7
 0.79  rh
.0.85  ¯
am¯
a0.59
ky 0.67
 hd
¯¯
a0.65  ¯
ah 0.77
šy 0.73
 k¯
an 0.46  l¯
al¯
a0.59
 m¯
ad
¯¯
a0.47 l¯
a0.65 l¯
a0.7
 hd
¯¯
a0.44   w¯
ah
.d0.48
֓il¯
a0.45
 hd
¯¯
a0.64  ¯
anw 0.51
tw 0.37  wl¯
a0.4
 hl 0.45
 ¯
ašy 0.55  mw 0.48
 ֒l¯
0.35
 r¯
any 0.38

d
¯lk 0.42 lm¯
a0.53
 knyr 0.47
 h
.t¯
a0.34

b¯
0.37
 lkn 0.42  hw 0.5 lm¯
a0.45

b¯
ahy 0.33
 h
.t¯
a0.36
 lk 0.39

 ֒š¯
an 0.45
 ¯
ally 0.44  hw 0.31  w¯
allh 0.34
 ֒ndm¯
a0.39 
hyk 0.44 
hyk 0.37
 wqt 0.31  hw 0.32

qlt 0.83
 ¯
ad
¯¯
a0.44  h¯
ad 0.36
 mwš 0.29  r¯
ah
.0.32

֓id
¯¯
a0.35
 ktyr 0.4  ¯
allh 0.35   w¯
ah
.d0.25 b¯
als
.h
.0.32
 lh¯
a0.34  ¯
allh 0.36

lyš 0.34 0.25  dwk 0.31
 hn¯
ak 0.41
 hd
¯h0.35
 ¯
ad
¯¯
a0.31
bršh 0.25 
kym¯
a0.31
 ¯
allh 0.32
zy 0.34
 mtl 0.31
 ¯
ally 0.24 brk 0.3
 lh 0.32

lyš 0.33
 ֒n0.29
šy 0.23
 r¯
ahy 0.3

šy֓0.32
 ¯
any 0.3
 k¯
an 0.28  wl¯
a0.23 ¡¢s
.h
.0.29
most extensive of its kind concerned with measuring similarities and dierences in Arabic and dialectal Arabic, and
represents a basis for new similar investigations, focusing on other criteria such as phonological distance, morpho-
logical distance and semantic distance. In the future, we plan to employ other methods of measuring similarity and
distance based on the semantics of the words, e.g. word embedding techniques with Word2Vec. In this way, one can
extract dierent words in terms of their lexical relatedness, and use them in automatic machine translation tools for
the languages and dialects investigated.
References
[1] Shah, Mustafa., The Arabic language, Routledge, 2008.
[2] Versteegh, Kees, The Arabic language, Edinburgh University Press, 2014.
[3] Ferguson, Charles A., Diglossia, word 15 (2) (1959) 325–340.
[4] Zouhir, Abderrahman., Language situation and conflict in Morocco, in: Selected Proceedings of the 43rd Annual Conference on African
Linguistics, ed. Olanike Ola Orie and Karen W. Sanders, 2013, pp. 271–277.
[5] Jabbari, MJ., Diglossia in Arabic – a comparative study of the Modern Standard Arabic and colloquial Egyptian Arabic, Global Journal of
Human Social Sciences 12 (8) (2012) 23–46.
[6] Clark, Stephen, Vector space models of lexical meaning, in: Lappin, Shalom and FoxS, Chris (Eds.). Handbook of Contemporary Semantics –
second edition, Wiley – Blackwell, 2015, Ch. 16, pp. 493–522.
[7] Kumar, Ch Aswani, M Radvansky, and J Annapurna, Analysis of a Vector Space Model, Latent Semantic Indexing and formal concept analysis
for Information Retrieval, Cybernetics and Information Technologies 12 (1) (2012) 34–48.
[8] González-Castro, Víctor, Rocío Alaiz-Rodríguez, and Enrique Alegre. Class distribution estimation based on the Hellinger distance, Informa-
tion Sciences 218 (2013) 146–164.
[9] Chiswick, Barry R and Paul W Miller. Linguistic distance: A quantitative measure of the distance between English and other languages, Journal
of Multilingual and Multicultural Development 26 (1) (2005) 1–11.
12 Author name /Procedia Computer Science 00 (2018) 000–000
[10] Heeringa, Wilbert, Jelena Golubovic, Charlotte Gooskens, Anja Schüppert, Femke Swarte, and Stefanie Voigt. Lexical and orthographic dis-
tances between Germanic, Romance and Slavic languages and their relationship to geographic distance, Phonetics in Europe: Perception and
Production (2013) 99–137.
[11] Sengupta, Debapriya and Goutam Saha. Study on similarity among Indian languages using language verification framework, Advances in
Artificial Intelligence 2015 (2015) 2.
[12] Rama, Taraka, Ça˘
grı Çöltekin, and Pavel Sofroniev, Computational analysis of Gondi dialects, in: Proceedings of the Fourth Workshop on NLP
for Similar Languages, Varieties and Dialects (VarDial), 2017, pp. 26–35.
[13] Houtzagers, Peter, John Nerbonne, and Jelena Proki´
c, Quantitative and traditional classifications of Bulgarian dialects compared, Scando-
Slavica 56 (2) (2010) 163–188.
[14] Aminul Islam and Diana Inkpen, Semantic text similarity using corpus-based word similarity and string similarity, ACM Transactions on
Knowledge Discovery from Data (TKDD) 2 (2) (2008) 10.
[15] Robert W Irving and Campbell B Fraser. Two algorithms for the longest common subsequence of three (or more) strings, in: Annual Symposium
on Combinatorial Pattern Matching, Springer, 1992, pp. 214–229.
[16] Abunasser, Mahmoud Abedel Kader. Computational measures of linguistic variation: A study of Arabic varieties, Ph.D. thesis, University of
Illinois at Urbana-Champaign (2015).
[17] Navarro, Gonzalo. A guided tour to approximate string matching, ACM computing surveys (CSUR) 33 (1) (2001) 31–88.
[18] Kondrak, Grzegorz. N-gram similarity and distance, in: International symposium on string processing and information retrieval, Springer,
2005, pp. 115–126.
[19] Needleman, Saul B and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two
proteins, Journal of molecular biology 48 (3) (1970) 443–453.
[20] Ruette, Tom, Dirk Speelman, and Dirk Geeraerts. Measuring the lexical distance between registers in national variaties of Dutch, Aletheia,
Publicações da Faculdade de Filosofia da Universidade Católica Portuguesa, 2011.
[21] Anna Huang. Similarity measures for text document clustering, in: Proceedings of the sixth New Zealand computer science research student
conference (NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.
[22] HarratSalima, Karima Meftouh, Mourad Abbas, Salma Jamoussi, Motaz Saad, and Kamel Smaili. Cross-dialectal Arabic processing, in: Inter-
national Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2015, pp. 620–632.
[23] Bigi, Brigitte. Using Kullback-Leibler distance for text categorization, in: European Conference on Information Retrieval, Springer, 2003, pp.
305–319.
[24] Niwattanakul, Suphakit, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. Using of Jaccard coecient for keywords
similarity, in: Proceedings of the International MultiConference of Engineers and Computer Scientists, Vol. 1, 2013.
[25] Sebastiani, Fabrizio. Machine Learning in automated text categorization, ACM computing surveys (CSUR) 34 (1) (2002) 1–47.
[26] Swadesh, Morris. Salish internal relationships, International Journal of American Linguistics 16 (4) (1950) 157–167.
[27] Baalbaki, Munir.
 _
 
  ¯
almwrd: q¯
amws ¯
anˇ
glyzy ֒rby.


   d¯
ar ¯
al֒lm llml¯
ayyn: bt
¯rwt , 1982.
[28] Elias, Elias Antoon and Ed E Elias. Elias’ modern dictionary, Arabic-English, (1983).
[29] Saad, Motaz. Fouille de documents et d’opinions multilingue, Ph.D. thesis, Université de Lorraine (2015).
[30] Habash, Nizar and Owen Rambow. a morphological analyzer and generator for the Arabic dialects, in: Proceedings of the 21st International
Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, 2006, pp. 681–688.
[31] Dasigi, Pradeep and Mona T Diab. Towards identifying orthographic variants in dialectal Arabic., in: IJCNLP, 2011, pp. 318–326.
[32] Qwaider, Chatrine, Motaz Saad, Stergios Chatzikyriakidis, and Simon Dobnik. Shami: A Corpus of Levantine Arabic Dialects, in Proceedings
of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association
(ELRA), Miyazaki, Japan, 2018.
[33] Jarrar, Mustafa, Nizar Habash, Diyam Akra, and Nasser Zalmout. Building a corpus for Palestinian Arabic: A preliminary study, in: Proceedings
of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, pp. 18–27.
[34] Habash, Nizar, Mona T Diab, and Owen Rambow. Conventional orthography for dialectal Arabic, in: LREC, 2012, pp. 711–718.
[35] Meftouh, Karima, Salima Harrat, Salma Jamoussi, Mourad Abbas, and Kamel Smaili. Machine translation experiments on PADIC: A parallel
Arabic Dialect Corpus, in: The 29th Pacific Asia conference on language, information and computation, 2015.
[36] Darwish, Kareem, Hassan Sajjad, and Hamdy Mubarak. Verifiably eective Arabic dialect identification., in: EMNLP, 2014, pp. 1465–1468.
[37] Zbib, Rabih, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar F Zaidan, and Chris
Callison-Burch. Machine Translation of Arabic Dialects, in: Proceedings of the 2012 conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human language technologies, Association for Computational Linguistics, 2012, pp. 49–59.
[38] Bouamor, Houda, Nizar Habash, and Kemal Oflazer. A Multidialectal Parallel Corpus of Arabic, in: LREC, 2014, pp. 1240–1245.
[39] Saad, Motaz and Basem O Alijla. Wikidocsaligner: An o-the-shelf Wikipedia documents alignment tool, in: Information and Communication
Technology (PICICT), 2017 Palestinian International Conference on, IEEE, 2017, pp. 34–39.
[40] Larson, Ray R. Introduction to Information Retrieval, Journal of the American Society for Information Science and Technology 61 (4) (2010)
852–853.
[41] Blei, David M, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation, Journal of Machine Learning research 3 (Jan) (2003)
993–1022.
Kathrein Abu Kwaik et al. / Procedia Computer Science 142 (2018) 2–13 13
Author name /Procedia Computer Science 00 (2018) 000–000 11
Table 13. The percentage of the most frequent words in PADIC
MSA Palestinian Syrian Tunisian Algerian
Word TF% Word TF % Word TF % Word TF% Word TF%
l¯
a1.96
 ¯
ally 0.84 bs 0.98

b¯
0.85
ly 1.14
֓an 1.44  ¯
anh 0.83
¯
ay 0.92
 ¯
aly 0.78
 w¯
1
 lm 0.81 bs 0.81  ֒m0.89 
¯
ayh 0.73 
¯
ayh 0.82
ly 0.7

¯
ayš 0.8
šw 0.88 l¯
a0.72 
t¯
a֒0.79
n֒m0.7
 0.79  rh
.0.85  ¯
am¯
a0.59
ky 0.67
 hd
¯¯
a0.65  ¯
ah 0.77
šy 0.73
 k¯
an 0.46  l¯
al¯
a0.59
 m¯
ad
¯¯
a0.47 l¯
a0.65 l¯
a0.7
 hd
¯¯
a0.44   w¯
ah
.d0.48
֓il¯
a0.45
 hd
¯¯
a0.64  ¯
anw 0.51
tw 0.37  wl¯
a0.4
 hl 0.45
 ¯
ašy 0.55  mw 0.48
 ֒l¯
0.35
 r¯
any 0.38

d
¯lk 0.42 lm¯
a0.53
 knyr 0.47
 h
.t¯
a0.34

b¯
0.37
 lkn 0.42  hw 0.5 lm¯
a0.45

b¯
ahy 0.33
 h
.t¯
a0.36
 lk 0.39

 ֒š¯
an 0.45
 ¯
ally 0.44  hw 0.31  w¯
allh 0.34
 ֒ndm¯
a0.39 
hyk 0.44 
hyk 0.37
 wqt 0.31  hw 0.32

qlt 0.83
 ¯
ad
¯¯
a0.44  h¯
ad 0.36
 mwš 0.29  r¯
ah
.0.32

֓id
¯¯
a0.35
 ktyr 0.4  ¯
allh 0.35   w¯
ah
.d0.25 b¯
als
.h
.0.32
 lh¯
a0.34  ¯
allh 0.36

lyš 0.34 0.25  dwk 0.31
 hn¯
ak 0.41
 hd
¯h0.35
 ¯
ad
¯¯
a0.31
bršh 0.25 
kym¯
a0.31
 ¯
allh 0.32
zy 0.34
 mtl 0.31
 ¯
ally 0.24 brk 0.3
 lh 0.32

lyš 0.33
 ֒n0.29
šy 0.23
 r¯
ahy 0.3

šy֓0.32
 ¯
any 0.3
 k¯
an 0.28  wl¯
a0.23 ¡¢s
.h
.0.29
most extensive of its kind concerned with measuring similarities and dierences in Arabic and dialectal Arabic, and
represents a basis for new similar investigations, focusing on other criteria such as phonological distance, morpho-
logical distance and semantic distance. In the future, we plan to employ other methods of measuring similarity and
distance based on the semantics of the words, e.g. word embedding techniques with Word2Vec. In this way, one can
extract dierent words in terms of their lexical relatedness, and use them in automatic machine translation tools for
the languages and dialects investigated.
References
[1] Shah, Mustafa., The Arabic language, Routledge, 2008.
[2] Versteegh, Kees, The Arabic language, Edinburgh University Press, 2014.
[3] Ferguson, Charles A., Diglossia, word 15 (2) (1959) 325–340.
[4] Zouhir, Abderrahman., Language situation and conflict in Morocco, in: Selected Proceedings of the 43rd Annual Conference on African
Linguistics, ed. Olanike Ola Orie and Karen W. Sanders, 2013, pp. 271–277.
[5] Jabbari, MJ., Diglossia in Arabic – a comparative study of the Modern Standard Arabic and colloquial Egyptian Arabic, Global Journal of
Human Social Sciences 12 (8) (2012) 23–46.
[6] Clark, Stephen, Vector space models of lexical meaning, in: Lappin, Shalom and FoxS, Chris (Eds.). Handbook of Contemporary Semantics –
second edition, Wiley – Blackwell, 2015, Ch. 16, pp. 493–522.
[7] Kumar, Ch Aswani, M Radvansky, and J Annapurna, Analysis of a Vector Space Model, Latent Semantic Indexing and formal concept analysis
for Information Retrieval, Cybernetics and Information Technologies 12 (1) (2012) 34–48.
[8] González-Castro, Víctor, Rocío Alaiz-Rodríguez, and Enrique Alegre. Class distribution estimation based on the Hellinger distance, Informa-
tion Sciences 218 (2013) 146–164.
[9] Chiswick, Barry R and Paul W Miller. Linguistic distance: A quantitative measure of the distance between English and other languages, Journal
of Multilingual and Multicultural Development 26 (1) (2005) 1–11.
12 Author name /Procedia Computer Science 00 (2018) 000–000
[10] Heeringa, Wilbert, Jelena Golubovic, Charlotte Gooskens, Anja Schüppert, Femke Swarte, and Stefanie Voigt. Lexical and orthographic dis-
tances between Germanic, Romance and Slavic languages and their relationship to geographic distance, Phonetics in Europe: Perception and
Production (2013) 99–137.
[11] Sengupta, Debapriya and Goutam Saha. Study on similarity among Indian languages using language verification framework, Advances in
Artificial Intelligence 2015 (2015) 2.
[12] Rama, Taraka, Ça˘
grı Çöltekin, and Pavel Sofroniev, Computational analysis of Gondi dialects, in: Proceedings of the Fourth Workshop on NLP
for Similar Languages, Varieties and Dialects (VarDial), 2017, pp. 26–35.
[13] Houtzagers, Peter, John Nerbonne, and Jelena Proki´
c, Quantitative and traditional classifications of Bulgarian dialects compared, Scando-
Slavica 56 (2) (2010) 163–188.
[14] Aminul Islam and Diana Inkpen, Semantic text similarity using corpus-based word similarity and string similarity, ACM Transactions on
Knowledge Discovery from Data (TKDD) 2 (2) (2008) 10.
[15] Robert W Irving and Campbell B Fraser. Two algorithms for the longest common subsequence of three (or more) strings, in: Annual Symposium
on Combinatorial Pattern Matching, Springer, 1992, pp. 214–229.
[16] Abunasser, Mahmoud Abedel Kader. Computational measures of linguistic variation: A study of Arabic varieties, Ph.D. thesis, University of
Illinois at Urbana-Champaign (2015).
[17] Navarro, Gonzalo. A guided tour to approximate string matching, ACM computing surveys (CSUR) 33 (1) (2001) 31–88.
[18] Kondrak, Grzegorz. N-gram similarity and distance, in: International symposium on string processing and information retrieval, Springer,
2005, pp. 115–126.
[19] Needleman, Saul B and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two
proteins, Journal of molecular biology 48 (3) (1970) 443–453.
[20] Ruette, Tom, Dirk Speelman, and Dirk Geeraerts. Measuring the lexical distance between registers in national variaties of Dutch, Aletheia,
Publicações da Faculdade de Filosofia da Universidade Católica Portuguesa, 2011.
[21] Anna Huang. Similarity measures for text document clustering, in: Proceedings of the sixth New Zealand computer science research student
conference (NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.
[22] HarratSalima, Karima Meftouh, Mourad Abbas, Salma Jamoussi, Motaz Saad, and Kamel Smaili. Cross-dialectal Arabic processing, in: Inter-
national Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2015, pp. 620–632.
[23] Bigi, Brigitte. Using Kullback-Leibler distance for text categorization, in: European Conference on Information Retrieval, Springer, 2003, pp.
305–319.
[24] Niwattanakul, Suphakit, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. Using of Jaccard coecient for keywords
similarity, in: Proceedings of the International MultiConference of Engineers and Computer Scientists, Vol. 1, 2013.
[25] Sebastiani, Fabrizio. Machine Learning in automated text categorization, ACM computing surveys (CSUR) 34 (1) (2002) 1–47.
[26] Swadesh, Morris. Salish internal relationships, International Journal of American Linguistics 16 (4) (1950) 157–167.
[27] Baalbaki, Munir.
 _
 
  ¯
almwrd: q¯
amws ¯
anˇ
glyzy ֒rby.


   d¯
ar ¯
al֒lm llml¯
ayyn: bt
¯rwt , 1982.
[28] Elias, Elias Antoon and Ed E Elias. Elias’ modern dictionary, Arabic-English, (1983).
[29] Saad, Motaz. Fouille de documents et d’opinions multilingue, Ph.D. thesis, Université de Lorraine (2015).
[30] Habash, Nizar and Owen Rambow. a morphological analyzer and generator for the Arabic dialects, in: Proceedings of the 21st International
Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, 2006, pp. 681–688.
[31] Dasigi, Pradeep and Mona T Diab. Towards identifying orthographic variants in dialectal Arabic., in: IJCNLP, 2011, pp. 318–326.
[32] Qwaider, Chatrine, Motaz Saad, Stergios Chatzikyriakidis, and Simon Dobnik. Shami: A Corpus of Levantine Arabic Dialects, in Proceedings
of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association
(ELRA), Miyazaki, Japan, 2018.
[33] Jarrar, Mustafa, Nizar Habash, Diyam Akra, and Nasser Zalmout. Building a corpus for Palestinian Arabic: A preliminary study, in: Proceedings
of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, pp. 18–27.
[34] Habash, Nizar, Mona T Diab, and Owen Rambow. Conventional orthography for dialectal Arabic, in: LREC, 2012, pp. 711–718.
[35] Meftouh, Karima, Salima Harrat, Salma Jamoussi, Mourad Abbas, and Kamel Smaili. Machine translation experiments on PADIC: A parallel
Arabic Dialect Corpus, in: The 29th Pacific Asia conference on language, information and computation, 2015.
[36] Darwish, Kareem, Hassan Sajjad, and Hamdy Mubarak. Verifiably eective Arabic dialect identification., in: EMNLP, 2014, pp. 1465–1468.
[37] Zbib, Rabih, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar F Zaidan, and Chris
Callison-Burch. Machine Translation of Arabic Dialects, in: Proceedings of the 2012 conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human language technologies, Association for Computational Linguistics, 2012, pp. 49–59.
[38] Bouamor, Houda, Nizar Habash, and Kemal Oflazer. A Multidialectal Parallel Corpus of Arabic, in: LREC, 2014, pp. 1240–1245.
[39] Saad, Motaz and Basem O Alijla. Wikidocsaligner: An o-the-shelf Wikipedia documents alignment tool, in: Information and Communication
Technology (PICICT), 2017 Palestinian International Conference on, IEEE, 2017, pp. 34–39.
[40] Larson, Ray R. Introduction to Information Retrieval, Journal of the American Society for Information Science and Technology 61 (4) (2010)
852–853.
[41] Blei, David M, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation, Journal of Machine Learning research 3 (Jan) (2003)
993–1022.
... Some have focused on specific Arabic-speaking regions instead of considering all Arabic-speaking countries. For example, Abu Kwaik et al. (2018) developed a Levantine dialect corpus that consisted of data from four countries located in the Levantine region (i.e., the region including Jordan, Lebanon, Palestine, and Syria). The authors adopted a combination of manual and automatic methods to construct their corpus, including the use of data from Twitter as well as blogs written by individuals who were writing in the dialect of a specified region. ...
... The authors compared the similarities between texts collected from four countries. In the previously-described paper by Kwaik et al. (2018), the authors also compared dialects collected from the four countries in their corpus. They concluded that there is "great overlap between the dialects and dispersion of lexical items between categories," while also acknowledging that there was "a little similarity between the two dialects on the lexical level" when classifying only Jordanian and Lebanese texts. ...
Article
The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related, yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectal similarity. Our work is twofold, reliant on: (1) comparing the textual similarities between dialects using cosine similarity and (2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectal similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectal similarity and city proximity suggests that cities that are closer together are more likely to share dialectal attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialect identification.
... Limited previous research has focused on studying the differences between Arabic dialects. In one study, the authors considered several corpora that consisted of countrylevel dialects and compared their commonalities [36]. The authors studied the different dialects' similarity to Modern Standard Arabic and found that "Levantine dialects are in general the closest to MSA, while the North African dialects are the farthest," despite the fact that none of the corpora used in their research included data from the Gulf region. ...
... The authors compared the similarities between texts collected from four countries. In the previously-described paper by Kwaik et al. [36], the authors also compared dialects collected from the four countries in their corpus. They concluded that there is "great overlap between the dialects and dispersion of lexical items between categories," while also acknowledging that there was "a little similarity between the two dialects on the lexical level" when classifying only Jordanian and Lebanese texts. ...
Preprint
The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectical similarity. Our work is twofold, reliant on: 1) comparing the textual similarities between dialects using cosine similarity and 2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectical similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectical similarity and city proximity suggests that cities that are closer together are more likely to share dialectical attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialects identification.
... A fairly large body of literature has been developing on the matter in the 2010s. 12 In this regard, too, dialect corpora are crucial to natural language processing, particularly in dialect classification, automatic dialect identification, sentiment analysis, and opinion mining (for example, Abu Kwaik et al. 2018b;Lulu and Elnagar 2018;Alshutayri and Atwell 2019;Salameh et al. 2018;Boudad et al. 2018 andDuwairi et al. 2015). Abu Kwaik et al. (2019) It should be obvious from the dates of most works in this necessarily brief overview that the use of corpora for such applied linguistics applications is a new but rapidly developing research arena, with most of the work appearing in the 2010s, and even in the last few years of that decade. ...
Chapter
Full-text available
The authors examine the application of electronically searchable corpora, from their own experience, in addressing questions pertinent to linguistics as a whole and to matters internal to Arabic, the while lamenting that the field of Arabic linguistics, in its theoretical and applied orientations alike, has not made use of the rich data source that searchable electronic corpora represent. They show how corpora can be used easily to falsify common assumptions and assertions about the human language capacity in general just as they can be used efficiently to query assumptions and assertions about Arabic itself. So, too, do they hold implications for applied uses such as teaching Arabic as a foreign language and translation between Arabic and other languages. In any of these applications, the use of corpora in the analysis of all varieties of Arabic remains underdeveloped compared to their use in the analysis of other languages, especially English.
... People from Syria speak Levantine Arabic. This form differs from MSA in vocabulary, morphology, phonology, and even syntax (Kwaik et al., 2018;Saiegh-Haddad & Schiff, 2016). Upon entering school these children often have experiences learning MSA that are similar to learning an additional language. ...
Article
Word reading is a fundamental skill in reading and one of the building blocks of reading comprehension. Theories have posited that for second language (L2) learners, word reading skills are related if the children have sufficient experience in the L2 and are literate in the first language (L1). The L1 and L2 reading, phonological awareness skills, and morphological awareness skills of Syrian refugee children who speak Arabic and English were measured. These children were recent immigrants with limited L2 skills and varying levels of L1 education that was often not commensurate with their ages. Within- and across-language skills were examined in 96 children, ages 6 to 13 years. Results showed that phonological awareness and morphological awareness were strong within-language variables related to reading. Additionally, Arabic phonological awareness and morphological processing were strongly related to English word reading. Commonality analyses for variables within constructs (e.g., phonological awareness, morphological awareness) but across languages (Arabic and English) in relation to English word reading showed that in addition to unique variance contributed by the variables, there was a high degree of overlapping variance.
Chapter
Full-text available
Morphological analysis is a crucial component in natural language processing. For the Arabic language, many attempts have been conducted in order to build morphological analyzers. Despite the increasing attention paid to Arabic dialects, the number of morphological analyzers that have been built is not important compared to Modern Standard Arabic. In addition, these tools often cover a few dialects of Arabic such as Egyptian, Levantine, and Gulf, and don’t support currently all of them. In this paper, we present a literature review of morphological analyzers supporting Arabic dialects. We classify their building approaches and propose some guidelines to adapt them to a specific Arabic dialect.KeywordsMorphological analyzerArabic dialectLexiconNatural language processingStandard ArabicCorpusAnnotation
Thesis
Full-text available
The current study investigates and analyzes the language attitudes of Libyan university students towards two Arabic varieties that are used by Libyans in different linguistic contexts/domains: Modern Standard Arabic (henceforth MSA) and the Libyan Arabic Dialect (henceforth LAD). The purpose of this research is to determine what attitudes university students in Libya hold towards different varieties of their native language and whether there are any significant differences in the students' attitudes towards MSA and LAD based on their gender. The sample of the study is comprised of 108 participants divided equally into 54 male and 54 female university students. The participants were graduate and undergraduate students from various disciplines of study. The instrument used in the research is a language survey containing 43 questions divided into four sections: background information; multiple-choice items; attitude judgments on a 5-point-Likert-scale; and open-ended questions. The results revealed that Libyan students have complex attitudes and varied use of two Arabic varieties in Libya, LAD and MSA in many contexts. There are differences across the contexts about how men and women value and use these two varieties. The study concludes that Libyan students prefer to use MSA in academic and worship domains and LAD in social and media domains. Gender was an important factor in determining the preferences towards MSA and LAD; as a result, female students favor LAD over MSA in some contexts i.e. giving a presentation to classmates and listening to a song.
Article
Purpose This corpus-based study provides a descriptive account of the distribution of the polysemous noun nafs in two Arabic varieties, Modern Standard Arabic (MSA) and Classical Arabic (CA). The research objective is to survey the use of nafs as a reflexive marker in local binding domains and as a self-intensifier in NP-adjoined positions. Design/methodology/approach The consulted corpora are Timespamped JSI Web corpus for MSA and Quran corpus for CA. While attending to corpora size differences, MSA and CA exhibit a pattern of difference and similarity in nafs diffusion. Findings In the modern variety, nafs is pervasively used as reflexive marker in canonical binding domains, along with a less frequent, yet notable, intensifier user, and these uses are partially and cautiously attributed to the specific genre in which they occur. In CA, nafs is mainly recurrent as a polysemous noun, along with extensive use as a reflexive marker in local binding settings. As an intensifier, nafs is totally non-existent in the CA corpus, in the same way as it is in absentia in VP-constituent extraction in MSA. Originality/value Examining whether nafs , as a reflexive marker, deviates from canonical binding in Arabic the way English reflexive pronouns do. Building a general account of this distribution is relevant in understanding the explicit (syntactic) and implicit (discourse-based) dimensions of reflexive marker and self-intensifier processing and interpretation in Arabic as a first and second language.
Article
Purpose Processing narrow focus (NF), the stressed word in the sentence, includes both the perceptual ability to identify the stressed word in the sentence and the pragmatic–semantic ability to comprehend the nonexplicit linguistic message. NF and its underlying meaning can be conveyed only via the auditory modality. Therefore, NF can be considered as a measure for assessing the efficacy of the hearing aid (HA) and cochlear implants (CIs) for acquiring nonexplicit language skills. The purpose of this study was to assess identification and comprehension of NF by HA and CI users who are native speakers of Arabic and to associate NF outcomes with speech perception and cognitive and linguistic abilities. Method A total of 46 adolescents (age range: 11;2–18;8) participated: 18 with moderate-to-severe hearing loss who used HAs, 10 with severe-to-profound hearing loss who used CIs, and 18 with typical hearing (TH). Test materials included the Arabic Narrow Focus Test (ANFT), which includes three subtests assessing identification (ANFT1), comprehension of NF in simple four-word sentences (ANFT2), and longer sentences with a construction list at the clause or noun phrase level (ANFT3). In addition, speech perception, vocabulary, and working memory were assessed. Results All the participants successfully identified the word carrying NF, with no significant difference between the groups. Comprehension of NF in ANFT2 and ANFT3 was reduced for HA and CI users compared with TH peers, and speech perception, hearing status, and memory for digits predicted the variability in the overall results of ANFT1, ANFT2, and ANFT3, respectively. Conclusions Arabic speakers who used HAs or CIs were able to identify NF successfully, suggesting that the acoustic cues were perceptually available to them. However, HA and CI users had considerable difficulty in understanding NF. Different factors may contribute to this difficulty, including the memory load during the task as well as pragmatic-linguistic knowledge on the possible meanings of NF.
Article
Arabic controlled vocabularies do not differ from all other controlled vocabularies as far as basic features are concerned, however they do bear a number of shortcomings, which have limited their effectiveness dissemination. These include the lack of adaptations to Arab-specific applications, and the failure of terms to connote the content of subject areas easily and consistently. Besides, differences and variability in terminology and syntax cause problems in cross-domain or cross-system interoperability. In addition, Existing software is unequipped to service Arabic-speaking libraries in such a way that allows them to be technologically comparable with modern libraries around the world, thus limiting their integration into the international library community. This technological shortcoming also limits the ease by which they are able to make knowledge resources attainable to researchers and other library users. This article proposes a framework for a monolingual web-based terminology management system that operates in Arabic and supports the use of Arabic controlled vocabularies. This article is based on ISO 26162:2012 Systems to manage terminology, knowledge and content — Design, implementation and maintenance of terminology management systems.
Conference Paper
Full-text available
Presently, information retrieval can be accomplished simply and rapidly with the use of search engines. This allows users to specify the search criteria as well as specific keywords to obtain the required results. Additionally, an index of search engines has to be updated on most recent information as it is constantly changed over time. Particularly, information retrieval results as documents are typically too extensive, which affect on accessibility of the required results for searchers. Consequently, a similarity measurement between keywords and index terms is essentially performed to facilitate searchers in accessing the required results promptly. Thus, this paper proposed the similarity measurement method between words by deploying Jaccard Coefficient. Technically, we developed a measure of similarity Jaccard with Prolog programming language to compare similarity between sets of data. Furthermore, the performance of this proposed similarity measurement method was accomplished by employing precision, recall, and F-measure. Precisely, the test results demonstrated the awareness of advantage and disadvantages of the measurement which were adapted and applied to a search for meaning by using Jaccard similarity coefficient.
Conference Paper
Full-text available
Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.
Conference Paper
This paper presents a computational analysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data using different techniques from dialectome-try, deep learning, and computational biology. We show that the methods largely agree with each other and with the earlier non-computational analyses of the language group.
Conference Paper
Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%. This work is a first-step towards an ultimate goal of a translation system from Arabic to English and French, within the ASMAT project