Conference PaperPDF Available

A Unified Approach for Arabic Language Dialect Detection

Authors:

Abstract and Figures

The paralinguistic information in a speech signal includes clues to the ethnic and social background of the speaker. In this paper, we propose a hybrid approach to dialect and accent recognition from spoken Arabic language, based on phonotactic and spectral systems separately then combining both by decision fusion technique. We extract speech attribute features that represent acoustic cues of different speaker's dialect to obtain feature streams that are modeled with the Gaussian Mixture Model with Universal background model (GMM-UBM) in addition to Identity Vector (I-vector) classifier. Moreover, this paper introduces our proposed dataset SARA, which is a Modern Colloquial Arabic dataset (MCA) contains three different Arabic dialects and its common accents, this dataset will be the master dataset for this work. We find our proposed technique with acoustic features achieves a significant performance improvement over the state-of-the-art systems using Arabic dialects in the dataset.
Content may be subject to copyright.
A Unified Approach for Arabic Language Dialect Detection
Rania R. Ziedan1,2, Michael N. Micheal2, Abdulwahab K. Alsammak2,
Mona F.M.Mursi2and Adel S. Elmaghraby1
1Computer Engineering and Computer Science, University of Louisville, KY, USA
2Communication and Computer Engineering, Benha University, Egypt
E-mail: rania.ziedan@louisville.edu / rania.ziedan@feng.bu.edu.eg
Abstract
The paralinguistic information in a speech signal
includes clues to the ethnic and social background of the
speaker. In this paper, we propose a hybrid approach
to dialect and accent recognition from spoken Arabic
language, based on phonotactic and spectral systems
separately then combining both by decision fusion
technique. We extract speech attribute features that
represent acoustic cues of different speaker’s dialect
to obtain feature streams that are modeled with the
Gaussian Mixture Model with Universal background
model (GMM-UBM) in addition to Identity Vector (I-
vector) classifier. Moreover, this paper introduces our
proposed dataset SARA, which is a Modern Colloquial
Arabic dataset (MCA) contains three different Arabic
dialects and its common accents, this dataset will
be the master dataset for this work. We find our
proposed technique with acoustic features achieves a
significant performance improvement over the state-of-
the-art systems using Arabic dialects in the dataset.
keywords: accent/dialect recognition, I-vector, GMM-
UBM, SARA colloquial Arabic dataset.
1 Introduction
Recently, human-machine interaction has received
increasing attention from various fields such as arti-
ficial intelligence, machine learning, and information
retrieval. One of the most important challenges in
human-machine interaction is the proper understanding
of human speech by automated systems. The recog-
nition of the speech by a machine permits a deeper
interaction between both parties.
Figure 1 shows the most popular speech processing
research areas. The fundamental challenge for current
research is the understanding and modeling of the
individual variation in the dialect and accent based on
speech. The dialect refers to the linguistic variations of
a language; however, the accent refers to the different
ways of pronouncing a language within a community.
Figure 1: Automatic Speech Processing Research Areas
The nonnative speakers pose many problems for auto-
matic speech recognition systems’ (ASR) performance.
In addition, the nonnatives adversely affect the speaker
verification systems’ performance because of the sys-
tematic shifts in score distributions relative to the
native speakers. Therefore, knowing the nativeness of
a speaker would enable the adaptation techniques to
mitigate the mismatch between the training and the
test data. Moreover, identifying the nativeness of the
speaker is useful in many intelligent applications.
After Chinese, Spanish and English, Arabic is the
fourth most commonly spoken language around the
world with more than 230 million native speakers [2]
and it is the official language for more than 22 countries.
In addition, the world’s Muslims, around one billion
people, also use Arabic as a religious language; however,
it is not always spoken as the way it is written. The
Modern Standard Arabic (MSA) is usually used as a
writing form for the official Arabic language, and it
is the language of literature, books, newspapers, and
the official documents, whereas, the spoken form varies
based on the geographical region. Little research has
been done on processing Arabic speech especially for
Arabic dialects. For Arabic language, in addition to
MSA there are a number of regional dialects. These
dialects are mainly grouped based on the geographical
regions into the Maghrebi group, the Sudanese group,
the Egyptian group (EGY), the Arabian Peninsula
group (ARP), the Iraqi group (IRQ), and the Levantine
group (LEV) each group includes some accents. For
Arabic dialects recognition, there are two important
challenges that must be considered, which are 1) the
978-1-943436-04-0 / copyright ISCA, CAINE 2016
September 26-28, 2016, Denver, Colorado, USA
165
dialects have no standards, and 2) the large gap between
MSA and the dialects, which led some research studies
to classify the Arabic dialects as separate languages. In
the dialect recognition systems, the weight of a dialect
feature depends on its distance from the standard
pronunciation, and the frequency of that feature in the
speech. The differences between dialects can be found
in two parts, which are the differences in phonetic tran-
scriptions and the differences in acoustical intonations
of dialects. The differences in phonetic transcription
can be categorized into two classes, differences in the
number and identity of the phonemes and differences
in phonetic realizations such as phoneme substitution,
deletion, and insertion. There are some examples of
phonetic transcription differences in different Arabic
accents in [7]. In addition to the differences in phonetic
transcription, there are four differences of acoustic
correlates of dialects, which are formant, pitch prosody
correlates, Timing correlates and laryngeal (glottal)
correlates. In this paper, we focus on recognizing
the Arabic dialects based on the acoustical feature
extraction and classification techniques.
The rest of the paper is organized as follow. In section
2, related work is described. In section 3, we present
the proposed dialect and accent recognition approach.
The proposed dialected dataset is given with the most
popular regional accents in section 4. For the acoustic
features, the experimental environment is presented in
section 5 then followed by the results in section 6.
Finally, section 7 introduces the conclusion and future
work presented in section 8.
2 Related Work
This section introduces the challenges in the speech
and dialect recognition systems. The first challenge
is the selection of the suitable features that represent
the dialect differences; various features extraction al-
gorithms are used in speech dialect analysis, which
are different in the contexts, the meanings, and the
configurations such as:
1- Mel Frequency Cepstral Coefficients (MFCC) is the
most popular feature extraction technique in the speech
recognition areas, it is based on frequency domain using
the Mel scale which is based on the human ear scale
[10, 13]. In addition, these coefficients are robust and
reliable to variations according to the speakers and the
recording conditions.
2- Shifted Delta Cepstral coefficients (SDC), which
created by stacking delta cepstra computed across
multiple speech frames. It based on the concept that
the feature vectors include the temporal information
spanning to a large number of frames [16]. Four
parameters are used to calculate the SDC features,
which are the number of spectral coefficients calculated
at each period N, the time delay and advance for
calculating the delta d, the number of blocks for which
delta coefficients are concatenated k, and the time delay
between consecutive blocks P.
3- Perceptual Linear prediction (PLP) model devel-
oped by Hermansky [14]. PLP enhances the speech
recognition rate by discarding the irrelevant informa-
tion of the speech.
4- In addition to PLP, Hermansky developed Relative
Spectra Filtering (RASTA) where the conventional
critical-band short-term spectrum in PLP is replaced
with a spectral estimate from frequencies band-pass
filtered via a sharp spectral zero at zero frequency in
order to smooth over short-term noise variations and to
remove any constant offset resulting from static spectral
coloration in the speech channel [15, 21]
5- RASTA-PLP [15] is a speech feature representa-
tion, which is a hybrid from RASTA and PLP steps.
The second challenge is building a classifier model
that able to handle and combine efficiently the hetero-
geneous structure of the acoustic features.
1- GMM-UBM: GMM is one of the most popular
classifiers for speaker recognition, due to its capability
to represent a large class of sample distributions, its
components considered as a model for the underlying
broad phonetic sounds that characterize a person’s
voice [19]. GMM parameters are estimated from train-
ing data using the iterative Expectation-Maximization
(EM) algorithm, which maximize the likelihood of the
GMM given the training data. In speech applications,
the adaptation of the acoustic models to new operating
conditions is important because of data variability due
to different speakers, environments, speaking styles and
so on[18]. The universal background model (UBM)
which is the M-component of the GMM parametrized
by wm, mm,Pm, m = 1, ..., M, where w, m, and P
are the mixture weight, mean vector, and covariance
matrix adapted to a specific speaker using a maximum
a posteriori (MAP) scheme. The basic idea in the
adaptation approach is when enrolling a new speaker to
the system, the speaker’s model is derived by updating
the well-trained parameters in the UBM via adaptation
[20]. This provides a tighter coupling between the
speaker’s model and UBM, which produces better
performance and allows a fast-scoring technique than
decoupled models.
2- I-Vector: Recently, Dehak [11] developed a new
classifier based on Joint Factor Analysis (JFA) as
feature extractor. The idea is finding a low dimen-
sional subspace of the GMM super-vector space, named
total variability space that represents the speaker and
166
channel variability. The vectors in that low-dimensional
space are called I-vectors, this low dimension space is
classified using PLDA classifier which represented by
Kenny [17] in the speaker verification systems. The
representation of the I-vector has a small size to reduce
the execution time of the recognition time while keeping
the recognition rates acceptable. I-vector classifier is
tested with different accents in [4, 5, 12], in [6]it uses
some Arabic Speakers to test the Finnish language
Proficiency. According to our knowledge, the only
work used I-vectors classifier for an Arabic dialect is
represented by Boulkenafet et al. [9] for the Algerian
Arabic dialect, which one of Maghrebi group dialects.
In addition, the proper selection of a speech corpus
is a challenge to evaluate the dialect/ accent detection
system’s performance. Linguistic Data Consortium
(LDC) is an open consortium of universities, libraries,
corporations, and government research laboratories;
it is one of the most common speech/text database
provider since1996. In addition to LDC, The European
Language Resource Association (ELRA) Agency has a
great effort in language resources for the Human Lan-
guage Technology (HLT) since 2007. Moreover, there
are some researchs done to collect Arabic dialected
corpus mainly for the academic use in dialect/ accent
recognition systems. The summarization in [3], for
the Arabic corpus discusses the existing Arabic MSA
and Arabic dialected corpus recording conditions and
specifications.
3 Proposed Approach
In this research, we propose an approach for recog-
nizing the Arabic dialects from speech. The approach
is divided into two systems as shown in figure 2, one
based on acoustic features and the other based on
phonetic features. The first system is responsible for
speech signal analysis to extract the acoustic features
that discriminate the Arabic regional dialects. In
training stage, these extracted features are used to
build a dialect/ accent model in the Acoustic Accent
Lexicon. In the test phase, the extracted features are
used to determine the speaker nativity using mapping
technique. However, the second system is based on
phonotactic representation of the speech, a phonetic
pattern for the dialect / accent distinctive words is
trained, and the phonetic model for each accent word is
saved in a lexicon. Then in the recognition phase, the
system will search for the accent phonetic pattern for
matching and recognize the accent. Finally, the decision
is taken by score-level fusion between the acoustic and
phonetic systems decisions.
According to our knowledge, the only research rec-
ognized the Arabic dialects based on acoustic and
Figure 2: Proposed Approach
phonotactic characteristics of speech separately was
proposed by Hynek et al.[8] , which proposed a study
that included two parts. The first part identified
the acoustic characteristics of the dialect based on
the spoken part and the silent part in the speech.
In addition, the second part focused on phonotactic
dialect modeling that utilizes phone recognizers and
support vector machines (PRSVM). However, Hynek
et al. used the English and Hindi phoneme recognizers
to identify the dialect phonetic characteristics of the
Arabic language. However, in our proposed approach
we intend to use the Arabic phone recognizer for
recognizing the Arabic phonemes.
4 Proposed Dataset
The speech corpus is collected to be spontaneous,
canonical, or both [1]. A spontaneous speech corpus is
the speech corpus that is collected from the real world,
human–human or human machine communication. For
this corpus, the speaker does not read prompts, rather
he or she speaks naturally to convey a message and/or
get information, and the speech is expected to be natu-
ral and not affected by the reading habits. However, in
the canonical speech corpus, the speakers must follow
certain procedures, including the reading of prompts,
for collecting specific speech sounds. The speech
content can be words or sentences, the sentences may be
phrases, phonetically rich sentences. The spontaneous
speech corpus is suitable for language understanding
and dialogue design; it tends to include unneeded
frequently repeated words and utterances but does not
necessarily include all the sounds of the language under
investigation. However, a canonical speech corpus tends
to be phonetically rich; all the sounds of the language
are presented in various phonotactic positions.
In this paper, we propose a database consists of a set
of spontaneous not pre-specified colloquial phrases in
everyday life and life situations that are collected from
media shows, episodes and films published on YouTube
played by native speakers with three different Arabic
dialects and accents. We define the proposed dataset as
167
Spoken Arabic Regional Archive (SARA). In contrast
to Voice over IP (VoIP), which is defined as a real
time delivery of the voice packets through the network
using internet protocols, the media for SARA are
downloaded from YouTube and stored before sampling
process. Therefore, the live transmission quality of
service (QoS) problems like delay, delay jitter, and
unreliable packet delivery, which VoIP suffers from, do
not affect our proposed dataset SARA. The subjected
database can be used to train speech-processing systems
such as automatic speech recognition, speaker veri-
fication/ recognition and dialect/ accent recognition.
SARA dataset contains only adult speakers to avoid
the improper pronunciation of the children that can
affect the detection process. The proposed database
contains males’ and females’ speech for three Arabic
regional dialects, which are Egyptian dialect (EGY),
Arabian Peninsula dialect (ARP) and the Levantine
dialect (LEV). In addition, within each dialect group
there are a number of different accents. Each utterance
contains a one-speaker speech, the number of speakers
within the dialect and the number of utterances by a
speaker is unknown. The categorization of the proposed
dialects and their most popular accents in the dataset
is proposed in figure 3 In addition, the three different
dialects are explained briefly with numbers in table 1.
Figure 3: Proposed corpus distribution
Table 1: Corpus Utterances’ Dialects Distribution
Dialect Male Female Total
EGY 877 488 1365
ARP 657 573 1230
LEV 583 662 1245
For EGY, the number of males’ and females’ speech
samples is distributed according to the accent as in
table 2. Tables 3, 4 and 5the distribution for the
ARP dialect and its common accents in the proposed
corpus is shown. In table 6 the LEV dialect and accent
distribution is shown. In addition, table 7proposes that
the dataset samples are variant in length in order to
verify the minimum time in which we can determine
the speaker dialect or accent when the speaker speaks
in free talk.
Table 2: EGY Accents Distribution
Accent Male Female
Egyptian 623 221
Sa’idi 254 267
Table 3: ARP Accents Distribution
Accent Male Female
Saudi 180 207
Gulf 477 366
Table 4: Saudi Accents
Accent Male Female
Saudi 143 203
Hejazi 29 0
Najdi 8 4
Table 5: Gulf Accents
Accent Male Female
Bahrani 335 282
Gulf 64 46
Omani 78 38
Table 6: LEV Accents Distribution
Accent Male Female
Syrian 46 261
Lebanese 76 341
Jordanian 362 42
Palestinian 99 18
Table 7: Dialected utterances’ length distribution
Length EGY ARP LEV Total
3 sec 468 388 398 1254
4 sec 377 409 354 1140
5 sec 271 269 246 786
6 sec 166 117 158 441
7 sec 83 47 89 219
Total 1365 1230 1245 3840
5 Experiment Environment
In this paper, the proposed acoustic system shown
in figure 2 is used to identify the different dialects
of SARA. The samples used in this experiment are
divided into approximately 70% and 30% for training
and testing. These samples are gender-independent to
identify the dialect features regardless of the speaker
gender features. The dialect acoustic features are ex-
tracted using 12 MFCC coefficients with the log energy
168
component (a), delta MFCC (c), double delta MFCC
(e) and SDC with parameters (7,1,3,7) (g), and these
techniques are used with and without cepstral mean and
variance normalization. These features normalization
form are denoted as (b, d, f and h) respectively. In
addition, we also use PLP (w), double delta PLP
(x), RASTA (y) and Rasta with PLP coefficient (z)
techniques with PLP order =12. Moreover, these
features are mainly classified using the classification
technique GMM-UBM with 1024 mixtures. In addition,
the I-vector is used with total variability equals 60
and classified by the PLDA classifier. The result is
evaluated by computing EER, which is the rate at which
both acceptance and rejection errors are equal.
6 Experimental Results
From the results, the best recognition rates are
achieved when the utterance length is more than 4
seconds. In the most features techniques, when in-
creasing the sample size, the EER decreased. In the
3 seconds case, the EER increased up to 2% using
GMM-UBM and 4% using I-vector depending on the
feature extraction technique. As shown in table 8
and table 9 when using GMM-UBM, the best accuracy
found when using normalized delta MFCC, EER varies
from 14.2% to 5.6% depending on sample size. From
the experiments, we noticed that in the case of the
utterances longer than 6-seconds with MFCC features,
in the GMM classifier case the classifier misclassified all
the test samples as Egyptian dialect. Moreover, using
the GMM-UBM produces the same classification result.
Table 8 shows that using the normalized features solve
this problem in long utterances. However, table 10 and
table 11 show the EER of the I-vector classifier with the
MFCC features, delta MFCC and double delta MFCC
give comparable EER values, which varies from 20% to
11% depending on sample size. Like GMM-UBM, using
RASTA and Rasta-PLP gives the worst results. EER
varies from 28% to 14% depending on sample size.
It is noticed that RASTA and Rasta-PLP features
gave worse results than PLP features in continuous
dialected speech application in contrast to speech recog-
nition application.
Table 8: GMM-UMB classifier EER %
a b c d e f
3sec 14.2 16.3 12.5 13.8 12.5 12.3
4sec 14.8 16.7 14.2 14.2 13.8 15.4
5sec 10 13.3 10.7 13 11.3 13
6sec 12.2 15 11.1 11.1 11.7 12.2
7sec 66.7 13.9 66.7 5.6 66.7 9.7
Table 9: GMM-UMB classifier EER %
g h w x y z
3sec 22.9 20 16.9 14.6 25 26.3
4sec 21.7 19.6 17.1 14.6 25 25
5sec 21 15.7 14.7 13.7 26 24.7
6sec 23.3 21.1 15.6 14.4 26.7 23.3
7sec 16.7 19.4 16.7 12.5 15.3 19.4
Table 10: PLDA classifier with I-vector EER %
a b c d e f
3sec 19.2 20.8 19.6 19.2 18.8 20
4sec 18.8 20 18.3 18.8 17.5 18.8
5sec 15 18 16 16.7 16 16.7
6sec 16.1 11.1 14.4 13.3 13.3 15
7sec 18.1 13.9 16.7 13.9 13.9 13.9
Table 11: PLDA classifier with I-vector EER %
g h w x y z
3sec 26.3 28.8 22.1 20.4 27.7 28
4sec 22.5 19.6 19.4 19.2 23.8 23.3
5sec 20 21.3 16.3 18 23 23.3
6sec 23.3 21.1 18.9 17.8 22.8 21.1
7sec 22.2 19.4 19.4 15.3 13.9 18.1
7 Conclusion
In this paper, we introduced a new approach for
dialect/ accent recognition based on the fusion between
the acoustic and phonetic features. In addition, we
explained the specifications of our proposed dialected
Arabic dataset SARA. In addition, The I-vector tech-
nique that is used with English, British, and Finnish
languages is used to classify the Arabic dialects and
compared to GMM-UBM.
8 Future Work
We believe that there is a significant room for im-
provement by including some factors that we did not ex-
plicitly model in this paper. These factors may include
gender dependent recognition, recognizing the speaker
nationality rather than only their region. Additionally
we plan to apply fusion techniques at the feature level
to improve the recognition rate. In addition, we will
make our dataset SARA available to the researchers
for comparative studies. We also plan to expand this
archive with other dialects.
References
[1] Mansour Alghamdi, Fayez Alhargan, Mohammed
Alkanhal, Ashraf Alkhairy, Munir Eldesouki, and
Ammar Alenazi. Saudi accented arabic voice bank.
169
Journal of King Saud University-Computer and
Information Sciences, 20:45–64, 2008.
[2] Khalid Almeman and Mark Lee. A comparison
of arabic speech recognition for multi-dialect vs.
specific dialects,”. In Proceedings of the Seventh
International Conference on Speech Technology
and Human-Computer Dialogue (SpeD 2013),
Cluj-Napoca, Romania, pages 16–19, 2013.
[3] Khalid Almeman, Minhung Lee, and Ali Abdul-
rahman Almiman. Multi dialect arabic speech
parallel corpora. In Communications, Signal
Processing, and their Applications (ICCSPA),
2013 1st International Conference on, pages 1–6.
IEEE, 2013.
[4] Mohamad Hasan Bahari, Rahim Saeidi, David
Van Leeuwen, et al. Accent recognition using i-
vector, gaussian mean supervector and gaussian
posterior probability supervector for spontaneous
telephone speech. In Acoustics, Speech and Signal
Processing (ICASSP), 2013 IEEE International
Conference on, pages 7344–7348. IEEE, 2013.
[5] Hamid Behravan, Ville Hautam¨aki, and Tomi
Kinnunen. Factors affecting i-vector based foreign
accent recognition: A case study in spoken finnish.
Speech Communication, 66:118–129, 2015.
[6] Hamid Behravan, Ville Hautamauki,
Sabato Marco Siniscalchi, Tomi Kinnunen,
and Chin-Hui Lee. Introducing attribute features
to foreign accent recognition. In Acoustics, Speech
and Signal Processing (ICASSP), 2014 IEEE
International Conference on, pages 5332–5336.
IEEE, 2014.
[7] Fadi Biadsy, Julia Hirschberg, and Nizar Habash.
Spoken arabic dialect identification using phono-
tactic modeling. In Proceedings of the eacl
2009 workshop on computational approaches to
semitic languages, pages 53–61. Association for
Computational Linguistics, 2009.
[8] Hynek Boril, Abhijeet Sangwan, and John HL
Hansen. Arabic dialect identification-’is the
secret in the silence?’and other observations. In
INTERSPEECH, pages 30–33, 2012.
[9] Z Boulkenafet, Messaoud Bengherabi, Farid
Harizi, Omar Nouali, and Cheriet Mohamed.
Forensic evidence reporting using gmm-ubm, jfa
and i-vector methods: Application to algerian
arabic dialect. In Image and Signal Processing
and Analysis (ISPA), 2013 8th International
Symposium on, pages 404–409. IEEE, 2013.
[10] Namrata Dave. Feature extraction methods lpc,
plp and mfcc in speech recognition. International
Journal for Advance Research in Engineering and
Technology, 1(6):1–4, 2013.
[11] Najim Dehak, Reda Dehak, Patrick Kenny, Niko
Br¨ummer, Pierre Ouellet, and Pierre Dumouchel.
Support vector machines versus fast scoring in the
low-dimensional total variability space for speaker
verification. In Interspeech, volume 9, pages 1559–
1562, 2009.
[12] Andrea DeMarco and Stephen J Cox. Native
accent classification via i-vectors and speaker
compensation fusion. In INTERSPEECH, pages
1472–1476, 2013.
[13] Taabish Gulzar, Anand Singh, and Sandeep
Sharma. Comparative analysis of lpcc, mfcc
and bfcc for the recognition of hindi words using
artificial neural networks. International Journal of
Computer Applications, 101(12):22–27, 2014.
[14] Hynek Hermansky. Perceptual linear predictive
(plp) analysis of speech. the Journal of the
Acoustical Society of America, 87(4):1738–1752,
1990.
[15] Hynek Hermansky and Nelson Morgan. Rasta
processing of speech. Speech and Audio Processing,
IEEE Transactions on, 2(4):578–589, 1994.
[16] Herman Kamper and Thomas Niesler. A
literature review of language, dialect and accent
identification. Report, 2012.
[17] Patrick Kenny. Bayesian speaker verification with
heavy-tailed priors. In Odyssey, page 14, 2010.
[18] Tomi Kinnunen and Haizhou Li. An overview
of text-independent speaker recognition: From
features to supervectors. Speech communication,
52(1):12–40, 2010.
[19] Douglas Reynolds. Gaussian mixture models.
Encyclopedia of Biometrics, pages 827–832, 2015.
[20] Douglas A Reynolds, Thomas F Quatieri, and
Robert B Dunn. Speaker verification using adapted
gaussian mixture models. Digital signal processing,
10(1):19–41, 2000.
[21] Urmila Shrawankar and Vilas M Thakare. Tech-
niques for feature extraction in speech recognition
system: A comparative study. arXiv preprint
arXiv:1305.1145, 2013.
170
... The third main challenge is the corpus preparation and selection. In this work, the speech dataset used is our dataset proposed in [7], which includes the three typical Arabic dialects based on regional distribution, which are the Arabian Peninsula dialect (ARP), the Egyptian dialect (EGY), and the Levantine dialect (LEV). Each of these groups includes several different accents. ...
... Probabilistic graphical model representing PLDA with Gaussian priors [13] III. ARABIC DIALECT / ACCENT RECOGNITION SYSTEM As explained in [7], the overall goal of the current research is to define a dialect / accent recognition system for our freespoken colloquial Arabic dataset based on the fusion between the acoustic and phonetic modeling as shown in Fig. 2. In this paper, we focus on a comparison between MFCC and RASTA acoustic feature extraction techniques in recognizing the speaker originality group from the dialect, whether he or she belongs to the Arabian Peninsula area, the Levantine area, or Egypt. In this paper, the proposed acoustic system shown in Fig. 2 "the right part" is used to recognize the different dialects of the proposed dataset. ...
... The dataset used in this research is our dataset proposed in [7], which contains adult speakers for the three most popular Arabic regional dialects and their common accents, which are EGY, ARP, and LEV as in Fig. 3. In addition, Fig. 4 shows the males and females samples distribution. ...
... Similarly, Khalifa et al. (2016) developed a corpus that focused on dialectal texts from the six countries located in the Gulf region. In another paper, the authors created a dataset that consisted of spoken Arabic dialects, using YouTube as a source for collection (Ziedan et al., 2016). In a fourth paper, a new dataset was constructed by considering five regions: Egypt, the Gulf region, Iraq, the Levantine region, and North Africa (Alshutayri & Atwell, 2017). ...
Article
The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related, yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectal similarity. Our work is twofold, reliant on: (1) comparing the textual similarities between dialects using cosine similarity and (2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectal similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectal similarity and city proximity suggests that cities that are closer together are more likely to share dialectal attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialect identification.
... To solve this, each segment was changed to have at least 3 seconds, and any silences were trimmed. The Table 1: Distribution of datasets Dataset Gender Dialect Emotion Total Labels M F MSA LEV GUL NOR EGY NEU SAD HAP SUR ANG QUES TOT Train Set KSU 1360 1357 0 0 0 0 0 544 542 544 544 288 255 5434 ANAD 0 Spoken Arabic Regional Archive (SARA): The contributors of this dataset collected from episodes and films published on YouTube played by native speakers with three different Arabic dialects and accents: Egyptian, the Arabian Peninsula dialect and Levantine dialect [24]. The dataset samples vary in length from 3 to 7 seconds in order to verify the minimum time required to detect the speaker's dialect. ...
Preprint
Full-text available
This paper proposes a novel approach to an automatic estimation of three speaker traits from Arabic speech: gender, emotion, and dialect. After showing promising results on different text classification tasks, the multi-task learning (MTL) approach is used in this paper for Arabic speech classification tasks. The dataset was assembled from six publicly available datasets. First, The datasets were edited and thoroughly divided into train, development, and test sets (open to the public), and a benchmark was set for each task and dataset throughout the paper. Then, three different networks were explored: Long Short Term Memory (LSTM), Convolutional Neural Network (CNN), and Fully-Connected Neural Network (FCNN) on five different types of features: two raw features (MFCC and MEL) and three pre-trained vectors (i-vectors, d-vectors, and x-vectors). LSTM and CNN networks were implemented using raw features: MFCC and MEL, where FCNN was explored on the pre-trained vectors while varying the hyper-parameters of these networks to obtain the best results for each dataset and task. MTL was evaluated against the single task learning (STL) approach for the three tasks and six datasets, in which the MTL and pre-trained vectors almost constantly outperformed STL. All the data and pre-trained models used in this paper are available and can be acquired by the public.
... However, very few have employed SVM for dialect identification Chittaragi and Koolagudi (2017), Pedersen and Diederich (2007). A familiar Universal Background Model (UBM) proposed for speaker identification is applied for dialect classification, which has shown better performance Ziedan et al. (2016). A study of modified version of UBM is reported for classification of dialects to address a smaller dataset with noisy speech and has resulted in better performance Liu and Hansen (2011). ...
Article
Full-text available
In this paper, an automatic dialect identification (ADI) system is proposed by extracting spectral and prosodic features for Kannada language. A new dialect dataset is collected from native speakers of Kannada language (A Dravidian language). This dataset includes five distinct dialects of Kannada language representing five geographical regions of Karnataka state. Investigation of the significance of spectral and prosodic variations on five Kannada dialects is carried out. Mel-frequency cepstral coefficients (MFCCs), spectral flux, and entropy are used as representatives of spectral features. Besides, pitch and energy features are extracted as representatives of prosodic parameters for identification of dialects. These raw feature vectors are further processed to get a new derived feature vectors by using statistical processing. In this paper, a single classifier based multi-class support vector machine (SVM) and multiple classifier based ensemble SVM (ESVM) techniques are employed for classification of dialects. The effectiveness and performance evaluation of the explored features are carried out on newly collected Kannada speech corpus, with five Kannada dialects and internationally known standard Intonation Variation in English (IViE) dataset with nine British English dialects. Experimental results have demonstrated that the derived feature vectors performs better when compared to raw feature vectors. However, ESVM technique has demonstrated better performance over a single SVM. Spectral and prosodic features have resulted individually with the dialect recognition performance of 83.12% and 44.52% respectively. Further, the complementary nature of both spectral and prosodic features is evaluated by combining both feature vectors for dialect recognition. However, an increase in dialect recognition performance of about 86.25% is observed. This indicates the existence of complementary dialect specific evidence with spectral and prosodic features. The experiments conducted on standard IViE corpus have shown a higher recognition rate of 91.38% using ESVM. Proposed ADI systems with derived features have shown better performance over the state-of-the-art i-vector feature based systems on both datasets.
Article
Full-text available
i-Vector based recognition is a well-established technique in state-of-the-art speaker and language recognition but its use in dialect and accent classification has received less attention. In this work, we extensively experiment with the spectral feature based i-vector system on Finnish foreign accent recognition task. Parameters of the system are initially tuned with the CallFriend corpus. Then the optimized system is applied to the Finnish national foreign language certificate (FSD) corpus. The availability of suitable Finnish language corpora to estimate the hyper-parameters is necessarily limited in comparison to major languages such as English. In addition, it is not immediately clear which factors affect the foreign accent detection performance most. To this end, we assess the effect of three different components of the foreign accent recognition: (1) recognition system parameters, (2) data used for estimating hyper-parameters and (3) language aspects. We find out that training the hyper-parameters from non-matched dataset yields poor detection error rates in comparison to training from application-specific dataset. We also observed that, the mother tongue of speakers with higher proficiency in Finnish are more difficult to detect than of those speakers with lower proficiency. Analysis on age factor suggests that mother tongue detection in older speaker groups is easier than in younger speaker groups. This suggests that mother tongue traits might be more preserved in older speakers when speaking the second language in comparison to younger speakers.
Conference Paper
Full-text available
We propose a hybrid approach to foreign accent recognition combining both phonotactic and spectral based systems by treating the problem as a spoken language recognition task. We extract speech attribute features that represent speech and acoustic cues reflecting foreign accents of a speaker to obtain feature streams that are modeled with the i-vector methodology. Testing on the Finnish Language Proficiency exam corpus, we find our proposed technique to achieve a significant performance improvement over the state-of-the-art systems using only spectral based features.
Conference Paper
Full-text available
In this paper, three utterance modelling approaches, namely Gaussian Mean Supervector (GMS), i-vector and Gaussian Posterior Probability Supervector (GPPS), are applied to the accent recognition problem. For each utterance modeling method, three different classifiers, namely the Support Vector Machine (SVM), the Naive Bayesian Classifier (NBC) and the Sparse Representation Classifier (SRC), are employed to find out suitable matches between the utterance modelling schemes and the classifiers. The evaluation database is formed by using English utterances of speakers whose native languages are Russian, Hindi, American English, Thai, Vietnamese and Cantonese. These utterances are drawn from the National Institute of Standards and Technology (NIST) 2008 Speaker Recognition Evaluation (SRE) database. The study results show that GPPS and i-vector are more effective than GMS in this accent recognition task. It is also concluded that among the employed classifiers, the best matches for i-vector and GPPS are SVM and SRC, respectively.
Article
Full-text available
The time domain waveform of a speech signal carries all of the auditory information. From the phonological point of view, it little can be said on the basis of the waveform itself. However, past research in mathematics, acoustics, and speech technology have provided many methods for converting data that can be considered as information if interpreted correctly. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. These features should describe each segment in such a characteristic way that other similar segments can be grouped together by comparing their features. There are enormous interesting and exceptional ways to describe the speech signal in terms of parameters. Though, they all have their strengths and weaknesses, we have presented some of the most used methods with their importance.
Article
Full-text available
The Arabic language is a collection of multiple variants, among which Modern Standard Arabic (MSA) has a special sta- tus as the formal written standard language of the media, culture and education across the Arab world. The other variants are in- formal spoken dialects that are the media of communication for daily life. Arabic di- alects differ substantially from MSA and each other in terms of phonology, mor- phology, lexical choice and syntax. In this paper, we describe a system that automat- ically identifies the Arabic dialect (Gulf, Iraqi, Levantine, Egyptian and MSA) of a speaker given a sample of his/her speech. The phonotactic approach we use proves to be effective in identifying these di- alects with considerable overall accuracy — 81.60% using 30s test utterances.
Data
This paper describes the building of a multi dialect Arabic speech parallel corpus. It is designed to encompass four main dialects; Modern Standard Arabic (MSA), Gulf, Egypt and Levantine dialects. We have chosen a specific linguistic domain to work with it: travel and tourism. Parallel prompts were written for the four main dialects, which involved 1291 recordings for MSA and 1069 recordings for other dialects. The recordings were conducted with the consent of 52 participants. We have obtained about 32 speech hours. After the segmentation stage, we have obtained a total number of 67,132 speech files. These are the first Arabic parallel texts, and speech corpora and will be an open source for researchers.