Conference PaperPDF Available

Accent Identification by Combining Deep Neural Networks and Recurrent Neural Networks Trained on Long and Short Term Features

Accent Identification by Combining Deep Neural Networks and Recurrent
Neural Networks Trained on Long and Short Term Features
Yishan Jiao1, Ming Tu1, Visar Berisha1,2, Julie Liss1
1Department of Speech and Hearing Science
2School of Electrical, Computer, and Energy Engineering
Arizona State University
{yjiao16, mingtu, visar, julie.liss}
Automatic identification of foreign accents is valuable for many
speech systems, such as speech recognition, speaker identifica-
tion, voice conversion, etc. The INTERSPEECH 2016 Native
Language Sub-Challenge is to identify the native languages of
non-native English speakers from eleven countries. Since dif-
ferences in accent are due to both prosodic and articulation char-
acteristics, a combination of long-term and short-term training
is proposed in this paper. Each speech sample is processed into
multiple speech segments with equal length. For each segment,
deep neural networks (DNNs) are used to train on long-term
statistical features, while recurrent neural networks (RNNs) are
used to train on short-term acoustic features. The result for each
speech sample is calculated by linearly fusing the results from
the two sets of networks on all segments. The performance
of the proposed system greatly surpasses the provided baseline
system. Moreover, by fusing the results with the baseline sys-
tem, the performance can be further improved.
Index Terms: accent identification, deep neural networks,
prosody, articulation
1. Introduction
Accent classification refers to the problem of inferring the
native language of a speaker from his or her foreign accented
speech. Identifying idiosyncratic differences in speech produc-
tion is important for improving the robustness of existing speech
analysis systems. For example, automatic speech recognition
(ASR) systems exhibit lower performance when evaluated on
foreign accented speech. By developing pre-processing algo-
rithms that identify the accent, these systems can be modified
to customize the recognition algorithm to the particular accent
[1] [2]. In addition to ASR applications, accent identification
is also useful for forensic speaker profiling by identifying the
speaker’s regional origin and ethnicity in applications involving
targeted marketing [3] [4]. In this paper we propose a method
for classification of 11 accents directly from the speech acous-
A number of studies have analyzed how elemental compo-
nents of speech change with accent. Spectral features (e.g. for-
mant frequencies) and temporal features (e.g. intonation and
durations) have all been shown to vary with accent [5] [6].
These features have been combined in various statistical models
and machine learning methods to automate the accent classifi-
cation task. Gaussian Mixture Models (GMMs) and Hidden
Markov Models (HMMs) are commonly used approaches in
many earlier studies [7] [8] [9]. For example, Deshpande et al.
used GMMs based on formant frequency features to discrimi-
nate between standard American English and Indian accented
English [7]. Chen et al. explored the effect of the number of
components in GMMs on classification performance [10]. Tang
and Ghorbani compared the performance of HMMs with Sup-
port Vector Machine (SVM) for accent classification [11]. Oth-
ers have also considered linear models. Ghesquiere et al. used
both formant frequencies and duration features and proposed
an “eigenvoice” approach for Flemish accent identification [8].
Kumpf and King proposed to use linear discriminant analysis
(LDA) for identification of three accents in Australian English
Artificial neural networks, especially Deep Neural Net-
works (DNNs) and Recurrent Neural Networks (RNNs) have
been widely used in state-of-the-art speech systems [13] [14]
[15] [16]; however in the area of accent identification, there are
only a few studies evaluating the performance of neural net-
works [17] [18]. Nonetheless, in a related area, language identi-
fication (LID), neural networks have been investigated exhaus-
tively [19] [20] [21]. A recent study in this area explored the
use of recurrent neural networks for automatic language iden-
tification [22]. Their study also suggests that the combination
of recurrent and deep networks can lead to significant improve-
ments in performance. Inspired by this work, in this paper, we
propose a system that combines DNNs and RNNs. In contrast
to the work in [22], we propose to take advantage of both long-
term and short-term features since previous work shows that
foreign accents depend on both long-term prosodic features and
short-term articulation features. The final prediction is obtained
by linearly fusing the results from the two neural networks.
The organization of this paper is as follows. Section 2
briefly describes the goal, the dataset and the baseline system
for the INTERSPEECH 16 Native Language Sub-Challenge.
Section 3 introduces the proposed system that combines long
and short term features using DNNs and RNNs. The corre-
sponding experimental setup is also described in this section.
The evaluation results are shown in Section 4. The discussion
and the conclusion are in Section 5.
2. Dataset and the Baseline System
The provided dataset for the INTERSPEECH 16 Native Lan-
guage Sub-Challenge contains a training, a development, and
a test set. The corpus contains one speech sample from 5132
speakers, labeled with one of the 11 native languages. The
training and development sets are each assigned 3300 and 965
samples respectively. The remaining 867 samples are assigned
to the test set. The length of each sample is 45 seconds. A
detailed description of the dataset can be found in the baseline
Table 1: Confusion matrix of baseline system on development
set. Rows are reference, and columns are hypothesis.
ARA 31 3 6 7 5 5 6 5 5 6 7
CHI 4 38 5 4 5 2 5 9 7 4 1
FRE 11 7 29 9 0 5 3 1 9 0 6
GER 4 4 5 54 1 7 2 3 6 1 0
HIN 3 2 2 0 48 2 1 2 2 21 0
ITA 6 3 8 7 6 46 0 3 10 1 4
JPN 4 13 5 2 2 1 35 11 10 1 1
KOR 3 20 1 3 2 3 13 31 5 3 6
SPA 6 11 15 6 2 4 9 8 33 1 5
TEL 2 0 3 2 24 2 3 1 2 42 2
TUR 6 4 4 6 2 6 7 8 5 0 47
paper [23].
The goal of the Native Language Sub-Challenge is to
identify the corresponding native language from the accented
speech. The challenge is particularly difficult for two reasons:
first, all of the speech samples were recorded with babel back-
ground noise using low-quality head mounted microphones.
Second, in addition to accent differences, a large number of the
speakers were not perfectly fluent in English; therefore there
were a number of pauses and linguistic fillers in the speech. In
our proposed system we try to address these challenges by us-
ing a voice activity detection (VAD) to remove the pauses and
using a non-linear learning algorithm to model the relationship
between the features and the class label.
The baseline system against which we compare used 6373
long-term features extracted from each speech sample with
openSMILE [24]; these include prosodic features (range, maxi-
mum, minimum of F0, sub-band energies, peaks, etc.) and var-
ious statistics of traditional acoustic features (mean, standard
deviation, kurtosis of MFCC, RASTA, etc.). A support vector
machine (SVM) is constructed to model the data. More detail
about the baseline system can be found in [23]. The perfor-
mance of the baseline system on development set is shown as a
confusion matrix in Table 1. The overall accuracy is 44.66%.
The recall for each class and the unweighted average recall
(UAR) is shown in the second column of Table 2.
3. Proposed System Description and
Experimental Setup
The proposed system is shown in Figure 1. It consists of a voice
activity detector, followed by two parallel neural networks (a
DNN and an RNN) analyzing the speech samples at different
scales, and a probabilistic fusion algorithm. Below we describe
each component of the model.
Voice Activity Detection: As mentioned previously, there are
a number of pauses and silences in the speech samples. These
were often due to the fact that some of the speakers did not
speak fluent English and paused to think of the proper expres-
sion. We first used voice activity detection (VAD) [25] to re-
move the silence periods. The VAD threshold was adjusted
to match the noise level of the speech samples using cross-
validation and we only removed the detected silence segments
with length longer than 300 milliseconds.
Framing and Feature Extraction: The remaining speech sam-
ples were then trimmed into multiple segments with equal
S segments
4s 4s 4s
One speech sample (45 s)
Long term features
short term features
layer Hidden layers
Figure 1: The proposed system of combining long and short
term features using DNNs and RNNs.
length of 4 seconds. Thus every 45-second speech sample was
segmented into approximately 10-11 parts. Long-term features
we used were the same as those in the baseline system (mean,
standard deviation, kurtosis of MFCC, RASTA, etc.). They
were extracted from each segment in each speech sample with
openSMILE scripts. Each 4-sec window was further split into
25ms windows with a 10ms overlap. Short-term features were
extracted from each 25ms signal. Specifically, we used 39th -
order mel-scale filterbank features with logarithmic compres-
sion [26].
Deep Neural Network: A DNN was constructed to make a pre-
diction regarding the accent type from the long-term features.
The structure of the DNN is as follows: There was an input
layer with 6373 nodes corresponding to each dimension in the
feature set. Three hidden layers with 256 nodes for each fol-
lowed. Rectifier linear units (“ReLU”) were used at the output
of each layer and we use the dropout method to prevent over-
fitting - each input unit to the next layer can be dropped with
0.5 probability [27]. The output layer contained 11 nodes cor-
responding to the 11 accents with softmax activation functions.
Stochastic gradient descent with a batch size of 128 was used
for training. The learning rate and momentum were set to 0.001
and 0.9 respectively. All of the parameters were optimized on
the development set. We attempted to use principal componen-
t analysis (PCA) to reduce the input feature dimension from
Table 2: Recall for each class and the unweighted average recall
(UAR) on development set given by different systems (%)
Baseline DNN+RNN Baseline
ARA 36.0 39.5 41.9
CHI 45.2 65.4 65.5
FRE 36.3 45.0 50.0
GER 62.1 62.4 68.2
HIN 57.8 79.5 77.1
ITA 48.9 64.9 68.1
JPN 41.2 43.5 44.7
KOR 34.4 42.2 47.8
SPA 33.0 26.0 35.0
TEL 50.6 43.4 49.4
TUR 49.5 62.8 66.0
UAR 45.1 52.2 55.8
6373 to 800. Our hope was that this would reduce the size of
the model, making it easier to train, and improving its robust-
ness; however the cross-validation results on the development
set after PCA decreased slightly, therefore we kept the original
feature set.
Recurrent Neural Network: The RNN was trained on the
short-term features extracted from 25ms frames of speech. Cat-
egorical labels were assigned to each frame of the segment. The
results for each sample were calculated by averaging the pre-
dictions on all frames in all segments. The structure of RNN
is as follows: The input data is sequentially fed into the RNN
frame-by-frame. Each frame is of dimension 39. Two hidden
layers with 512 long short term memory (LSTM) nodes were
used. In each LSTM node, there is a cell state regulated by a
forget gate, an input gate and an output gate. The activation
function for the gates was a ‘logistic sigmoid’ and for updat-
ing the cell state we used a ‘tanh’. The accent label was as-
signed to every 25ms speech frame - the LSTM layers allowed
the model to learn long-term dependencies by taking the output
of the previous hidden nodes as part of the inputs to the cur-
rent nodes. Our hypothesis was that with this kind of structure
the model could learn differences in articulation (e.g. formant
values) and differences in how articulation changes over time
(e.g. formant trajectories) for different accents. Specifically,
as shown in the RNN part of Figure 1, the input is a time se-
ries of acoustic features X= [x1, ..., xn, ..., xN]with length
N. After training, the RNN computes the hidden sequences
H= [h1, ..., hn, ..., hN]and outputs the probability predic-
tions for each frame Y= [y1, ..., yn, ..., yN]by iterating from
n= 1 to N as follows [28]:
h y
For training the model, we followed a similar approach to
the DNN. We used the dropout methods with each of the input
units to the next layer dropped in 0.5 probability [29]. The RM-
SProp algorithm was used for optimization [30] with a learning
rate of 0.001 and a batch size for training of 256 samples.
Generating a final decision: We interpret the output of the
activation functions of both the DNN and the RNN as a poste-
riori probabilities. The final decision was calculated by fusing
these two estimations. Suppose the complete speech sample
Table 3: Accuracy and UAR for the variations of the systems
on development set
on segments
DNN with
RNN(on sequence)
Accuracy (%) 42.9 49.1 49.8 50.2
UAR (%) 43.2 49.5 50.0 50.4
from a speaker (45 sec) is segmented into S4-sec parts. The
DNN provides as an output a probability vector that describes
the probability that the input segment belongs to any of the 11
classes. Thus, the probability prediction given by DNN for the
ith segment in the jth class is denoted by PDNN(i, j ), where
i= 1,2, ..., S and j= 1,2, ..., 11. The RNN also provides a
probability vector, but it is predicted on every 25ms frame in-
stead of on every segment. For the same 4-sec segment used
in the DNN, we can combine the results from the individual
frames into a single prediction for the segment, PRNN(i, j ), as
PRNN(i, j ) = 1
pRNN(n, j )(2)
where PRNN(n, j )is the prediction of the RNN on the nth
frame for the jth class dimension with n= 1,2, ...N, j =
1,2, ..., 11.Nis the total number of frames in speech segment
i,i= 1,2, ..., S.
After combining the individual probabilities for each frame
into a single probability for the segment, we can combine the
DNN and RNN probabilities using a weighted average. The
final probability score P(j)on the complete sample in the jth
class is calculated as Equation 3.
P(j) = 1
PDNN(i, j ) + wRNN
PRNN(i, j )#
where i= 1,2, ...S, j = 1,2, ..., 11.wDNN and wRNN are the
weights for DNN and RNN predictions. They are determined
by the accuracy of DNN and RNN on the development set as
wRNN = 1 wD NN
where Accis the accuracy of the model, which is the propor-
tion of correct predictions. A final decision is made by selecting
the class with the highest probability.
4. Evaluation and Results
Both the DNN and RNN were trained with the Python neural
networks library, Keras [31], running on top of Theano on a
CUDA GPU. The data was normalized to zero mean and unit
standard deviation, using the mean and standard deviations from
the training set. The results are shown as recall for each class
in the third column of Table 2. The overall accuracy is 51.92%,
and the UAR is 52.24%.
We also made a number of variations of the system and test-
ed the performance on the development set. The first two varia-
tions (DNN only and RNN only) use the DNN and RNN alone
without any fusion. The third variation (Fusion on segments)
uses both the DNN and RNN, but the prediction is obtained by
Table 4: Confusion matrix of the proposed system fused with
baseline on development set. Rows are reference, and columns
are hypothesis.
ARA 36 3 3 8 5 6 3 0 4 6 11
CHI 1 55 3 3 4 4 1 7 2 2 2
FRE 9 1 40 3 2 9 1 2 8 1 4
GER 3 7 4 58 1 5 0 0 1 1 5
HIN 0 1 0 0 64 1 2 0 0 14 1
ITA 7 1 5 3 4 64 1 0 5 0 4
JPN 3 15 2 0 2 4 38 12 8 1 0
KOR 2 21 1 2 2 2 9 43 4 1 3
SPA 6 8 8 2 5 9 7 8 35 3 9
TEL 2 1 0 2 34 1 0 0 2 41 0
TUR 9 2 0 5 3 6 2 1 3 1 62
Figure 2: Many-to-one RNN structure used in the method of
DNN with RNN(on sequence).
fusing the results on segments (see Equation 5) instead of fusing
on speakers (see Equation 3).
P(j) = 1
[wDNNPDNN (i, j) + wRNN PRNN (i, j)].(5)
Comparing the equation above with Equation (3), we see that
the weights are inside the summation whereas they are outside
the summation in (3).
For the fourth variation (DNN+ RNN (on sequence)), the
structure is the same as that of the proposed system. The differ-
ence is in the way we train the RNN. In this method, we train the
RNN on the segment level instead of on the frame level. In other
words, the accent label was assigned to the segment instead of
assigning it to every frame. This can be interpreted as a many-
to-one model in Figure 2. The fusion between the DNN and the
RNN was done in the same way as in the proposed system. The
accuracy and UAR for these variations of the system are shown
in Table 3. The results show that none of the variations of the
system performs better than the current system.
Fusing with baseline. Comparing the results between the base-
line and the proposed systems, we can see that the proposed
system outperforms the baseline system overall and for most of
the accents. However, for some of the accents, such as Spanish
(SPA) and Telugu (TEL), the baseline system seems to work
better than the proposed system. It seems that the neural net-
works and the SVM learned complementary representations of
the data for the task. Therefore, we tried to fuse the predic-
tion between the SVM-based baseline system and the proposed
DNN/RNN based system. The weights of the fusion algorithm
were tuned on the development set (set to 0.9 for the proposed
Table 5: Confusion matrix on test set. Rows are reference, and
columns are hypothesis.
ARA 28 2 3 2 3 9 10 6 4 3 10
CHI 1 45 1 2 0 2 13 4 2 2 2
FRE 5 4 38 5 1 7 5 4 4 4 1
GER 0 7 6 45 1 4 1 0 3 1 7
HIN 5 3 1 2 41 0 0 0 2 27 1
ITA 5 3 7 2 2 37 0 0 6 4 2
JPN 5 5 0 2 1 1 49 10 0 0 2
KOR 2 13 1 1 1 1 12 41 4 0 4
SPA 7 5 10 4 4 8 5 4 26 1 3
TEL 1 1 0 0 29 0 0 2 0 54 1
TUR 14 4 5 2 1 2 4 2 4 1 51
Table 6: Recall for each class and the UAR on the test set (%)
35 61 49 60 50 54 65 51 34 61 57 52.5
system and 0.1 for the baseline system). The accuracy after fus-
ing increased to 55.54%. The recalls are shown in the last col-
umn of Table 2. From the table, we can see the performance
improved further after fusing with the baseline system. The
confusion matrix is shown in Table 4.
The best performance we can achieve on the test set is
shown as confusion matrix in Table 5 and recalls in Table 6.
The overall accuracy is 52.48% and the UAR is 52.48%. This
is better than the performance of the baseline system reported
in [23].
5. Discussion and Conclusion
In this paper, we present an accent identification system by
combining DNNs and RNNs trained on long-term and short-
term features respectively. We process the original speech sam-
ples into multiple segments to generate predictions of the accent
type from each sample using neural networks, then to fuse them
across all samples from a single speaker. Moreover, by fus-
ing the results between DNNs and RNNs, we take advantage
of both long-term prosodic features and short-term articulation
features. We have evaluated the proposed system on the devel-
opment set and the test set. The results show that the proposed
system surpasses the performance of the provided SVM-based
baseline system. By fusing the results of the proposed system
with that of the baseline system, performance can be further
improved. However, by looking through the confusion matrix
in Table 4, we see that the system makes more mistakes among
languages which are geographically close, such as between Hin-
di and Telugu; and among Japanese, Korean and Chinese. As
future work it makes sense to develop a hierarchical classifier
that initially considers groups of languages then makes more
fine-grained decisions. Moreover, it is also worthwhile to in-
vestigate the individual benefits of DNNs and RNNs, since for
some languages like Hindi, the prosody is more distinct; while
for others like German, articulation is more important.
6. Acknowledgement
This work was partially supported by an NIH 1R21DC013812
grant. The authors graciously acknowledge a hardware dona-
tion from NVIDIA.
7. References
[1] L. Kat and P. Fung, “Fast accent identification and accented
speech recognition,” in Acoustics, Speech, and Signal Processing
(ICASSP), IEEE International Conference on, vol. 1. Phoenix,
AZ, USA: IEEE, 1999, pp. 221–224.
[2] C. Huang, T. Chen, and E. Chang, “Accent issues in large vo-
cabulary continuous speech recognition,” International Journal of
Speech Technology, vol. 7, no. 2-3, pp. 141–153, 2004.
[3] D. C. Tanner and M. E. Tanner, Forensic aspects of speech pat-
terns: voice prints, speaker profiling, lie and intoxication detec-
tion. Lawyers & Judges Publishing Company, 2004.
[4] F. Biadsy, J. B. Hirschberg, and D. P. Ellis, “Dialect and accent
recognition using phonetic-segmentation supervectors,” 2011.
[5] L. M. Arslan and J. H. Hansen, “Frequency characteristics of for-
eign accented speech,” in Acoustics, Speech, and Signal Process-
ing (ICASSP), IEEE International Conference on, vol. 2. Mu-
nich, Germany: IEEE, 1997, pp. 1123–1126.
[6] E. Ferragne and F. Pellegrino, “Formant frequencies of vowels in
13 accents of the british isles,” Journal of the International Pho-
netic Association, vol. 40, no. 01, pp. 1–34, 2010.
[7] S. Deshpande, S. Chikkerur, and V. Govindaraju, “Accent classifi-
cation in speech,” in Automatic Identification Advanced Technolo-
gies, Fourth IEEE Workshop on. Buffalo, NY, USA: IEEE, 2005,
pp. 139–143.
[8] P.-J. Ghesquiere and D. Van Compernolle, “Flemish accent iden-
tification based on formant and duration features,” in Acoustic-
s, Speech, and Signal Processing (ICASSP), IEEE International
Conference on, vol. 1. Orlando, FL, USA: IEEE, 2002, pp. I–
[9] Y. Zheng, R. Sproat, L. Gu, I. Shafran, H. Zhou, Y. Su, D. Juraf-
sky, R. Starr, and S.-Y. Yoon, “Accent detection and speech recog-
nition for shanghai-accented mandarin.” in Interspeech. Lisbon,
Portugal: Citeseer, 2005, pp. 217–220.
[10] T. Chen, C. Huang, E. Chang, and J. Wang, “Automatic ac-
cent identification using gaussian mixture models,” in Automat-
ic Speech Recognition and Understanding, IEEE Workshop on.
Madonna di Campiglio, Italy: IEEE, 2001, pp. 343–346.
[11] H. Tang and A. A. Ghorbani, “Accent classification using support
vector machine and hidden markov model,” in Advances in Artifi-
cial Intelligence. Springer, 2003, pp. 629–631.
[12] K. Kumpf and R. W. King, “Foreign speaker accent classifica-
tion using phoneme-dependent accent discrimination models and
comparisons with human perception benchmarks,” in Proc. Eu-
roSpeech, vol. 4, pp. 2323–2326, 1997.
[13] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
neural networks for acoustic modeling in speech recognition: The
shared views of four research groups,” Signal Processing Maga-
zine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
[14] H. Zen and H. Sak, “Unidirectional long short-term memory re-
current neural network with recurrent output layer for low-latency
speech synthesis,” in Acoustics, Speech and Signal Processing (I-
CASSP), IEEE International Conference on. Brisbane, Australia:
IEEE, 2015, pp. 4470–4474.
[15] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study
on speech enhancement based on deep neural networks,” Signal
Processing Letters, IEEE, vol. 21, no. 1, pp. 65–68, 2014.
[16] Y. Jiao, M. Tu, V. Berisha, and J. Liss, “Online speaking rate esti-
mation using recurrent neural netwroks,” in Acoustics, Speech and
Signal Processing, IEEE International Conference on. Shanghai,
China: IEEE, 2016.
[17] M. V. Chan, X. Feng, J. A. Heinen, and R. J. Niederjohn, “Clas-
sification of speech accents with neural networks,” in Neural
Networks, IEEE World Congress on Computational Intelligence.,
IEEE International Conference on, vol. 7. IEEE, 1994, pp. 4483–
[18] A. Rabiee and S. Setayeshi, “Persian accents identification using
an adaptive neural network,” in Second International Workshop
on Education Technology and Computer Science. Wuhan, China:
IEEE, 2010, pp. 7–10.
[19] G. Montavon, “Deep learning for spoken language identification,”
in NIPS Workshop on deep learning for speech recognition and
related applications, Whistler, BC, Canada, 2009, pp. 1–4.
[20] R. A. Cole, J. W. Inouye, Y. K. Muthusamy, and M. Gopalakrish-
nan, “Language identification with neural networks: a feasibili-
ty study,” in Communications, Computers and Signal Processing,
IEEE Pacific Rim Conference on. IEEE, 1989, pp. 525–529.
[21] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Mar-
tinez, J. Gonzalez-Rodriguez, and P. Moreno, “Automatic lan-
guage identification using deep neural networks,” in Acoustic-
s, Speech and Signal Processing (ICASSP), IEEE International
Conference on. Florence, Italy: IEEE, 2014, pp. 5337–5341.
[22] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-
Rodriguez, and P. J. Moreno, “Automatic language identification
using long short-term memory recurrent neural networks.” in In-
terspeech, Singapore, 2014, pp. 2155–2159.
[23] B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. Burgoon,
A. Baird, A. Elkins, Y. Zhang, E. Coutinho, and K. Evanini, “The
interspeech 2016 computational paralinguistics challenge: Decep-
tion, sincerity & native language.” in Interspeech. San Francisco,
CA, USA: ISCA, 2016.
[24] F. Eyben, M. W ¨
ollmer, and B. Schuller, “Opensmile: the mu-
nich versatile and fast open-source audio feature extractor,” in
Proceedings of the 18th ACM international conference on Mul-
timedia. ACM, 2010, pp. 1459–1462.
[25] M. Tu, X. Xie, and Y. Jiao, “Towards improving statistical model
based voice activity detection.” in Interspeech, Singapore, 2014,
pp. 1549–1552.
[26] T. Virtanen, R. Singh, and B. Raj, Techniques for noise robustness
in automatic speech recognition. John Wiley & Sons, 2012.
[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: A simple way to prevent neural net-
works from overfitting,The Journal of Machine Learning Re-
search, vol. 15, no. 1, pp. 1929–1958, 2014.
[28] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition
with deep recurrent neural networks,” in Acoustics, Speech and
Signal Processing (ICASSP), IEEE International Conference on.
Vancouver, BC, Canada: IEEE, 2013, pp. 6645–6649.
[29] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural net-
work regularization,arXiv preprint arXiv:1409.2329, 2014.
[30] T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the
gradient by a running average of its recent magnitude,” COURS-
ERA: Neural Networks for Machine Learning, 2012.
[31] F. Chollet, “keras,”, 2015.
... Purely acoustic models have reported encouraging results from the perspective of the speakers' native language identification from speech in their L2 English. Jiao et al. 23 have proposed a fusion of Recurrent Neural Network (RNN) and Deep Neural Network (DNN) on the Native Language Speech Corpus (NLSC) in the INTERSPEECH 16 Native Language Sub-Challenge, consisting of TOEFL recordings for English speech by speakers from 11 different countries. The model has reported a 51.92% accuracy. ...
... Most of the acoustic models in the literature have relied on variants of RNN for modelling time series of short-term features. 19,23 Moreover, most fusion techniques have merely merged the output layers by simple averaging or weighted averaging by using empirically adjusted weights. 22,23 On the other hand, this paper has proposed and analysed the fusion of multiple convolutional networks and found it to give a better performance than standalone RNN or CNN based architectures. ...
... 19,23 Moreover, most fusion techniques have merely merged the output layers by simple averaging or weighted averaging by using empirically adjusted weights. 22,23 On the other hand, this paper has proposed and analysed the fusion of multiple convolutional networks and found it to give a better performance than standalone RNN or CNN based architectures. The proposed model uses a meta-SVM classifier to fuse the individual classifiers by concatenating the bottleneck layers from the trained end-to-end CNNs and using them as inputs to the SVM classifier. ...
Full-text available
Social background profiling of speakers refers to estimating the geographical origin of speakers by their speech features. Methods for accent profiling that use linguistic features, require phoneme alignment and transcription of the speech samples. This paper proposes a purely acoustic accent profiling model, composed of multiple convolutional networks with global average-pooling layers, to classify the temporal sequence of acoustic features. The bottleneck representations of the convolutional networks, trained with the original signals and their low-pass filtered copies, are fed to a Support Vector Machine classifier for final prediction. The model has been analysed for a speech dataset of Indian speakers from social backgrounds spread across India. It has been shown that up to 85% accuracy is achievable for classifying the geographic origin of speakers corresponding to regional Indian languages; 17% higher than the benchmark deep learning model using the same features. Results have also indicated that classification of accents is easier using the second language of the speakers, as compared to their native language.
... In the last decade, although artificial neural networks, such as recurrent neural networks (RNNs), and deep neural networks (DNNs), have been used in several areas of machine learning research for a long time with excellent results, there are only a few studies that have evaluated their applicability in the field of accent classification. One of them is Jiao et al. [7], who developed an accent detection system using a combination of RNNs and DNNs trained on short-term and long-term characteristics. They attempted to identify the original languages of non-native English speakers from eleven nations, and advocated a mix of long-term and short-term training based on the findings that accent differences are mostly attributable to prosodic and articulation disparities. ...
... In table 3, we compared our approach with Jiao [7], VFnet [8], and Resnet [10] models of automatic dialect comparison. It has become clear that our approach of dialects identification works increasingly better than previous models based on comparable approaches. ...
... Studies by Yishan Jiao et al 11 . and Keven Chionh et al 12 . ...
This paper presents three innovative deep learning models for English accent classification: Multi-DenseNet, PSA-DenseNet, and MPSE-DenseNet, that combine multi-task learning and the PSA module attention mechanism with DenseNet. We applied these models to data collected from six dialects of English across native English speaking regions (Britain, the United States, Scotland) and nonnative English speaking regions (China, Germany, India). Our experimental results show a significant improvement in classification accuracy, particularly with MPSA-DenseNet, which outperforms all other models, including DenseNet and EPSA models previously used for accent identification. Our findings indicate that MPSA-DenseNet is a highly promising model for accurately identifying English accents.
... With the development of i-vector, x-vector, and neural networks, language identification has achieved significant success [1][2][3]. In recent years, accent and dialect identification have received increasing attention from speech researchers [4][5][6][7][8][9]. However, DID is often more challenging than the language identification task, as similar dialects often share similar feature spaces [10]. ...
Full-text available
The time-delay neural network (TDNN) can consider multiple frames of information simultaneously, making it particularly suitable for dialect identification. However, previous TDNN architectures have focused on only one aspect of either the temporal or channel information, lacking a unified optimization for both domains. We believe that extracting appropriate contextual information and enhancing channels are critical for dialect identification. Therefore, in this paper, we propose a novel approach that uses the ECAPA-TDNN from the speaker recognition domain as the backbone network and introduce a new multi-scale channel adaptive module (MSCA-Res2Block) to construct a multi-scale channel adaptive time-delay neural network (MSCA-TDNN). The MSCA-Res2Block is capable of extracting multi-scale features, thus further enlarging the receptive field of convolutional operations. We evaluated our proposed method on the ADI17 Arabic dialect dataset and employed a balanced fine-tuning strategy to address the issue of imbalanced dialect datasets, as well as Z-Score normalization to eliminate score distribution differences among different dialects. After experimental validation, our system achieved an average cost performance (Cavg) of 4.19% and a 94.28% accuracy rate. Compared to ECAPA-TDNN, our model showed a 22% relative improvement in Cavg. Furthermore, our model outperformed the state-of-the-art single-network model reported in the ADI17 competition. In comparison to the best-performing multi-network model hybrid system in the competition, our Cavg also exhibited an advantage.
... The ELMs method achieved 77.88% accuracy on the specified dataset. Further, Jiao et al. [9] proposed an accent classification system using a hybrid DNN and recurrent neural network (RNN) algorithm. The RNN and DNN models were trained on long-term features and short features, respectively. ...
Full-text available
Accent similarity evaluation and accent identification are complex and challenging tasks for various applications due to the existence of variant types of native and non-native languages in the world. The absence of existing studies for the non-native and native English accent similarity evaluation and the limitation of individual feature extraction techniques for accent classifications have led us to propose a new model termed the intra-native accent feature sharing based native accent identification (NAI) framework using an English accent archive speech dataset. The NAI network was employed for non-native English accent classification, native English accent classification, and identification of native and non-native English accents. Finally, the accent similarity of native and non-native English accents was evaluated based on a delicate NAI pre-trained model. Moreover, the proposed approach has a high role in training data augmentation to overcome the challenge of a huge amount of training datasets demands of deep learning. The ordinary individual voice feature extraction with data augmentation and regularization techniques was the baseline for our work. The proposed approach boosted the accuracy of the baseline method with an average accuracy value of 3.7% -7.5% on different vigorous deep learning algorithms. The Quade test method for the performance comparison gave a 0.01 significant level (p-value) that proved that the proposed approach performed better than the baseline significantly. The model makes the rank for non-native English accents based on their similarity to native English accents and the proximity rank is Mandarin, Italian, German, French, Amharic, and Hindi.
... However, this approach critically degrades ASR performance when the system fails to identify the input speech's dialect accurately. In such contexts, language and dialect identification still remain a challenging problem and are very difficult especially when the input utterances are short [5,[19][20][21][22]. In addition, it is hard to apply such systems to utterances with multiple speaker changes, for example, the processing of single channel call center conversations with multiple speakers conversing in different dialects. ...
Automatic accent classification is an active research field concerning speech processing. It can be useful to identify a speaker's region of origin, which can be applied in police investigations carried out by Law Enforcement Agencies, as well as for the improvement of current speech recognition systems. This paper presents a novel descriptor called Grad-Transfer, extracted using the Gradient-weighted Class Activation Mapping (Grad-CAM) method based on convolutional neural network (CNN) interpretability. Additionally, we propose a methodology for accent classification that implements Grad-Transfer, which is based on transferring the knowledge acquired by a CNN to a classical machine learning algorithm. The paper works on two hypotheses: the coarse localization maps produced by Grad-CAM on spectrograms are able to highlight the regions of the spectrograms that are important for predicting accents, and Grad-Transfer descriptors computed from audios represent distinctive descriptions of the target accents. These hypotheses were demonstrated experimentally, clustering the generated Grad-Transfer descriptors according to the original accent of the audios using Birch and $k$ -means algorithms. We carried out experiments on the Voice Cloning Toolkit dataset, seeing an increase of macro average accuracy, and unweighted average recall in the results obtained by a Gaussian Naive Bayes classifier up to $23.00\%$ , and $23.58\%$ , respectively, compared to a model trained with spectrograms. This demonstrates that Grad-Transfer is able to improve the performance of accent classification models and opens the door to new implementations in similar tasks.
Full-text available
Representation of one-dimensional (1D) signals as surfaces and higher-dimensional manifolds reveals geometric structures that can enhance assessment of signal similarity and classification of large sets of signals. Motivated by this observation, we propose a novel robust algorithm for extraction of geometric features, by mapping the obtained geometric objects into a reference domain. This yields a set of highly descriptive features that are instrumental in feature engineering and in analysis of 1D signals. Two examples illustrate applications of our approach to well-structured audio signals: Lung sounds were chosen because of the interest in respiratory pathologies caused by the coronavirus and environmental conditions; accent detection was selected as a challenging speech analysis problem. Our approach outperformed baseline models under all measured metrics. It can be further extended by considering higher-dimensional distortion measures. We provide access to the code for those who are interested in other applications and different setups (Code: Supplementary information: The online version contains supplementary material available at 10.1186/s13634-022-00933-9.
Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. Includes contributions from top ASR researchers from leading research units in the field
Errors in Bayes Classification Bayes Classification and ASR External Influences on Speech Recordings The Effect of External Influences on Recognition Improving Recognition under Adverse Conditions References
Statistical model based voice activity detection (VAD) is commonly used in various speech related research and applications. In this paper, we try to improve the performance of statistical model based VAD via new feature extraction method. Our main innovation focuses on that we apply Mel-frequency subband coefficients with power-law nonlinearity as feature for statistical model based VAD instead of Discrete Fourier Transform (DFT) coefficients. This proposed feature is then modeled by Gaussian distribution. Performances of this method are comprehensively compared with existing methods. Meanwhile we also test power-law nonlinearity on existing methods. Experimental results prove that with proposed subband coefficients the performance of statistical model based VAD could be improved a lot. Power-law nonlinearity on DFT coefficients could also bring some improvement.
This work explores the use of Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) for automatic language identification (LID). The use of RNNs is motivated by their better ability in modeling sequences with respect to feed forward networks used in previous works. We show that LSTM RNNs can effectively exploit temporal dependencies in acoustic data, learning relevant features for language discrimination purposes. The proposed approach is compared to baseline i-vector and feed forward Deep Neural Network (DNN) systems in the NIST Language Recognition Evaluation 2009 dataset. We show LSTM RNNs achieve better performance than our best DNN system with an order of magnitude fewer parameters. Further, the combination of the different systems leads to significant performance improvements (up to 28%).
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Empirical results have shown that many spoken language identification systems based on hand-coded features perform poorly on small speech samples where a human would be successful. A hypothesis for this low performance is that the set of extracted features is insufficient. A deep architecture that learns features automatically is implemented and evaluated on several datasets.
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, and machine translation.