Conference PaperPDF Available

Accent Identification by Combining Deep Neural Networks and Recurrent Neural Networks Trained on Long and Short Term Features

Authors:
Accent Identification by Combining Deep Neural Networks and Recurrent
Neural Networks Trained on Long and Short Term Features
Yishan Jiao1, Ming Tu1, Visar Berisha1,2, Julie Liss1
1Department of Speech and Hearing Science
2School of Electrical, Computer, and Energy Engineering
Arizona State University
{yjiao16, mingtu, visar, julie.liss}@asu.edu
Abstract
Automatic identification of foreign accents is valuable for many
speech systems, such as speech recognition, speaker identifica-
tion, voice conversion, etc. The INTERSPEECH 2016 Native
Language Sub-Challenge is to identify the native languages of
non-native English speakers from eleven countries. Since dif-
ferences in accent are due to both prosodic and articulation char-
acteristics, a combination of long-term and short-term training
is proposed in this paper. Each speech sample is processed into
multiple speech segments with equal length. For each segment,
deep neural networks (DNNs) are used to train on long-term
statistical features, while recurrent neural networks (RNNs) are
used to train on short-term acoustic features. The result for each
speech sample is calculated by linearly fusing the results from
the two sets of networks on all segments. The performance
of the proposed system greatly surpasses the provided baseline
system. Moreover, by fusing the results with the baseline sys-
tem, the performance can be further improved.
Index Terms: accent identification, deep neural networks,
prosody, articulation
1. Introduction
Accent classification refers to the problem of inferring the
native language of a speaker from his or her foreign accented
speech. Identifying idiosyncratic differences in speech produc-
tion is important for improving the robustness of existing speech
analysis systems. For example, automatic speech recognition
(ASR) systems exhibit lower performance when evaluated on
foreign accented speech. By developing pre-processing algo-
rithms that identify the accent, these systems can be modified
to customize the recognition algorithm to the particular accent
[1] [2]. In addition to ASR applications, accent identification
is also useful for forensic speaker profiling by identifying the
speaker’s regional origin and ethnicity in applications involving
targeted marketing [3] [4]. In this paper we propose a method
for classification of 11 accents directly from the speech acous-
tics.
A number of studies have analyzed how elemental compo-
nents of speech change with accent. Spectral features (e.g. for-
mant frequencies) and temporal features (e.g. intonation and
durations) have all been shown to vary with accent [5] [6].
These features have been combined in various statistical models
and machine learning methods to automate the accent classifi-
cation task. Gaussian Mixture Models (GMMs) and Hidden
Markov Models (HMMs) are commonly used approaches in
many earlier studies [7] [8] [9]. For example, Deshpande et al.
used GMMs based on formant frequency features to discrimi-
nate between standard American English and Indian accented
English [7]. Chen et al. explored the effect of the number of
components in GMMs on classification performance [10]. Tang
and Ghorbani compared the performance of HMMs with Sup-
port Vector Machine (SVM) for accent classification [11]. Oth-
ers have also considered linear models. Ghesquiere et al. used
both formant frequencies and duration features and proposed
an “eigenvoice” approach for Flemish accent identification [8].
Kumpf and King proposed to use linear discriminant analysis
(LDA) for identification of three accents in Australian English
[12].
Artificial neural networks, especially Deep Neural Net-
works (DNNs) and Recurrent Neural Networks (RNNs) have
been widely used in state-of-the-art speech systems [13] [14]
[15] [16]; however in the area of accent identification, there are
only a few studies evaluating the performance of neural net-
works [17] [18]. Nonetheless, in a related area, language identi-
fication (LID), neural networks have been investigated exhaus-
tively [19] [20] [21]. A recent study in this area explored the
use of recurrent neural networks for automatic language iden-
tification [22]. Their study also suggests that the combination
of recurrent and deep networks can lead to significant improve-
ments in performance. Inspired by this work, in this paper, we
propose a system that combines DNNs and RNNs. In contrast
to the work in [22], we propose to take advantage of both long-
term and short-term features since previous work shows that
foreign accents depend on both long-term prosodic features and
short-term articulation features. The final prediction is obtained
by linearly fusing the results from the two neural networks.
The organization of this paper is as follows. Section 2
briefly describes the goal, the dataset and the baseline system
for the INTERSPEECH 16 Native Language Sub-Challenge.
Section 3 introduces the proposed system that combines long
and short term features using DNNs and RNNs. The corre-
sponding experimental setup is also described in this section.
The evaluation results are shown in Section 4. The discussion
and the conclusion are in Section 5.
2. Dataset and the Baseline System
The provided dataset for the INTERSPEECH 16 Native Lan-
guage Sub-Challenge contains a training, a development, and
a test set. The corpus contains one speech sample from 5132
speakers, labeled with one of the 11 native languages. The
training and development sets are each assigned 3300 and 965
samples respectively. The remaining 867 samples are assigned
to the test set. The length of each sample is 45 seconds. A
detailed description of the dataset can be found in the baseline
Table 1: Confusion matrix of baseline system on development
set. Rows are reference, and columns are hypothesis.
ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR
ARA 31 3 6 7 5 5 6 5 5 6 7
CHI 4 38 5 4 5 2 5 9 7 4 1
FRE 11 7 29 9 0 5 3 1 9 0 6
GER 4 4 5 54 1 7 2 3 6 1 0
HIN 3 2 2 0 48 2 1 2 2 21 0
ITA 6 3 8 7 6 46 0 3 10 1 4
JPN 4 13 5 2 2 1 35 11 10 1 1
KOR 3 20 1 3 2 3 13 31 5 3 6
SPA 6 11 15 6 2 4 9 8 33 1 5
TEL 2 0 3 2 24 2 3 1 2 42 2
TUR 6 4 4 6 2 6 7 8 5 0 47
paper [23].
The goal of the Native Language Sub-Challenge is to
identify the corresponding native language from the accented
speech. The challenge is particularly difficult for two reasons:
first, all of the speech samples were recorded with babel back-
ground noise using low-quality head mounted microphones.
Second, in addition to accent differences, a large number of the
speakers were not perfectly fluent in English; therefore there
were a number of pauses and linguistic fillers in the speech. In
our proposed system we try to address these challenges by us-
ing a voice activity detection (VAD) to remove the pauses and
using a non-linear learning algorithm to model the relationship
between the features and the class label.
The baseline system against which we compare used 6373
long-term features extracted from each speech sample with
openSMILE [24]; these include prosodic features (range, maxi-
mum, minimum of F0, sub-band energies, peaks, etc.) and var-
ious statistics of traditional acoustic features (mean, standard
deviation, kurtosis of MFCC, RASTA, etc.). A support vector
machine (SVM) is constructed to model the data. More detail
about the baseline system can be found in [23]. The perfor-
mance of the baseline system on development set is shown as a
confusion matrix in Table 1. The overall accuracy is 44.66%.
The recall for each class and the unweighted average recall
(UAR) is shown in the second column of Table 2.
3. Proposed System Description and
Experimental Setup
The proposed system is shown in Figure 1. It consists of a voice
activity detector, followed by two parallel neural networks (a
DNN and an RNN) analyzing the speech samples at different
scales, and a probabilistic fusion algorithm. Below we describe
each component of the model.
Voice Activity Detection: As mentioned previously, there are
a number of pauses and silences in the speech samples. These
were often due to the fact that some of the speakers did not
speak fluent English and paused to think of the proper expres-
sion. We first used voice activity detection (VAD) [25] to re-
move the silence periods. The VAD threshold was adjusted
to match the noise level of the speech samples using cross-
validation and we only removed the detected silence segments
with length longer than 300 milliseconds.
Framing and Feature Extraction: The remaining speech sam-
ples were then trimmed into multiple segments with equal
VAD
S segments
4s 4s 4s
One speech sample (45 s)
6373
S
N
M
M
M
N
N
M
N
DNNs RNNs
Long term features
short term features
S
Fusion
Output
layer
Input
layer Hidden layers
Figure 1: The proposed system of combining long and short
term features using DNNs and RNNs.
length of 4 seconds. Thus every 45-second speech sample was
segmented into approximately 10-11 parts. Long-term features
we used were the same as those in the baseline system (mean,
standard deviation, kurtosis of MFCC, RASTA, etc.). They
were extracted from each segment in each speech sample with
openSMILE scripts. Each 4-sec window was further split into
25ms windows with a 10ms overlap. Short-term features were
extracted from each 25ms signal. Specifically, we used 39th -
order mel-scale filterbank features with logarithmic compres-
sion [26].
Deep Neural Network: A DNN was constructed to make a pre-
diction regarding the accent type from the long-term features.
The structure of the DNN is as follows: There was an input
layer with 6373 nodes corresponding to each dimension in the
feature set. Three hidden layers with 256 nodes for each fol-
lowed. Rectifier linear units (“ReLU”) were used at the output
of each layer and we use the dropout method to prevent over-
fitting - each input unit to the next layer can be dropped with
0.5 probability [27]. The output layer contained 11 nodes cor-
responding to the 11 accents with softmax activation functions.
Stochastic gradient descent with a batch size of 128 was used
for training. The learning rate and momentum were set to 0.001
and 0.9 respectively. All of the parameters were optimized on
the development set. We attempted to use principal componen-
t analysis (PCA) to reduce the input feature dimension from
Table 2: Recall for each class and the unweighted average recall
(UAR) on development set given by different systems (%)
Baseline DNN+RNN Baseline
+DNN+RNN
ARA 36.0 39.5 41.9
CHI 45.2 65.4 65.5
FRE 36.3 45.0 50.0
GER 62.1 62.4 68.2
HIN 57.8 79.5 77.1
ITA 48.9 64.9 68.1
JPN 41.2 43.5 44.7
KOR 34.4 42.2 47.8
SPA 33.0 26.0 35.0
TEL 50.6 43.4 49.4
TUR 49.5 62.8 66.0
UAR 45.1 52.2 55.8
6373 to 800. Our hope was that this would reduce the size of
the model, making it easier to train, and improving its robust-
ness; however the cross-validation results on the development
set after PCA decreased slightly, therefore we kept the original
feature set.
Recurrent Neural Network: The RNN was trained on the
short-term features extracted from 25ms frames of speech. Cat-
egorical labels were assigned to each frame of the segment. The
results for each sample were calculated by averaging the pre-
dictions on all frames in all segments. The structure of RNN
is as follows: The input data is sequentially fed into the RNN
frame-by-frame. Each frame is of dimension 39. Two hidden
layers with 512 long short term memory (LSTM) nodes were
used. In each LSTM node, there is a cell state regulated by a
forget gate, an input gate and an output gate. The activation
function for the gates was a ‘logistic sigmoid’ and for updat-
ing the cell state we used a ‘tanh’. The accent label was as-
signed to every 25ms speech frame - the LSTM layers allowed
the model to learn long-term dependencies by taking the output
of the previous hidden nodes as part of the inputs to the cur-
rent nodes. Our hypothesis was that with this kind of structure
the model could learn differences in articulation (e.g. formant
values) and differences in how articulation changes over time
(e.g. formant trajectories) for different accents. Specifically,
as shown in the RNN part of Figure 1, the input is a time se-
ries of acoustic features X= [x1, ..., xn, ..., xN]with length
N. After training, the RNN computes the hidden sequences
H= [h1, ..., hn, ..., hN]and outputs the probability predic-
tions for each frame Y= [y1, ..., yn, ..., yN]by iterating from
n= 1 to N as follows [28]:
ht=fθ(Wx
hxt+W
h
hht1+b
h),
yt=w
h y
ht+by.
(1)
For training the model, we followed a similar approach to
the DNN. We used the dropout methods with each of the input
units to the next layer dropped in 0.5 probability [29]. The RM-
SProp algorithm was used for optimization [30] with a learning
rate of 0.001 and a batch size for training of 256 samples.
Generating a final decision: We interpret the output of the
activation functions of both the DNN and the RNN as a poste-
riori probabilities. The final decision was calculated by fusing
these two estimations. Suppose the complete speech sample
Table 3: Accuracy and UAR for the variations of the systems
on development set
RNN
only
DNN
only
Fusion
on segments
DNN with
RNN(on sequence)
Accuracy (%) 42.9 49.1 49.8 50.2
UAR (%) 43.2 49.5 50.0 50.4
from a speaker (45 sec) is segmented into S4-sec parts. The
DNN provides as an output a probability vector that describes
the probability that the input segment belongs to any of the 11
classes. Thus, the probability prediction given by DNN for the
ith segment in the jth class is denoted by PDNN(i, j ), where
i= 1,2, ..., S and j= 1,2, ..., 11. The RNN also provides a
probability vector, but it is predicted on every 25ms frame in-
stead of on every segment. For the same 4-sec segment used
in the DNN, we can combine the results from the individual
frames into a single prediction for the segment, PRNN(i, j ), as
follows:
PRNN(i, j ) = 1
N
N
X
n=1
pRNN(n, j )(2)
where PRNN(n, j )is the prediction of the RNN on the nth
frame for the jth class dimension with n= 1,2, ...N, j =
1,2, ..., 11.Nis the total number of frames in speech segment
i,i= 1,2, ..., S.
After combining the individual probabilities for each frame
into a single probability for the segment, we can combine the
DNN and RNN probabilities using a weighted average. The
final probability score P(j)on the complete sample in the jth
class is calculated as Equation 3.
P(j) = 1
S"wDNN
S
X
i=1
PDNN(i, j ) + wRNN
S
X
i=1
PRNN(i, j )#
(3)
where i= 1,2, ...S, j = 1,2, ..., 11.wDNN and wRNN are the
weights for DNN and RNN predictions. They are determined
by the accuracy of DNN and RNN on the development set as
follows,
wDN N =AccDNN
AccDNN +AccRNN
wRNN = 1 wD NN
(4)
where Accis the accuracy of the model, which is the propor-
tion of correct predictions. A final decision is made by selecting
the class with the highest probability.
4. Evaluation and Results
Both the DNN and RNN were trained with the Python neural
networks library, Keras [31], running on top of Theano on a
CUDA GPU. The data was normalized to zero mean and unit
standard deviation, using the mean and standard deviations from
the training set. The results are shown as recall for each class
in the third column of Table 2. The overall accuracy is 51.92%,
and the UAR is 52.24%.
We also made a number of variations of the system and test-
ed the performance on the development set. The first two varia-
tions (DNN only and RNN only) use the DNN and RNN alone
without any fusion. The third variation (Fusion on segments)
uses both the DNN and RNN, but the prediction is obtained by
Table 4: Confusion matrix of the proposed system fused with
baseline on development set. Rows are reference, and columns
are hypothesis.
ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR
ARA 36 3 3 8 5 6 3 0 4 6 11
CHI 1 55 3 3 4 4 1 7 2 2 2
FRE 9 1 40 3 2 9 1 2 8 1 4
GER 3 7 4 58 1 5 0 0 1 1 5
HIN 0 1 0 0 64 1 2 0 0 14 1
ITA 7 1 5 3 4 64 1 0 5 0 4
JPN 3 15 2 0 2 4 38 12 8 1 0
KOR 2 21 1 2 2 2 9 43 4 1 3
SPA 6 8 8 2 5 9 7 8 35 3 9
TEL 2 1 0 2 34 1 0 0 2 41 0
TUR 9 2 0 5 3 6 2 1 3 1 62
𝒚
𝒉1𝒉2𝒉n𝒉N
𝒙1𝒙2𝒙n𝒙N
Outputs
Inputs
Figure 2: Many-to-one RNN structure used in the method of
DNN with RNN(on sequence).
fusing the results on segments (see Equation 5) instead of fusing
on speakers (see Equation 3).
P(j) = 1
S
S
X
i=1
[wDNNPDNN (i, j) + wRNN PRNN (i, j)].(5)
Comparing the equation above with Equation (3), we see that
the weights are inside the summation whereas they are outside
the summation in (3).
For the fourth variation (DNN+ RNN (on sequence)), the
structure is the same as that of the proposed system. The differ-
ence is in the way we train the RNN. In this method, we train the
RNN on the segment level instead of on the frame level. In other
words, the accent label was assigned to the segment instead of
assigning it to every frame. This can be interpreted as a many-
to-one model in Figure 2. The fusion between the DNN and the
RNN was done in the same way as in the proposed system. The
accuracy and UAR for these variations of the system are shown
in Table 3. The results show that none of the variations of the
system performs better than the current system.
Fusing with baseline. Comparing the results between the base-
line and the proposed systems, we can see that the proposed
system outperforms the baseline system overall and for most of
the accents. However, for some of the accents, such as Spanish
(SPA) and Telugu (TEL), the baseline system seems to work
better than the proposed system. It seems that the neural net-
works and the SVM learned complementary representations of
the data for the task. Therefore, we tried to fuse the predic-
tion between the SVM-based baseline system and the proposed
DNN/RNN based system. The weights of the fusion algorithm
were tuned on the development set (set to 0.9 for the proposed
Table 5: Confusion matrix on test set. Rows are reference, and
columns are hypothesis.
ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR
ARA 28 2 3 2 3 9 10 6 4 3 10
CHI 1 45 1 2 0 2 13 4 2 2 2
FRE 5 4 38 5 1 7 5 4 4 4 1
GER 0 7 6 45 1 4 1 0 3 1 7
HIN 5 3 1 2 41 0 0 0 2 27 1
ITA 5 3 7 2 2 37 0 0 6 4 2
JPN 5 5 0 2 1 1 49 10 0 0 2
KOR 2 13 1 1 1 1 12 41 4 0 4
SPA 7 5 10 4 4 8 5 4 26 1 3
TEL 1 1 0 0 29 0 0 2 0 54 1
TUR 14 4 5 2 1 2 4 2 4 1 51
Table 6: Recall for each class and the UAR on the test set (%)
ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR UAR
35 61 49 60 50 54 65 51 34 61 57 52.5
system and 0.1 for the baseline system). The accuracy after fus-
ing increased to 55.54%. The recalls are shown in the last col-
umn of Table 2. From the table, we can see the performance
improved further after fusing with the baseline system. The
confusion matrix is shown in Table 4.
The best performance we can achieve on the test set is
shown as confusion matrix in Table 5 and recalls in Table 6.
The overall accuracy is 52.48% and the UAR is 52.48%. This
is better than the performance of the baseline system reported
in [23].
5. Discussion and Conclusion
In this paper, we present an accent identification system by
combining DNNs and RNNs trained on long-term and short-
term features respectively. We process the original speech sam-
ples into multiple segments to generate predictions of the accent
type from each sample using neural networks, then to fuse them
across all samples from a single speaker. Moreover, by fus-
ing the results between DNNs and RNNs, we take advantage
of both long-term prosodic features and short-term articulation
features. We have evaluated the proposed system on the devel-
opment set and the test set. The results show that the proposed
system surpasses the performance of the provided SVM-based
baseline system. By fusing the results of the proposed system
with that of the baseline system, performance can be further
improved. However, by looking through the confusion matrix
in Table 4, we see that the system makes more mistakes among
languages which are geographically close, such as between Hin-
di and Telugu; and among Japanese, Korean and Chinese. As
future work it makes sense to develop a hierarchical classifier
that initially considers groups of languages then makes more
fine-grained decisions. Moreover, it is also worthwhile to in-
vestigate the individual benefits of DNNs and RNNs, since for
some languages like Hindi, the prosody is more distinct; while
for others like German, articulation is more important.
6. Acknowledgement
This work was partially supported by an NIH 1R21DC013812
grant. The authors graciously acknowledge a hardware dona-
tion from NVIDIA.
7. References
[1] L. Kat and P. Fung, “Fast accent identification and accented
speech recognition,” in Acoustics, Speech, and Signal Processing
(ICASSP), IEEE International Conference on, vol. 1. Phoenix,
AZ, USA: IEEE, 1999, pp. 221–224.
[2] C. Huang, T. Chen, and E. Chang, “Accent issues in large vo-
cabulary continuous speech recognition,” International Journal of
Speech Technology, vol. 7, no. 2-3, pp. 141–153, 2004.
[3] D. C. Tanner and M. E. Tanner, Forensic aspects of speech pat-
terns: voice prints, speaker profiling, lie and intoxication detec-
tion. Lawyers & Judges Publishing Company, 2004.
[4] F. Biadsy, J. B. Hirschberg, and D. P. Ellis, “Dialect and accent
recognition using phonetic-segmentation supervectors,” 2011.
[5] L. M. Arslan and J. H. Hansen, “Frequency characteristics of for-
eign accented speech,” in Acoustics, Speech, and Signal Process-
ing (ICASSP), IEEE International Conference on, vol. 2. Mu-
nich, Germany: IEEE, 1997, pp. 1123–1126.
[6] E. Ferragne and F. Pellegrino, “Formant frequencies of vowels in
13 accents of the british isles,” Journal of the International Pho-
netic Association, vol. 40, no. 01, pp. 1–34, 2010.
[7] S. Deshpande, S. Chikkerur, and V. Govindaraju, “Accent classifi-
cation in speech,” in Automatic Identification Advanced Technolo-
gies, Fourth IEEE Workshop on. Buffalo, NY, USA: IEEE, 2005,
pp. 139–143.
[8] P.-J. Ghesquiere and D. Van Compernolle, “Flemish accent iden-
tification based on formant and duration features,” in Acoustic-
s, Speech, and Signal Processing (ICASSP), IEEE International
Conference on, vol. 1. Orlando, FL, USA: IEEE, 2002, pp. I–
749.
[9] Y. Zheng, R. Sproat, L. Gu, I. Shafran, H. Zhou, Y. Su, D. Juraf-
sky, R. Starr, and S.-Y. Yoon, “Accent detection and speech recog-
nition for shanghai-accented mandarin.” in Interspeech. Lisbon,
Portugal: Citeseer, 2005, pp. 217–220.
[10] T. Chen, C. Huang, E. Chang, and J. Wang, “Automatic ac-
cent identification using gaussian mixture models,” in Automat-
ic Speech Recognition and Understanding, IEEE Workshop on.
Madonna di Campiglio, Italy: IEEE, 2001, pp. 343–346.
[11] H. Tang and A. A. Ghorbani, “Accent classification using support
vector machine and hidden markov model,” in Advances in Artifi-
cial Intelligence. Springer, 2003, pp. 629–631.
[12] K. Kumpf and R. W. King, “Foreign speaker accent classifica-
tion using phoneme-dependent accent discrimination models and
comparisons with human perception benchmarks,” in Proc. Eu-
roSpeech, vol. 4, pp. 2323–2326, 1997.
[13] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
neural networks for acoustic modeling in speech recognition: The
shared views of four research groups,” Signal Processing Maga-
zine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
[14] H. Zen and H. Sak, “Unidirectional long short-term memory re-
current neural network with recurrent output layer for low-latency
speech synthesis,” in Acoustics, Speech and Signal Processing (I-
CASSP), IEEE International Conference on. Brisbane, Australia:
IEEE, 2015, pp. 4470–4474.
[15] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study
on speech enhancement based on deep neural networks,” Signal
Processing Letters, IEEE, vol. 21, no. 1, pp. 65–68, 2014.
[16] Y. Jiao, M. Tu, V. Berisha, and J. Liss, “Online speaking rate esti-
mation using recurrent neural netwroks,” in Acoustics, Speech and
Signal Processing, IEEE International Conference on. Shanghai,
China: IEEE, 2016.
[17] M. V. Chan, X. Feng, J. A. Heinen, and R. J. Niederjohn, “Clas-
sification of speech accents with neural networks,” in Neural
Networks, IEEE World Congress on Computational Intelligence.,
IEEE International Conference on, vol. 7. IEEE, 1994, pp. 4483–
4486.
[18] A. Rabiee and S. Setayeshi, “Persian accents identification using
an adaptive neural network,” in Second International Workshop
on Education Technology and Computer Science. Wuhan, China:
IEEE, 2010, pp. 7–10.
[19] G. Montavon, “Deep learning for spoken language identification,”
in NIPS Workshop on deep learning for speech recognition and
related applications, Whistler, BC, Canada, 2009, pp. 1–4.
[20] R. A. Cole, J. W. Inouye, Y. K. Muthusamy, and M. Gopalakrish-
nan, “Language identification with neural networks: a feasibili-
ty study,” in Communications, Computers and Signal Processing,
IEEE Pacific Rim Conference on. IEEE, 1989, pp. 525–529.
[21] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Mar-
tinez, J. Gonzalez-Rodriguez, and P. Moreno, “Automatic lan-
guage identification using deep neural networks,” in Acoustic-
s, Speech and Signal Processing (ICASSP), IEEE International
Conference on. Florence, Italy: IEEE, 2014, pp. 5337–5341.
[22] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-
Rodriguez, and P. J. Moreno, “Automatic language identification
using long short-term memory recurrent neural networks.” in In-
terspeech, Singapore, 2014, pp. 2155–2159.
[23] B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. Burgoon,
A. Baird, A. Elkins, Y. Zhang, E. Coutinho, and K. Evanini, “The
interspeech 2016 computational paralinguistics challenge: Decep-
tion, sincerity & native language.” in Interspeech. San Francisco,
CA, USA: ISCA, 2016.
[24] F. Eyben, M. W ¨
ollmer, and B. Schuller, “Opensmile: the mu-
nich versatile and fast open-source audio feature extractor,” in
Proceedings of the 18th ACM international conference on Mul-
timedia. ACM, 2010, pp. 1459–1462.
[25] M. Tu, X. Xie, and Y. Jiao, “Towards improving statistical model
based voice activity detection.” in Interspeech, Singapore, 2014,
pp. 1549–1552.
[26] T. Virtanen, R. Singh, and B. Raj, Techniques for noise robustness
in automatic speech recognition. John Wiley & Sons, 2012.
[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: A simple way to prevent neural net-
works from overfitting,The Journal of Machine Learning Re-
search, vol. 15, no. 1, pp. 1929–1958, 2014.
[28] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition
with deep recurrent neural networks,” in Acoustics, Speech and
Signal Processing (ICASSP), IEEE International Conference on.
Vancouver, BC, Canada: IEEE, 2013, pp. 6645–6649.
[29] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural net-
work regularization,arXiv preprint arXiv:1409.2329, 2014.
[30] T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the
gradient by a running average of its recent magnitude,” COURS-
ERA: Neural Networks for Machine Learning, 2012.
[31] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015.
... The goal of Native Language Identification (NLI) is 'the automatic classification of utterances belonging to specific groups of speakers' [34]. Typical applications include improved personalization of the customer services [21], and detecting the soft biometric traits. Improved knowledge related to the native languages can also be beneficial in developing the database of under resourced languages. ...
... Hence, more systematic investigations need to be explored. So, a more practical approach would be to model the above variations using the Machine-Learning (ML) techniques [21]. In order to build a model for classifying the acoustic feature distribution, language-dependent patterns, derived from the feature vectors of the speech signal, are used. ...
Article
Full-text available
Identification of the native language from speech segment of a second language utterance, that is manifested as a distinct pattern of articulatory or prosodic behavior, is a challenging task. A method of classification of speakers, based on the regional English accent, is proposed in this paper. A database of English speech, spoken by the native speakers of three closely related Dravidian languages, was collected from a non-overlapping set of speakers, along with the native language speech data. Native speech samples from speakers of the regional languages of India, namely Kannada, Tamil, and Telugu are used for the training set. The testing set contains utterances of non-native English speakers of compatriots of the above three groups. Automatic identification of native language is proposed by using the spectral features of the non-native speech, that are classified using the classifiers such as Gaussian Mixture Models (GMM), GMM-Universal Background Model (GMM-UBM), and i-vector. Identification accuracy of \(87.9\%\) was obtained using the GMM classifier, which was increased to \(90.9\%\) by using the GMM-UBM method. But the i-vector-based approach gave a better accuracy of \(93.9\%\), along with EER of \(6.1\%\). The results obtained are encouraging, especially viewing the current state-of-the-art accuracies around \(85\%\). It is observed that the identification rate of nativity, while speaking English, is relatively higher at \(95.2\%\) for the speakers of Kannada language, as compared to that for the speakers of Tamil or Telugu as their native language.
... Auditory-based features such as automatic speech recognition and accent recognition are not considered. Yishan Jiao et al. [13]. presented accent identification system by integrating the Recurrent Neural Network (RNN), and CNN trained in the short term and long term features. ...
... The methods, such as CNN [33], i-Vector [12], DBN [32] [21], BSO-DBN, O-PVI [11], MFCC + SDC + GMM [20], DNN + RNN [13], AMCASC [8], DBSO-DBN without Feature selection are compared with the proposed DBSO-DBN to prove the superiority of the developed method. ...
Article
Full-text available
Accent classification has been gained more attention, due to improving demands for better speaker recognition with the accented speech. An accent is defined by the pronunciation of the language, particularly with the locality, social class, or particular nation. Thus, this paper proposes a model for accent classification using feature extraction and classifier. At first, the input signals are pre-processed. Here, the spectral skewness, spectral centroid, tonal power ratio, spectral kurtosis, spectral flux, and Renyi entropy-based Multi kernel Mel Frequency Cepstral Coefficient features (ReMKMFCC) features is adapted for the feature extraction, and the ReMKMFCC feature is derived by the combination of Renyi entropy, Multiple Kernel Weighted Mel Frequency Cepstral Coefficient (MKMFCC). After the extraction of the features, the features are fed as input to the feature selection module. The feature selection is carried out using Tanimoto, and then the selected features are forwarded to the classification module, where the features are classified using DBN, and the classifier is trained by the proposed Dragonfly-Bird swarm optimization (DBSO), which is the combination of Dragonfly Algorithm (DA) and Bird swarm optimization algorithm (BSO). Thus, the DBSO-based DBN aims at classifying the accent. The analysis proves that the proposed method acquired a maximal accuracy of 96.96% by considering dataset-1, maximum F-Measure of 96.97% by considering dataset-2, and minimal FAR of 3.04%, by considering the dataset -1.
... With the development * Equal contribution. of deep neural networks (DNN) in speech applications, a significant step forward was obtained by combining i-vector or xvector and DNNs for both speaker recognition and LID [6][7][8]. More recently, dialect and accent recognition have begun to receive attention from the speech science and technology communities [9][10][11][12][13][14][15][16][17]. [17] proposed a method that combined DNNs and Recurrent Neural Networks (RNNs) to be respectively trained on long-term statistic features and short-term acoustic features for DID. ...
... More recently, dialect and accent recognition have begun to receive attention from the speech science and technology communities [9][10][11][12][13][14][15][16][17]. [17] proposed a method that combined DNNs and Recurrent Neural Networks (RNNs) to be respectively trained on long-term statistic features and short-term acoustic features for DID. In [13], the authors proposed an E2E-based DID model for Arabic dialectal speech. ...
... [9]. With the great success of deep learning in recent years, acoustic frames are directly fed into deep neural networks (DNNs) [10], [11], [12], [13], [14], [15], recurrent neural networks (RNNs) [16], and long short-term memory [17]. ...
Preprint
Full-text available
Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.
... Rashmi et al. have (Kethireddy et al., 2020) also applied CNN directly to the raw audio waveform and achieved 81.26% accuracy for a subset from the 'Common Voice' corpus classifying 8 accents. Earlier Jiao et al. (Jiao et al., 2016) have used combination of Deep Neural Networks (DNN) and Recurrent Neural Networks (RNN) for long-term and short-term text independent features of speech, and have demonstrated a 51.92% accuracy in identifying 11 different native languages from Native Language Sub-Challenge (NLSC) corpus. The corpus consists of English, spoken by speakers from different countries around the globe with relatively high accent differences. ...
Article
Identifying the social background of an unknown speaker by speech accent has multiple applications including in forensic profiling and adaptation of speech recognition. The most effective accent classification models based on phoneme pronunciation require the presence of certain phonemes in the test speech and hence, are applicable only for a longer duration of test samples. On the other hand, the text-independent classifiers disregard the phoneme and linguistic information completely. This paper proposes an ensemble of convolutional neural networks for phoneme-based short-term and text-independent long-term classification of speech regarding speaker background profiling. The model is evaluated by classifying the native language of Indian speakers by their English speech. Both the classifiers within the ensemble complement each other positively; to give higher classification accuracy as compared to classification accuracies obtained from the individual classifiers. Low-pass filtering based speech augmentation has been proven to further improve the classification performance and average accuracy, with up to 79% and 73.7% accuracies achieved for speaker-level and sentence-level tests, respectively.
... Long Short Term Memory neural networks are often succesfully experimented with in terms of accent recognition. A related experiment found accuracies at around 50% when classifying 12 different locales [86]. Of the dataset gathered from multiple locations across the globe, it was observed that the highest recall rates were that of Japanese, Chinese, and German scoring 65%, 61% and 60% respectively. ...
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
... The speaker understanding of accent for a singing language identification is presented in ref. [19] which affects the singing melody transcription accuracy [20,21]. The accent detection had been performed in ref. [22] using the combination of two parallel deep neural networks (DNN and RNN). They have characterized the difference between the prosodic and articulation characteristics. ...
Article
Full-text available
The dependency of a speech recognition system on the accent of a user leads to the variation in its performance, as the people from different backgrounds have different accents. Accent labeling and conversion have been reported as a prospective solution for the challenges faced in language learning and various other voice-based advents. In the English TTS system, the accent labeling of unregistered words is another very important link besides the phonetic conversion. Since the importance of the primary stress is much greater than that of the secondary stress, and the primary stress is easier to call than the secondary stress, the labeling of the primary stress is separated from the secondary stress. In this work, the labeling of primary accents uses a labeling algorithm that combines morphological rules and machine learning; the labeling of secondary accents is done entirely through machine learning algorithms. After 10 rounds of cross-validation, the average tagging accuracy rate of primary stress was 94%, the average tagging accuracy rate of secondary stress was 94%, and the total tagging accuracy rate was 83.6%. This perceptual study separates the labeling of primary and secondary accents providing the promising outcomes.
Article
Accent recognition refers to the problem of inferring the native language of a speaker from his foreign-accented speech. Differences in accent are due to both articulation and prosodic characteristics. The automatic identification of foreign accents is valuable for different speech systems, such as speech recognition, speaker identification or voice conversion. This paper aims to identify the native languages of non-native English speakers from different countries in the Arabic region: the researchers choose Saudi Arabia to represent the eastern Arabic region in Asia, Egypt to represent the eastern Arabic region in Africa and Tunisia to represent the western Arabic region in Africa. In this research, reinforcement learning (RL), a sub-branch of machine learning, and artificial neural network will be used as an intelligent method to classify and predict the speech. The aim is to train a neural network to automatically detect speech accents. Then the researchers develop a hybrid multi-agent RL algorithm that takes advantage from the multi-agent communication and cooperative agents on the language detection process. Hence, the aim is to help sociolinguists and discourse analysts. As for the Saudi context, this study will be very useful on resolving e-learning issues such as linguistic problems of students.
Book
Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. Includes contributions from top ASR researchers from leading research units in the field
Chapter
Errors in Bayes Classification Bayes Classification and ASR External Influences on Speech Recordings The Effect of External Influences on Recognition Improving Recognition under Adverse Conditions References
Article
Statistical model based voice activity detection (VAD) is commonly used in various speech related research and applications. In this paper, we try to improve the performance of statistical model based VAD via new feature extraction method. Our main innovation focuses on that we apply Mel-frequency subband coefficients with power-law nonlinearity as feature for statistical model based VAD instead of Discrete Fourier Transform (DFT) coefficients. This proposed feature is then modeled by Gaussian distribution. Performances of this method are comprehensively compared with existing methods. Meanwhile we also test power-law nonlinearity on existing methods. Experimental results prove that with proposed subband coefficients the performance of statistical model based VAD could be improved a lot. Power-law nonlinearity on DFT coefficients could also bring some improvement.
Article
This work explores the use of Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) for automatic language identification (LID). The use of RNNs is motivated by their better ability in modeling sequences with respect to feed forward networks used in previous works. We show that LSTM RNNs can effectively exploit temporal dependencies in acoustic data, learning relevant features for language discrimination purposes. The proposed approach is compared to baseline i-vector and feed forward Deep Neural Network (DNN) systems in the NIST Language Recognition Evaluation 2009 dataset. We show LSTM RNNs achieve better performance than our best DNN system with an order of magnitude fewer parameters. Further, the combination of the different systems leads to significant performance improvements (up to 28%).
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Empirical results have shown that many spoken language identification systems based on hand-coded features perform poorly on small speech samples where a human would be successful. A hypothesis for this low performance is that the set of extracted features is insufficient. A deep architecture that learns features automatically is implemented and evaluated on several datasets.
Article
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, and machine translation.