Accent Classification in Human Speech Biometrics
for Native and Non-native English Speakers
Jordan J. Bird
School of Engineering and Applied Science, Aston
School of Engineering and Applied Science, Aston
School of Engineering and Applied Science, Aston
Diego R. Faria
School of Engineering and Applied Science, Aston
Accent classication provides a biometric path to high reso-
lution speech recognition. This preliminary study explores
various methods of human accent recognition through clas-
sication of locale. Classical, ensemble, timeseries and deep
learning techniques are all explored and compared. A set of
diphthong vowel sounds are recorded from participants from
the United Kingdom and Mexico, and then formed into a large
static dataset of statistical descriptions by way of their Mel-
frequency Cepstral Coecients (MFCC) at a sample window
length of 0.02 seconds. Using both at and timeseries data,
various machine learning models are trained and compared
to the scientic standard Hidden Markov Model (HMM). Re-
sults through 10 fold cross validation show that a vote of
average probabilities between a Random Forest and Long
Short-term Memory Neural Network result in a classication
accuracy of 94.74%, outperforming the speech classication
standard Hidden Markov Model by a 5% increase in accuracy.
•Computing methodologies →Speech recognition
chine learning approaches;Learning settings.
Computational Linguistics, Speech Recognition, Accent Recog-
nition, Machine Learning, Biometrics, Voice Assistants
Speech recognition in the home is quickly becoming a more
viable and aordable technology through systems such as Ap-
ple Siri, Amazon Alexa and Google Home. Home assistants
perform many tasks such as purchasing products, remotely
controlling home appliances, and making phonecalls among
many countless other skills. Despite the growing abilities
and availability of Smart Homes and their respective devices,
there are several issues hampering their usage in terms of
the level of scientic state-of-the-art. Specically, non-native
English speakers often encounter issues when attempting
to converse with automated assistants [
], and thus
measures are required to be able to correctly recognise the
accent or locale of a speaker, which can then be logically
acted on accordingly.
In this work, a dataset of spoken sounds from the English
phonetic dictionary are grouped based on the locale of the
speaker. Speakers are both native (West Midlands, UK; Lon-
don, UK) and non-native (Mexico City, MX; Chihuahua, MX)
English speakers producing a four-class problem. Various
single, ensemble and deep learning models are trained and
compared in terms of their classication ability for accent
recognition. A at dataset of 26 200ms Mel-frequency Cep-
stral Coecients form data objects for classication, except
for a timeseries of the aforementioned datapoints that are
generated for Hidden Markov Model training and prediction.
The main contributions of this work are as follows:
A benchmark of the most common model used for
contemporary voice recognition, the Hidden Markov
Model, when training from a uniform spoken audio
dataset and producing predictions of speaker accent/locale.
Single and ensemble Models are presented for the clas-
sication of accent of two Mexican locales and two
The nal comparison of the eleven machine learning
models in which a vote of average probabilities of Ran-
dom Forest and LSTM is suggested as the best model
with a very high classication accuracy of 94.74%.
Hidden Markov Models (HMM), since their inception in 1966,
remain a modern approach for speech recognition due to
their retaining of eectiveness given more computational
resources. An earlier work from 1996 found that, using 5
hidden Markov states due to the computational resources
available at the time, four spoken accents could be classied
at an observed accuracy of 74.5%[
]. It must be noted that far
deeper exploration into optimal HMM topology and struc-
ture is now possible due to the larger degree of processing
power available to researchers in the modern day. A more
modern work found that Support Vector Machines (SVM)
and HMM could classify between three dierent national
locales (Chinese, Indian, Canadian) at an average accuracy
of 93.8% for both models [
], though, this study only classi-
ed data from male speakers due to the statistical frequency
dierences between gender and voice.
Long Short Term Memory neural networks are often suc-
cesfully experimented with in terms of accent recognition.
A related experiment found accuracies at around 50% when
classifying 12 dierent locales [
]. Of the dataset gathered
from multiple locations across the globe, it was observed
that the highest recall rates were that of Japanese, Chinese,
and German scoring 65%, 61% and 60% respectively. Sub-
jects were recorded in their native language. An alternative
network-based approach, Convolutional Neural Networks,
were used to classify speech from English, Korean and Chi-
nese dialects at an average accuracy of 88%[
]. A proposed
approach to the accent problem in speech recognition, also
using a CNN, oered a preliminary study into deriving a
conversion matrix to be applied to Mel-frequency Cepstral
Coecients which would act to translate the user’s accent
into a specied second accent before speech recognition is
The eectiveness of voting between multiple models to
create a simple classier ensemble has been observed to
be extremely eective in many human-machine interaction
domains such as Sentiment Analysis [
] and EEG brainwave
It is worth noting in terms of criticism, that many accent
recognition experiments rarely dene the spoken language
itself, often resulting in the classication of a subject speak-
ing their native language in their natural locale. Through
this, it is very possible that the classiers in question would
not only learn from accent, but from natural language pat-
terns as a form of audible Natural Language Processing, since
such eects are also represented within MFCC data. For the
goal of improving voice recognition for non-native English
speakers who are speaking English, the previous models
would be somewhat worthless or inaccurate. The originality
of this experiment is to classify data retrieved from native
and non-native English speakers (who are all requested to
pronounce sounds from the English phonetic dictionary, as
if they were speaking in English), with the ultimate goal of
providing a path to improving voice recognition on English
language devices and services for non-native speakers.
Mel-frequency Cepstral Coeicients
Soundwaves are complex, random, and non-stationary and
thus classication of raw sound is very dicult. For example,
real-time monitoring of sound data would simply give a
measured frequency at a single point in time, which would be
impossible to classify since the behaviour of the wave is not
described. A sliding time windowing technique is introduced
and a statistical extraction is performed based on the section
of the wave appearing within the observed time window.
This results in a set of temporal mathematical descriptions of
wave sections. Mel-frequency Cepstral Coecient (MFCC)
of the sound is often cited as the most eective statistical
modelling method of sound waves[
]. MFCC datasets
are produced from a sliding time window as follows:
The Fourier Transform (FT) of the time window data
The powers from the FT are mapped to the mel-scale,
which is the psychological scale of audible pitch.
The Mel-frequency Cepstrum (MFC), or power spec-
trum of sound, is considered and logs of each of their
powers are taken.
The derived mel-log powers are treated as a signal,
and a discrete cosine transform (DCT) is measured:
2)kk=0, . . ., N−1.(2)
The MFCCs are nally considered as the resultant amplitudes
of the spectrum generated through the above process. A
mathematical description of a short wave section has been
generated, and provide attributes for the mapped class.
A dataset was gathered for concurrent experiments and
made freely available online for future experiments
the accent locales observed in Fig. 1. The voice recognition
dataset contained seven individual phonetic sounds spoken
ten times each by subjects from the United Kingdom and
Mexico. Those from the UK were native English speakers
whereas those from Mexico were native Spanish and uent
English speakers who were asked to pronounce the phonetic
sounds as if they were speaking English. 26 logs of MFCC
data were extracted from each dataset at a sliding time win-
dow of 200ms, each data object were the 26 MFCC features
Figure 1: Accent Locale of Experimental Subjects (Not to
Scale). To the left, Chihuahua (topmost) and Mexico City
(boom-most) are located, in Mexico; to the right the West
Midlands (topmost) and London (boom-most) are located,
in the United Kingdom.
2 4 6 8 10 12 14 16 18 20 22 24 26
Figure 2: Information Gain of Each MFCC Log Attribute in
mapped to the accent of the speaker. Accents were sourced
from the West Midlands and London in the UK whereas
accents from Mexico were sourced from Mexico City and
Chihuahua. Weights of all four classes were balanced (since
the clips diered in length) to simulate an equally distributed
dataset. The dataset was formatted into a timeseries (rela-
tional attributes) for HMM training and prediction. The In-
formation Gain classication ability of each of the individual
attributes are shown in in Fig. 2. Models were all trained
on an AMD FX8320 8-core processor with a clock speed of
An extremely complex dense neural network was trained
for purposes of results and comparison, but would be unreal-
istically complex for actual real world use. A neural network
of two hidden layers (256, 128) was trained on the CUDA
cores of an NVidia GTX680 for 1000 epochs per fold, with
a batch size of 100. This model was trained using the Keras
] running via the TensorFlow Platform. This is
labelled as"DENSE NN" in the results.
Due to the large degree of computational resources re-
quired for training several of the selected models, 10-fold
cross validation was chosen for model averaging.
Chosen Machine Learning Methods
Various methods of machine learning were selected based on
their ranging statistical[
] and methodological dierences,
as to aptly benchmark a selection of methods for spoken
accent classication. This section briey details the scientic
method followed by each of the methods to derive knowledge
through learning for classication.
Random Trees and Forests. A Decision Tree is a data structure
of conditional control statements based on attribute values,
which are then mapped to a tree. Classication is performed
by cascading a data point down the tree through each con-
ditional check until a leaf node (a node with no remaining
branches), which is mapped to a Class, ie. the prediction of
the model. The growth of the tree is based on the entropy of
its end node, that is, the level of disorder in classes found on
that node. Entropy of a node is considered as:
where the entropies of each class prediction are measured at
a leaf node.
A Random Tree (RT) is a method of generating a ran-
dom decision tree generated in which k-random attributes
are selected at each control statement as well as a best-t
]. An overtted tree is generated for the input set
and therefore cross-validation or a test-set are required for
proper measurement of prediction ability. J48 is an algorithm
to generate a decision tree based on C4.5[
]. Rather than
randomness, information entropy is used to calculate a best
split at each node ie. the most optimal split at that given step.
C4.5 requires far more processing power than RT due to the
requirement of this calculation.
A Random Forest is an ensemble of many Random Trees
through Bootstrap Aggregation (bagging) and Voting[
Numerous RT’s are generated by a random selection of data.
During classication, all of the trees will vote on their pre-
diction, and the majority vote is selected as the overall pre-
diction. Random Forests tend to outperform Random Trees
due to their decreasing of variance without increasing of the
Bayesian Classifiers. Bayes Theorem[
] is the comparable
probability that data point dwill match to Class C. The
Figure 3: Diagram of a Long Short-term Memory Network
theorem is given as follows:
where the probability of P(A) being true is related to the
probability of the H with evidence P(A|B). In terms of this
work, this would take inputs of MFCC measurements and
attempt to classify the spoken accent via selecting that of
the highest probability based on previous evidence.
Naive Bayes classication is given as follows:
y=k∈(1,. . ., K)p(Ck)
where class label yis given to data object k. The naivety in
Bayesian algorithms concerns the assumed independence of
attribute values (or existence), whether or not the assumption
holds true for a data.
Long Short-term Memory. Long Short Term Memory (LSTM)
is a form of Articial Neural Network in which multiple
Recurrent Neural Networks (RNN) will predict based on
state and previous states. As seen in Fig. 3, the data structure
of a neuron within a layer is an ’LSTM Block’. The general
idea is as follows:
The rst step requires the LSTM block to select stored
data to delete:
Where Wf are the weights of the blocks, h is the previous
output of the block (t-1), xt are the inputs received by the
block, and a bias is applied via bf.
The block must then select which data to store/remember.
Based on cell input i,Ct are the values generated.
The block will then update parameters through a convolu-
Finally, output Ot is produced, and the hidden state is
Due to the observed consideration of time sequences, i.e.
previously seen data, it is often found that time dependent
data are very eectively classied due to the memory state
of the block. LSTM ANN is thus a particularly powerful
technique in terms of speech recognition [
] and brainwave
] - since both are temporal, wave-like
Nearest Neighbour Classification. K-nearest Neighbour (KNN)
is a method of classication based on measured distance from
ktraining data points [
]. KNN is considered a lazy learning
technique since all computation is deferred and only required
during the classication stage. KNN is performed as follows:
Convert nominal attributes to integers mapped to the
(2) Normalise attributes
Map all training data to n-dimensional space where n
are are the values of attributes
(4) Lazy Computation Stage - For each data point:
Plot the data point to the previously generated n-
Have K-nearest points all vote on the point based on
Predict the class of the point with that which has
received the highest number of votes
Logistic Regression. Logistic Regression is a process of sym-
metric statistics where a numerical value is linked to a prob-
ability of event occurring, ie. the number of driving lessons
to predict pass or fail .
In a two class problem within a dataset containing inum-
ber of attributes and
model parameters, the log odds lis
and the odds of an outcome
are shown through
which can be used to
predict an outcome based on previous observation.
Support Vector Machines. Support Vector Machines (SVM)
classify data points by optimising a data-dimensional hyper-
plane to most aptly separate them, and then classifying based
on the distance vector measured from the hyper-plane[
Optimisation follows the goal of the average margins be-
tween points and the separator to be at the maximum pos-
sible value. Generation of an SVM is performed through
Sequential Minimal Optimisation (SMO), a high-performing
algorithm to generate and implement an SVM classier[
To perform this, the large optimisation problem is broken
down into smaller sub-problems, these can then be solved
linearly. For multipliers a, reduced constraints are given as:
where there are data classes yand kis the negative of the
sum over the remaining terms of the equality constraint.
Hidden Markov Models. Markov Chains are a probabilis-
tic model that describe a sequence and probability of an
event occurring based on those which have been observed
]. Each previously observed event is repre-
sented as a Hidden Unit and therefore the most optimal
number of hidden states required is largely data dependent.
The general idea of the HMM process is as follows:
denotes the probability of event Yoccurring based on the
the sequence of length L. Secondly,
describes the probability of Y where the sum runs over all of
the generated hidden node sequences, given as X:
X=x(0),x(1), ..., x(L−1).(15)
Classication is nally chosen based on highest probabil-
ity on previously studied data sequences within the hidden
model through the Bayesian process .
Voting. Voting is a simple method of fusing the decisions
of multiple classiers and calculating an output prediction.
For example, an ensemble of two classiers with diering
classication abilities may possible produce a better result
when working together, ie. selected for their strengths. Vot-
ing is performed by various metrics, in which all classiers
will vote for a class, and then a prediction is produced by the
highest vote. Methods of voting include:
Vote weighted by probability to classify an individual
•Give 1 vote to predicted Class - Majority
•Vote based on overall ability to classify
•Vote weighted by condence
•Vote based on Min and Max Probabilities
The class is predicted based purely on the maximum vote
Hidden Markov Units as well as hidden LSTM units were
linearly explored. Preliminary experimentation found that a
single layer of LSTM units persistently outperformed deeper
25 50 75 100 125 150 175 200
HMM Hidden Units
Classication Accuracy (%)
Figure 4: Exploration of HMM Hidden Unit Selection
25 50 75 100
92 92.01 92.01
LSTM Hidden Units
Classication Accuracy (%)
Figure 5: Exploration of LSTM Hidden Unit Selection
networks, and thus only one layer was linearly searched. The
chosen amount of HMM hidden units was selected as 200
since it had the superior classication accuracy of 89.65% as
observed in Figs 4 and 5 respectively. The chosen amount of
hidden units for the LSTM were selected as 75 since it too
had the most superior classication accuracy of 92.01%.
Table 1 displays the overall classication accuracy of the
selected single models when predicting the locale of the
speaker at each 200ms audio interval. The best single model
was an LSTM with 92.01% accuracy, closely followed by the
extremely complex dense neural network for benchmark
Table 1: Single Classier Results for Accent Classication (Sorted Lowest to Highest)
Model NB BN J48 LR RT SVM HMM RF KNN(10) DENSE NN LSTM
Accuracy 58.29 70.62 85.2 85.8 85.94 86.19 89.65 89.72 90.76 91.55 92.01
Table 2: Democratic Voting Processes for Ensemble Classi-
Model Accuracy (%)
Democracy RF, LSTM KNN, LSTM KNN, RF
Avg. Prob. 94.74 94.63 92.62
Product Prob. 94.73 94.62 92.62
purposes, and then the K-Nearest Neighbours, and Hidden
The best ensemble, and overall best, was a vote of average
probabilities between the Random Forest and LSTM, achiev-
ing 94.74% accuracy, this can be seen in the exploration of
democratic voting processes with best models, in Table 2.
5 CONCLUSION AND NEXT STEPS
This study has shown the eectiveness of various machine
learning techniques in terms of classifying the accent of
the subject based on recorded audio data. The diphthong
phoneme sounds were succesfully classied into four dif-
ferent accents from the UK and Mexico with an accuracy
of 94.74% when a manually tuned LSTM of 200 units and
a Random Forest are ensembled through a vote of average
Leave-one-out (LOO) cross validation has been observed
to be superior to test-set and k-fold cross-validation tech-
niques but requires far more processing time[
], this study
would have been around 3000 times more complex due to
there being 30,000 classiable data objects. It is likely better
results would be attained through this approach but with the
resources available, was not possible. Furthermore, more in-
tense searching of the problem spaces of HMM and LSTM hid-
den unit selection should be performed since relatively large
dierences were observed in minute topological changes.
Evolutionary algorithms have been observed to be a strong
method of topology selection and tuning [6, 8, 10, 22].
Neal Alewine, Eric Janke, Paul Sharp, and Roberto Sicconi. 2008. Sys-
tems and methods for building a native language phoneme lexicon
having native pronunciations of non-native words derived from non-
native pronunciations. US Patent 7,472,061.
Naomi S Altman. 1992. An introduction to kernel and nearest-neighbor
nonparametric regression. The American Statistician 46, 3 (1992), 175–
Levent M Arslan and John HL Hansen. 1996. Language accent clas-
sication in American English. Speech Communication 18, 4 (1996),
Thomas Bayes, Richard Price, and John Canton. 1763. An essay towards
solving a problem in the doctrine of chances. (1763).
Amy Bearman, Kelsey Josund, and Gawan Fiore. [n. d.]. Accent Con-
version Using Articial Neural Networks. ([n. d.]).
Jordan J. Bird, , Elizabeth Wanner, Aniko Ekart, and Diego R. Faria.
2019. Phoneme Aware Speech Recognition through Evolutionary
Optimisation. In The Genetic and Evolutionary Computation Conference.
Jordan J. Bird, Aniko Ekart, Christopher D. Buckingham, and Diego R.
Faria. 2019. Mental Emotional Sentiment Classication with an EEG-
based Brain-Machine Interface. In The International Conference on
Digital Image and Signal Processing (DISP’19). Springer.
Jordan J. Bird, Aniko Ekart, and Diego R. Faria. 2019. Evolutionary
Optimisation of Fully Connected Articial Neural Network Topology.
In SAI Computing Conference 2019. SAI.
Jordan J. Bird, Aniko Ekart, and Diego R. Faria. 2019. High Resolution
Sentiment Analysis by Ensemble Classication. In SAI Computing
Conference 2019. SAI.
Jordan J. Bird, Diego R. Faria, Luis J. Manso, Aniko Ekart, and Christo-
pher D. Buckingham. 2019. A Deep Evolutionary Approach to Bioin-
spired Classier Optimisation for Brain-Machine Interaction. Com-
plexity 2019 (2019). https://doi.org/10.1155/2019/4316548
Jordan J. Bird, Luis J. Manso, Eduardo P. Ribiero, Aniko Ekart, and
Diego R. Faria. 2018. A Study on Mental State Classication using
EEG-based Brain-Machine Interface. In 9th International Conference
on Intelligent Systems. IEEE.
William Byrne, Eva Knodt, Sanjeev Khudanpur, and Jared Bernstein.
1998. Is automatic speech recognition ready for non-native speech?
A data collection eort and initial experiments in modeling conversa-
tional Hispanic English. Proc. Speech Technology in Language Learning
(STiLL) 1, 99 (1998), 8.
 François Chollet et al. 2015. Keras. https://keras.io.
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks.
Machine learning 20, 3 (1995), 273–297.
PR Davidson, RD Jones, and MTR Peiris. 2006. Detecting behavioral
microsleeps using EEG and LSTM recurrent neural networks. In 2005
IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE,
Paul A Gagniuc. 2017. Markov Chains: From Theory to Implementation
and Experimentation. John Wiley & Sons.
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hy-
brid speech recognition with deep bidirectional LSTM. In Automatic
Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on.
Klaus Gre, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink,
and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE
transactions on neural networks and learning systems 28, 10 (2017),
Tin Kam Ho. 1995. Random decision forests. In Document analysis and
recognition, 1995., proceedings of the third international conference on,
Vol. 1. IEEE, 278–282.
Yishan Jiao, Ming Tu, Visar Berisha, and Julie M Liss. 2016. Accent Iden-
tication by Combining Deep Neural Networks and Recurrent Neural
Networks Trained on Long and Short Term Features.. In Interspeech.
Ron Kohavi et al
1995. A study of cross-validation and bootstrap for
accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal,
Alejandro Martín, Raúl Lara-Cabrera, Félix Fuentes-Hurtado, Valery
Naranjo, and David Camacho. 2018. EvoDeep: A new evolutionary
approach for automatic Deep Neural Networks parametrisation. J.
Parallel and Distrib. Comput. 117 (2018), 180–191.
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010.
Voice recognition algorithms using mel frequency cepstral coecient
(MFCC) and dynamic time warping (DTW) techniques. arXiv preprint
John Platt. 1998. Sequential minimal optimization: A fast algorithm
for training support vector machines. (1998).
Anantha M Prasad, Louis R Iverson, and Andy Liaw. 2006. Newer
classication and regression tree techniques: bagging and random
forests for ecological prediction. Ecosystems 9, 2 (2006), 181–199.
 J Ross Quinlan. 2014. C4. 5: programs for machine learning. Elsevier.
Md Sahidullah and Goutam Saha. 2012. Design, analysis and experi-
mental evaluation of block based transformation in MFCC computation
for speaker recognition. Speech Communication 54, 4 (2012), 543–565.
 Corey Shih. [n. d.]. Speech Accent Classication. ([n. d.]).
Stanley Smith Stevens, John Volkmann, and Edwin B Newman. 1937.
A scale for the measurement of the psychological magnitude pitch.
The Journal of the Acoustical Society of America 8, 3 (1937), 185–190.
Hong Tang and Ali A Ghorbani. 2003. Accent classication using
support vector machine and hidden markov model. In Conference of
the Canadian Society for Computational Studies of Intelligence. Springer,
Laura Mayeld Tomokiyo and Alex Waibel. 2003. Adaptation methods
for non-native speech. Multilingual Speech and Language Processing 6
Jessica PM Vital, Diego R Faria, Gonçalo Dias, Micael S Couceiro,
Fernanda Coutinho, and Nuno MF Ferreira. 2017. Combining discrimi-
native spatiotemporal features for daily life activity recognition using
wearable motion sensing suit. Pattern Analysis and Applications 20, 4
Strother H Walker and David B Duncan. 1967. Estimation of the
probability of an event as a function of several independent variables.
Biometrika 54, 1-2 (1967), 167–179.