Conference PaperPDF Available

Accent Classification in Human Speech Biometrics for Native and Non-native English Speakers


Abstract and Figures

Accent classification provides a biometric path to high resolution speech recognition. This preliminary study explores various methods of human accent recognition through classification of locale. Classical, ensemble, timeseries and deep learning techniques are all explored and compared. A set of diphthong vowel sounds are recorded from participants from the United Kingdom and Mexico, and then formed into a large static dataset of statistical descriptions by way of their Mel-frequency Cepstral Coefficients at a sample window length of 0.02 seconds. Using both flat and timeseries data, various machine learning models are trained and compared to the scientific standard Hidden Markov Model (HMM). Results through 10 fold cross validation show that a vote of average probabilities between a Random Forest and Long Short-term Memory Neural Network result in a classification accuracy of 94.74%, outperforming the speech classification standard Hidden Markov Model by a 5% increase in accuracy.
Content may be subject to copyright.
Accent Classification in Human Speech Biometrics
for Native and Non-native English Speakers
Jordan J. Bird
School of Engineering and Applied Science, Aston
Birmingham, UK
Elizabeth Wanner
School of Engineering and Applied Science, Aston
Birmingham, UK
Anikó Ekárt
School of Engineering and Applied Science, Aston
Birmingham, UK
Diego R. Faria
School of Engineering and Applied Science, Aston
Birmingham, UK
Accent classication provides a biometric path to high reso-
lution speech recognition. This preliminary study explores
various methods of human accent recognition through clas-
sication of locale. Classical, ensemble, timeseries and deep
learning techniques are all explored and compared. A set of
diphthong vowel sounds are recorded from participants from
the United Kingdom and Mexico, and then formed into a large
static dataset of statistical descriptions by way of their Mel-
frequency Cepstral Coecients (MFCC) at a sample window
length of 0.02 seconds. Using both at and timeseries data,
various machine learning models are trained and compared
to the scientic standard Hidden Markov Model (HMM). Re-
sults through 10 fold cross validation show that a vote of
average probabilities between a Random Forest and Long
Short-term Memory Neural Network result in a classication
accuracy of 94.74%, outperforming the speech classication
standard Hidden Markov Model by a 5% increase in accuracy.
Computing methodologies Speech recognition
chine learning approaches;Learning settings.
Computational Linguistics, Speech Recognition, Accent Recog-
nition, Machine Learning, Biometrics, Voice Assistants
Speech recognition in the home is quickly becoming a more
viable and aordable technology through systems such as Ap-
ple Siri, Amazon Alexa and Google Home. Home assistants
perform many tasks such as purchasing products, remotely
controlling home appliances, and making phonecalls among
many countless other skills. Despite the growing abilities
and availability of Smart Homes and their respective devices,
there are several issues hampering their usage in terms of
the level of scientic state-of-the-art. Specically, non-native
English speakers often encounter issues when attempting
to converse with automated assistants [
], and thus
measures are required to be able to correctly recognise the
accent or locale of a speaker, which can then be logically
acted on accordingly.
In this work, a dataset of spoken sounds from the English
phonetic dictionary are grouped based on the locale of the
speaker. Speakers are both native (West Midlands, UK; Lon-
don, UK) and non-native (Mexico City, MX; Chihuahua, MX)
English speakers producing a four-class problem. Various
single, ensemble and deep learning models are trained and
compared in terms of their classication ability for accent
recognition. A at dataset of 26 200ms Mel-frequency Cep-
stral Coecients form data objects for classication, except
for a timeseries of the aforementioned datapoints that are
generated for Hidden Markov Model training and prediction.
The main contributions of this work are as follows:
A benchmark of the most common model used for
contemporary voice recognition, the Hidden Markov
Model, when training from a uniform spoken audio
dataset and producing predictions of speaker accent/locale.
Single and ensemble Models are presented for the clas-
sication of accent of two Mexican locales and two
British locales.
The nal comparison of the eleven machine learning
models in which a vote of average probabilities of Ran-
dom Forest and LSTM is suggested as the best model
with a very high classication accuracy of 94.74%.
Related Works
Hidden Markov Models (HMM), since their inception in 1966,
remain a modern approach for speech recognition due to
their retaining of eectiveness given more computational
resources. An earlier work from 1996 found that, using 5
hidden Markov states due to the computational resources
available at the time, four spoken accents could be classied
at an observed accuracy of 74.5%[
]. It must be noted that far
deeper exploration into optimal HMM topology and struc-
ture is now possible due to the larger degree of processing
power available to researchers in the modern day. A more
modern work found that Support Vector Machines (SVM)
and HMM could classify between three dierent national
locales (Chinese, Indian, Canadian) at an average accuracy
of 93.8% for both models [
], though, this study only classi-
ed data from male speakers due to the statistical frequency
dierences between gender and voice.
Long Short Term Memory neural networks are often suc-
cesfully experimented with in terms of accent recognition.
A related experiment found accuracies at around 50% when
classifying 12 dierent locales [
]. Of the dataset gathered
from multiple locations across the globe, it was observed
that the highest recall rates were that of Japanese, Chinese,
and German scoring 65%, 61% and 60% respectively. Sub-
jects were recorded in their native language. An alternative
network-based approach, Convolutional Neural Networks,
were used to classify speech from English, Korean and Chi-
nese dialects at an average accuracy of 88%[
]. A proposed
approach to the accent problem in speech recognition, also
using a CNN, oered a preliminary study into deriving a
conversion matrix to be applied to Mel-frequency Cepstral
Coecients which would act to translate the user’s accent
into a specied second accent before speech recognition is
The eectiveness of voting between multiple models to
create a simple classier ensemble has been observed to
be extremely eective in many human-machine interaction
domains such as Sentiment Analysis [
] and EEG brainwave
classication [7].
It is worth noting in terms of criticism, that many accent
recognition experiments rarely dene the spoken language
itself, often resulting in the classication of a subject speak-
ing their native language in their natural locale. Through
this, it is very possible that the classiers in question would
not only learn from accent, but from natural language pat-
terns as a form of audible Natural Language Processing, since
such eects are also represented within MFCC data. For the
goal of improving voice recognition for non-native English
speakers who are speaking English, the previous models
would be somewhat worthless or inaccurate. The originality
of this experiment is to classify data retrieved from native
and non-native English speakers (who are all requested to
pronounce sounds from the English phonetic dictionary, as
if they were speaking in English), with the ultimate goal of
providing a path to improving voice recognition on English
language devices and services for non-native speakers.
Mel-frequency Cepstral Coeicients
Soundwaves are complex, random, and non-stationary and
thus classication of raw sound is very dicult. For example,
real-time monitoring of sound data would simply give a
measured frequency at a single point in time, which would be
impossible to classify since the behaviour of the wave is not
described. A sliding time windowing technique is introduced
and a statistical extraction is performed based on the section
of the wave appearing within the observed time window.
This results in a set of temporal mathematical descriptions of
wave sections. Mel-frequency Cepstral Coecient (MFCC)
of the sound is often cited as the most eective statistical
modelling method of sound waves[
]. MFCC datasets
are produced from a sliding time window as follows:
The Fourier Transform (FT) of the time window data
is derived:
The powers from the FT are mapped to the mel-scale,
which is the psychological scale of audible pitch[29].
The Mel-frequency Cepstrum (MFC), or power spec-
trum of sound, is considered and logs of each of their
powers are taken.
The derived mel-log powers are treated as a signal,
and a discrete cosine transform (DCT) is measured:
cos π
2)kk=0, . . ., N1.(2)
The MFCCs are nally considered as the resultant amplitudes
of the spectrum generated through the above process. A
mathematical description of a short wave section has been
generated, and provide attributes for the mapped class.
A dataset was gathered for concurrent experiments and
made freely available online for future experiments
, from
the accent locales observed in Fig. 1. The voice recognition
dataset contained seven individual phonetic sounds spoken
ten times each by subjects from the United Kingdom and
Mexico. Those from the UK were native English speakers
whereas those from Mexico were native Spanish and uent
English speakers who were asked to pronounce the phonetic
sounds as if they were speaking English. 26 logs of MFCC
data were extracted from each dataset at a sliding time win-
dow of 200ms, each data object were the 26 MFCC features
Figure 1: Accent Locale of Experimental Subjects (Not to
Scale). To the left, Chihuahua (topmost) and Mexico City
(boom-most) are located, in Mexico; to the right the West
Midlands (topmost) and London (boom-most) are located,
in the United Kingdom.
2 4 6 8 10 12 14 16 18 20 22 24 26
Information Gain
Figure 2: Information Gain of Each MFCC Log Attribute in
the Dataset
mapped to the accent of the speaker. Accents were sourced
from the West Midlands and London in the UK whereas
accents from Mexico were sourced from Mexico City and
Chihuahua. Weights of all four classes were balanced (since
the clips diered in length) to simulate an equally distributed
dataset. The dataset was formatted into a timeseries (rela-
tional attributes) for HMM training and prediction. The In-
formation Gain classication ability of each of the individual
attributes are shown in in Fig. 2. Models were all trained
on an AMD FX8320 8-core processor with a clock speed of
An extremely complex dense neural network was trained
for purposes of results and comparison, but would be unreal-
istically complex for actual real world use. A neural network
of two hidden layers (256, 128) was trained on the CUDA
cores of an NVidia GTX680 for 1000 epochs per fold, with
a batch size of 100. This model was trained using the Keras
library [
] running via the TensorFlow Platform. This is
labelled as"DENSE NN" in the results.
Due to the large degree of computational resources re-
quired for training several of the selected models, 10-fold
cross validation was chosen for model averaging.
Chosen Machine Learning Methods
Various methods of machine learning were selected based on
their ranging statistical[
] and methodological dierences,
as to aptly benchmark a selection of methods for spoken
accent classication. This section briey details the scientic
method followed by each of the methods to derive knowledge
through learning for classication.
Random Trees and Forests. A Decision Tree is a data structure
of conditional control statements based on attribute values,
which are then mapped to a tree. Classication is performed
by cascading a data point down the tree through each con-
ditional check until a leaf node (a node with no remaining
branches), which is mapped to a Class, ie. the prediction of
the model. The growth of the tree is based on the entropy of
its end node, that is, the level of disorder in classes found on
that node. Entropy of a node is considered as:
where the entropies of each class prediction are measured at
a leaf node.
A Random Tree (RT) is a method of generating a ran-
dom decision tree generated in which k-random attributes
are selected at each control statement as well as a best-t
]. An overtted tree is generated for the input set
and therefore cross-validation or a test-set are required for
proper measurement of prediction ability. J48 is an algorithm
to generate a decision tree based on C4.5[
]. Rather than
randomness, information entropy is used to calculate a best
split at each node ie. the most optimal split at that given step.
C4.5 requires far more processing power than RT due to the
requirement of this calculation.
A Random Forest is an ensemble of many Random Trees
through Bootstrap Aggregation (bagging) and Voting[
Numerous RT’s are generated by a random selection of data.
During classication, all of the trees will vote on their pre-
diction, and the majority vote is selected as the overall pre-
diction. Random Forests tend to outperform Random Trees
due to their decreasing of variance without increasing of the
model bias.
Bayesian Classifiers. Bayes Theorem[
] is the comparable
probability that data point dwill match to Class C. The
Figure 3: Diagram of a Long Short-term Memory Network
Unit [18]
theorem is given as follows:
where the probability of P(A) being true is related to the
probability of the H with evidence P(A|B). In terms of this
work, this would take inputs of MFCC measurements and
attempt to classify the spoken accent via selecting that of
the highest probability based on previous evidence.
Naive Bayes classication is given as follows:
y=k∈(1,. . ., K)p(Ck)
where class label yis given to data object k. The naivety in
Bayesian algorithms concerns the assumed independence of
attribute values (or existence), whether or not the assumption
holds true for a data.
Long Short-term Memory. Long Short Term Memory (LSTM)
is a form of Articial Neural Network in which multiple
Recurrent Neural Networks (RNN) will predict based on
state and previous states. As seen in Fig. 3, the data structure
of a neuron within a layer is an ’LSTM Block’. The general
idea is as follows:
The rst step requires the LSTM block to select stored
data to delete:
Where Wf are the weights of the blocks, h is the previous
output of the block (t-1), xt are the inputs received by the
block, and a bias is applied via bf.
The block must then select which data to store/remember.
Based on cell input i,Ct are the values generated.
The block will then update parameters through a convolu-
tional operation:
Finally, output Ot is produced, and the hidden state is
Due to the observed consideration of time sequences, i.e.
previously seen data, it is often found that time dependent
data are very eectively classied due to the memory state
of the block. LSTM ANN is thus a particularly powerful
technique in terms of speech recognition [
] and brainwave
classication [
] - since both are temporal, wave-like
Nearest Neighbour Classification. K-nearest Neighbour (KNN)
is a method of classication based on measured distance from
ktraining data points [
]. KNN is considered a lazy learning
technique since all computation is deferred and only required
during the classication stage. KNN is performed as follows:
Convert nominal attributes to integers mapped to the
attribute label
(2) Normalise attributes
Map all training data to n-dimensional space where n
are are the values of attributes
(4) Lazy Computation Stage - For each data point:
Plot the data point to the previously generated n-
dimensional space
Have K-nearest points all vote on the point based on
their value
Predict the class of the point with that which has
received the highest number of votes
Logistic Regression. Logistic Regression is a process of sym-
metric statistics where a numerical value is linked to a prob-
ability of event occurring, ie. the number of driving lessons
to predict pass or fail [33].
In a two class problem within a dataset containing inum-
ber of attributes and
model parameters, the log odds lis
derived via
and the odds of an outcome
are shown through
which can be used to
predict an outcome based on previous observation.
Support Vector Machines. Support Vector Machines (SVM)
classify data points by optimising a data-dimensional hyper-
plane to most aptly separate them, and then classifying based
on the distance vector measured from the hyper-plane[
Optimisation follows the goal of the average margins be-
tween points and the separator to be at the maximum pos-
sible value. Generation of an SVM is performed through
Sequential Minimal Optimisation (SMO), a high-performing
algorithm to generate and implement an SVM classier[
To perform this, the large optimisation problem is broken
down into smaller sub-problems, these can then be solved
linearly. For multipliers a, reduced constraints are given as:
where there are data classes yand kis the negative of the
sum over the remaining terms of the equality constraint.
Hidden Markov Models. Markov Chains are a probabilis-
tic model that describe a sequence and probability of an
event occurring based on those which have been observed
]. Each previously observed event is repre-
sented as a Hidden Unit and therefore the most optimal
number of hidden states required is largely data dependent.
The general idea of the HMM process is as follows:
Y=y(0),y(1), ...y(L1),(13)
denotes the probability of event Yoccurring based on the
the sequence of length L. Secondly,
describes the probability of Y where the sum runs over all of
the generated hidden node sequences, given as X:
X=x(0),x(1), ..., x(L1).(15)
Classication is nally chosen based on highest probabil-
ity on previously studied data sequences within the hidden
model through the Bayesian process [4].
Voting. Voting is a simple method of fusing the decisions
of multiple classiers and calculating an output prediction.
For example, an ensemble of two classiers with diering
classication abilities may possible produce a better result
when working together, ie. selected for their strengths. Vot-
ing is performed by various metrics, in which all classiers
will vote for a class, and then a prediction is produced by the
highest vote. Methods of voting include:
Vote weighted by probability to classify an individual
Give 1 vote to predicted Class - Majority
Vote based on overall ability to classify
Vote weighted by condence
Vote based on Min and Max Probabilities
The class is predicted based purely on the maximum vote
Manual Tuning
Hidden Markov Units as well as hidden LSTM units were
linearly explored. Preliminary experimentation found that a
single layer of LSTM units persistently outperformed deeper
25 50 75 100 125 150 175 200
HMM Hidden Units
Classication Accuracy (%)
Figure 4: Exploration of HMM Hidden Unit Selection
25 50 75 100
92 92.01 92.01
LSTM Hidden Units
Classication Accuracy (%)
Figure 5: Exploration of LSTM Hidden Unit Selection
networks, and thus only one layer was linearly searched. The
chosen amount of HMM hidden units was selected as 200
since it had the superior classication accuracy of 89.65% as
observed in Figs 4 and 5 respectively. The chosen amount of
hidden units for the LSTM were selected as 75 since it too
had the most superior classication accuracy of 92.01%.
Overall Results
Table 1 displays the overall classication accuracy of the
selected single models when predicting the locale of the
speaker at each 200ms audio interval. The best single model
was an LSTM with 92.01% accuracy, closely followed by the
extremely complex dense neural network for benchmark
Table 1: Single Classier Results for Accent Classication (Sorted Lowest to Highest)
Accuracy 58.29 70.62 85.2 85.8 85.94 86.19 89.65 89.72 90.76 91.55 92.01
Table 2: Democratic Voting Processes for Ensemble Classi-
Model Accuracy (%)
Avg. Prob. 94.74 94.63 92.62
Product Prob. 94.73 94.62 92.62
purposes, and then the K-Nearest Neighbours, and Hidden
Markov Models.
The best ensemble, and overall best, was a vote of average
probabilities between the Random Forest and LSTM, achiev-
ing 94.74% accuracy, this can be seen in the exploration of
democratic voting processes with best models, in Table 2.
This study has shown the eectiveness of various machine
learning techniques in terms of classifying the accent of
the subject based on recorded audio data. The diphthong
phoneme sounds were succesfully classied into four dif-
ferent accents from the UK and Mexico with an accuracy
of 94.74% when a manually tuned LSTM of 200 units and
a Random Forest are ensembled through a vote of average
Leave-one-out (LOO) cross validation has been observed
to be superior to test-set and k-fold cross-validation tech-
niques but requires far more processing time[
], this study
would have been around 3000 times more complex due to
there being 30,000 classiable data objects. It is likely better
results would be attained through this approach but with the
resources available, was not possible. Furthermore, more in-
tense searching of the problem spaces of HMM and LSTM hid-
den unit selection should be performed since relatively large
dierences were observed in minute topological changes.
Evolutionary algorithms have been observed to be a strong
method of topology selection and tuning [6, 8, 10, 22].
Neal Alewine, Eric Janke, Paul Sharp, and Roberto Sicconi. 2008. Sys-
tems and methods for building a native language phoneme lexicon
having native pronunciations of non-native words derived from non-
native pronunciations. US Patent 7,472,061.
Naomi S Altman. 1992. An introduction to kernel and nearest-neighbor
nonparametric regression. The American Statistician 46, 3 (1992), 175–
Levent M Arslan and John HL Hansen. 1996. Language accent clas-
sication in American English. Speech Communication 18, 4 (1996),
Thomas Bayes, Richard Price, and John Canton. 1763. An essay towards
solving a problem in the doctrine of chances. (1763).
Amy Bearman, Kelsey Josund, and Gawan Fiore. [n. d.]. Accent Con-
version Using Articial Neural Networks. ([n. d.]).
Jordan J. Bird, , Elizabeth Wanner, Aniko Ekart, and Diego R. Faria.
2019. Phoneme Aware Speech Recognition through Evolutionary
Optimisation. In The Genetic and Evolutionary Computation Conference.
Jordan J. Bird, Aniko Ekart, Christopher D. Buckingham, and Diego R.
Faria. 2019. Mental Emotional Sentiment Classication with an EEG-
based Brain-Machine Interface. In The International Conference on
Digital Image and Signal Processing (DISP’19). Springer.
Jordan J. Bird, Aniko Ekart, and Diego R. Faria. 2019. Evolutionary
Optimisation of Fully Connected Articial Neural Network Topology.
In SAI Computing Conference 2019. SAI.
Jordan J. Bird, Aniko Ekart, and Diego R. Faria. 2019. High Resolution
Sentiment Analysis by Ensemble Classication. In SAI Computing
Conference 2019. SAI.
Jordan J. Bird, Diego R. Faria, Luis J. Manso, Aniko Ekart, and Christo-
pher D. Buckingham. 2019. A Deep Evolutionary Approach to Bioin-
spired Classier Optimisation for Brain-Machine Interaction. Com-
plexity 2019 (2019).
Jordan J. Bird, Luis J. Manso, Eduardo P. Ribiero, Aniko Ekart, and
Diego R. Faria. 2018. A Study on Mental State Classication using
EEG-based Brain-Machine Interface. In 9th International Conference
on Intelligent Systems. IEEE.
William Byrne, Eva Knodt, Sanjeev Khudanpur, and Jared Bernstein.
1998. Is automatic speech recognition ready for non-native speech?
A data collection eort and initial experiments in modeling conversa-
tional Hispanic English. Proc. Speech Technology in Language Learning
(STiLL) 1, 99 (1998), 8.
[13] François Chollet et al. 2015. Keras.
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks.
Machine learning 20, 3 (1995), 273–297.
PR Davidson, RD Jones, and MTR Peiris. 2006. Detecting behavioral
microsleeps using EEG and LSTM recurrent neural networks. In 2005
IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE,
Paul A Gagniuc. 2017. Markov Chains: From Theory to Implementation
and Experimentation. John Wiley & Sons.
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hy-
brid speech recognition with deep bidirectional LSTM. In Automatic
Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on.
IEEE, 273–278.
Klaus Gre, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink,
and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE
transactions on neural networks and learning systems 28, 10 (2017),
Tin Kam Ho. 1995. Random decision forests. In Document analysis and
recognition, 1995., proceedings of the third international conference on,
Vol. 1. IEEE, 278–282.
Yishan Jiao, Ming Tu, Visar Berisha, and Julie M Liss. 2016. Accent Iden-
tication by Combining Deep Neural Networks and Recurrent Neural
Networks Trained on Long and Short Term Features.. In Interspeech.
Ron Kohavi et al
1995. A study of cross-validation and bootstrap for
accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal,
Canada, 1137–1145.
Alejandro Martín, Raúl Lara-Cabrera, Félix Fuentes-Hurtado, Valery
Naranjo, and David Camacho. 2018. EvoDeep: A new evolutionary
approach for automatic Deep Neural Networks parametrisation. J.
Parallel and Distrib. Comput. 117 (2018), 180–191.
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010.
Voice recognition algorithms using mel frequency cepstral coecient
(MFCC) and dynamic time warping (DTW) techniques. arXiv preprint
arXiv:1003.4083 (2010).
John Platt. 1998. Sequential minimal optimization: A fast algorithm
for training support vector machines. (1998).
Anantha M Prasad, Louis R Iverson, and Andy Liaw. 2006. Newer
classication and regression tree techniques: bagging and random
forests for ecological prediction. Ecosystems 9, 2 (2006), 181–199.
[26] J Ross Quinlan. 2014. C4. 5: programs for machine learning. Elsevier.
Md Sahidullah and Goutam Saha. 2012. Design, analysis and experi-
mental evaluation of block based transformation in MFCC computation
for speaker recognition. Speech Communication 54, 4 (2012), 543–565.
[28] Corey Shih. [n. d.]. Speech Accent Classication. ([n. d.]).
Stanley Smith Stevens, John Volkmann, and Edwin B Newman. 1937.
A scale for the measurement of the psychological magnitude pitch.
The Journal of the Acoustical Society of America 8, 3 (1937), 185–190.
Hong Tang and Ali A Ghorbani. 2003. Accent classication using
support vector machine and hidden markov model. In Conference of
the Canadian Society for Computational Studies of Intelligence. Springer,
Laura Mayeld Tomokiyo and Alex Waibel. 2003. Adaptation methods
for non-native speech. Multilingual Speech and Language Processing 6
Jessica PM Vital, Diego R Faria, Gonçalo Dias, Micael S Couceiro,
Fernanda Coutinho, and Nuno MF Ferreira. 2017. Combining discrimi-
native spatiotemporal features for daily life activity recognition using
wearable motion sensing suit. Pattern Analysis and Applications 20, 4
(2017), 1179–1194.
Strother H Walker and David B Duncan. 1967. Estimation of the
probability of an event as a function of several independent variables.
Biometrika 54, 1-2 (1967), 167–179.
... In contrast, accent detection improves the robustness of automatic speech recognition systems (ASR), since it helps to overcome this unwanted variability [6,7,1,8,9,10,11,12]. Being a sub-task of speech and language recognition, in terms of classification models, accent detection is based on the same machine learning architectures [13,14,5,15,10,1,16], e.g. CNN [17,1,18,14,13,19], FFNN [10], HMM [20], KNN [21], Logistic Regression [15,20], GMM [22,23], LSTM and bLSTM [3,8], Random Forest, SVM [24,25,20,23,21]. Accent classification accuracy depends upon the input feature set. ...
... In contrast, accent detection improves the robustness of automatic speech recognition systems (ASR), since it helps to overcome this unwanted variability [6,7,1,8,9,10,11,12]. Being a sub-task of speech and language recognition, in terms of classification models, accent detection is based on the same machine learning architectures [13,14,5,15,10,1,16], e.g. CNN [17,1,18,14,13,19], FFNN [10], HMM [20], KNN [21], Logistic Regression [15,20], GMM [22,23], LSTM and bLSTM [3,8], Random Forest, SVM [24,25,20,23,21]. Accent classification accuracy depends upon the input feature set. ...
... In contrast, accent detection improves the robustness of automatic speech recognition systems (ASR), since it helps to overcome this unwanted variability [6,7,1,8,9,10,11,12]. Being a sub-task of speech and language recognition, in terms of classification models, accent detection is based on the same machine learning architectures [13,14,5,15,10,1,16], e.g. CNN [17,1,18,14,13,19], FFNN [10], HMM [20], KNN [21], Logistic Regression [15,20], GMM [22,23], LSTM and bLSTM [3,8], Random Forest, SVM [24,25,20,23,21]. Accent classification accuracy depends upon the input feature set. ...
... • Analysis and modeling of speakers' variability in frame of speech recognition [9]; • Development of user interaction scenarios in video-games [11]; • Analysis of phonetic particularities and related personal behavior [12]; • Using accent-related information as components of biometric data [13]; • Mitigating accent influence in voice-control systems [14]; • Improving personalization of exercises and feedback in CAPT systems [2]. ...
... Table 1 lists the papers which are the closest to the scope of our study. As a sub-task of speech and language recognition, accent detection algorithms are built using the standard classification models and machine learning architectures including convolutional neural networks (CNN) [5,11,16,21], feedforward neural networks (FFNN) [10], hidden Markov model (HMM) [13], k-nearest neighbor (KNN) model [22], Gaussian mixture model (GMM) [23,24], long short-term memory (LSTM) and bidirectional LSTM (bLSTM) [25,26], random forest, and support vector machine (SVM) [13,22,24,27,28]. ...
... Table 1 lists the papers which are the closest to the scope of our study. As a sub-task of speech and language recognition, accent detection algorithms are built using the standard classification models and machine learning architectures including convolutional neural networks (CNN) [5,11,16,21], feedforward neural networks (FFNN) [10], hidden Markov model (HMM) [13], k-nearest neighbor (KNN) model [22], Gaussian mixture model (GMM) [23,24], long short-term memory (LSTM) and bidirectional LSTM (bLSTM) [25,26], random forest, and support vector machine (SVM) [13,22,24,27,28]. ...
Full-text available
The problem of accent recognition has received a lot of attention with the development of Automatic Speech Recognition (ASR) systems. The crux of the problem is that conventional acoustic language models adapted to fit standard language corpora are unable to satisfy the recognition requirements for accented speech. In this research, we contribute to the accent recognition task for a group of up to nine European accents in English and try to provide some evidence in favor of specific hyperparameter choices for neural network models together with the search for the best input speech signal parameters to ameliorate the baseline accent recognition accuracy. Specifically, we used a CNN-based model trained on the audio features extracted from the Speech Accent Archive dataset, which is a crowd-sourced collection of accented speech recordings. We show that harnessing time–frequency and energy features (such as spectrogram, chromogram, spectral centroid, spectral rolloff, and fundamental frequency) to the Mel-frequency cepstral coefficients (MFCC) may increase the accuracy of the accent classification compared to the conventional feature sets of MFCC and/or raw spectrograms. Our experiments demonstrate that the most impact is brought about by amplitude mel-spectrograms on a linear scale fed into the model. Amplitude mel-spectrograms on a linear scale, which are the correlates of the audio signal energy, allow to produce state-of-the-art classification results and brings the recognition accuracy for English with Germanic, Romance and Slavic accents ranged from 0.964 to 0.987; thus, outperforming existing models of classifying accents which use the Speech Accent Archive. We also investigated how the speech rhythm affects the recognition accuracy. Based on our preliminary experiments, we used the audio recordings in their original form (i.e., with all the pauses preserved) for other accent classification experiments.
... The creators should not simply assume that a certain definition is broadly agreed upon by the linguistic community; rather, it is helpful to provide detailed definitions of such subpopulations. This is especially true for datasets created for dialect identification or classification, or non-native speech assessment [7,20,28,60,250,282]. ...
... Your answer here. 20. Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? ...
Full-text available
Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.
... The ensemble of shallow and deep models like random forest, KNN, LSTM models have been explored for the accent classification [151]. Speech signals have also been explored in biomedical fields like Parkinson's speech recognition [61], autism spectrum disorders recognition [29] and diagnosis of mild dementia [73]. ...
Full-text available
Machine learning methods are extensively used for processing and analysing speech signals by virtue of their performance gains over multiple domains. Deep learning and ensemble learning are the two most commonly used techniques, which results in benchmark performance across different downstream tasks. Ensemble deep learning is a recent development which combines these two techniques to result in a robust architecture having substantial performance gains, as well as better generalization performance over the individual techniques. In this paper, we extensively review the use of ensemble deep learning methods for different speech signal related tasks, ranging from general objectives such as automatic speech recognition and voice activity detection, to more specific areas such as biomedical applications involving the detection of pathological speech or music genre detection. We provide a discussion on the use of different ensemble strategies such as bagging, boosting and stacking in the context of speech signals, and identify the various salient features and advantages from a broader perspective when coupled with deep learning architectures. The main objective of this study is to comprehensively evaluate existing works in the area of ensemble deep learning, and highlight the future directions that may be explored to further develop it as a tool for several speech related tasks. To the best of our knowledge, this is the first review study which primarily focuses on ensemble deep learning for speech applications. This study aims to serve as a valuable resource for researchers in academia and in industry working with speech signals, supporting advanced novel applications of ensemble deep learning models towards solving challenges in existing speech processing systems.
... Duduka et al. [17]Kashif et al.[18] Bird et al.[19] ...
Full-text available
Presentation of the article that was accepted in 13th ICCCNT 2022
... Mainly, these systems can be grouped into three different categories. e first category deals with the classification of native and nonnative accents of speakers [25]. e basic purpose of such systems is to overcome the performance loss caused in VPSs due to the accent or dialect mismatches of native and nonnative speakers [26]. ...
Full-text available
The present research is an effort to enhance the performance of voice processing systems, in our case the speaker identification system (SIS) by addressing the variability caused by the dialectical variations of a language. We present an effective solution to reduce dialect-related variability from voice processing systems. The proposed method minimizes the system’s complexity by reducing search space during the testing process of speaker identification. The speaker is searched from the set of speakers of the identified dialect instead of all the speakers present in system training. The study is conducted on the Pashto language, and the voice data samples are collected from native Pashto speakers of specific regions of Pakistan and Afghanistan where Pashto is spoken with different dialectal variations. The task of speaker identification is achieved with the help of a novel hierarchical framework that works in two steps. In the first step, the speaker’s dialect is identified. For automated dialect identification, the spectral and prosodic features have been used in conjunction with Gaussian mixture model (GMM). In the second step, the speaker is identified using a multilayer perceptron (MLP)-based speaker identification system, which gets aggregated input from the first step, i.e., dialect identification along with prosodic and spectral features. The robustness of the proposed SIS is compared with traditional state-of-the-art methods in the literature. The results show that the proposed framework is better in terms of average speaker recognition accuracy (84.5% identification accuracy) and consumes 39% less time for the identification of speaker.
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Full-text available
This study suggests a new approach to EEG data classification by exploring the idea of using evolutionary computation to both select useful discriminative EEG features and optimise the topology of Artificial Neural Networks. An evolutionary algorithm is applied to select the most informative features from an initial set of 2550 EEG statistical features. Optimisation of a Multilayer Perceptron (MLP) is performed with an evolutionary approach before classification to estimate the best hyperparameters of the network. Deep learning and tuning with Long Short-Term Memory (LSTM) are also explored, and Adaptive Boosting of the two types of models is tested for each problem. Three experiments are provided for comparison using different classifiers: one for attention state classification, one for emotional sentiment classification, and a third experiment in which the goal is to guess the number a subject is thinking of. The obtained results show that an Adaptive Boosted LSTM can achieve an accuracy of 84.44%, 97.06%, and 9.94% on the attentional, emotional, and number datasets, respectively. An evolutionary-optimised MLP achieves results close to the Adaptive Boosted LSTM for the two first experiments and significantly higher for the number-guessing experiment with an Adaptive Boosted DEvo MLP reaching 31.35%, while being significantly quicker to train and classify. In particular, the accuracy of the nonboosted DEvo MLP was of 79.81%, 96.11%, and 27.07% in the same benchmarks. Two datasets for the experiments were gathered using a Muse EEG headband with four electrodes corresponding to TP9, AF7, AF8, and TP10 locations of the international EEG placement standard. The EEG MindBigData digits dataset was gathered from the TP9, FP1, FP2, and TP10 locations.
Conference Paper
Full-text available
This study proposes an approach to ensemble sentiment classification of a text to a score in the range of 1-5 of negative-positive scoring. A high-performing model is produced from TripAdvisor restaurant reviews via a generated dataset of 684 word-stems selected by their information gain ranking. Analysis documents the few mis-classified instances as almost entirely being close to their real class, the best performing classification was an ensemble classifier of RandomForest, Naive Bayes Multinomial and Multilayer Perceptron (Neural Network) methods ensembled via a Vote on Average Probabilities approach. The best ensemble produced a classification accuracy of 91.02% which scored higher than the best single classifier, a Random Tree model with an accuracy of 78.6%. Ensemble through Adaptive Boosting, Random Forests and Voting is explored. All ensemble methods far outperformed the best single classifier methods.
Conference Paper
Full-text available
This paper explores single and ensemble methods to classify emotional experiences based on EEG brainwave data. A commercial MUSE EEG headband is used with a resolution of four (TP9, AF7, AF8, TP10) electrodes. Positive and negative emotional states are invoked using film clips with an obvious valence, and neutral resting data is also recorded with no stimuli involved, all for one minute per session. Statistical extraction of the alpha, beta, theta, delta and gamma brainwaves is performed to generate a large dataset that is then reduced to smaller datasets by feature selection using scores from OneR, Bayes Network, Information Gain, and Symmetrical Uncertainty. Of the set of 2548 features, a subset of 63 selected by their Information Gain values were found to be best when used with ensemble classifiers such as Random Forest. They attained an overall accuracy of around 97.89%, outperforming the current state of the art by 2.99 percentage points. The best single classifier was a deep neural network with an accuracy of 94.89%.
Conference Paper
Full-text available
This paper proposes an approach to selecting the amount of layers and neurons contained within Multilayer Perceptron hidden layers through a single-objective evolutionary approach with the goal of model accuracy. At each generation, a population of Neural Network architectures are created and ranked by their accuracy. The generated solutions are combined in a breeding process to create a larger population, and at each generation the weakest solutions are removed to retain the population size inspired by a Darwinian 'survival of the fittest'. Multiple datasets are tested, and results show that architectures can be successfully improved and derived through a hyper-heuristic evolutionary approach, in less than 10% of the exhaustive search time. The evolutionary approach was further optimised through population density increase as well as gradual solution max complexity increase throughout the simulation.
Conference Paper
Full-text available
This work aims to find discriminative EEG-based features and appropriate classification methods that can categorise brainwave patterns based on their level of activity or frequency for mental state recognition useful for human-machine interaction. By using the Muse headband with four EEG sensors (TP9, AF7, AF8, TP10), we categorised three possible states such as relaxing, neutral and concentrating based on a few states of mind defined by cognitive behavioural studies. We have created a dataset with five individuals and sessions lasting one minute for each class of mental state in order to train and test different methods. Given the proposed set of features extracted from the EEG headband five signals (alpha, beta, theta, delta, gamma), we have tested a combination of different features selection algorithms and classifier models to compare their performance in terms of recognition accuracy and number of features needed. Different tests such as 10-fold cross validation were performed. Results show that only 44 features from a set of over 2100 features are necessary when used with classical classifiers such as Bayesian Networks, Support Vector Machines and Random Forests, attaining an overall accuracy over 87%.
Full-text available
Deep Neural Networks (DNN) have become a powerful, and extremely popular mechanism, which has been widely used to solve problems of varied complexity , due to their ability to make models fitted to non-linear complex problems. Despite its well-known benefits, DNNs are complex learning models whose parametrization and architecture are made usually by hand. This paper proposes a new Evolutionary Algorithm, named EvoDeep, devoted to evolve the parameters and the architecture of a DNN in order to maximize its classification accuracy, as well as maintaining a valid sequence of layers. This model is tested against a widely used dataset of handwritten digits images. The experiments performed using this dataset show that the Evolutionary Algorithm is able to select the parameters and the DNN architecture appropriately, achieving a 98.93% accuracy in the best run.
A fascinating and instructive guide to Markov chains for experienced users and newcomers alike This unique guide to Markov chains approaches the subject along the four convergent lines of mathematics, implementation, simulation, and experimentation. It introduces readers to the art of stochastic modeling, shows how to design computer implementations, and provides extensive worked examples with case studies. Markov Chains: From Theory to Implementation and Experimentation begins with a general introduction to the history of probability theory in which the author uses quantifiable examples to illustrate how probability theory arrived at the concept of discrete-time and the Markov model from experiments involving independent variables. An introduction to simple stochastic matrices and transition probabilities is followed by a simulation of a two-state Markov chain. The notion of steady state is explored in connection with the long-run distribution behavior of the Markov chain. Predictions based on Markov chains with more than two states are examined, followed by a discussion of the notion of absorbing Markov chains. Also covered in detail are topics relating to the average time spent in a state, various chain configurations, and n-state Markov chain simulations used for verifying experiments involving various diagram configurations. • Fascinating historical notes shed light on the key ideas that led to the development of the Markov model and its variants • Various configurations of Markov Chains and their limitations are explored at length • Numerous examples—from basic to complex—are presented in a comparative manner using a variety of color graphics • All algorithms presented can be analyzed in either Visual Basic, Java Script, or PHP • Designed to be useful to professional statisticians as well as readers without extensive knowledge of probability theory Covering both the theory underlying the Markov model and an array of Markov chain implementations, within a common conceptual framework, Markov Chains: From Theory to Implementation and Experimentation is a stimulating introduction to and a valuable reference for those wishing to deepen their understanding of this extremely valuable statistical tool.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.