Content uploaded by Adi A. Maaita
Author content
All content in this area was uploaded by Adi A. Maaita on Feb 13, 2024
Content may be subject to copyright.
Content uploaded by Dimah Al-Fraihat
Author content
All content in this area was uploaded by Dimah Al-Fraihat on Feb 08, 2024
Content may be subject to copyright.
March 2024 Volume 14
RESEARCH'(Research'Manuscript)''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''Open'Access!
Human-centric Computing and Information Sciences (2024) 14:15
DOI: https://doi.org/10.22967/HCIS.2024.14.015
Received: December 12, 2022; Accepted: September 24, 2023; Published: March 15, 2024
Speech Recognition Utilizing Deep Learning:
A Systematic Review of the Latest Developments
Dimah Al-Fraihat1,*, Yousef Sharrab2, Faisal Alzyoud3, Ayman Qahmash4, Monther Tarawneh3, and Adi Maaita5
Abstract
Speech recognition is a natural language processing task that involves the computerized transcription of spoken
language in real time. Numerous studies have been conducted on the utilization of deep learning (DL) models
for speech recognition. However, this field is advancing rapidly. This systematic review provides an in-depth
and comprehensive examination of studies published from 2019 to 2022 on speech recognition utilizing DL
techniques. Initially, 575 studies were retrieved and examined. After filtration and application of the inclusion
and exclusion criteria, 94 were retained for further analysis. A literature survey revealed that 17% of the studies
used stand-alone models, whereas 52% used hybrid models. This indicates a shift towards the adoption of
hybrid models, which were proven to achieve better results. Furthermore, most of the studies employed public
datasets (56%) and used the English language (46%), whereas their environments were neutral (81%). The
word error rate was the most frequently used method of evaluation, while Mel-frequency cepstral coefficients
were the most frequently employed method of feature extraction. Another observation was the lack of studies
utilizing transformers, which were demonstrated to be powerful models that can facilitate fast learning speeds,
allow parallelization and improve the performance of low-resource languages. The results also revealed
potential and interesting areas of future research that had received scant attention in earlier studies.
Keywords
Speech Recognition, Deep Learning (DL), Deep Neural Networks (DNNs), Natural Language Processing
(NLP), Systematic Review
1. Introduction
Speech is a ubiquitous and essential mode of communication among humans, facilitating the expression
of ideas, thoughts and emotions and enabling engagement in meaningful conversations [1]. As our lives
become increasingly intertwined with machines and smart devices, new communication techniques that
align with our digitalized world have emerged. Speech recognition has played a pivotal role in enabling
us to adapt to these novel modes of communication: it not only empowers individuals with disabilities to
※ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits
unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
*Corresponding Author: Dimah Al-Fraihat (d.fraihat@iu.edu.jo)
1Department of Software Engineering, Faculty of Information Technology, Isra University, Amman, 11622, Jordan
2Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Isra University, Amman, 11622, Jordan
3Department of Computer Science, Faculty of Information Technology, Isra University, Amman, 11622, Jordan
4Department of Information Systems, College of Computer Science, King Khalid University, Abha, Saudi Arabia
5Department of Software Engineering, Faculty of Information Technology, Middle East University, Amman, Jordan
Page 2 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
interact, share knowledge and engage in open conversations, but also holds promise for revolutionizing
communication between machines using natural languages.
The field of speech recognition has drawn considerable attention in the past decade, driven by the
availability of powerful computing resources and vast amounts of training data. The utilization of deep
learning (DL) models and algorithms has further bolstered the recognition rates of speech recognition
systems. Positioned within the broader domain of natural language processing (NLP), speech recognition
involves the perception and recognition of spoken words, often coupled with transcription and translation
capabilities [2]. It begins by extracting relevant features from the speech signal by employing such
techniques as signal processing, acoustic modelling, and linguistic analysis. Segmentation algorithms
play a vital role in delineating word boundaries, thereby enhancing recognition precision. Pattern
recognition techniques also enable the mapping of audio features to linguistic units, such as phonemes,
words or sentences, exploiting statistical consistencies within the speech signal. Machine learning (ML)
algorithms, including deep neural networks (DNNs), have proven instrumental in learning and
recognizing these patterns, leading to accurate and efficient speech recognition. The future of speech
recognition technology holds promising prospects. It is poised to surpass our expectations in meeting the
evolving needs of the community and has the potential to facilitate seamless communication between
machines using natural languages [3]. By continuously advancing the boundaries of research and
development, speech recognition systems will continue to enhance our ability to communicate effectively
and effortlessly in an increasingly interconnected world.
The use of speech recognition has proven to be efficient; it is commonly used in the domains of
language identification [3], phone banking, robotics [4], attendance systems [5], spoken commands,
security [6], education [7], smart healthcare [8], and smart cities [9]. However, speech recognition is a
challenging task, and there are several open problems that require ongoing attention and innovative
solutions. One significant challenge is the existence of dialects and multilingualism. Speech recognition
systems struggle to interpret and recognize spoken words in different dialects and languages accurately.
The existence of multiple dialects within a language, along with code-switching, in which individuals
mix languages during conversations, poses considerable difficulties. Another challenge is that of
variability among speakers. Different individuals have unique speech characteristics, such as accents,
voice quality, and speaking styles [10]. This speaker variability presents challenges in training speech
recognition systems to recognize and adapt to different speakers accurately. Achieving speaker-
independent recognition and handling speaker variability effectively remain open problems in the field.
In addition, speech recognition systems need to address issues related to noise and adverse acoustic
conditions, out-of-vocabulary words and contextual understanding, ambiguity in speech and low-
resource languages and domains [11]. Ethical and privacy concerns are also critical considerations [12].
The field of speech recognition is undergoing rapid development, with major corporations such as
Google and Microsoft developing impressive tools to that end. For instance, Microsoft has launched its
own speech recognition system, called the Microsoft Audio Video Indexing Service. The main
components of speech recognition include an input device for capturing speech, a digital signal processor
for filtering out surrounding noise, an acoustic model for identifying speech patterns, and a language
model for decoding words [13]. Scholars have explored various research trends to improve the
performance and capabilities of speech recognition systems. One prominent trend is the utilization of
deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks
(RNNs), which have shown promising results in speech recognition tasks. These DL models can
automatically learn complex patterns and features from raw audio data, enabling more accurate and robust
speech recognition. DL models offer both high flexibility and the ability to learn complex patterns, but
they require large amounts of labelled training data and computationally intensive training procedures.
Another research trend focuses on the integration of contextual information and language modelling in
speech recognition systems. By incorporating contextual cues and linguistic knowledge, such as grammar
rules and semantic information, into the recognition process, the accuracy and understanding of spoken
words can be enhanced. This trend involves exploring advanced language modelling techniques, such as
Human-centric Computing and Information Sciences Page 3 / 33
RNN language models and transformer-based models, to improve contextual understanding and handle
variations in vocabulary [13].
With the initial retrieval and analysis of 575 studies, this study highlights the sheer volume of research
conducted utilizing DL algorithms in the field of speech recognition. Hence, there is a need to appraise
and summarize the collective findings and identify the latest developments. While acknowledging the
valuable review studies conducted by previous researchers, it should be noted that this study follows a
systematic review methodology and employs a rigorous selection and analysis process. The motivation
for this research stems from the dynamic and swiftly evolving landscape of DL and its potential to
enhance the accuracy and efficiency of speech recognition systems to a significant degree. This study
conducts a systematic review of speech recognition, focusing specifically on the years 2019 to 2022. The
timeframe chosen for the review reflects the intention to capture the most recent developments in the
field, acknowledging the fast-paced nature of research advances during this period. The objective of this
study is to complement previous research by examining studies conducted in subsequent years and
extending the analysis to include the period 2019–2022, thus highlighting recent advances and emerging
trends in DL techniques for speech recognition. To the best of the authors’ knowledge, this is the first
systematic review paper on speech recognition using DL to cover the years 2019–2022.
Furthermore, the study is underpinned by the dual impetus of undertaking a methodical and exhaustive
analysis of recent advances in speech recognition via DL and of providing an overarching background
that may guide future research and fill gaps in existing research. In this pursuit, the present study
contributes to a richer, deeper understanding of the contemporary landscape and helps to identify trends
and potential pathways for the evolution of speech recognition technology. A particularly salient
contribution of this study is its identification of underexplored avenues and nascent prospects for future
research endeavors. Specifically, we highlight the underutilization of transformer models, a subset of DL
techniques that have exhibited considerable potency in diverse NLP tasks. This insight lends the study
an anticipatory dimension, serving to elucidate not only the present trajectory of the field but also a
roadmap for forthcoming explorations and enquiries.
The paper is organized into the following seven sections: Section 1 contains the introduction; Section
2 introduces previous review studies and related work; Section 3 presents a comprehensive background
to the topic; Section 4 presents the research questions, selection criteria and research methodology;
Section 5 discusses the results and answers to the formulated questions; Section 6 details the conclusions;
and, finally, Section 7 discusses future research directions.
2. Related Work
A few systematic reviews have been conducted in the field of speech recognition. For example, one
review in this field examined research published from 2006 to 2018 [14]. The review focused on DNNs
and identified 174 studies from which to extract specific information and develop a statistical analysis
thereof. The main purpose of the survey was to present papers using multilevel processing prior to
Markov model-based coding of word sequences. The study also identified the models that had been used
in previous research, such as deep belief network (DBN), CNN and deep convex network, and classified
the evaluation techniques used. Overall, the study found that more focus should be directed towards RNN
models.
Another study provided an overview of DL for low-resource languages in the field of speech
recognition [15]. The authors reviewed the history and research status of two models: RNN and CNN.
Some techniques were also introduced to enhance data performance and model training, such as making
improvements to end-to-end (E2E) systems by integrating more language knowledge, studying complex
and noisy environments, and strengthening the acoustic and language model. Another technique involved
expanding modal information to excavate speech structure knowledge of multiple modes, such as speech,
images, and videos. This review paper concluded that further improvements should be made to speech
Page 4 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
recognition systems by developing more sophisticated models that could work for several accents and
noisy environments.
Another review study surveyed mouth interface devices based on sensors for the recognition of speech
using DL techniques [16]. It also focused on communication difficulties for disabled people, visual
speech recognition, and a silent speech interface. Another study conducted a brief review of 17 papers
on Arabic speech recognition using neural networks [17]. For the same language, another survey
highlighted the models found in 35 studies and the evaluation techniques and metrics used in dialectal
Arabic speech [18]. The study provided details on the progress of dialectal Arabic and summarized the
challenges and issues raised in previous research.
Another review of speech recognition using DL concluded that using both probabilistic neural
networks and RNNs can correctly recognize 98% of phonemes, followed by hidden Markov models
(HMMs) [19]. In a related study conducted by [20], the basic principles of voice recognition technology,
related theories and problems were surveyed. The study presented optimization methods of artificial
intelligence (AI) and simulation training for speech recognition based on DL. It highlighted several issues
that affect the quality of this technology, such as its capability to handle noisy environments and low
endpoint detection levels. To address these challenges, the study introduced methods of optimization and
analysis aimed at improving speech recognition performance by processing information more effectively
[20].
The review study undertaken by [21] examined several selected studies on CNN-based speech
recognition [21, 22]. From their comparison of the selected studies, the researchers identified some of the
weaknesses of CNN, but stated that CNN significantly reduced both model complexity and the word
error rate (WER).
Overall, previous literature surveys show that since 2019, no systematic literature review has been
conducted in the field of speech recognition using DL. To the best of the authors’ knowledge, this review
research is one of the most comprehensive researches yet to be undertaken in this field, as it addresses
recent advances as well as the latest technologies that have neither been used nor covered in earlier
research. Furthermore, this review not only adopts the rigour and guidelines of previous systematic
reviews, but is also systematic, transparent, and unbiased in selecting studies that may result from a non-
systematic strategy.
The retrieved studies were independently screened for eligibility using predetermined inclusion and
exclusion criteria. This study presents a solid theoretical background and provides an overview of ML,
DL, DNNs, NLP and speech recognition so as to enable a full understanding of the topic. A total of 575
studies were initially retrieved and analyzed. Among these studies, which covered the period of 2019
until the end of 2022, 94 were kept for further analysis after data filtration and application of the inclusion
and exclusion criteria. The 94 selected studies were carefully and thoroughly examined to extract
information that would help to identify patterns in the use of DL in speech recognition. The extracted
information was used to produce statistics that highlighted research gaps and future research directions
in the domain.
The following section provides a comprehensive research background and presents the key concepts
relevant to the research topic, the historical evolution of and recent advances in speech recognition based
on DL, and an analysis of various methods of evaluating performance.
3. Background
ML is the study of computer algorithms that can improve automatically with training, as well as a
branch of AI that is based on statistics. The purpose of ML is to develop a mathematical model that can
make predictions about the future without explicitly programming it to do so. Some of its engineering
applications include robotics, vision, speech recognition, and voice recognition. As shown in Fig. 1, DL
is a branch of ML that relies on DNN models. Both ML and DL include supervised learning (SL),
unsupervised learning (UL), reinforcement learning (RL), and self-supervised learning (SSL).
Human-centric Computing and Information Sciences Page 5 / 33
Fig. 1. The ancestors of machine learning and its children and grandchildren.
3.1 Deep Learning and Deep Neural Networks
DL is distinct from other ML and AI techniques in that it requires comparatively little human
interference. To address most ML issues without the need for domain-specific feature engineering, DL
employs DNNs with multi-hidden layers. The idea of neural networks was first introduced in 1943 [23]
and later developed in 1969 [24]. The backpropagation technique was revised and published in 1986.
Backpropagation made neural network training easier [25] because additional techniques were then
integrated with neural networks [26]. DL involves DNN algorithms and is very useful for applications
that require big data. In the past decade, DNN has become an essential component of data science [27].
A DNN is a network data structure that contains input and output layers, as well as numerous hidden
layers in between. Each layer contains nodes that are connected to all the nodes in the layer next to it
[28, 29]. By predicting output variables from input features, DNNs provide solutions to many engineering,
scientific and business problems [29, 30]. The importance of DNNs has led to their widespread
adoption in various fields. Fig. 2 illustrates the wide-ranging applications of DNNs, which include speech
recognition, such as text-to-speech and speech-to-text applications; NLP for tasks, such as classification,
machine translation, and question answering; and vision applications, such as image recognition and
machine vision.
Fig. 2. DNN applications.
Page 6 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
DL models, based on their precise modelling of the human brain, offer an advanced approach to ML
[31]. Once trained with large amounts of data, these systems can generate results autonomously without
human interaction. The inputs and outputs of DNN models need to be identified, and the data must be
pre-processed for training, while performance measures are necessary [31].
In artificial neural network (ANN) models, features serve as independent parameters that assist in
predicting labels [32]. The neurons in ANNs function similarly to biological neurons, whereby their
outputs are determined by their inputs and weights. Activation functions, such as sigmoid functions,
ensure that the network output is accurate [33]. Neurons are connected to form a neural network in which
each layer is linked to the next layer. The final layer produces the network output, and a cost function
computes the error between the predicted and actual values [34]. ANN models are trained by adjusting
the weights that link the neurons of one layer to the next [35]. The model is trained by feeding it training
samples (datasets), each of which contains the proper output response (label) and a particular input
feature. The training continually iterates through the data for a specific number of epochs until the model
is learned (the weights of the model are updated to optimize the total error) [36, 37].
A computer can learn using statistical computational theories, and the fields of DL and ML rely on
these theories and algorithms. The algorithms used in such architectures come in a variety of forms,
including SL, UL, and RL, and some variants of one or more of these types.
3.1.1 Supervised learning
SL, a subcategory of DL, is a model specifically created to teach the computer how to react to data
during training [36]. This supervision helps the model to become more accurate and capable of handling
fresh datasets that comply with the learned patterns over time, such as spam detection, predictive
analytics, image and object recognition, and customer and sentiment analysis. Fig. 3 illustrates the SL
process.
Fig. 3. Supervised learning.
There are numerous models of SL, some of the most widely used of which are described below.
1) Convolutional neural networks: These algorithms were created specifically to operate with images.
The technique described as convolution involves applying a weighted filter to each component of
an image, which enables the computer to recognize and respond to picture elements. The past ten
years have seen significant developments in the field of computer vision, which investigates how
computers can understand and interpret images and movies [30].
2) Recurrent neural networks: These networks remember the output from the previous step and
consider it to be the input to the current step. The most important feature of RNNs is the existence
of hidden layers that can remember certain information about a sequence. RNNs have a memory
that recalls everything that has been computed. RNNs employ the same settings for each input, as
Human-centric Computing and Information Sciences Page 7 / 33
it performs the same work on all inputs or hidden layers to generate the result. Unlike other neural
networks, this decreases the complexity of the parameters. Modelling time-dependent and
sequential data problems—such as speech recognition, machine translation, face detection, and text
synthesis—is possible with RNNs [30].
Various techniques can be employed to accelerate modelling and decrease the number of model
parameters in speech recognition. These include the utilization of CNNs for feature extraction and the
incorporation of attention mechanisms into RNNs. Other approaches involve knowledge distillation,
compact neural network architectures, parameter sharing, pruning techniques, quantization, transfer
learning, and language models. The selection of a specific technique depends on such factors as
architecture, dataset size, and task requirements. Experimentation is essential to strike a sound balance
between complexity, resource usage, and performance [37].
3.1.2 Unsupervised learning
This branch of ML uses algorithms to analyze and cluster unlabelled datasets. Unlike SL, unsupervised
machines use unlabelled data (Fig. 4). The machine is free to identify relationships and patterns as it sees
fit, and frequently produces findings that a human data analyst might not have noticed. Compared with
other algorithms, UL performs more sophisticated processing tasks for the discovery of hidden patterns
with no human intervention [37]. Although UL is typically more difficult to forecast, it uses techniques
such as neural networks, clustering, anomaly detection, and other technologies [38].
Fig. 4. Unsupervised learning.
3.1.3 Self-supervised learning
In SSL, the labels for the training dataset are generated by UL algorithms rather than by humans.
Unlabelled data are far more abundant than labelled data. Additionally, this strategy employs UL
techniques to identify common patterns in the data, which will then be used to improve the supervised
model. Some researchers claim that SSL is simply a UL variation, i.e., it is a two-step procedure with the
ultimate objective of creating an SL model [38]. Because of self-supervised training, which is a novel
idea in the field, Wav2Vec 2.0 has been developed as one of the most recent models of automatic speech
recognition (ASR). In this training technique, a model can be pre-trained using always-available
unlabelled data and then adjusted to a particular dataset for a particular objective. This training strategy
is quite effective, as evidenced by earlier work. The difference between SSL and SL and UL is shown in
Fig. 5.
3.1.4 Reinforcement learning
Both SL and UL have no consequences for the computer when data cannot be comprehended or
classified correctly. Both RL and deep reinforcement learning (DRL) are essential for assisting machines
in learning complicated tasks that involve handling vast, incredibly flexible and unpredictable datasets.
This enables computers to accomplish things like driving a car, operating on people, and checking bags
Page 8 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
for harmful items. The main components of a typical scenario of RL and DRL are illustrated in Fig. 6
[39]. RL includes the interaction of an agent with its surroundings to achieve its goals by making
decisions. RL depends on elements such as states, actions, rewards, policies, and value functions. As
shown in Fig. 6(b), DRL extends RL further by incorporating networks to approximate policy and value
functions, allowing it to handle state spaces. Furthermore, DRL utilizes techniques, such as exploration
strategies, experience replay and target networks, to improve learning efficiency and stability. By
combining RL with networks, the agent becomes capable of tackling intricate tasks in different domains.
Fig. 5. Self-supervised learning versus supervised and unsupervised learning
(Source from https://medium.com/).
(a)
(b)
Fig. 6. Essential components of (a) reinforcement learning and (b) deep reinforcement learning.
3.2 Speech Recognition and Natural Language Processing
Speech recognition is a technique that enables a machine to perceive spoken language. ASR, text-to-
speech, speech synthesis or simply speech recognition develops techniques and approaches that enable
the perception and transcription of spoken language into text by computers.
NLP refers to the combination of linguistics and ML. Using millions of sample datasets, machines
learn to understand natural language in NLP, an ML application. Computational linguistics has a subfield
called speech recognition, which deals with technology that allows people to speak to computers. Speech
recognition incorporates knowledge and research in linguistics, computer science and electrical
engineering. Fig. 7 depicts the speech recognition process, which comprises the following steps:
1) Analog-to-digital conversion, which converts analogue voice to digital by utilizing sampling and
quantization techniques. A vector of voice integer samples is used to represent speech in digital form.
Human-centric Computing and Information Sciences Page 9 / 33
2) Speech pre-processing, in which background noise and long periods of quiet are identified and
removed. The speech is then divided into 20-second frames for the subsequent step.
3) Feature extraction is the conversion of speech frames into a feature vector that specifies which
phoneme is being spoken.
4) In word selection, the sequence of phonemes/features is translated into the spoken word using a
language model.
Fig. 7. Speech recognition process (Source from https://medium.com/).
From a technological standpoint, speech recognition has a lengthy history and has undergone numerous
significant technological advances. Recent developments in DL and big data have improved the field.
Not only has there been an increase in academic papers published on the topic, but, more significantly,
the global industry, including major corporations, has also adopted several DL techniques for creating
and implementing voice recognition systems. The global companies include Google, Facebook,
Microsoft, Amazon, and Apple.
As illustrated in Fig. 8, an acoustic model and a language model are two conceptually distinct
categories of models used in speech recognition. The challenges of converting sound signals into some
sort of phonetic representation are solved by the acoustic model. The language model is where the words,
grammar and sentence structure domain information of a language are kept. ML methods can be used to
realize these conceptual models in probabilistic models. Over the past few decades, improvements in
speech recognition have improved HMMs, which are currently regarded as the standard speech
recognition solutions. Meanwhile, E2E DNNs are the cutting-edge models of speech recognition.
Fig. 8. An acoustic model and a language model in a speech recognition system.
Page 10 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
3.3 History of Speech Recognition and Natural Language Processing
The speech recognition field has witnessed significant advancements over the years. Earlier speech
recognition systems relied on traditional approaches, such as HMMs and Gaussian mixture models
(GMMs). These systems involve building statistical models based on acoustic features and language
models to recognize speech. However, these approaches face challenges in accurately handling variations
in speech patterns and are limited in their ability to handle complex linguistic structures. The introduction
of ML techniques revolutionized speech recognition. ML algorithms, such as support vector machines
and Gaussian Naive Bayes, allow for better classification and pattern recognition. These techniques have
improved the accuracy of speech recognition systems by leveraging statistical modelling and probabilistic
approaches. Meanwhile, the advent of DL has further revolutionized speech recognition tasks. DL,
particularly with DNNs, has surpassed traditional approaches, such as HMMs and GMMs. DNNs excel
at capturing complex patterns and hierarchies in speech data by learning intricate representations. They
can handle large-scale datasets and automatically extract relevant features, resulting in significant
improvements in speech recognition accuracy. The combination of ML and DL approaches has transformed
speech recognition, making it more accurate, robust, and adaptable to various contexts and languages.
Speech recognition has evolved from traditional approaches to ML techniques and, more recently, to
the transformative impact of DL. Feature extraction methods and accuracy-enhancing techniques
continue to be areas of active research and development in the field. The integration of advanced
algorithms and the exploration of innovative approaches have the potential to further improve speech
recognition systems. A timeline of some of the major developments in speech recognition from 2012 to
2022 is presented in the following paragraph.
Recent years have seen the introduction of voice-activated devices and voice assistants, as well as
popular, open-source speech recognition software, such as in [31, 40]. Additionally, we have seen
improvements in voice recognition models starting from hybrid neural network designs [41] to more E2E
models, such as Deep Speech [42], Deep Speech 2 [43], encoder-decoder models with attention [44], and
transducer-based speech recognition [45, 46]. Recently, speech recognition and related technologies have
benefitted from significant advances. Fig. 9 presents a chronological overview of key, remarkable
advancements in speech recognition spanning the years 2010 to 2020. This era marked the introduction
of voice-centric gadgets and digital assistants, as well as the emergence of popular open-source speech
recognition tools, such as Kaldi, and significant benchmarks, such as LibriSpeech. Voice assistants for
mobile devices, such as Apple’s Siri and Amazon’s Alexa, were introduced in 2011 and became widely
used during this period. These technologies were made possible in part through DL’s significant
improvements in speech recognition WER. Moreover, progress in speech recognition techniques has
evolved from initial hybrid neural networks to comprehensive E2E models, such as Deep Speech, Deep
Speech 2, encoder-decoder architectures featuring attention mechanisms, and speech recognition based
on transducers in 2017 [47].
Fig. 9. History of speech recognition.
Human-centric Computing and Information Sciences Page 11 / 33
Currently there are several methods capable of enhancing the accuracy of speech recognition systems.
One such approach is to employ data augmentation techniques, such as adding background noise or
varying the pitch and speed of speech data. Another method involves using ensemble models or
combining multiple speech recognition systems to improve overall performance. Additionally,
incorporating language models and context information can enhance accuracy by considering broader
linguistic contexts during the recognition process [29]. Using a variety of algorithms, tools and
techniques, NLP aims to include the interpretation, analysis, and manipulation of natural language data
for the intended use. However, numerous difficulties may arise depending on the natural language data
being used, making it impossible to accomplish all the goals using a single strategy. As a result, numerous
scholars have recently focused on the creation of various tools and approaches in the field of NLP and
pertinent areas of study. These changes are represented in Fig. 10 [48].
Fig. 10. History of natural language processing.
3.4 Latest Developments in Speech Recognition based on Deep Learning
End-to-end modelling for speech recognition has recently become a key trend in the speech
community, replacing DNN-based hybrid modelling. Although E2E models consistently outperform
hybrid models in most standards for speech recognition accuracy, hybrid models are prevalent in many
commercial ASR systems today. The decision to deploy the production model is influenced by a variety
of practical reasons. Traditional hybrid models typically perform well in these areas because they have
been developed for decades. It is challenging for E2E models to achieve widespread commercialization
without offering good solutions to all these issues [49].
Before SSL became part of computer vision research, it had already made significant contributions to
NLP. Language models were applied everywhere, including in-text suggestion, sentence completion and
document processing applications. Although wav2vec, which revolutionized the NLP field, was released
in 2013, these models have now improved their learning capabilities. The concept behind word
embedding approaches was straightforward—rather than asking a model to anticipate the next word, one
could ask it to do so based on the previous context.
These developments have allowed the achievement of meaningful representation through the
distribution of word embedding, which may be applied to a variety of tasks, including sentence
Page 12 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
completion and word prediction. Bidirectional encoder representations from transformers are currently
among the most widely utilized SSL techniques in NLP. The discipline of NLP has seen an incredible
influx of research and development over the last ten years. Fig. 11 illustrates wav2vec unsupervised,
which trains speech recognition models without labelled data.
Fig. 11. A wav2vec unsupervised learning (source from https://arxiv.org/abs/2204.02492/).
3.5 Performance Evaluation
The datasets utilized in the training and testing of DL-based ASR systems have changed over time,
from clean speech to intentionally introduced environmental noise. The rapid system development and
performance evaluation of various ASR systems require automatic ASR metrics. To evaluate the
performance of a speech recognition system, an appropriate evaluation metric must be selected. This
section discusses several ASR metrics found in the reviewed papers.
The sentence error rate (SER), also known as the string error rate, is a straightforward evaluation metric
determined by comparing the hypothesis string produced by the decoder with the reference string and
marking the entire sentence as incorrect if it varies. This measure is imprecise because it counts any
difference between the reference and the hypothesis as one mistake, regardless of how similar the two
strings are.
The WER is a metric that is frequently used by researchers to assess the effectiveness of ASR systems.
It is the accepted metric for assessing the performance of an ASR system.
The number of words that differ between the reference and the hypothesis is measured by the WER.
The following four distinct scenarios that can occur are identified as follows:
1) The predicted and reference words that are corrected are equivalent according to certain rules.
2) The predicted word substituted is aligned with a different reference word.
3) An extra word inserted into the predicted sentence cannot align with a reference word.
4) A word deleted from the reference is missing in the predicted sentence completely.
The WER is calculated based on the following equation:
!"# =!"#"$
%,
(1)
where N is the number of words pronounced in the speech input to the ASR system, S is the number of
false word substitutions, D is the number of word deletions, I is the number of false word insertions in
the ASR output, and S is the number of incorrect word substitutions.
Human-centric Computing and Information Sciences Page 13 / 33
3.6 Speech Feature Extraction
Speech is a sophisticated human skill. Its features are extracted by converting the speech waveform to
a parametric representation at a comparatively low data rate for further processing and analysis. Speech
feature extraction approaches include discrete wavelet transform, line spectral frequencies, Mel-
frequency cepstral coefficients (MFCCs), linear prediction coefficients (LPCs), linear prediction cepstral
coefficients, and perceptual linear prediction. MFCCs are commonly used features in speech recognition.
The processing steps of MFCCs are described by the block diagram in Fig. 12. The computation starts
by applying pre-emphasis to the speech signal in order to improve higher frequencies. The signal is then
split into short frames and multiplied by a window function to separate the stationary segments. Next,
fast Fourier transform is employed to obtain the power spectrum of each frame. The power spectrum is
then passed through a Mel filterbank, which approximates the human auditory system’s frequency
resolution. The resulting filterbank energies are transformed into a logarithmic scale and further
decorrelated using a discrete cosine transform. Finally, cepstral mean normalization is applied to
normalize the coefficients across frames. These steps collectively generate MFCCs that capture essential
speech information for use in speech recognition systems.
Fig. 12. Block diagram for the Mel-frequency cepstral coefficient feature extraction.
4. Research Methodology
The objective of this systematic review research is to conduct a fair and comprehensive evaluation and
interpretation of available research published from 2019 to 2022 in the field of speech recognition
utilizing DL. The guidelines suggested by [50] were followed in undertaking this systematic review. The
first phase of the research involved planning the review, which was further divided into determining the
necessity for a review, developing the research questions, and describing the search approach to find
relevant research papers. The second phase involved conducting the review, which was subdivided into
defining the appropriate research selection criteria, including the inclusion/exclusion criteria, developing
the quality evaluation rules to filter research publications, constructing the data extraction strategy to
address the study objectives, and then synthesizing the data taken from the publications. The third phase
of the research was the reporting phase.
4.1 Planning Phase
DL is an ML technique that is widely used in the speech recognition field. This area of research is
experiencing rapid growth. However, the use of DL in speech recognition applications and its related
Page 14 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
concepts remain poorly understood, as is the case with many contemporary technologies. Hence, this
research has been undertaken to provide an overview of published research in the field of speech
recognition applications that use DL and to extract information from research publications over the last
4 years. The information obtained will help in the identification of research patterns and the development
of statistics that will in turn shed light on limitations and gaps in the literature, in addition to current and
future directions of research. It is hoped that this will provide a background and framework for
researchers to appropriately establish new research in this era. Based on this, the outcomes of this review
will ultimately provide answers to the following research questions:
1. Which categories of publications are included in this review research?
2. What are the various types of datasets used to test the algorithm for each included publication?
3. What are the languages identified for each included publication?
4. What are the various types of environments utilized for each included publication?
5. What are the different types of feature extraction methods used to conduct the study for each
included publication?
6. What types of DL models are used in each included publication?
The search strategy that was followed to retrieve papers and identify the relevant ones was thoroughly
discussed among the authors in regular meetings. This research used the search terms “speech
recognition” AND “deep” in the following digital resources to include publications that employed both
DL and DNNs in the field of speech recognition: Google Scholar, Wiley Online Library, IEEE Explore,
and Science Direct.
Based on the search terms mentioned above and using the specified period (2019–2022), 575
publications were initially retrieved. Ninety-one were identified as duplicates among the different
libraries and removed. Hence, 484 publications were retained for the next phase.
4.2 Review Phase
The first task to carry out in this phase consisted in determining both the filtration rules and the
inclusion/exclusion criteria according to the following steps:
1. Remove publications that are out of topic based on the title.
2. Remove publications that are irrelevant based on the abstract and keywords.
3. Remove publications that are review papers (i.e. those used in the related work section or for the
purpose of comparison).
4. Filtering: Remove articles based on the inclusion/exclusion criteria (10 criteria were applied, and
only those studies with scores of 6 or higher out of 10 were kept). The inclusion criteria were as
follows:
CR1: Organization of the paper.
CR2: Relevance of the research questions.
CR3: Clear identification of the research objectives.
CR4: Existence of an adequate background.
CR5: Existence of practical experiments.
CR6: Appropriateness of the conducted experiment.
CR7: Suitability of the methods of analysis.
CR8: Clear reporting of the results.
CR9: Clear identification of the dataset.
CR10: Overall usefulness of the research.
The exclusion criteria were as follows:
1) Publications that are books, chapters, or theses. However, these are mentioned in the literature
review section.
2) Publications which, upon careful reading, do not answer the research questions.
3) Publications that do not utilize DL or DNNs.
Human-centric Computing and Information Sciences Page 15 / 33
4) Publications that are not written in English.
5) Publications that are reports or workshops with no publication information.
4.3 Analysis
The finalized list of articles for full review was thoroughly assessed to extract data that answered the
research questions listed earlier. These are reported in the results section. Fig. 13 depicts the steps
followed in this phase.
Fig. 13. Flow chart of the steps followed to select the publications.
5. Results and Discussion
The third phase of this systematic review research was the reporting of the results. A total of 94
publications were included in the final list of research articles [51–144]. These papers were thoroughly
studied and used to extract information that answered the research questions. The information extracted
was quantitatively described and used to determine patterns in the studies carried out from 2019 to 2022.
It also revealed the commonalities and discrepancies between the studies, which helped to identify future
research directions.
5.1 RQ1: What are the categories of publications included in this review research?
Ninety-four publications included in this study belonged to two main types: journal articles and
conference papers. The type “others” included workshops, which accounted for only 1% of the papers.
Fig. 14 illustrates the distribution of publications between the two types, while Fig. 15 presents the
distribution of the publications over the years examined.
Page 16 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
Fig. 14. Distribution of the publications per type.
Fig. 15. Distribution of the publications over the years studied.
Conference papers accounted for the majority (59%) of the publications in this study, whereas journal
articles accounted for about 40%. They were distributed across different conferences and journals, of
which the International Journal of Speech Technology and the International Conference on Acoustics,
Speech and Signal Processing were the most common journal and conference, respectively. The
distribution of the studies over the publication years indicates a notable trend, with the majority
concentrated in recent years. In particular, 2020 saw the highest share at 32%, followed by 2019 at 29%,
while 2021 and 2022 accounted for 25% and 14% of the publications, respectively.
5.2 RQ2: What are the various types of datasets used to test the algorithm for
each included publication?
For the purposes of testing and training the algorithms, a variety of datasets were identified in the
studies. Some datasets were reported as public and accessible on the internet at 37%, while most of them
were private at 56%. Fig. 16 depicts the distribution of datasets by type. It is worth mentioning here that
7% of the papers did not specify the type of dataset used in their experiments. The public datasets included
the Multi-Channel Articulatory database, Switchboard, CallHome, the Crowdsourced high-quality multi-
speaker speech dataset, the si284 dataset, dev93, eval92, TIMIT, Tibetan Corpus, SpinalNet, the Amharic
reading speech corpus, the Google OpenSLR dataset, and the Kaggle dataset.
5.3 RQ3: Which languages were identified for each included publication?
Different languages were used in the speech recognition publications examined. Fig. 17 presents the
languages thus identified for which the DL approach was applied in order to train and test the algorithms.
Human-centric Computing and Information Sciences Page 17 / 33
As shown in Fig. 17, the most dominant language is English, followed by Arabic, Chinese, Indian, and
Indonesian. The English language was also used alongside other languages, such as Tibetan and Indian.
Fig. 16. Types of datasets.
Fig. 17. Languages identified in the publications.
5.4 RQ4: What are the various types of environments utilized for each included
publication?
The types of environments used in the publications were either neutral or noisy. It is worth mentioning
that some papers did not mention the type of environment and thus they were assumed to be neutral.
Thus, 81% of the publications showed a neutral environment, while 19% were noisy, as shown in Fig. 18.
0
5
10
15
20
25
30
35
40
45
50
English
Arabic
Chinese
Indian
Indonesian
Ethiopian
Urdu -Punjabi
Tibetan
Korean
Taiwanese
Turkish
Albanian
Uzbek
Bengali
Sri Lanka
-Sinhala
Lithuanian
Italian-Libri
Persian
Page 18 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
5.5 RQ5: What are the different types of feature extraction methods used to
conduct the study for each included publication?
To prepare and train the data, different approaches were utilized to extract features from speech. Most
(about 60%) of these approaches were based on MFCCs; on the other hand, 6% of the studies did not
specify the type of feature extraction method used, while 10% did not perform any extraction.
Furthermore, 9% of the research used one of the following hybrid methods, combining MFCCs with
GMM and HMM, gammatone frequency cepstral coefficients (FCC-GFCC), acoustic feature extraction
(AF-MFCC), feature space maximum likelihood linear regression (fMLLR), amplitude modulation
spectrogram, cycle generative adversarial networks, perceptual linear prediction (PLPC), minimum
variance distortionless response, GMM-HMM, and MFCCs with linear discriminant analysis, maximum
likelihood linear transform and fMLLR. Fig. 19 depicts the feature extraction methods used, and Table 1
provides further details about the distribution of feature extraction methods across the publications.
Fig. 18. Environment conditions.
Fig. 19. Feature extraction methods.
0
10
20
30
40
50
60
MFCCs Hybrid Other GFCCs HMM No
Extraction
LPC CNN
Human-centric Computing and Information Sciences Page 19 / 33
5.6 RQ6: What types of deep learning models are used in each included
publication?
Regarding the types of DL models used in speech recognition, 51% of the publications used hybrid
models, whereas 17% used the DNN model. Transformer models were also utilized (5.3%). The rest of
the publications used different types of models, such as RNNs, CNNs and GMMs. Fig. 20 presents the
identified stand-alone models, while Table 1 shows the distribution of these models.
Page 20 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
Table 1. Distribution of models and feature extraction methods used in speech recognition research
Study
Feature extraction methods
Distribution of models used in speech recognition research
CNN
HLDA
HMM
MFCCs
No Ext.
N/A
Other
Hybrid
Enc.
RNN
DNN
LSTM
BLSTM
CNN
DBN
DTN
DRN
GMM
Tran.
Hybrid
[51]
P
P
[52]
P
P
[53]
P
P
[54]
P
P
[55]
P
P
[56]
P
P
[57]
P
P
[58]
P
P
[59]
P
P
[60]
P
P
[61]
P
P
P
[62]
P
P
[63]
P
P
P
P
[64]
P
P
[65]
P
P
[66]
P
P
[67]
P
P
P
[68]
P
P
[69]
P
P
[70]
P
P
P
[71]
P
P
[72]
P
P
[73]
P
P
[74]
P
P
[75]
P
P
[76]
P
P
[77]
P
P
[78]
P
P
[79]
P
P
[80]
P
P
[81]
P
P
P
[82]
P
P
[83]
P
P
Human-centric Computing and Information Sciences Page 21 / 33
Table 1. Continued
Study
Feature extraction methods
Distribution of models used in speech recognition research
CNN
HLDA
HMM
MFCCs
No Ext.
N/A
Other
Hybrid
Enc.
RNN
DNN
LSTM
BLSTM
CNN
DBN
DTN
DRN
GMM
Tran.
Hybrid
[84]
P
[85]
P
[86]
P
P
[87]
P
P
[88]
P
P
[89]
P
[90]
P
P
[91]
P
P
[92]
P
P
[93]
P
P
[94]
P
P
[95]
P
P
[96]
P
P
[97]
P
P
[98]
P
P
[99]
P
P
[100]
P
P
[101]
P
P
[102]
P
P
[103]
P
P
[104]
P
P
[105]
P
P
[106]
P
P
[107]
P
P
[108]
P
[109]
P
P
[110]
P
P
[111]
P
P
[112]
P
P
[113]
P
P
P
[114]
P
P
P
[115]
P
P
P
[116]
P
P
Page 22 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
Table 1. Continued
Study
Feature extraction methods
Distribution of models used in speech recognition research
CNN
HLDA
HMM
MFCCs
No Ext.
N/A
Other
Hybrid
Enc.
RNN
DNN
LSTM
BLSTM
CNN
DBN
DTN
DRN
GMM
Tran.
Hybrid
[117]
P
P
[118]
P
P
[119]
P
P
[120]
P
P
[121]
P
[122]
P
P
[123]
P
P
[124]
P
P
[125]
P
P
[126]
P
P
[127]
P
P
[128]
P
P
[129]
P
P
[130]
P
P
[131]
P
[132]
P
P
[133]
P
P
[134]
P
P
[135]
P
P
[136]
P
P
[137]
P
P
[138]
P
P
[139]
P
P
[140]
P
P
[141]
P
P
[142]
P
[143]
P
P
[144]
P
P
No Ext.=no extraction, N/A=not available, Enc.=autoencoder, Tran.=transformer.
Human-centric Computing and Information Sciences Page 23 / 33
Fig. 20. Types of models used in speech recognition research.
Fig. 21. Evaluation techniques.
The evaluation techniques that were identified in the included publications for the purpose of
evaluating the overall performance of the model were mainly the WER at 49%, followed by accuracy at
19%. As for label error rate (LER), it was found in only one study (Fig. 21). The category of “others”
included the use of different evaluation techniques, which accounted for 15% of all publications. These
included WER + WordAcc, accuracy + confusion matrix, WER + character error rate (CER) + loss,
phone error rate (PER) + WER, accuracy + CER, PER, sentence error rate (SER), syllable error rate, SER
+CER + WER, WER + loss + mean edit distance, WER + PER + frame error rate, CER + LER, WER +
monophones + triphones, and CER + WER.
6. Conclusion
This study consists of a comprehensive analysis of the application of DL techniques in the field of
speech recognition based on an examination of 94 studies published from 2019 to 2022. The findings of
this study have revealed several key insights and trends in the field. The distribution of the included
studies was observed across various journals (40%) and conferences (59%). Most of the studies utilized
public datasets (56%) and focused on English language processing (46%) in neutral environments (81%).
The evaluation techniques used were predominantly based on the WER (49%). Furthermore, the present
study’s analysis highlighted the widespread use of MFCCs as a feature extraction method. However, it
would be interesting for future researchers to explore alternative approaches, such as fMLLR, GFCC,
and LPC.
0
5
10
15
20
25
30
35
40
45
Hybrid
DNN
Hybrid
RNN
Autoencoder
CNN
Transformer
Other
0
10
20
30
40
50
WER Accuracy Other PER LER CER
Page 24 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
A significant finding from the literature survey is the increasing adoption of hybrid DNN models
exhibiting improved performance compared with stand-alone models. Approximately 52% of the
publications examined utilized hybrid models, whereas 17% relied on DNN stand-alone models. Also
identified was a research gap in the exploration of speech recognition using transformer models. This
area presents promising opportunities for future investigations, as transformers have demonstrated
powerful capabilities, including faster learning speed, parallelization, and enhanced performance for low-
resource languages.
In conclusion, our study provides valuable insights into the utilization of DL in speech recognition.
The trends and research gaps identified are expected to open new avenues for future research, which will
be further discussed in the subsequent section, emphasizing the importance of exploring alternative
methods of feature extraction and leveraging transformer models to advance the field.
7. Future Work
Despite the extensive analysis conducted in this review, there are several avenues for future research
in the field of speech recognition. First, investigating alternative methods of feature extraction is crucial
to expanding the repertoire beyond the commonly used MFCCs. As techniques such as fMLLR, GFCC
and LPC offer alternative representations of speech signals, they should be evaluated for their
effectiveness in speech recognition tasks. Second, the exploration of hybrid model architectures holds
promise for improving speech recognition performance. While the current study indicated a shift towards
hybrid models, further investigation is needed to explore different combinations of architectures. For
example, combining CNNs with RNNs, or incorporating transformer models, could yield valuable
insights into optimal model composition and its impact on speech recognition accuracy. Third, the
application of transformer-based models in speech recognition has been relatively unexplored in the
studies reviewed for this paper. Future research should delve into the use of transformers, as they offer
various advantages, such as faster learning speed, parallelization capabilities, and potential performance
enhancements for low-resource languages. Comparative studies of transformers and traditional
architectures could also help to elucidate their suitability and effectiveness in speech recognition tasks.
Moreover, the focus on low-resource languages is an important direction for future research. While the
majority of studies in the current analysis were focused on English language processing, addressing the
challenges specific to low-resource languages, including limited training data and linguistic variations,
is imperative. Investigating and developing robust models that can handle these challenges effectively
will contribute to closing the gap in speech recognition technology for underrepresented languages.
Lastly, the evaluation of performance metrics beyond the WER is essential for a comprehensive
assessment of speech recognition systems. Metrics such as phoneme error rate, precision, recall, and F1-
score can provide additional insights into the strengths and weaknesses of different models. Incorporating
a diverse range of metrics will facilitate a more nuanced understanding of system performance and enable
researchers to make informed decisions when designing and optimizing speech recognition models. By
addressing these future research directions, researchers could advance speech recognition technology,
enhance its applicability to diverse languages and contexts, and ultimately improve the user experience
in various speech-related applications.
Author’s Contributions
Conceptualization, DAF; Funding acquisition, AQ; Investigation and methodology, DAF; Project
administration, DAF; Resources, FA; Supervision, DAF; Software, YS; Validation, YS; Formal analysis,
DAF; Data curation, FA; Writing of the original draft, DAF; Writing of the review and editing, DAF,
AQ, MT, AM, YS.
Funding
Human-centric Computing and Information Sciences Page 25 / 33
This research has been funded by the Deanship of Scientific Research at King Khalid University (No.
RGP.1/209/43).
Competing Interests
The authors declare that they have no competing interests.
References
[1]D. Yu and L. Deng, Automatic Speech Recognition. London, UK: Springer, 2016.
https://doi.org/10.1007/978-1-4471-5779-3
[2]L. Besacier, E. Barnard, A. Karpov, and T. Schultz, "Automatic speech recognition for under-resourced
languages: a survey," Speech Communication, vol. 56, pp. 85-100, 2014.
https://doi.org/10.1016/j.specom.2013.07.008
[3]A. Mathur and R. Sultana, "A study of machine learning algorithms in speech recognition and language
identification system," in Innovations in Computer Science and Engineering. Singapore: Springer, 2021,
pp. 503-513. https://doi.org/10.1007/978-981-33-4543-0_54
[4]C. Deuerlein, M. Langer, J. Seßner, P. Heß, and J. Franke, "Human-robot-interaction using cloud-based
speech recognition systems," Procedia CIRP, vol. 97, pp. 130-135, 2021.
https://doi.org/10.1016/j.procir.2020.05.214
[5]N. Sandhya, R. V. Saraswathi, P. Preethi, K. A. Chowdary, M. Rishitha, and V. S. Vaishnavi, "Smart
attendance system using speech recognition," in Proceedings of 2022 4th International Conference on
Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 2022, pp. 144-149.
https://doi.org/10.1109/ICSSIT53264.2022.9716261
[6]Y. Chen, J. Zhang, X. Yuan, S. Zhang, K. Chen, X. Wang, and S. Guo, "SoK: a modularized approach to
study the security of automatic speech recognition systems," ACM Transactions on Privacy and Security,
vol. 25, article no. 17, 2022. https://doi.org/10.1145/3510582
[7]K. Xia, X. Xie, H. Fan, and H. Liu, "An intelligent hybrid–integrated system using speech recognition and
a 3D display for early childhood education," Electronics, vol. 10, no. 15, article no. 1862, 2021.
https://doi.org/10.3390/electronics10151862
[8]A. Ahmad, P. Mozelius, and K. Ahlin, "Speech and language relearning for stroke patients-understanding
user needs for technology enhancement," in Proceedings of the 13th international Conference on eHealth,
Telemedicine, and Social Medicine (eTELEMED), Nice, France, 2021, pp. 31-38.
[9]K. Avazov, M. Mukhiddinov, F. Makhmudov, and Y. I. Cho, "Fire detection method in smart city
environments using a deep-learning-based approach," Electronics, vol. 11, no. 1, article no. 73, 2021.
https://doi.org/10.3390/electronics11010073
[10] L. V. Kremin, J. Alves, A. J. Orena, L. Polka, and K. Byers-Heinlein, "Code-switching in parents’ everyday
speech to bilingual infants," Journal of Child Language, vol. 49, no. 4, pp. 714-740, 2022.
https://doi.org/10.1017/S0305000921000118
[11] D. O’Shaughnessy, "Automatic speech recognition: history, methods and challenges," Pattern Recognition,
vol. 41, no. 10, pp. 2965-2979, 2008. https://doi.org/10.1016/j.patcog.2008.05.008
[12]M. H. Ali, M. M. Jaber, S. K. Abd, A. Rehman, M. J. Awan, D. Vitkute-Adzgauskiene, R. Damasevicius,
and S. A. Bahaj, "Harris Hawks sparse auto-encoder networks for automatic speech recognition system,"
Applied Sciences, vol. 12, no. 3, article no. 1091, 2022. https://doi.org/10.3390/app12031091
[13]N. Arjangi, "Applications of speech recognition using machine learning and computer vision,"
International Journal of Research Publication and Reviews, vol. 3, no. 11, pp. 998-1002, 2022.
[14]A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech recognition using deep neural
networks: a systematic review," IEEE Access, vol. 7, pp. 19143-19165, 2019.
https://doi.org/10.1109/ACCESS.2019.2896880
[15]C. Yu, M. Kang, Y. Chen, J. Wu, and X. Zhao, "Acoustic modeling based on deep learning for low-
resource speech recognition: an overview," IEEE Access, vol. 8, pp. 163829-163843, 2020.
https://doi.org/10.1109/ACCESS.2020.3020421
Page / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[16]W. Lee, J. J. Seong, B. Ozlu, B. S. Shim, A. Marakhimov, and S. Lee, "Biosignal sensors and deep
learning-based speech recognition: a review," Sensors, vol. 21, no. 4, article no. 1399, 2021.
https://doi.org/10.3390/s21041399
[17] W. Algihab, N. Alawwad, A. Aldawish, and S. AlHumoud, "Arabic speech recognition with deep learning:
a review," in Social Computing and Social Media: Design, Human Behavior and Analytics. Cham,
Switzerland: Springer, 2019, pp. 15-31. https://doi.org/10.1007/978-3-030-21902-4_2
[18]H. A. Alsayadi, I. Hegazy, Z. T. Fayed, B. Alotaibi, and A. A. Abdelhamid, "Deep investigation of the
recent advances in dialectal Arabic speech recognition," IEEE Access, vol. 10, pp. 57063-57079, 2022.
https://doi.org/10.1109/ACCESS.2022.3177191
[19]D. Dayal, F. Alam, H. Varun, and N. Singh, "Review on speech recognition using deep learning,"
International Journal for Research in Applied Science & Engineering Technology (IJRASET), vol. 8, no. 5,
pp. 1-5, 2020.
[20]L. Zhang and X. Sun, "Study on speech recognition method of artificial intelligence deep learning,"
Journal of Physics: Conference Series, vol. 1754, no. 1, article no. 012183, 2021.
https://doi.org/10.1088/1742-6596/1754/1/012183
[21]K. I. Taher and A. M. Abdulazeez, "A deep learning convolutional neural network for speech recognition:
a review," International Journal of Science and Business, vol. 5, no. 3, pp. 1-14, 2021.
https://doi.org/10.5281/zenodo.4475361
[22]M. El-Shebli, Y. Sharrab, and D. Al-Fraihat, "Prediction and modeling of water quality using deep neural
networks," Environment, Development and Sustainability, 2023. https://doi.org/10.1007/s10668-023-
03335-5
[23] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," The Bulletin
of Mathematical Biophysics, vol. 5, pp. 115-133, 1943. https://doi.org/10.1007/BF02478259
[24]M. L. Minsky and S. A. Papert, Perceptrons: An Introduction to Computational Geometry. Cambridge,
MA: MIT Press, 1988.
[25]D. E. Rumelhart, G. E. Hinton, and J. L. McClelland, "A general framework for parallel distributed
processing," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition.
Cambridge, MA: MIT Press, 1986, pp. 45-76. https://doi.org/10.7551/mitpress/5236.003.0005
[26]L. See and S. Openshaw, "Applying soft computing approaches to river level forecasting," Hydrological
Sciences Journal, vol. 44, no. 5, pp. 763-778, 1999. https://doi.org/10.1080/02626669909492272
[27]I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.
[28]Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no 7553, pp. 436-444, 2015.
https://doi.org/10.1038/nature14539
[29]K. Ali, M. Alzaidi, D. Al-Fraihat, and A. M. Elamir, "Artificial intelligence: benefits, application, ethical
issues, and organizational responses," in Intelligent Sustainable Systems. Singapore: Springer, 2023, pp.
685-702. https://doi.org/10.1007/978-981-19-7660-5_62
[30]C. Dawson and R. Wilby, "Hydrological modelling using artificial neural networks," Progress in Physical
Geography, vol. 25, no. 1, pp. 80-108, 2001. https://doi.org/10.1177/030913330102500104
[31]D. Al-Fraihat, M. Alzaidi, and M. Joy, "Why do consumers adopt smart voice assistants for shopping
purposes? A perspective from complexity theory," Intelligent Systems with Applications, vol. 18, article no.
200230, 2023. https://doi.org/10.1016/j.iswa.2023.200230
[32]K. Zeinalzadeh and E. Rezaei, "Determining spatial and temporal changes of surface water quality using
principal component analysis," Journal of Hydrology: Regional Studies, vol. 13, pp. 1-10, 2017.
https://doi.org/10.1016/j.ejrh.2017.07.002
[33]N. Buduma, N. Buduma, and J. Papa, Fundamentals of Deep Learning. Sebastopol, CA: O'Reilly Media
Inc., 2022.
[34] S. Bhanja and A. Das, "Impact of data normalization on a deep neural network for time series forecasting,"
2018 [Online]. Available: https://arxiv.org/abs/1812.05519.
[35] B. Simpson, F. Dutil, Y. Bengio, and J. P. Cohen, "GradMask: reduce overfitting by regularizing saliency,"
2019 [Online]. Available: https://arxiv.org/abs/1904.07478.
[36]H. Pishro-Nik, Introduction to Probability, Statistics, and Random Processes. Blue Bell, PA: Kappa
Research, 2014.
Human-centric Computing and Information Sciences Page / 33
[37]N. Zhang, S. L. Shen, A. Zhou, and Y. S. Xu, "Investigation on performance of neural networks using
quadratic relative error cost function," IEEE Access, vol. 7, pp. 106642-106652, 2019.
https://doi.org/10.1109/ACCESS.2019.2930520
[38]S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing, A. Kathan, B. Hu, and B. W. Schuller,
"Audio self-supervised learning: a survey," 2022 [Online]. Available: https://arxiv.org/abs/2203.01205.
[39]S. Gronauer and K. Diepold, "Multi-agent deep reinforcement learning: a survey," Artificial Intelligence
Review, vol. 55, pp. 895-943, 2022. https://doi.org/10.1007/s10462-021-09996-w
[40] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, et al., "The Kaldi speech recognition
toolkit," in Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding,
Big Island, HI, USA, 2011.
[41]G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, et al., "Deep neural networks for
acoustic modeling in speech recognition: the shared views of four research groups," IEEE Signal
Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012. https://doi.org/10.1109/MSP.2012.2205597
[42]A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, et al., "Deep speech: scaling up end-
to-end speech recognition," 2014 [Online]. Available: https://arxiv.org/abs/1412.5567.
[43]D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, et al., "Deep speech 2: end-
to-end speech recognition in English and Mandarin," Proceedings of Machine Learning Research, vol. 48,
pp. 173-182, 2016.
[44]J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based models for speech
recognition," Advances in Neural Information Processing Systems, vol. 28, pp. 577-585, 2015.
[45]Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, et al., "Streaming end-to-end
speech recognition for mobile devices," in Proceedings of 2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 6381-6385.
https://doi.org/10.1109/ICASSP.2019.8682336
[46]O. Tarawneh, M. Tarawneh, Y. Sharrab, and M. Altarawneh, "Mushroom classification using machine-
learning techniques," AIP Conference Proceedings, vol. 2979, article no. 030003, 2023.
https://doi.org/10.1063/5.0174721
[47]A. Hannun, "The history of speech recognition to the year 2030," 2021 [Online]. Available:
https://arxiv.org/abs/2108.00084.
[48] D. Khurana, A. Koli, K. Khatter, and S. Singh, "Natural language processing: state of the art, current trends
and challenges," Multimedia Tools and Applications, vol. 82, pp. 3713-3744, 2023.
https://doi.org/10.1007/s11042-022-13428-4
[49]J. Li, "Recent advances in end-to-end automatic speech recognition," APSIPA Transactions on Signal and
Information Processing, vol. 11, no. 1, article no. e8, 2022. https://doi.org/10.1561/116.00000050
[50]S. Keele, "Guidelines for performing systematic literature reviews in software engineering," Keele
University, Staffs, UK, Technical Report No. EBSE-2007-01, 2007.
[51]B. Dendani, H. Bahi, and T. Sari, "Speech enhancement based on a deep AutoEncoder for remote Arabic
speech recognition," in Image and Signal Processing. Cham, Switzerland: Springer, 2020, pp. 221-229,
2020. https://doi.org/10.1007/978-3-030-51935-3_24
[52]R. Amari, Z. Noubigh, S. Zrigui, D. Berchech, H. Nicolas, and M. Zrigui, "Deep convolutional neural
network for arabic speech recognition," in Conference on Computational Collective. Cham, Switzerland:
Springer, 2022, pp. 120-134. https://doi.org/10.1007/978-3-031-16014-1_11
[53]K. Nugroho, E. Noersasongko, and H. A. Santoso, "Javanese gender speech recognition using deep
learning and singular value decomposition," in Proceedings of 2019 International Seminar on Applications
for Information and Communication Technology (iSemantic), Semarang, Indonesia, 2019, pp. 251-254.
https://doi.org/10.1109/ISEMANTIC.2019.8884267
[54]T. F. Abidin, A. Misbullah, R. Ferdhiana, M. Z. Aksana, and L. Farsiah, "Deep neural network for
automatic speech recognition from Indonesian audio using several lexicon types," in Proceedings of 2020
International Conference on Electrical Engineering and Informatics (ICELTICs), Aceh, Indonesia, 2020,
pp. 1-5. https://doi.org/10.1109/ICELTICs50595.2020.9315538
[55]Z. Ling, "An acoustic model for English speech recognition based on deep learning," in Proceedings of
2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA),
Qiqihar, China, 2019, pp. 610-614. https://doi.org/10.1109/ICMTMA.2019.00140
Page / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[56]D. A. Rahman and D. P. Lestari, "Indonesian spontaneous speech recognition system using deep neural
networks," in Proceedings of 2020 7th International Conference on Advance Informatics: Concepts, Theory
and Applications (ICAICTA), Tokoname, Japan, 2020, pp. 1-3.
https://doi.org/10.1109/ICAICTA49861.2020.9429070
[57]X. Chu, "Speech recognition method based on deep learning and its application," in Proceedings of 2021
International Conference of Social Computing and Digital Economy (ICSCDE), Chongqing, China, 2021,
pp. 299-302. https://doi.org/10.1109/ICSCDE54196.2021.00075
[58]S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang,
"Quartznet: deep automatic speech recognition with 1D time-channel separable convolutions," in
Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Barcelona, Spain, 2020, pp. 6124-6128. https://doi.org/10.1109/ICASSP40776.2020.9053889
[59]S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, "Deep contextualized acoustic representations for semi-
supervised speech recognition," in Proceedings of 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6429-6433.
https://doi.org/10.1109/ICASSP40776.2020.9053176
[60] W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung, and M. Picheny, "Distributed deep learning
strategies for automatic speech recognition," in Proceedings of 2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 5706-5710.
https://doi.org/10.1109/ICASSP.2019.8682888
[61]M. A. Al Amin, M. T. Islam, S. Kibria, and M. S. Rahman, "Continuous Bengali speech recognition based
on deep neural network," in Proceedings of 2019 International Conference on Electrical, Computer and
Communication Engineering (ECCE), Cox'sBazar, Bangladesh, 2019, pp. 1-6.
https://doi.org/10.1109/ECACE.2019.8679341
[62]V. Bhardwaj and V. Kadyan, "Deep neural network trained Punjabi children's speech recognition system
using Kaldi toolkit," in Proceedings of 2020 IEEE 5th International Conference on Computing
Communication and Automation (ICCCA), Greater Noida, India, 2020, pp. 374-378.
https://doi.org/10.1109/ICCCA49541.2020.9250780
[63]Z. Chen and H. Yang, "Yi language speech recognition using deep learning methods," in Proceedings of
2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference
(ITNEC), Chongqing, China, 2020, pp. 1064-1068. https://doi.org/10.1109/ITNEC48623.2020.9084771
[64]W. Wang, X. Yang, and H. Yang, "End-to-End low-resource speech recognition with a deep CNN-LSTM
encoder," in Proceedings of 2020 IEEE 3rd International Conference on Information Communication and
Signal Processing (ICICSP), Shanghai, China, 2020, pp. 158-162.
https://doi.org/10.1109/ICICSP50920.2020.9232119
[65]Y. Shan, M. Liu, Q. Zhan, S. Du, J. Wang, and X. Xie, "Speech recognition based on a deep tensor neural
network and multifactor feature," in Proceedings of 2019 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 2019, pp. 650-654.
https://doi.org/10.1109/APSIPAASC47483.2019.9023251
[66]S. T. Abate, M. Y. Tachbelie, and T. Schultz, "Deep neural networks-based automatic speech recognition
for four Ethiopian languages," in Proceedings of 2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 8274-8278.
https://doi.org/10.1109/ICASSP40776.2020.9053883
[67]A. Rista and A. Kadriu, "End-to-end speech recognition model based on deep learning for Albanian," in
Proceedings of 2021 44th International Convention on Information, Communication and Electronic
Technology (MIPRO), Opatija, Croatia, 2021, pp. 442-446.
https://doi.org/10.23919/MIPRO52101.2021.9596713
[68]Q. An, K. Bai, M. Zhang, Y. Yi, and Y. Liu, "Deep neural network based speech recognition systems
under noise perturbations," in Proceedings of 2020 21st International Symposium on Quality Electronic
Design (ISQED), Santa Clara, CA, USA, 2020, pp. 377-382.
https://doi.org/10.1109/ISQED48828.2020.9136978
[69]H. Xu, H. Yang, and Y. You, "Donggan speech recognition based on a deep neural network," in
Proceedings of 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence
Conference (ITAIC), Chongqing, China, 2019, pp. 354-358. https://doi.org/10.1109/ITAIC.2019.8785451
Human-centric Computing and Information Sciences Page / 33
[70]A. Sirwan, K. A. Thama, and S. Suyanto, "Indonesian automatic speech recognition based on an end-to-
end deep learning model," in Proceedings of 2022 IEEE International Conference on Cybernetics and
Computational Intelligence (CyberneticsCom), Malang, Indonesia, 2022, pp. 410-415.
https://doi.org/10.1109/CyberneticsCom55287.2022.9865253
[71]R. G. Kodali, D. P. Manukonda, and R. Sundararajan, "Bilingual speech recognition based on deep neural
networks and directed acyclic word graphs," in Proceedings of 2019 International Conference on Data
Mining Workshops (ICDMW), Beijing, China, 2019, pp. 1-6.
https://doi.org/10.1109/ICDMW48858.2019.9024758
[72]H. Karunathilaka, V. Welgama, T. Nadungodage, and R. Weerasinghe, "Low-resource Sinhala speech
recognition using deep learning," in Proceedings of 2020 20th International Conference on Advances in
ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 2020, pp. 196-201.
https://doi.org/10.1109/ICTer51097.2020.9325468
[73]J. Jorge, A. Giménez, J. A. Silvestre-Cerda, J. Civera, A. Sanchis, and A. Juan, "Live Streaming speech
recognition using deep bidirectional LSTM acoustic models and interpolated language models," IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 148-161, 2021.
https://doi.org/10.1109/TASLP.2021.3133216
[74]C. W. Chen, Y. F. Yeh, C. L. Lin, S. P. Tseng, and J. F. Wang, "Hybrid deep neural network acoustic
model for Taiwanese speech recognition," in Proceedings of 2020 8th International Conference on Orange
Technology (ICOT), Daegu, South Korea, 2020, pp. 1-5.
https://doi.org/10.1109/ICOT51877.2020.9468762
[75]P. Wu and M. Wang, "Large vocabulary continuous speech recognition with deep recurrent network," in
Proceedings of 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing,
China, 2020, pp. 794-798. https://doi.org/10.1109/ICSIP49896.2020.9339455
[76]M. S. Chauhan, R. Mishra, and M. I. Patel, "A speech recognition and separation system using deep
learning," in Proceedings of 2021 International Conference on Innovative Computing, Intelligent
Communication and Smart Electrical Systems (ICSES), Chennai, India, 2021, pp. 1-5.
https://doi.org/10.1109/ICSES52305.2021.9633779
[77]R. Masumura, M. Ihori, A. Takashima, T. Tanaka, and T. Ashihara, "End-to-end automatic speech
recognition with deep mutual learning," in Proceedings of 2020 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 2020, pp.
632-637.
[78]R. Yang, J. Yang, and Y. Lu, "Indonesian speech recognition based on deep neural network," in
Proceedings of 2021 International Conference on Asian Language Processing (IALP), Singapore, 2021, pp.
36-41. https://doi.org/10.1109/IALP54817.2021.9675280
[79]W. Zhang, M. Zhai, Z. Huang, C. Liu, W. Li, and Y. Cao, "Towards end-to-end speech recognition with
deep multipath convolutional neural networks," in Intelligent Robotics and Applications. Cham,
Switzerland: Springer, 2019, pp. 332-341. https://doi.org/10.1007/978-3-030-27529-7_29
[80]H. Teng, S. Wang, X. Liu, and X. G. Yue, "Speech recognition model based on deep learning and
application to a pronunciation quality evaluation system," in Proceedings of the 2019 International
Conference on Data Mining and Machine Learning, Hong Kong, China, 2019, pp. 1-5.
https://doi.org/10.1145/3335656.3335657
[81]S. Zhao, C. Ni, R. Tong, and B. Ma, "Multi-task multi-network joint-learning of deep residual networks
and cycle-consistency generative adversarial networks for robust speech recognition," in Proceedings
of 20th Annual Conference of the International Speech Communication Association (INTERSPEECH),
Graz, Austria, 2019, pp. 1238-1242. https://doi.org/10.21437/interspeech.2019-2078
[82] Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition,"
in Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), New Orleans, LA, USA, 2017, pp. 4845-4849. https://doi.org/10.1109/ICASSP.2017.7953077
[83]B. Gong, R. Cai, Z. Cai, Y. Ding, and M. Peng, "Selection of an acoustic modeling unit for Tibetan speech
recognition based on deep learning," MATEC Web of Conferences, vol. 336, article no. 06014, 2021.
https://doi.org/10.1051/matecconf/202133606014
Page / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[84]T. Watzel, L. Li, L. Kurzinger, and G. Rigoll, "Deep neural network quantizers outperforming continuous
speech recognition systems," in Speech and Computer. Cham, Switzerland: Springer, 2019, pp. 530-539.
https://doi.org/10.1007/978-3-030-26061-3_54
[85]E. D. Emiru, Y. Li, S. Xiong, and A. Fesseha, "Speech recognition system based on deep neural network
acoustic modeling for a poorly resourced language-Amharic," in Proceedings of the 3rd International
Conference on Telecommunications and Communication Engineering, Tokyo, Japan, 2019, pp. 141-145.
https://doi.org/10.1145/3369555.3369564
[86] Y. F. Yeh, B. H. Su, Y. Y. Ou, J. F. Wang, and A. C. Tsai, "Taiwanese speech recognition based on hybrid
deep neural network architecture," in Proceedings of the 32nd Conference on Computational Linguistics
and Speech Processing (ROCLING), Taipei, Taiwan, 2020, pp. 102-113.
[87]L. Toth and G. Gosztolya, "Adversarial multi-task learning of speaker-invariant deep neural network
acoustic models for speech recognition," in Proceedings of the 1st International Conference on Advances
in Signal Processing and Artificial Intelligence (ASPAI), Barcelona, Spain, 2019, pp. 82-86.
[88]H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, and Z. T. Fayed, "Arabic speech recognition using end-to-
end deep learning," IET Signal Processing, vol. 15, no. 8, pp. 521-534, 2021.
https://doi.org/10.1049/sil2.12057
[89]N. Zerari, S. Abdelhamid, H. Bouzgou, and C. Raymond, "Bidirectional deep architecture for Arabic
speech recognition," Open Computer Science, vol. 9, no. 1, pp. 92-102, 2019. https://doi.org/10.1515/comp-
2019-0004
[90]H. Alsayadi, A. Abdelhamid, I. Hegazy, and Z. Taha, "Data augmentation for Arabic speech recognition
based on end-to-end deep learning," International Journal of Intelligent Computing and Information
Sciences, vol. 21, no. 2, pp. 50-64, 2021.
[91]G. Cheng, X. Li, and Y. Yan, “Highway connections to enable deep small‐footprint LSTM‐RNNs for
speech recognition,” Chinese Journal of Electronics, vol. 28, no. 1, pp. 107-112, 2019.
https://doi.org/10.1049/cje.2018.11.008
[92]Y. R. Oh, K. Park, H. B. Jeon, and J. G. Park, "Automatic proficiency assessment of Korean speech read
aloud by non-natives using bidirectional LSTM-based speech recognition," ETRI Journal, vol. 42, no. 5,
pp. 761-772, 2020. https://doi.org/10.4218/etrij.2019-0400
[93]R. Marimuthu, "Speech recognition using a Taylor-gradient descent political optimization-based deep
residual network," Computer Speech & Language, vol. 78, article no. 101442, 2023.
https://doi.org/10.1016/j.csl.2022.101442
[94]J. R. Koya and S. V. M. Rao, "Deep bidirectional neural networks for robust speech recognition under
heavy background noise," Materials Today: Proceedings, vol. 46, no. 6, pp. 4117-4121, 2021.
https://doi.org/10.1016/j.matpr.2021.02.640
[95]Y. Dokuz and Z. Tufekci, "Mini-batch sample selection strategies for deep learning-based speech
recognition," Applied Acoustics, vol. 171, article no. 107573, 2021.
https://doi.org/10.1016/j.apacoust.2020.107573
[96]T. Zoughi, M. M. Homayounpour, and M. Deypir, "Adaptive windows multiple deep residual networks
for speech recognition," Expert Systems with Applications, vol. 139, article no. 112840, 2020.
https://doi.org/10.1016/j.eswa.2019.112840
[97] Z. Song, "English speech recognition based on deep learning with multiple features," Computing, vol. 102,
pp. 663-682, 2020. https://doi.org/10.1007/s00607-019-00753-0
[98] A. Shewalkar, D. Nyavanandi, and S. A. Ludwig, "Performance evaluation of deep neural networks applied
to speech recognition: RNN, LSTM and GRU," Journal of Artificial Intelligence and Soft Computing
Research, vol. 9, no. 4, pp. 235-245, 2019. https://doi.org/10.2478/jaiscr-2019-0006
[99]H. Veisi and A. Haji Mani, "Persian speech recognition using deep learning," International Journal of
Speech Technology, vol. 23, pp. 893-905, 2020. https://doi.org/10.1007/s10772-020-09768-x
[100]A. Santhanavijayan, D. Naresh Kumar, and G. Deepak, "A semantic-aware strategy for automatic speech
recognition incorporating deep learning models," in Intelligent System Design. Singapore: Springer, 2021,
pp. 247-254. https://doi.org/10.1007/978-981-15-5400-1_25
[101]N. Q. Pham, T. S. Nguyen, J. Niehues, M. Muller, S. Stuker, and A. Waibel, "Very deep self-attention
networks for end-to-end speech recognition," in Proceedings of the 20th Annual Conference of the
Human-centric Computing and Information Sciences Page / 33
International Speech Communication Association (INTERSPEECH), Graz, Austria, 2019, pp. 66-70.
https://doi.org/10.21437/interspeech.2019-2702
[102]V. Passricha and R. K. Aggarwal, "A hybrid of deep CNN and bidirectional LSTM for automatic speech
recognition," Journal of Intelligent Systems, vol. 29, no. 1, pp. 1261-1274, 2020.
https://doi.org/10.1515/jisys-2018-0372
[103]M. D. Hassan, A. N. Nasret, M. R. Baker, and Z. S. Mahmood, "Enhancement automatic speech
recognition by deep neural networks," Periodicals of Engineering and Natural Sciences, vol. 9, no. 4, pp.
921-927, 2021. https://doi.org/10.21533/pen.v9i4.2450
[104]W. Ying, L. Zhang, and H. Deng, "Sichuan dialect speech recognition with a deep LSTM network,"
Frontiers of Computer Science, vol. 14, pp. 378-387, 2020. https://doi.org/10.1007/s11704-018-8030-z
[105]W. Zhang, X. Cui, U. Finkler, G. Saon, A. Kayi, A. Buyuktosunoglu, B. Kingsbury, D. Kung, and M.
Picheny, "A highly efficient distributed deep learning system for automatic speech recognition," 2019
[Online]. Available: https://arxiv.org/abs/1907.05701.
[106]L. Pipiras, R. Maskeliunas, and R. Damasevicius, "Lithuanian speech recognition using purely phonetic
deep learning," Computers, vol. 8, no. 4, article no. 76, 2019. https://doi.org/10.3390/computers8040076
[107]V. Kadyan, M. Dua, and P. Dhiman, "Enhancing the accuracy of long contextual dependencies for a
Punjabi speech recognition system using deep LSTM," International Journal of Speech Technology, vol.
24, pp. 517-527, 2021. https://doi.org/10.1007/s10772-021-09814-2
[108]D. Raval, V. Pathak, M. Patel, and B. Bhatt, "Improving deep learning-based automatic speech
recognition for Gujarati," Transactions on Asian and Low-Resource Language Information Processing, vol.
21, no. 3, article no. 47, 2021. https://doi.org/10.1145/3483446
[109]Y. Xu, "English speech recognition and evaluation of pronunciation quality using deep learning," Mobile
Information Systems, vol. 2022, article no. 7186375, 2022. https://doi.org/10.1155/2022/7186375
[110]A. Mukhamadiyev, I. Khujayarov, O. Djuraev, and J. Cho, "An automatic speech recognition method
based on deep learning approaches to the Uzbek language," Sensors, vol. 22, no. 10, article no. 3683, 2022.
https://doi.org/10.3390/s22103683
[111]M. Ali Humayun, I. A. Hameed, S. Muslim Shah, S. Hassan Khan, I. Zafar, S. Bin Ahmed, and J. Shuja,
"Regularized Urdu speech recognition with semi-supervised deep learning," Applied Sciences, vol. 9, no. 9,
article no. 1956, 2019. https://doi.org/10.3390/app9091956
[112]A. Alsobhani, H. M. A. Alabboodi, and H. Mahdi, "Speech recognition using convolution deep neural
networks," Journal of Physics: Conference Series, vol. 1973, article no. 012166, 2021.
https://doi.org/10.1088/1742-6596/1973/1/012166
[113]T. Zoughi and M. M. Homayounpour, "A gender-aware deep neural network structure for speech
recognition," Iranian Journal of Science and Technology, Transactions of Electrical Engineering, vol. 43,
pp. 635-644, 2019. https://doi.org/10.1007/s40998-019-00177-8
[114]T. Rajapakshe, R. K. Rana, S. Latif, S. Khalifa, and B. W. Schuller, "Pre-training in deep reinforcement
learning for automatic speech recognition," 2019 [Online]. Available: https://arxiv.org/abs/1910.11256.
[115] M. U. Farooq, F. Adeeba, S. Rauf, and S. Hussain, "Improving large vocabulary Urdu speech recognition
system using deep neural networks," in Proceedings of the 20th Annual Conference of the International
Speech Communication Association (INTERSPEECH), Graz, Austria, 2019, pp. 2978-2982.
https://doi.org/10.21437/interspeech.2019-2629
[116]S. Isobe, S. Tamura, and S. Hayamizu, "Speech recognition using deep canonical correlation analysis in
noisy environments," in Proceedings of the 10th International Conference on Pattern Recognition
Applications and Methods (ICPRAM), Virtual Event, 2021, pp. 63-70.
https://doi.org/10.5220/0010268200630070
[117]H. Lei, Y. Xiao, Y. Liang, D. Li, and H. Lee, "DLD: an optimized Chinese speech recognition model
based on deep learning," Complexity, vol. 2022, article no. 6927400, 2022.
https://doi.org/10.1155/2022/6927400
[118]T. Rajapakshe, S. Latif, R. Rana, S. Khalifa, and B. Schuller, "Deep reinforcement learning with pre-
training for time-efficient training of automatic speech recognition," 2020 [Online]. Available:
https://arxiv.org/abs/2005.11172.
Page / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[119]S. Shukla and M. Jain, "A novel stochastic deep resilient network for effective speech recognition,"
International Journal of Speech Technology, vol. 24, pp. 797-806, 2021. https://doi.org/10.1007/s10772-
021-09851-x
[120]I. K. Tantawi, M. A. M. Abushariah, and B. H. Hammo, "A deep learning approach to automatic speech
recognition of The Holy Qur’ān recitations," International Journal of Speech Technology, vol. 24, pp. 1017-
1032, 2021. https://doi.org/10.1007/s10772-021-09853-9
[121]B. Tombaloglu and H. Erdem, "Deep learning-based automatic speech recognition for Turkish," Sakarya
University Journal of Science, vol. 24, no. 4, pp. 725-739, 2020.
https://doi.org/10.16984/saufenbilder.711888
[122]Y. R. Oh, K. Park, and J. G. Park, "Online speech recognition using multichannel parallel acoustic score
computation and deep neural network (DNN)-based voice-activity detector," Applied Sciences, vol. 10, no.
12, article no. 4091, 2020. https://doi.org/10.3390/app10124091
[123]T. G. Fantaye, J. Yu, and T. T. Hailu, "Investigation of automatic speech recognition systems via the
multilingual deep neural network modeling methods for a very low-resource language, Chaha," Journal of
Signal and Information Processing, vol. 11, no. 1, pp. 1-21, 2020. https://doi.org/10.4236/jsip.2020.111001
[124]H. Hugeng and E. Hansel, "Implementation of Android-based speech recognition for an Indonesian
geography dictionary," Ultima Computing: Jurnal Sistem Komputer, vol. 7, no. 2, pp. 76-82, 2015.
https://doi.org/10.31937/sk.v7i2.296
[125]G. Kaur, M. Srivastava, and A. Kumar, "Speech recognition using enhanced features with deep belief
network for real-time applications," Wireless Personal Communications, vol. 120, pp. 3225-3242, 2021.
https://doi.org/10.1007/s11277-021-08610-0
[126] Z. Niu, "Voice detection and deep learning algorithms application in remote English translation classroom
monitoring," Mobile Information Systems, vol. 2022, article no. 3340999, 2022.
https://doi.org/10.1155/2022/3340999
[127] J. Oruh, S. Viriri and A. Adegun, "Long short-term memory recurrent neural network for automatic
speech recognition," IEEE Access, vol. 10, pp. 30069-30079, 2022.
https://doi.org/10.1109/ACCESS.2022.3159339
[128]A. Dehghani and S. Seyyedsalehi, "Time-frequency localization using deep convolutional maxout neural
network in Persian speech recognition," Neural Processing Letters, vol. 55, pp. 3205-3224, 2023.
https://doi.org/10.1007/s11063-022-11006-1
[129] X. Xie, X. Sui, X. Liu, and L. Wang, "Investigation of deep neural network acoustic modelling approaches
for low-resource accented Mandarin speech recognition," 2021 [Online]. Available:
https://arxiv.org/abs/2201.09432.
[130] A. K. Singh, P. Singh, and K. Nathwani, "Using deep learning techniques and inferential speech statistics
for AI synthesised speech recognition," 2021 [Online]. Available: https://arxiv.org/abs/2107.11412.
[131]J. Yu, N. Ye, X. Du, and L. Han, "Automated English speech recognition using dimensionality reduction
with deep learning approach," Wireless Communications and Mobile Computing, vol. 2022, article no.
3597347, 2022. https://doi.org/10.1155/2022/3597347
[132]S. Girirajan and A. A. Pandian, "An acoustic model with a hybrid deep bidirectional single gated unit
(DBSGU) for low resource speech recognition," Multimedia Tools and Applications, vol. 81, pp. 17169-
17184, 2022. https://doi.org/10.1007/s11042-022-12723-4
[133] H. Dridi and K. Ouni, "Towards a robust combined deep architecture for speech recognition: experiments
on TIMIT," International Journal of Advanced Computer Science and Applications, vol. 11, no. 4, pp. 525-
534, 2020. https://doi.org/10.14569/ijacsa.2020.0110469
[134]T. J. Park and J. H. Chang, "Deep Q-network-based noise suppression for robust speech recognition,"
Turkish Journal of Electrical Engineering and Computer Sciences, vol. 29, no. 5, pp. 2362-2373, 2021.
https://doi.org/10.3906/elk-2011-144
[135] S. Lee, S. Han, S. Park, K. Lee, and J. Lee, "Korean speech recognition using deep learning," The Korean
Journal of Applied Statistics, vol. 32, no. 2, pp. 213-227, 2019.
https://doi.org/10.5351/KJAS.2019.32.2.213
[136]P. Dubey and B. Shah, “Deep speech based end-to-end automated speech recognition (ASR) for Indian-
English accents,” 2022 [Online]. Available: https://arxiv.org/abs/2204.00977.
Human-centric Computing and Information Sciences Page / 3
[137]H. Seki, K. Yamamoto, T. Akiba, and S. Nakagawa, "Discriminative learning of filterbank layer within
deep neural network-based speech recognition for speaker adaptation," IEICE Transactions on Information
and Systems, vol. 102D, no. 2, pp. 364-374, 2019. https://doi.org/10.1587/transinf.2018EDP7252
[138]C. Bai, X. Cui, and A. Li, "Robust speech recognition model using multi-source federal learning after
distillation and deep edge intelligence," Journal of Physics: Conference Series, vol. 2033, no. 1, article no.
012158, 2021. https://doi.org/10.1088/1742-6596/2033/1/012158
[139]P. Shao, "Chinese speech recognition system based on deep learning," Journal of Physics: Conference
Series, vol. 1549, no. 2, article no. 022012, 2020. https://doi.org/10.1088/1742-6596/1549/2/022012
[140]A. M. Samin, M. H. Kobir, S. Kibria, and M. S. Rahman, "Deep learning-based large vocabulary
continuous speech recognition of an under-resourced language Bangladeshi Bangla," Acoustical Science
and Technology, vol. 42, no. 5, pp. 252-260, 2021. https://doi.org/10.1250/ast.42.252
[141]A. Rista and A. Kadriu, "A model for Albanian speech recognition using end-to-end deep learning
techniques," Interdisciplinary Journal of Research and Development, vol. 9, no. 3, article no. 1, 2022.
https://doi.org/10.56345/ijrdv9n301
[142] G. Savitha, B. N. Shankar, and S. Shahi, "Deep recurrent neural network based audio speech recognition
system," Information Technology in Industry, vol. 9, no. 2, 2021, pp. 941-949, 2021.
https://doi.org/10.17762/itii.v9i2.434
[143]H. Abera and S. H. Mariam, "Speech recognition for Tigrinya language using deep neural network
approach," in Proceedings of the 2019 Workshop on Widening NLP (WNLP), Florence, Italy, 2019, pp. 7-
9. https://aclanthology.org/W19-3603
[144]M. G. Al-Obeidallah, D. G. Al-Fraihat, A. M. Khasawneh, A. M. Saleh and H. Addous, "Empirical
Investigation of the Impact of the Adapter Design Pattern on Software Maintainability," in Proceedings of
2021 International Conference on Information Technology (ICIT), Amman, Jordan, 2021, pp. 206-211.
https://doi.org/10.1109/ICIT52682.2021.9491719
[145]L. Huang, Z. Xiang, J. Yun, Y. Sun, Y. Liu, D. Jiang, H. Ma, and H. Yu, "Target detection based on two-
stream convolution neural network with self-powered sensors information, " IEEE Sensors Journal, vol. 23,
no. 18, pp. 20681-20690, 2023. https://doi.org/10.1109/JSEN.2022.3220341