ArticlePDF Available

Abstract

Speech recognition is a natural language processing task that involves the computerized transcription of spoken language in real time. Numerous studies have been conducted on the utilization of deep learning (DL) models for speech recognition. However, this field is advancing rapidly. This systematic review provides an in-depth and comprehensive examination of studies published from 2019 to 2022 on speech recognition utilizing DL techniques. Initially, 575 studies were retrieved and examined. After filtration and application of the inclusion and exclusion criteria, 94 were retained for further analysis. A literature survey revealed that 17% of the studies used stand-alone models, whereas 52% used hybrid models. This indicates a shift towards the adoption of hybrid models, which were proven to achieve better results. Furthermore, most of the studies employed public datasets (56%) and used the English language (46%), whereas their environments were neutral (81%). The word error rate was the most frequently used method of evaluation, while Mel-frequency cepstral coefficients were the most frequently employed method of feature extraction. Another observation was the lack of studies utilizing transformers, which were demonstrated to be powerful models that can facilitate fast learning speeds, allow parallelization and improve the performance of low-resource languages. The results also revealed potential and interesting areas of future research that had received scant attention in earlier studies.
March 2024 Volume 14
RESEARCH'(Research'Manuscript)''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''Open'Access!
Human-centric Computing and Information Sciences (2024) 14:15
DOI: https://doi.org/10.22967/HCIS.2024.14.015
Received: December 12, 2022; Accepted: September 24, 2023; Published: March 15, 2024
Speech Recognition Utilizing Deep Learning:
A Systematic Review of the Latest Developments
Dimah Al-Fraihat1,*, Yousef Sharrab2, Faisal Alzyoud3, Ayman Qahmash4, Monther Tarawneh3, and Adi Maaita5
Abstract
Speech recognition is a natural language processing task that involves the computerized transcription of spoken
language in real time. Numerous studies have been conducted on the utilization of deep learning (DL) models
for speech recognition. However, this field is advancing rapidly. This systematic review provides an in-depth
and comprehensive examination of studies published from 2019 to 2022 on speech recognition utilizing DL
techniques. Initially, 575 studies were retrieved and examined. After filtration and application of the inclusion
and exclusion criteria, 94 were retained for further analysis. A literature survey revealed that 17% of the studies
used stand-alone models, whereas 52% used hybrid models. This indicates a shift towards the adoption of
hybrid models, which were proven to achieve better results. Furthermore, most of the studies employed public
datasets (56%) and used the English language (46%), whereas their environments were neutral (81%). The
word error rate was the most frequently used method of evaluation, while Mel-frequency cepstral coefficients
were the most frequently employed method of feature extraction. Another observation was the lack of studies
utilizing transformers, which were demonstrated to be powerful models that can facilitate fast learning speeds,
allow parallelization and improve the performance of low-resource languages. The results also revealed
potential and interesting areas of future research that had received scant attention in earlier studies.
Keywords
Speech Recognition, Deep Learning (DL), Deep Neural Networks (DNNs), Natural Language Processing
(NLP), Systematic Review
1. Introduction
Speech is a ubiquitous and essential mode of communication among humans, facilitating the expression
of ideas, thoughts and emotions and enabling engagement in meaningful conversations [1]. As our lives
become increasingly intertwined with machines and smart devices, new communication techniques that
align with our digitalized world have emerged. Speech recognition has played a pivotal role in enabling
us to adapt to these novel modes of communication: it not only empowers individuals with disabilities to
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits
unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
*Corresponding Author: Dimah Al-Fraihat (d.fraihat@iu.edu.jo)
1Department of Software Engineering, Faculty of Information Technology, Isra University, Amman, 11622, Jordan
2Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Isra University, Amman, 11622, Jordan
3Department of Computer Science, Faculty of Information Technology, Isra University, Amman, 11622, Jordan
4Department of Information Systems, College of Computer Science, King Khalid University, Abha, Saudi Arabia
5Department of Software Engineering, Faculty of Information Technology, Middle East University, Amman, Jordan
Page 2 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
interact, share knowledge and engage in open conversations, but also holds promise for revolutionizing
communication between machines using natural languages.
The field of speech recognition has drawn considerable attention in the past decade, driven by the
availability of powerful computing resources and vast amounts of training data. The utilization of deep
learning (DL) models and algorithms has further bolstered the recognition rates of speech recognition
systems. Positioned within the broader domain of natural language processing (NLP), speech recognition
involves the perception and recognition of spoken words, often coupled with transcription and translation
capabilities [2]. It begins by extracting relevant features from the speech signal by employing such
techniques as signal processing, acoustic modelling, and linguistic analysis. Segmentation algorithms
play a vital role in delineating word boundaries, thereby enhancing recognition precision. Pattern
recognition techniques also enable the mapping of audio features to linguistic units, such as phonemes,
words or sentences, exploiting statistical consistencies within the speech signal. Machine learning (ML)
algorithms, including deep neural networks (DNNs), have proven instrumental in learning and
recognizing these patterns, leading to accurate and efficient speech recognition. The future of speech
recognition technology holds promising prospects. It is poised to surpass our expectations in meeting the
evolving needs of the community and has the potential to facilitate seamless communication between
machines using natural languages [3]. By continuously advancing the boundaries of research and
development, speech recognition systems will continue to enhance our ability to communicate effectively
and effortlessly in an increasingly interconnected world.
The use of speech recognition has proven to be efficient; it is commonly used in the domains of
language identification [3], phone banking, robotics [4], attendance systems [5], spoken commands,
security [6], education [7], smart healthcare [8], and smart cities [9]. However, speech recognition is a
challenging task, and there are several open problems that require ongoing attention and innovative
solutions. One significant challenge is the existence of dialects and multilingualism. Speech recognition
systems struggle to interpret and recognize spoken words in different dialects and languages accurately.
The existence of multiple dialects within a language, along with code-switching, in which individuals
mix languages during conversations, poses considerable difficulties. Another challenge is that of
variability among speakers. Different individuals have unique speech characteristics, such as accents,
voice quality, and speaking styles [10]. This speaker variability presents challenges in training speech
recognition systems to recognize and adapt to different speakers accurately. Achieving speaker-
independent recognition and handling speaker variability effectively remain open problems in the field.
In addition, speech recognition systems need to address issues related to noise and adverse acoustic
conditions, out-of-vocabulary words and contextual understanding, ambiguity in speech and low-
resource languages and domains [11]. Ethical and privacy concerns are also critical considerations [12].
The field of speech recognition is undergoing rapid development, with major corporations such as
Google and Microsoft developing impressive tools to that end. For instance, Microsoft has launched its
own speech recognition system, called the Microsoft Audio Video Indexing Service. The main
components of speech recognition include an input device for capturing speech, a digital signal processor
for filtering out surrounding noise, an acoustic model for identifying speech patterns, and a language
model for decoding words [13]. Scholars have explored various research trends to improve the
performance and capabilities of speech recognition systems. One prominent trend is the utilization of
deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks
(RNNs), which have shown promising results in speech recognition tasks. These DL models can
automatically learn complex patterns and features from raw audio data, enabling more accurate and robust
speech recognition. DL models offer both high flexibility and the ability to learn complex patterns, but
they require large amounts of labelled training data and computationally intensive training procedures.
Another research trend focuses on the integration of contextual information and language modelling in
speech recognition systems. By incorporating contextual cues and linguistic knowledge, such as grammar
rules and semantic information, into the recognition process, the accuracy and understanding of spoken
words can be enhanced. This trend involves exploring advanced language modelling techniques, such as
Human-centric Computing and Information Sciences Page 3 / 33
RNN language models and transformer-based models, to improve contextual understanding and handle
variations in vocabulary [13].
With the initial retrieval and analysis of 575 studies, this study highlights the sheer volume of research
conducted utilizing DL algorithms in the field of speech recognition. Hence, there is a need to appraise
and summarize the collective findings and identify the latest developments. While acknowledging the
valuable review studies conducted by previous researchers, it should be noted that this study follows a
systematic review methodology and employs a rigorous selection and analysis process. The motivation
for this research stems from the dynamic and swiftly evolving landscape of DL and its potential to
enhance the accuracy and efficiency of speech recognition systems to a significant degree. This study
conducts a systematic review of speech recognition, focusing specifically on the years 2019 to 2022. The
timeframe chosen for the review reflects the intention to capture the most recent developments in the
field, acknowledging the fast-paced nature of research advances during this period. The objective of this
study is to complement previous research by examining studies conducted in subsequent years and
extending the analysis to include the period 2019–2022, thus highlighting recent advances and emerging
trends in DL techniques for speech recognition. To the best of the authors’ knowledge, this is the first
systematic review paper on speech recognition using DL to cover the years 20192022.
Furthermore, the study is underpinned by the dual impetus of undertaking a methodical and exhaustive
analysis of recent advances in speech recognition via DL and of providing an overarching background
that may guide future research and fill gaps in existing research. In this pursuit, the present study
contributes to a richer, deeper understanding of the contemporary landscape and helps to identify trends
and potential pathways for the evolution of speech recognition technology. A particularly salient
contribution of this study is its identification of underexplored avenues and nascent prospects for future
research endeavors. Specifically, we highlight the underutilization of transformer models, a subset of DL
techniques that have exhibited considerable potency in diverse NLP tasks. This insight lends the study
an anticipatory dimension, serving to elucidate not only the present trajectory of the field but also a
roadmap for forthcoming explorations and enquiries.
The paper is organized into the following seven sections: Section 1 contains the introduction; Section
2 introduces previous review studies and related work; Section 3 presents a comprehensive background
to the topic; Section 4 presents the research questions, selection criteria and research methodology;
Section 5 discusses the results and answers to the formulated questions; Section 6 details the conclusions;
and, finally, Section 7 discusses future research directions.
2. Related Work
A few systematic reviews have been conducted in the field of speech recognition. For example, one
review in this field examined research published from 2006 to 2018 [14]. The review focused on DNNs
and identified 174 studies from which to extract specific information and develop a statistical analysis
thereof. The main purpose of the survey was to present papers using multilevel processing prior to
Markov model-based coding of word sequences. The study also identified the models that had been used
in previous research, such as deep belief network (DBN), CNN and deep convex network, and classified
the evaluation techniques used. Overall, the study found that more focus should be directed towards RNN
models.
Another study provided an overview of DL for low-resource languages in the field of speech
recognition [15]. The authors reviewed the history and research status of two models: RNN and CNN.
Some techniques were also introduced to enhance data performance and model training, such as making
improvements to end-to-end (E2E) systems by integrating more language knowledge, studying complex
and noisy environments, and strengthening the acoustic and language model. Another technique involved
expanding modal information to excavate speech structure knowledge of multiple modes, such as speech,
images, and videos. This review paper concluded that further improvements should be made to speech
Page 4 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
recognition systems by developing more sophisticated models that could work for several accents and
noisy environments.
Another review study surveyed mouth interface devices based on sensors for the recognition of speech
using DL techniques [16]. It also focused on communication difficulties for disabled people, visual
speech recognition, and a silent speech interface. Another study conducted a brief review of 17 papers
on Arabic speech recognition using neural networks [17]. For the same language, another survey
highlighted the models found in 35 studies and the evaluation techniques and metrics used in dialectal
Arabic speech [18]. The study provided details on the progress of dialectal Arabic and summarized the
challenges and issues raised in previous research.
Another review of speech recognition using DL concluded that using both probabilistic neural
networks and RNNs can correctly recognize 98% of phonemes, followed by hidden Markov models
(HMMs) [19]. In a related study conducted by [20], the basic principles of voice recognition technology,
related theories and problems were surveyed. The study presented optimization methods of artificial
intelligence (AI) and simulation training for speech recognition based on DL. It highlighted several issues
that affect the quality of this technology, such as its capability to handle noisy environments and low
endpoint detection levels. To address these challenges, the study introduced methods of optimization and
analysis aimed at improving speech recognition performance by processing information more effectively
[20].
The review study undertaken by [21] examined several selected studies on CNN-based speech
recognition [21, 22]. From their comparison of the selected studies, the researchers identified some of the
weaknesses of CNN, but stated that CNN significantly reduced both model complexity and the word
error rate (WER).
Overall, previous literature surveys show that since 2019, no systematic literature review has been
conducted in the field of speech recognition using DL. To the best of the authors’ knowledge, this review
research is one of the most comprehensive researches yet to be undertaken in this field, as it addresses
recent advances as well as the latest technologies that have neither been used nor covered in earlier
research. Furthermore, this review not only adopts the rigour and guidelines of previous systematic
reviews, but is also systematic, transparent, and unbiased in selecting studies that may result from a non-
systematic strategy.
The retrieved studies were independently screened for eligibility using predetermined inclusion and
exclusion criteria. This study presents a solid theoretical background and provides an overview of ML,
DL, DNNs, NLP and speech recognition so as to enable a full understanding of the topic. A total of 575
studies were initially retrieved and analyzed. Among these studies, which covered the period of 2019
until the end of 2022, 94 were kept for further analysis after data filtration and application of the inclusion
and exclusion criteria. The 94 selected studies were carefully and thoroughly examined to extract
information that would help to identify patterns in the use of DL in speech recognition. The extracted
information was used to produce statistics that highlighted research gaps and future research directions
in the domain.
The following section provides a comprehensive research background and presents the key concepts
relevant to the research topic, the historical evolution of and recent advances in speech recognition based
on DL, and an analysis of various methods of evaluating performance.
3. Background
ML is the study of computer algorithms that can improve automatically with training, as well as a
branch of AI that is based on statistics. The purpose of ML is to develop a mathematical model that can
make predictions about the future without explicitly programming it to do so. Some of its engineering
applications include robotics, vision, speech recognition, and voice recognition. As shown in Fig. 1, DL
is a branch of ML that relies on DNN models. Both ML and DL include supervised learning (SL),
unsupervised learning (UL), reinforcement learning (RL), and self-supervised learning (SSL).
Human-centric Computing and Information Sciences Page 5 / 33
Fig. 1. The ancestors of machine learning and its children and grandchildren.
3.1 Deep Learning and Deep Neural Networks
DL is distinct from other ML and AI techniques in that it requires comparatively little human
interference. To address most ML issues without the need for domain-specific feature engineering, DL
employs DNNs with multi-hidden layers. The idea of neural networks was first introduced in 1943 [23]
and later developed in 1969 [24]. The backpropagation technique was revised and published in 1986.
Backpropagation made neural network training easier [25] because additional techniques were then
integrated with neural networks [26]. DL involves DNN algorithms and is very useful for applications
that require big data. In the past decade, DNN has become an essential component of data science [27].
A DNN is a network data structure that contains input and output layers, as well as numerous hidden
layers in between. Each layer contains nodes that are connected to all the nodes in the layer next to it
[28, 29]. By predicting output variables from input features, DNNs provide solutions to many engineering,
scientific and business problems [29, 30]. The importance of DNNs has led to their widespread
adoption in various fields. Fig. 2 illustrates the wide-ranging applications of DNNs, which include speech
recognition, such as text-to-speech and speech-to-text applications; NLP for tasks, such as classification,
machine translation, and question answering; and vision applications, such as image recognition and
machine vision.
Fig. 2. DNN applications.
Page 6 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
DL models, based on their precise modelling of the human brain, offer an advanced approach to ML
[31]. Once trained with large amounts of data, these systems can generate results autonomously without
human interaction. The inputs and outputs of DNN models need to be identified, and the data must be
pre-processed for training, while performance measures are necessary [31].
In artificial neural network (ANN) models, features serve as independent parameters that assist in
predicting labels [32]. The neurons in ANNs function similarly to biological neurons, whereby their
outputs are determined by their inputs and weights. Activation functions, such as sigmoid functions,
ensure that the network output is accurate [33]. Neurons are connected to form a neural network in which
each layer is linked to the next layer. The final layer produces the network output, and a cost function
computes the error between the predicted and actual values [34]. ANN models are trained by adjusting
the weights that link the neurons of one layer to the next [35]. The model is trained by feeding it training
samples (datasets), each of which contains the proper output response (label) and a particular input
feature. The training continually iterates through the data for a specific number of epochs until the model
is learned (the weights of the model are updated to optimize the total error) [36, 37].
A computer can learn using statistical computational theories, and the fields of DL and ML rely on
these theories and algorithms. The algorithms used in such architectures come in a variety of forms,
including SL, UL, and RL, and some variants of one or more of these types.
3.1.1 Supervised learning
SL, a subcategory of DL, is a model specifically created to teach the computer how to react to data
during training [36]. This supervision helps the model to become more accurate and capable of handling
fresh datasets that comply with the learned patterns over time, such as spam detection, predictive
analytics, image and object recognition, and customer and sentiment analysis. Fig. 3 illustrates the SL
process.
Fig. 3. Supervised learning.
There are numerous models of SL, some of the most widely used of which are described below.
1) Convolutional neural networks: These algorithms were created specifically to operate with images.
The technique described as convolution involves applying a weighted filter to each component of
an image, which enables the computer to recognize and respond to picture elements. The past ten
years have seen significant developments in the field of computer vision, which investigates how
computers can understand and interpret images and movies [30].
2) Recurrent neural networks: These networks remember the output from the previous step and
consider it to be the input to the current step. The most important feature of RNNs is the existence
of hidden layers that can remember certain information about a sequence. RNNs have a memory
that recalls everything that has been computed. RNNs employ the same settings for each input, as
Human-centric Computing and Information Sciences Page 7 / 33
it performs the same work on all inputs or hidden layers to generate the result. Unlike other neural
networks, this decreases the complexity of the parameters. Modelling time-dependent and
sequential data problemssuch as speech recognition, machine translation, face detection, and text
synthesisis possible with RNNs [30].
Various techniques can be employed to accelerate modelling and decrease the number of model
parameters in speech recognition. These include the utilization of CNNs for feature extraction and the
incorporation of attention mechanisms into RNNs. Other approaches involve knowledge distillation,
compact neural network architectures, parameter sharing, pruning techniques, quantization, transfer
learning, and language models. The selection of a specific technique depends on such factors as
architecture, dataset size, and task requirements. Experimentation is essential to strike a sound balance
between complexity, resource usage, and performance [37].
3.1.2 Unsupervised learning
This branch of ML uses algorithms to analyze and cluster unlabelled datasets. Unlike SL, unsupervised
machines use unlabelled data (Fig. 4). The machine is free to identify relationships and patterns as it sees
fit, and frequently produces findings that a human data analyst might not have noticed. Compared with
other algorithms, UL performs more sophisticated processing tasks for the discovery of hidden patterns
with no human intervention [37]. Although UL is typically more difficult to forecast, it uses techniques
such as neural networks, clustering, anomaly detection, and other technologies [38].
Fig. 4. Unsupervised learning.
3.1.3 Self-supervised learning
In SSL, the labels for the training dataset are generated by UL algorithms rather than by humans.
Unlabelled data are far more abundant than labelled data. Additionally, this strategy employs UL
techniques to identify common patterns in the data, which will then be used to improve the supervised
model. Some researchers claim that SSL is simply a UL variation, i.e., it is a two-step procedure with the
ultimate objective of creating an SL model [38]. Because of self-supervised training, which is a novel
idea in the field, Wav2Vec 2.0 has been developed as one of the most recent models of automatic speech
recognition (ASR). In this training technique, a model can be pre-trained using always-available
unlabelled data and then adjusted to a particular dataset for a particular objective. This training strategy
is quite effective, as evidenced by earlier work. The difference between SSL and SL and UL is shown in
Fig. 5.
3.1.4 Reinforcement learning
Both SL and UL have no consequences for the computer when data cannot be comprehended or
classified correctly. Both RL and deep reinforcement learning (DRL) are essential for assisting machines
in learning complicated tasks that involve handling vast, incredibly flexible and unpredictable datasets.
This enables computers to accomplish things like driving a car, operating on people, and checking bags
Page 8 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
for harmful items. The main components of a typical scenario of RL and DRL are illustrated in Fig. 6
[39]. RL includes the interaction of an agent with its surroundings to achieve its goals by making
decisions. RL depends on elements such as states, actions, rewards, policies, and value functions. As
shown in Fig. 6(b), DRL extends RL further by incorporating networks to approximate policy and value
functions, allowing it to handle state spaces. Furthermore, DRL utilizes techniques, such as exploration
strategies, experience replay and target networks, to improve learning efficiency and stability. By
combining RL with networks, the agent becomes capable of tackling intricate tasks in different domains.
Fig. 5. Self-supervised learning versus supervised and unsupervised learning
(Source from https://medium.com/).
(a)
(b)
Fig. 6. Essential components of (a) reinforcement learning and (b) deep reinforcement learning.
3.2 Speech Recognition and Natural Language Processing
Speech recognition is a technique that enables a machine to perceive spoken language. ASR, text-to-
speech, speech synthesis or simply speech recognition develops techniques and approaches that enable
the perception and transcription of spoken language into text by computers.
NLP refers to the combination of linguistics and ML. Using millions of sample datasets, machines
learn to understand natural language in NLP, an ML application. Computational linguistics has a subfield
called speech recognition, which deals with technology that allows people to speak to computers. Speech
recognition incorporates knowledge and research in linguistics, computer science and electrical
engineering. Fig. 7 depicts the speech recognition process, which comprises the following steps:
1) Analog-to-digital conversion, which converts analogue voice to digital by utilizing sampling and
quantization techniques. A vector of voice integer samples is used to represent speech in digital form.
Human-centric Computing and Information Sciences Page 9 / 33
2) Speech pre-processing, in which background noise and long periods of quiet are identified and
removed. The speech is then divided into 20-second frames for the subsequent step.
3) Feature extraction is the conversion of speech frames into a feature vector that specifies which
phoneme is being spoken.
4) In word selection, the sequence of phonemes/features is translated into the spoken word using a
language model.
Fig. 7. Speech recognition process (Source from https://medium.com/).
From a technological standpoint, speech recognition has a lengthy history and has undergone numerous
significant technological advances. Recent developments in DL and big data have improved the field.
Not only has there been an increase in academic papers published on the topic, but, more significantly,
the global industry, including major corporations, has also adopted several DL techniques for creating
and implementing voice recognition systems. The global companies include Google, Facebook,
Microsoft, Amazon, and Apple.
As illustrated in Fig. 8, an acoustic model and a language model are two conceptually distinct
categories of models used in speech recognition. The challenges of converting sound signals into some
sort of phonetic representation are solved by the acoustic model. The language model is where the words,
grammar and sentence structure domain information of a language are kept. ML methods can be used to
realize these conceptual models in probabilistic models. Over the past few decades, improvements in
speech recognition have improved HMMs, which are currently regarded as the standard speech
recognition solutions. Meanwhile, E2E DNNs are the cutting-edge models of speech recognition.
Fig. 8. An acoustic model and a language model in a speech recognition system.
Page 10 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
3.3 History of Speech Recognition and Natural Language Processing
The speech recognition field has witnessed significant advancements over the years. Earlier speech
recognition systems relied on traditional approaches, such as HMMs and Gaussian mixture models
(GMMs). These systems involve building statistical models based on acoustic features and language
models to recognize speech. However, these approaches face challenges in accurately handling variations
in speech patterns and are limited in their ability to handle complex linguistic structures. The introduction
of ML techniques revolutionized speech recognition. ML algorithms, such as support vector machines
and Gaussian Naive Bayes, allow for better classification and pattern recognition. These techniques have
improved the accuracy of speech recognition systems by leveraging statistical modelling and probabilistic
approaches. Meanwhile, the advent of DL has further revolutionized speech recognition tasks. DL,
particularly with DNNs, has surpassed traditional approaches, such as HMMs and GMMs. DNNs excel
at capturing complex patterns and hierarchies in speech data by learning intricate representations. They
can handle large-scale datasets and automatically extract relevant features, resulting in significant
improvements in speech recognition accuracy. The combination of ML and DL approaches has transformed
speech recognition, making it more accurate, robust, and adaptable to various contexts and languages.
Speech recognition has evolved from traditional approaches to ML techniques and, more recently, to
the transformative impact of DL. Feature extraction methods and accuracy-enhancing techniques
continue to be areas of active research and development in the field. The integration of advanced
algorithms and the exploration of innovative approaches have the potential to further improve speech
recognition systems. A timeline of some of the major developments in speech recognition from 2012 to
2022 is presented in the following paragraph.
Recent years have seen the introduction of voice-activated devices and voice assistants, as well as
popular, open-source speech recognition software, such as in [31, 40]. Additionally, we have seen
improvements in voice recognition models starting from hybrid neural network designs [41] to more E2E
models, such as Deep Speech [42], Deep Speech 2 [43], encoder-decoder models with attention [44], and
transducer-based speech recognition [45, 46]. Recently, speech recognition and related technologies have
benefitted from significant advances. Fig. 9 presents a chronological overview of key, remarkable
advancements in speech recognition spanning the years 2010 to 2020. This era marked the introduction
of voice-centric gadgets and digital assistants, as well as the emergence of popular open-source speech
recognition tools, such as Kaldi, and significant benchmarks, such as LibriSpeech. Voice assistants for
mobile devices, such as Apple’s Siri and Amazon’s Alexa, were introduced in 2011 and became widely
used during this period. These technologies were made possible in part through DL’s significant
improvements in speech recognition WER. Moreover, progress in speech recognition techniques has
evolved from initial hybrid neural networks to comprehensive E2E models, such as Deep Speech, Deep
Speech 2, encoder-decoder architectures featuring attention mechanisms, and speech recognition based
on transducers in 2017 [47].
Fig. 9. History of speech recognition.
Human-centric Computing and Information Sciences Page 11 / 33
Currently there are several methods capable of enhancing the accuracy of speech recognition systems.
One such approach is to employ data augmentation techniques, such as adding background noise or
varying the pitch and speed of speech data. Another method involves using ensemble models or
combining multiple speech recognition systems to improve overall performance. Additionally,
incorporating language models and context information can enhance accuracy by considering broader
linguistic contexts during the recognition process [29]. Using a variety of algorithms, tools and
techniques, NLP aims to include the interpretation, analysis, and manipulation of natural language data
for the intended use. However, numerous difficulties may arise depending on the natural language data
being used, making it impossible to accomplish all the goals using a single strategy. As a result, numerous
scholars have recently focused on the creation of various tools and approaches in the field of NLP and
pertinent areas of study. These changes are represented in Fig. 10 [48].
Fig. 10. History of natural language processing.
3.4 Latest Developments in Speech Recognition based on Deep Learning
End-to-end modelling for speech recognition has recently become a key trend in the speech
community, replacing DNN-based hybrid modelling. Although E2E models consistently outperform
hybrid models in most standards for speech recognition accuracy, hybrid models are prevalent in many
commercial ASR systems today. The decision to deploy the production model is influenced by a variety
of practical reasons. Traditional hybrid models typically perform well in these areas because they have
been developed for decades. It is challenging for E2E models to achieve widespread commercialization
without offering good solutions to all these issues [49].
Before SSL became part of computer vision research, it had already made significant contributions to
NLP. Language models were applied everywhere, including in-text suggestion, sentence completion and
document processing applications. Although wav2vec, which revolutionized the NLP field, was released
in 2013, these models have now improved their learning capabilities. The concept behind word
embedding approaches was straightforwardrather than asking a model to anticipate the next word, one
could ask it to do so based on the previous context.
These developments have allowed the achievement of meaningful representation through the
distribution of word embedding, which may be applied to a variety of tasks, including sentence
Page 12 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
completion and word prediction. Bidirectional encoder representations from transformers are currently
among the most widely utilized SSL techniques in NLP. The discipline of NLP has seen an incredible
influx of research and development over the last ten years. Fig. 11 illustrates wav2vec unsupervised,
which trains speech recognition models without labelled data.
Fig. 11. A wav2vec unsupervised learning (source from https://arxiv.org/abs/2204.02492/).
3.5 Performance Evaluation
The datasets utilized in the training and testing of DL-based ASR systems have changed over time,
from clean speech to intentionally introduced environmental noise. The rapid system development and
performance evaluation of various ASR systems require automatic ASR metrics. To evaluate the
performance of a speech recognition system, an appropriate evaluation metric must be selected. This
section discusses several ASR metrics found in the reviewed papers.
The sentence error rate (SER), also known as the string error rate, is a straightforward evaluation metric
determined by comparing the hypothesis string produced by the decoder with the reference string and
marking the entire sentence as incorrect if it varies. This measure is imprecise because it counts any
difference between the reference and the hypothesis as one mistake, regardless of how similar the two
strings are.
The WER is a metric that is frequently used by researchers to assess the effectiveness of ASR systems.
It is the accepted metric for assessing the performance of an ASR system.
The number of words that differ between the reference and the hypothesis is measured by the WER.
The following four distinct scenarios that can occur are identified as follows:
1) The predicted and reference words that are corrected are equivalent according to certain rules.
2) The predicted word substituted is aligned with a different reference word.
3) An extra word inserted into the predicted sentence cannot align with a reference word.
4) A word deleted from the reference is missing in the predicted sentence completely.
The WER is calculated based on the following equation:
!"# =!"#"$
%,
(1)
where N is the number of words pronounced in the speech input to the ASR system, S is the number of
false word substitutions, D is the number of word deletions, I is the number of false word insertions in
the ASR output, and S is the number of incorrect word substitutions.
Human-centric Computing and Information Sciences Page 13 / 33
3.6 Speech Feature Extraction
Speech is a sophisticated human skill. Its features are extracted by converting the speech waveform to
a parametric representation at a comparatively low data rate for further processing and analysis. Speech
feature extraction approaches include discrete wavelet transform, line spectral frequencies, Mel-
frequency cepstral coefficients (MFCCs), linear prediction coefficients (LPCs), linear prediction cepstral
coefficients, and perceptual linear prediction. MFCCs are commonly used features in speech recognition.
The processing steps of MFCCs are described by the block diagram in Fig. 12. The computation starts
by applying pre-emphasis to the speech signal in order to improve higher frequencies. The signal is then
split into short frames and multiplied by a window function to separate the stationary segments. Next,
fast Fourier transform is employed to obtain the power spectrum of each frame. The power spectrum is
then passed through a Mel filterbank, which approximates the human auditory system’s frequency
resolution. The resulting filterbank energies are transformed into a logarithmic scale and further
decorrelated using a discrete cosine transform. Finally, cepstral mean normalization is applied to
normalize the coefficients across frames. These steps collectively generate MFCCs that capture essential
speech information for use in speech recognition systems.
Fig. 12. Block diagram for the Mel-frequency cepstral coefficient feature extraction.
4. Research Methodology
The objective of this systematic review research is to conduct a fair and comprehensive evaluation and
interpretation of available research published from 2019 to 2022 in the field of speech recognition
utilizing DL. The guidelines suggested by [50] were followed in undertaking this systematic review. The
first phase of the research involved planning the review, which was further divided into determining the
necessity for a review, developing the research questions, and describing the search approach to find
relevant research papers. The second phase involved conducting the review, which was subdivided into
defining the appropriate research selection criteria, including the inclusion/exclusion criteria, developing
the quality evaluation rules to filter research publications, constructing the data extraction strategy to
address the study objectives, and then synthesizing the data taken from the publications. The third phase
of the research was the reporting phase.
4.1 Planning Phase
DL is an ML technique that is widely used in the speech recognition field. This area of research is
experiencing rapid growth. However, the use of DL in speech recognition applications and its related
Page 14 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
concepts remain poorly understood, as is the case with many contemporary technologies. Hence, this
research has been undertaken to provide an overview of published research in the field of speech
recognition applications that use DL and to extract information from research publications over the last
4 years. The information obtained will help in the identification of research patterns and the development
of statistics that will in turn shed light on limitations and gaps in the literature, in addition to current and
future directions of research. It is hoped that this will provide a background and framework for
researchers to appropriately establish new research in this era. Based on this, the outcomes of this review
will ultimately provide answers to the following research questions:
1. Which categories of publications are included in this review research?
2. What are the various types of datasets used to test the algorithm for each included publication?
3. What are the languages identified for each included publication?
4. What are the various types of environments utilized for each included publication?
5. What are the different types of feature extraction methods used to conduct the study for each
included publication?
6. What types of DL models are used in each included publication?
The search strategy that was followed to retrieve papers and identify the relevant ones was thoroughly
discussed among the authors in regular meetings. This research used the search terms speech
recognitionAND deepin the following digital resources to include publications that employed both
DL and DNNs in the field of speech recognition: Google Scholar, Wiley Online Library, IEEE Explore,
and Science Direct.
Based on the search terms mentioned above and using the specified period (20192022), 575
publications were initially retrieved. Ninety-one were identified as duplicates among the different
libraries and removed. Hence, 484 publications were retained for the next phase.
4.2 Review Phase
The first task to carry out in this phase consisted in determining both the filtration rules and the
inclusion/exclusion criteria according to the following steps:
1. Remove publications that are out of topic based on the title.
2. Remove publications that are irrelevant based on the abstract and keywords.
3. Remove publications that are review papers (i.e. those used in the related work section or for the
purpose of comparison).
4. Filtering: Remove articles based on the inclusion/exclusion criteria (10 criteria were applied, and
only those studies with scores of 6 or higher out of 10 were kept). The inclusion criteria were as
follows:
CR1: Organization of the paper.
CR2: Relevance of the research questions.
CR3: Clear identification of the research objectives.
CR4: Existence of an adequate background.
CR5: Existence of practical experiments.
CR6: Appropriateness of the conducted experiment.
CR7: Suitability of the methods of analysis.
CR8: Clear reporting of the results.
CR9: Clear identification of the dataset.
CR10: Overall usefulness of the research.
The exclusion criteria were as follows:
1) Publications that are books, chapters, or theses. However, these are mentioned in the literature
review section.
2) Publications which, upon careful reading, do not answer the research questions.
3) Publications that do not utilize DL or DNNs.
Human-centric Computing and Information Sciences Page 15 / 33
4) Publications that are not written in English.
5) Publications that are reports or workshops with no publication information.
4.3 Analysis
The finalized list of articles for full review was thoroughly assessed to extract data that answered the
research questions listed earlier. These are reported in the results section. Fig. 13 depicts the steps
followed in this phase.
Fig. 13. Flow chart of the steps followed to select the publications.
5. Results and Discussion
The third phase of this systematic review research was the reporting of the results. A total of 94
publications were included in the final list of research articles [51144]. These papers were thoroughly
studied and used to extract information that answered the research questions. The information extracted
was quantitatively described and used to determine patterns in the studies carried out from 2019 to 2022.
It also revealed the commonalities and discrepancies between the studies, which helped to identify future
research directions.
5.1 RQ1: What are the categories of publications included in this review research?
Ninety-four publications included in this study belonged to two main types: journal articles and
conference papers. The type othersincluded workshops, which accounted for only 1% of the papers.
Fig. 14 illustrates the distribution of publications between the two types, while Fig. 15 presents the
distribution of the publications over the years examined.
Page 16 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
Fig. 14. Distribution of the publications per type.
Fig. 15. Distribution of the publications over the years studied.
Conference papers accounted for the majority (59%) of the publications in this study, whereas journal
articles accounted for about 40%. They were distributed across different conferences and journals, of
which the International Journal of Speech Technology and the International Conference on Acoustics,
Speech and Signal Processing were the most common journal and conference, respectively. The
distribution of the studies over the publication years indicates a notable trend, with the majority
concentrated in recent years. In particular, 2020 saw the highest share at 32%, followed by 2019 at 29%,
while 2021 and 2022 accounted for 25% and 14% of the publications, respectively.
5.2 RQ2: What are the various types of datasets used to test the algorithm for
each included publication?
For the purposes of testing and training the algorithms, a variety of datasets were identified in the
studies. Some datasets were reported as public and accessible on the internet at 37%, while most of them
were private at 56%. Fig. 16 depicts the distribution of datasets by type. It is worth mentioning here that
7% of the papers did not specify the type of dataset used in their experiments. The public datasets included
the Multi-Channel Articulatory database, Switchboard, CallHome, the Crowdsourced high-quality multi-
speaker speech dataset, the si284 dataset, dev93, eval92, TIMIT, Tibetan Corpus, SpinalNet, the Amharic
reading speech corpus, the Google OpenSLR dataset, and the Kaggle dataset.
5.3 RQ3: Which languages were identified for each included publication?
Different languages were used in the speech recognition publications examined. Fig. 17 presents the
languages thus identified for which the DL approach was applied in order to train and test the algorithms.
Human-centric Computing and Information Sciences Page 17 / 33
As shown in Fig. 17, the most dominant language is English, followed by Arabic, Chinese, Indian, and
Indonesian. The English language was also used alongside other languages, such as Tibetan and Indian.
Fig. 16. Types of datasets.
Fig. 17. Languages identified in the publications.
5.4 RQ4: What are the various types of environments utilized for each included
publication?
The types of environments used in the publications were either neutral or noisy. It is worth mentioning
that some papers did not mention the type of environment and thus they were assumed to be neutral.
Thus, 81% of the publications showed a neutral environment, while 19% were noisy, as shown in Fig. 18.
0
5
10
15
20
25
30
35
40
45
50
Arabic
Chinese
Indian
Indonesian
Ethiopian
Urdu -Punjabi
Tibetan
Korean
Taiwanese
Turkish
Albanian
Uzbek
Bengali
Sri Lanka
-Sinhala
Lithuanian
Italian-Libri
Persian
Page 18 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
5.5 RQ5: What are the different types of feature extraction methods used to
conduct the study for each included publication?
To prepare and train the data, different approaches were utilized to extract features from speech. Most
(about 60%) of these approaches were based on MFCCs; on the other hand, 6% of the studies did not
specify the type of feature extraction method used, while 10% did not perform any extraction.
Furthermore, 9% of the research used one of the following hybrid methods, combining MFCCs with
GMM and HMM, gammatone frequency cepstral coefficients (FCC-GFCC), acoustic feature extraction
(AF-MFCC), feature space maximum likelihood linear regression (fMLLR), amplitude modulation
spectrogram, cycle generative adversarial networks, perceptual linear prediction (PLPC), minimum
variance distortionless response, GMM-HMM, and MFCCs with linear discriminant analysis, maximum
likelihood linear transform and fMLLR. Fig. 19 depicts the feature extraction methods used, and Table 1
provides further details about the distribution of feature extraction methods across the publications.
Fig. 18. Environment conditions.
Fig. 19. Feature extraction methods.
0
10
20
30
40
50
60
MFCCs Hybrid Other GFCCs HMM No
Extraction
LPC CNN
Human-centric Computing and Information Sciences Page 19 / 33
5.6 RQ6: What types of deep learning models are used in each included
publication?
Regarding the types of DL models used in speech recognition, 51% of the publications used hybrid
models, whereas 17% used the DNN model. Transformer models were also utilized (5.3%). The rest of
the publications used different types of models, such as RNNs, CNNs and GMMs. Fig. 20 presents the
identified stand-alone models, while Table 1 shows the distribution of these models.
Page 20 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
Table 1. Distribution of models and feature extraction methods used in speech recognition research
Study
Feature extraction methods
Distribution of models used in speech recognition research
CNN
HLDA
HMM
MFCCs
No Ext.
N/A
Other
Hybrid
Enc.
RNN
DNN
LSTM
BLSTM
CNN
DBN
DTN
DRN
GMM
Tran.
Hybrid
[51]
P
P
[52]
P
P
[53]
P
P
[54]
P
P
[55]
P
P
[56]
P
P
[57]
P
P
[58]
P
P
[59]
P
P
[60]
P
P
[61]
P
P
P
[62]
P
P
[63]
P
P
P
P
[64]
P
P
[65]
P
P
[66]
P
P
[67]
P
P
P
[68]
P
P
[69]
P
P
[70]
P
P
P
[71]
P
P
[72]
P
P
[73]
P
P
[74]
P
P
[75]
P
P
[76]
P
P
[77]
P
P
[78]
P
P
[79]
P
P
[80]
P
P
[81]
P
P
P
[82]
P
P
[83]
P
P
Human-centric Computing and Information Sciences Page 21 / 33
Table 1. Continued
Study
Feature extraction methods
Distribution of models used in speech recognition research
CNN
HLDA
HMM
MFCCs
No Ext.
N/A
Other
Hybrid
Enc.
RNN
DNN
LSTM
BLSTM
CNN
DBN
DTN
DRN
GMM
Tran.
Hybrid
[84]
P
[85]
P
[86]
P
P
[87]
P
P
[88]
P
P
[89]
P
[90]
P
P
[91]
P
P
[92]
P
P
[93]
P
P
[94]
P
P
[95]
P
P
[96]
P
P
[97]
P
P
[98]
P
P
[99]
P
P
[100]
P
P
[101]
P
P
[102]
P
P
[103]
P
P
[104]
P
P
[105]
P
P
[106]
P
P
[107]
P
P
[108]
P
[109]
P
P
[110]
P
P
[111]
P
P
[112]
P
P
[113]
P
P
P
[114]
P
P
P
[115]
P
P
P
[116]
P
P
Page 22 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
Table 1. Continued
Study
Feature extraction methods
Distribution of models used in speech recognition research
CNN
HLDA
HMM
MFCCs
No Ext.
N/A
Other
Hybrid
Enc.
RNN
DNN
LSTM
BLSTM
CNN
DBN
DTN
DRN
GMM
Tran.
Hybrid
[117]
P
P
[118]
P
P
[119]
P
P
[120]
P
P
[121]
P
[122]
P
P
[123]
P
P
[124]
P
P
[125]
P
P
[126]
P
P
[127]
P
P
[128]
P
P
[129]
P
P
[130]
P
P
[131]
P
[132]
P
P
[133]
P
P
[134]
P
P
[135]
P
P
[136]
P
P
[137]
P
P
[138]
P
P
[139]
P
P
[140]
P
P
[141]
P
P
[142]
P
[143]
P
P
[144]
P
P
No Ext.=no extraction, N/A=not available, Enc.=autoencoder, Tran.=transformer.
Human-centric Computing and Information Sciences Page 23 / 33
Fig. 20. Types of models used in speech recognition research.
Fig. 21. Evaluation techniques.
The evaluation techniques that were identified in the included publications for the purpose of
evaluating the overall performance of the model were mainly the WER at 49%, followed by accuracy at
19%. As for label error rate (LER), it was found in only one study (Fig. 21). The category of “others
included the use of different evaluation techniques, which accounted for 15% of all publications. These
included WER + WordAcc, accuracy + confusion matrix, WER + character error rate (CER) + loss,
phone error rate (PER) + WER, accuracy + CER, PER, sentence error rate (SER), syllable error rate, SER
+CER + WER, WER + loss + mean edit distance, WER + PER + frame error rate, CER + LER, WER +
monophones + triphones, and CER + WER.
6. Conclusion
This study consists of a comprehensive analysis of the application of DL techniques in the field of
speech recognition based on an examination of 94 studies published from 2019 to 2022. The findings of
this study have revealed several key insights and trends in the field. The distribution of the included
studies was observed across various journals (40%) and conferences (59%). Most of the studies utilized
public datasets (56%) and focused on English language processing (46%) in neutral environments (81%).
The evaluation techniques used were predominantly based on the WER (49%). Furthermore, the present
study’s analysis highlighted the widespread use of MFCCs as a feature extraction method. However, it
would be interesting for future researchers to explore alternative approaches, such as fMLLR, GFCC,
and LPC.
0
5
10
15
20
25
30
35
40
45
Hybrid
DNN
Hybrid
RNN
Autoencoder
CNN
Transformer
Other
0
10
20
30
40
50
WER Accuracy Other PER LER CER
Page 24 / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
A significant finding from the literature survey is the increasing adoption of hybrid DNN models
exhibiting improved performance compared with stand-alone models. Approximately 52% of the
publications examined utilized hybrid models, whereas 17% relied on DNN stand-alone models. Also
identified was a research gap in the exploration of speech recognition using transformer models. This
area presents promising opportunities for future investigations, as transformers have demonstrated
powerful capabilities, including faster learning speed, parallelization, and enhanced performance for low-
resource languages.
In conclusion, our study provides valuable insights into the utilization of DL in speech recognition.
The trends and research gaps identified are expected to open new avenues for future research, which will
be further discussed in the subsequent section, emphasizing the importance of exploring alternative
methods of feature extraction and leveraging transformer models to advance the field.
7. Future Work
Despite the extensive analysis conducted in this review, there are several avenues for future research
in the field of speech recognition. First, investigating alternative methods of feature extraction is crucial
to expanding the repertoire beyond the commonly used MFCCs. As techniques such as fMLLR, GFCC
and LPC offer alternative representations of speech signals, they should be evaluated for their
effectiveness in speech recognition tasks. Second, the exploration of hybrid model architectures holds
promise for improving speech recognition performance. While the current study indicated a shift towards
hybrid models, further investigation is needed to explore different combinations of architectures. For
example, combining CNNs with RNNs, or incorporating transformer models, could yield valuable
insights into optimal model composition and its impact on speech recognition accuracy. Third, the
application of transformer-based models in speech recognition has been relatively unexplored in the
studies reviewed for this paper. Future research should delve into the use of transformers, as they offer
various advantages, such as faster learning speed, parallelization capabilities, and potential performance
enhancements for low-resource languages. Comparative studies of transformers and traditional
architectures could also help to elucidate their suitability and effectiveness in speech recognition tasks.
Moreover, the focus on low-resource languages is an important direction for future research. While the
majority of studies in the current analysis were focused on English language processing, addressing the
challenges specific to low-resource languages, including limited training data and linguistic variations,
is imperative. Investigating and developing robust models that can handle these challenges effectively
will contribute to closing the gap in speech recognition technology for underrepresented languages.
Lastly, the evaluation of performance metrics beyond the WER is essential for a comprehensive
assessment of speech recognition systems. Metrics such as phoneme error rate, precision, recall, and F1-
score can provide additional insights into the strengths and weaknesses of different models. Incorporating
a diverse range of metrics will facilitate a more nuanced understanding of system performance and enable
researchers to make informed decisions when designing and optimizing speech recognition models. By
addressing these future research directions, researchers could advance speech recognition technology,
enhance its applicability to diverse languages and contexts, and ultimately improve the user experience
in various speech-related applications.
Author’s Contributions
Conceptualization, DAF; Funding acquisition, AQ; Investigation and methodology, DAF; Project
administration, DAF; Resources, FA; Supervision, DAF; Software, YS; Validation, YS; Formal analysis,
DAF; Data curation, FA; Writing of the original draft, DAF; Writing of the review and editing, DAF,
AQ, MT, AM, YS.
Funding
Human-centric Computing and Information Sciences Page 25 / 33
This research has been funded by the Deanship of Scientific Research at King Khalid University (No.
RGP.1/209/43).
Competing Interests
The authors declare that they have no competing interests.
References
[1]D. Yu and L. Deng, Automatic Speech Recognition. London, UK: Springer, 2016.
https://doi.org/10.1007/978-1-4471-5779-3
[2]L. Besacier, E. Barnard, A. Karpov, and T. Schultz, "Automatic speech recognition for under-resourced
languages: a survey," Speech Communication, vol. 56, pp. 85-100, 2014.
https://doi.org/10.1016/j.specom.2013.07.008
[3]A. Mathur and R. Sultana, "A study of machine learning algorithms in speech recognition and language
identification system," in Innovations in Computer Science and Engineering. Singapore: Springer, 2021,
pp. 503-513. https://doi.org/10.1007/978-981-33-4543-0_54
[4]C. Deuerlein, M. Langer, J. Seßner, P. Heß, and J. Franke, "Human-robot-interaction using cloud-based
speech recognition systems," Procedia CIRP, vol. 97, pp. 130-135, 2021.
https://doi.org/10.1016/j.procir.2020.05.214
[5]N. Sandhya, R. V. Saraswathi, P. Preethi, K. A. Chowdary, M. Rishitha, and V. S. Vaishnavi, "Smart
attendance system using speech recognition," in Proceedings of 2022 4th International Conference on
Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 2022, pp. 144-149.
https://doi.org/10.1109/ICSSIT53264.2022.9716261
[6]Y. Chen, J. Zhang, X. Yuan, S. Zhang, K. Chen, X. Wang, and S. Guo, "SoK: a modularized approach to
study the security of automatic speech recognition systems," ACM Transactions on Privacy and Security,
vol. 25, article no. 17, 2022. https://doi.org/10.1145/3510582
[7]K. Xia, X. Xie, H. Fan, and H. Liu, "An intelligent hybridintegrated system using speech recognition and
a 3D display for early childhood education," Electronics, vol. 10, no. 15, article no. 1862, 2021.
https://doi.org/10.3390/electronics10151862
[8]A. Ahmad, P. Mozelius, and K. Ahlin, "Speech and language relearning for stroke patients-understanding
user needs for technology enhancement," in Proceedings of the 13th international Conference on eHealth,
Telemedicine, and Social Medicine (eTELEMED), Nice, France, 2021, pp. 31-38.
[9]K. Avazov, M. Mukhiddinov, F. Makhmudov, and Y. I. Cho, "Fire detection method in smart city
environments using a deep-learning-based approach," Electronics, vol. 11, no. 1, article no. 73, 2021.
https://doi.org/10.3390/electronics11010073
[10] L. V. Kremin, J. Alves, A. J. Orena, L. Polka, and K. Byers-Heinlein, "Code-switching in parents’ everyday
speech to bilingual infants," Journal of Child Language, vol. 49, no. 4, pp. 714-740, 2022.
https://doi.org/10.1017/S0305000921000118
[11] D. O’Shaughnessy, "Automatic speech recognition: history, methods and challenges," Pattern Recognition,
vol. 41, no. 10, pp. 2965-2979, 2008. https://doi.org/10.1016/j.patcog.2008.05.008
[12]M. H. Ali, M. M. Jaber, S. K. Abd, A. Rehman, M. J. Awan, D. Vitkute-Adzgauskiene, R. Damasevicius,
and S. A. Bahaj, "Harris Hawks sparse auto-encoder networks for automatic speech recognition system,"
Applied Sciences, vol. 12, no. 3, article no. 1091, 2022. https://doi.org/10.3390/app12031091
[13]N. Arjangi, "Applications of speech recognition using machine learning and computer vision,"
International Journal of Research Publication and Reviews, vol. 3, no. 11, pp. 998-1002, 2022.
[14]A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech recognition using deep neural
networks: a systematic review," IEEE Access, vol. 7, pp. 19143-19165, 2019.
https://doi.org/10.1109/ACCESS.2019.2896880
[15]C. Yu, M. Kang, Y. Chen, J. Wu, and X. Zhao, "Acoustic modeling based on deep learning for low-
resource speech recognition: an overview," IEEE Access, vol. 8, pp. 163829-163843, 2020.
https://doi.org/10.1109/ACCESS.2020.3020421
Page  / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[16]W. Lee, J. J. Seong, B. Ozlu, B. S. Shim, A. Marakhimov, and S. Lee, "Biosignal sensors and deep
learning-based speech recognition: a review," Sensors, vol. 21, no. 4, article no. 1399, 2021.
https://doi.org/10.3390/s21041399
[17] W. Algihab, N. Alawwad, A. Aldawish, and S. AlHumoud, "Arabic speech recognition with deep learning:
a review," in Social Computing and Social Media: Design, Human Behavior and Analytics. Cham,
Switzerland: Springer, 2019, pp. 15-31. https://doi.org/10.1007/978-3-030-21902-4_2
[18]H. A. Alsayadi, I. Hegazy, Z. T. Fayed, B. Alotaibi, and A. A. Abdelhamid, "Deep investigation of the
recent advances in dialectal Arabic speech recognition," IEEE Access, vol. 10, pp. 57063-57079, 2022.
https://doi.org/10.1109/ACCESS.2022.3177191
[19]D. Dayal, F. Alam, H. Varun, and N. Singh, "Review on speech recognition using deep learning,"
International Journal for Research in Applied Science & Engineering Technology (IJRASET), vol. 8, no. 5,
pp. 1-5, 2020.
[20]L. Zhang and X. Sun, "Study on speech recognition method of artificial intelligence deep learning,"
Journal of Physics: Conference Series, vol. 1754, no. 1, article no. 012183, 2021.
https://doi.org/10.1088/1742-6596/1754/1/012183
[21]K. I. Taher and A. M. Abdulazeez, "A deep learning convolutional neural network for speech recognition:
a review," International Journal of Science and Business, vol. 5, no. 3, pp. 1-14, 2021.
https://doi.org/10.5281/zenodo.4475361
[22]M. El-Shebli, Y. Sharrab, and D. Al-Fraihat, "Prediction and modeling of water quality using deep neural
networks," Environment, Development and Sustainability, 2023. https://doi.org/10.1007/s10668-023-
03335-5
[23] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," The Bulletin
of Mathematical Biophysics, vol. 5, pp. 115-133, 1943. https://doi.org/10.1007/BF02478259
[24]M. L. Minsky and S. A. Papert, Perceptrons: An Introduction to Computational Geometry. Cambridge,
MA: MIT Press, 1988.
[25]D. E. Rumelhart, G. E. Hinton, and J. L. McClelland, "A general framework for parallel distributed
processing," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition.
Cambridge, MA: MIT Press, 1986, pp. 45-76. https://doi.org/10.7551/mitpress/5236.003.0005
[26]L. See and S. Openshaw, "Applying soft computing approaches to river level forecasting," Hydrological
Sciences Journal, vol. 44, no. 5, pp. 763-778, 1999. https://doi.org/10.1080/02626669909492272
[27]I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.
[28]Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no 7553, pp. 436-444, 2015.
https://doi.org/10.1038/nature14539
[29]K. Ali, M. Alzaidi, D. Al-Fraihat, and A. M. Elamir, "Artificial intelligence: benefits, application, ethical
issues, and organizational responses," in Intelligent Sustainable Systems. Singapore: Springer, 2023, pp.
685-702. https://doi.org/10.1007/978-981-19-7660-5_62
[30]C. Dawson and R. Wilby, "Hydrological modelling using artificial neural networks," Progress in Physical
Geography, vol. 25, no. 1, pp. 80-108, 2001. https://doi.org/10.1177/030913330102500104
[31]D. Al-Fraihat, M. Alzaidi, and M. Joy, "Why do consumers adopt smart voice assistants for shopping
purposes? A perspective from complexity theory," Intelligent Systems with Applications, vol. 18, article no.
200230, 2023. https://doi.org/10.1016/j.iswa.2023.200230
[32]K. Zeinalzadeh and E. Rezaei, "Determining spatial and temporal changes of surface water quality using
principal component analysis," Journal of Hydrology: Regional Studies, vol. 13, pp. 1-10, 2017.
https://doi.org/10.1016/j.ejrh.2017.07.002
[33]N. Buduma, N. Buduma, and J. Papa, Fundamentals of Deep Learning. Sebastopol, CA: O'Reilly Media
Inc., 2022.
[34] S. Bhanja and A. Das, "Impact of data normalization on a deep neural network for time series forecasting,"
2018 [Online]. Available: https://arxiv.org/abs/1812.05519.
[35] B. Simpson, F. Dutil, Y. Bengio, and J. P. Cohen, "GradMask: reduce overfitting by regularizing saliency,"
2019 [Online]. Available: https://arxiv.org/abs/1904.07478.
[36]H. Pishro-Nik, Introduction to Probability, Statistics, and Random Processes. Blue Bell, PA: Kappa
Research, 2014.
Human-centric Computing and Information Sciences Page  / 33
[37]N. Zhang, S. L. Shen, A. Zhou, and Y. S. Xu, "Investigation on performance of neural networks using
quadratic relative error cost function," IEEE Access, vol. 7, pp. 106642-106652, 2019.
https://doi.org/10.1109/ACCESS.2019.2930520
[38]S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing, A. Kathan, B. Hu, and B. W. Schuller,
"Audio self-supervised learning: a survey," 2022 [Online]. Available: https://arxiv.org/abs/2203.01205.
[39]S. Gronauer and K. Diepold, "Multi-agent deep reinforcement learning: a survey," Artificial Intelligence
Review, vol. 55, pp. 895-943, 2022. https://doi.org/10.1007/s10462-021-09996-w
[40] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, et al., "The Kaldi speech recognition
toolkit," in Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding,
Big Island, HI, USA, 2011.
[41]G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, et al., "Deep neural networks for
acoustic modeling in speech recognition: the shared views of four research groups," IEEE Signal
Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012. https://doi.org/10.1109/MSP.2012.2205597
[42]A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, et al., "Deep speech: scaling up end-
to-end speech recognition," 2014 [Online]. Available: https://arxiv.org/abs/1412.5567.
[43]D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, et al., "Deep speech 2: end-
to-end speech recognition in English and Mandarin," Proceedings of Machine Learning Research, vol. 48,
pp. 173-182, 2016.
[44]J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based models for speech
recognition," Advances in Neural Information Processing Systems, vol. 28, pp. 577-585, 2015.
[45]Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, et al., "Streaming end-to-end
speech recognition for mobile devices," in Proceedings of 2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 6381-6385.
https://doi.org/10.1109/ICASSP.2019.8682336
[46]O. Tarawneh, M. Tarawneh, Y. Sharrab, and M. Altarawneh, "Mushroom classification using machine-
learning techniques," AIP Conference Proceedings, vol. 2979, article no. 030003, 2023.
https://doi.org/10.1063/5.0174721
[47]A. Hannun, "The history of speech recognition to the year 2030," 2021 [Online]. Available:
https://arxiv.org/abs/2108.00084.
[48] D. Khurana, A. Koli, K. Khatter, and S. Singh, "Natural language processing: state of the art, current trends
and challenges," Multimedia Tools and Applications, vol. 82, pp. 3713-3744, 2023.
https://doi.org/10.1007/s11042-022-13428-4
[49]J. Li, "Recent advances in end-to-end automatic speech recognition," APSIPA Transactions on Signal and
Information Processing, vol. 11, no. 1, article no. e8, 2022. https://doi.org/10.1561/116.00000050
[50]S. Keele, "Guidelines for performing systematic literature reviews in software engineering," Keele
University, Staffs, UK, Technical Report No. EBSE-2007-01, 2007.
[51]B. Dendani, H. Bahi, and T. Sari, "Speech enhancement based on a deep AutoEncoder for remote Arabic
speech recognition," in Image and Signal Processing. Cham, Switzerland: Springer, 2020, pp. 221-229,
2020. https://doi.org/10.1007/978-3-030-51935-3_24
[52]R. Amari, Z. Noubigh, S. Zrigui, D. Berchech, H. Nicolas, and M. Zrigui, "Deep convolutional neural
network for arabic speech recognition," in Conference on Computational Collective. Cham, Switzerland:
Springer, 2022, pp. 120-134. https://doi.org/10.1007/978-3-031-16014-1_11
[53]K. Nugroho, E. Noersasongko, and H. A. Santoso, "Javanese gender speech recognition using deep
learning and singular value decomposition," in Proceedings of 2019 International Seminar on Applications
for Information and Communication Technology (iSemantic), Semarang, Indonesia, 2019, pp. 251-254.
https://doi.org/10.1109/ISEMANTIC.2019.8884267
[54]T. F. Abidin, A. Misbullah, R. Ferdhiana, M. Z. Aksana, and L. Farsiah, "Deep neural network for
automatic speech recognition from Indonesian audio using several lexicon types," in Proceedings of 2020
International Conference on Electrical Engineering and Informatics (ICELTICs), Aceh, Indonesia, 2020,
pp. 1-5. https://doi.org/10.1109/ICELTICs50595.2020.9315538
[55]Z. Ling, "An acoustic model for English speech recognition based on deep learning," in Proceedings of
2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA),
Qiqihar, China, 2019, pp. 610-614. https://doi.org/10.1109/ICMTMA.2019.00140
Page  / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[56]D. A. Rahman and D. P. Lestari, "Indonesian spontaneous speech recognition system using deep neural
networks," in Proceedings of 2020 7th International Conference on Advance Informatics: Concepts, Theory
and Applications (ICAICTA), Tokoname, Japan, 2020, pp. 1-3.
https://doi.org/10.1109/ICAICTA49861.2020.9429070
[57]X. Chu, "Speech recognition method based on deep learning and its application," in Proceedings of 2021
International Conference of Social Computing and Digital Economy (ICSCDE), Chongqing, China, 2021,
pp. 299-302. https://doi.org/10.1109/ICSCDE54196.2021.00075
[58]S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang,
"Quartznet: deep automatic speech recognition with 1D time-channel separable convolutions," in
Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Barcelona, Spain, 2020, pp. 6124-6128. https://doi.org/10.1109/ICASSP40776.2020.9053889
[59]S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, "Deep contextualized acoustic representations for semi-
supervised speech recognition," in Proceedings of 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6429-6433.
https://doi.org/10.1109/ICASSP40776.2020.9053176
[60] W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung, and M. Picheny, "Distributed deep learning
strategies for automatic speech recognition," in Proceedings of 2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 5706-5710.
https://doi.org/10.1109/ICASSP.2019.8682888
[61]M. A. Al Amin, M. T. Islam, S. Kibria, and M. S. Rahman, "Continuous Bengali speech recognition based
on deep neural network," in Proceedings of 2019 International Conference on Electrical, Computer and
Communication Engineering (ECCE), Cox'sBazar, Bangladesh, 2019, pp. 1-6.
https://doi.org/10.1109/ECACE.2019.8679341
[62]V. Bhardwaj and V. Kadyan, "Deep neural network trained Punjabi children's speech recognition system
using Kaldi toolkit," in Proceedings of 2020 IEEE 5th International Conference on Computing
Communication and Automation (ICCCA), Greater Noida, India, 2020, pp. 374-378.
https://doi.org/10.1109/ICCCA49541.2020.9250780
[63]Z. Chen and H. Yang, "Yi language speech recognition using deep learning methods," in Proceedings of
2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference
(ITNEC), Chongqing, China, 2020, pp. 1064-1068. https://doi.org/10.1109/ITNEC48623.2020.9084771
[64]W. Wang, X. Yang, and H. Yang, "End-to-End low-resource speech recognition with a deep CNN-LSTM
encoder," in Proceedings of 2020 IEEE 3rd International Conference on Information Communication and
Signal Processing (ICICSP), Shanghai, China, 2020, pp. 158-162.
https://doi.org/10.1109/ICICSP50920.2020.9232119
[65]Y. Shan, M. Liu, Q. Zhan, S. Du, J. Wang, and X. Xie, "Speech recognition based on a deep tensor neural
network and multifactor feature," in Proceedings of 2019 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 2019, pp. 650-654.
https://doi.org/10.1109/APSIPAASC47483.2019.9023251
[66]S. T. Abate, M. Y. Tachbelie, and T. Schultz, "Deep neural networks-based automatic speech recognition
for four Ethiopian languages," in Proceedings of 2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 8274-8278.
https://doi.org/10.1109/ICASSP40776.2020.9053883
[67]A. Rista and A. Kadriu, "End-to-end speech recognition model based on deep learning for Albanian," in
Proceedings of 2021 44th International Convention on Information, Communication and Electronic
Technology (MIPRO), Opatija, Croatia, 2021, pp. 442-446.
https://doi.org/10.23919/MIPRO52101.2021.9596713
[68]Q. An, K. Bai, M. Zhang, Y. Yi, and Y. Liu, "Deep neural network based speech recognition systems
under noise perturbations," in Proceedings of 2020 21st International Symposium on Quality Electronic
Design (ISQED), Santa Clara, CA, USA, 2020, pp. 377-382.
https://doi.org/10.1109/ISQED48828.2020.9136978
[69]H. Xu, H. Yang, and Y. You, "Donggan speech recognition based on a deep neural network," in
Proceedings of 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence
Conference (ITAIC), Chongqing, China, 2019, pp. 354-358. https://doi.org/10.1109/ITAIC.2019.8785451
Human-centric Computing and Information Sciences Page  / 33
[70]A. Sirwan, K. A. Thama, and S. Suyanto, "Indonesian automatic speech recognition based on an end-to-
end deep learning model," in Proceedings of 2022 IEEE International Conference on Cybernetics and
Computational Intelligence (CyberneticsCom), Malang, Indonesia, 2022, pp. 410-415.
https://doi.org/10.1109/CyberneticsCom55287.2022.9865253
[71]R. G. Kodali, D. P. Manukonda, and R. Sundararajan, "Bilingual speech recognition based on deep neural
networks and directed acyclic word graphs," in Proceedings of 2019 International Conference on Data
Mining Workshops (ICDMW), Beijing, China, 2019, pp. 1-6.
https://doi.org/10.1109/ICDMW48858.2019.9024758
[72]H. Karunathilaka, V. Welgama, T. Nadungodage, and R. Weerasinghe, "Low-resource Sinhala speech
recognition using deep learning," in Proceedings of 2020 20th International Conference on Advances in
ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 2020, pp. 196-201.
https://doi.org/10.1109/ICTer51097.2020.9325468
[73]J. Jorge, A. Giménez, J. A. Silvestre-Cerda, J. Civera, A. Sanchis, and A. Juan, "Live Streaming speech
recognition using deep bidirectional LSTM acoustic models and interpolated language models," IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 148-161, 2021.
https://doi.org/10.1109/TASLP.2021.3133216
[74]C. W. Chen, Y. F. Yeh, C. L. Lin, S. P. Tseng, and J. F. Wang, "Hybrid deep neural network acoustic
model for Taiwanese speech recognition," in Proceedings of 2020 8th International Conference on Orange
Technology (ICOT), Daegu, South Korea, 2020, pp. 1-5.
https://doi.org/10.1109/ICOT51877.2020.9468762
[75]P. Wu and M. Wang, "Large vocabulary continuous speech recognition with deep recurrent network," in
Proceedings of 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing,
China, 2020, pp. 794-798. https://doi.org/10.1109/ICSIP49896.2020.9339455
[76]M. S. Chauhan, R. Mishra, and M. I. Patel, "A speech recognition and separation system using deep
learning," in Proceedings of 2021 International Conference on Innovative Computing, Intelligent
Communication and Smart Electrical Systems (ICSES), Chennai, India, 2021, pp. 1-5.
https://doi.org/10.1109/ICSES52305.2021.9633779
[77]R. Masumura, M. Ihori, A. Takashima, T. Tanaka, and T. Ashihara, "End-to-end automatic speech
recognition with deep mutual learning," in Proceedings of 2020 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 2020, pp.
632-637.
[78]R. Yang, J. Yang, and Y. Lu, "Indonesian speech recognition based on deep neural network," in
Proceedings of 2021 International Conference on Asian Language Processing (IALP), Singapore, 2021, pp.
36-41. https://doi.org/10.1109/IALP54817.2021.9675280
[79]W. Zhang, M. Zhai, Z. Huang, C. Liu, W. Li, and Y. Cao, "Towards end-to-end speech recognition with
deep multipath convolutional neural networks," in Intelligent Robotics and Applications. Cham,
Switzerland: Springer, 2019, pp. 332-341. https://doi.org/10.1007/978-3-030-27529-7_29
[80]H. Teng, S. Wang, X. Liu, and X. G. Yue, "Speech recognition model based on deep learning and
application to a pronunciation quality evaluation system," in Proceedings of the 2019 International
Conference on Data Mining and Machine Learning, Hong Kong, China, 2019, pp. 1-5.
https://doi.org/10.1145/3335656.3335657
[81]S. Zhao, C. Ni, R. Tong, and B. Ma, "Multi-task multi-network joint-learning of deep residual networks
and cycle-consistency generative adversarial networks for robust speech recognition," in Proceedings
of 20th Annual Conference of the International Speech Communication Association (INTERSPEECH),
Graz, Austria, 2019, pp. 1238-1242. https://doi.org/10.21437/interspeech.2019-2078
[82] Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition,"
in Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), New Orleans, LA, USA, 2017, pp. 4845-4849. https://doi.org/10.1109/ICASSP.2017.7953077
[83]B. Gong, R. Cai, Z. Cai, Y. Ding, and M. Peng, "Selection of an acoustic modeling unit for Tibetan speech
recognition based on deep learning," MATEC Web of Conferences, vol. 336, article no. 06014, 2021.
https://doi.org/10.1051/matecconf/202133606014
Page  / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[84]T. Watzel, L. Li, L. Kurzinger, and G. Rigoll, "Deep neural network quantizers outperforming continuous
speech recognition systems," in Speech and Computer. Cham, Switzerland: Springer, 2019, pp. 530-539.
https://doi.org/10.1007/978-3-030-26061-3_54
[85]E. D. Emiru, Y. Li, S. Xiong, and A. Fesseha, "Speech recognition system based on deep neural network
acoustic modeling for a poorly resourced language-Amharic," in Proceedings of the 3rd International
Conference on Telecommunications and Communication Engineering, Tokyo, Japan, 2019, pp. 141-145.
https://doi.org/10.1145/3369555.3369564
[86] Y. F. Yeh, B. H. Su, Y. Y. Ou, J. F. Wang, and A. C. Tsai, "Taiwanese speech recognition based on hybrid
deep neural network architecture," in Proceedings of the 32nd Conference on Computational Linguistics
and Speech Processing (ROCLING), Taipei, Taiwan, 2020, pp. 102-113.
[87]L. Toth and G. Gosztolya, "Adversarial multi-task learning of speaker-invariant deep neural network
acoustic models for speech recognition," in Proceedings of the 1st International Conference on Advances
in Signal Processing and Artificial Intelligence (ASPAI), Barcelona, Spain, 2019, pp. 82-86.
[88]H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, and Z. T. Fayed, "Arabic speech recognition using end-to-
end deep learning," IET Signal Processing, vol. 15, no. 8, pp. 521-534, 2021.
https://doi.org/10.1049/sil2.12057
[89]N. Zerari, S. Abdelhamid, H. Bouzgou, and C. Raymond, "Bidirectional deep architecture for Arabic
speech recognition," Open Computer Science, vol. 9, no. 1, pp. 92-102, 2019. https://doi.org/10.1515/comp-
2019-0004
[90]H. Alsayadi, A. Abdelhamid, I. Hegazy, and Z. Taha, "Data augmentation for Arabic speech recognition
based on end-to-end deep learning," International Journal of Intelligent Computing and Information
Sciences, vol. 21, no. 2, pp. 50-64, 2021.
[91]G. Cheng, X. Li, and Y. Yan, “Highway connections to enable deep small‐footprint LSTM‐RNNs for
speech recognition,” Chinese Journal of Electronics, vol. 28, no. 1, pp. 107-112, 2019.
https://doi.org/10.1049/cje.2018.11.008
[92]Y. R. Oh, K. Park, H. B. Jeon, and J. G. Park, "Automatic proficiency assessment of Korean speech read
aloud by non-natives using bidirectional LSTM-based speech recognition," ETRI Journal, vol. 42, no. 5,
pp. 761-772, 2020. https://doi.org/10.4218/etrij.2019-0400
[93]R. Marimuthu, "Speech recognition using a Taylor-gradient descent political optimization-based deep
residual network," Computer Speech & Language, vol. 78, article no. 101442, 2023.
https://doi.org/10.1016/j.csl.2022.101442
[94]J. R. Koya and S. V. M. Rao, "Deep bidirectional neural networks for robust speech recognition under
heavy background noise," Materials Today: Proceedings, vol. 46, no. 6, pp. 4117-4121, 2021.
https://doi.org/10.1016/j.matpr.2021.02.640
[95]Y. Dokuz and Z. Tufekci, "Mini-batch sample selection strategies for deep learning-based speech
recognition," Applied Acoustics, vol. 171, article no. 107573, 2021.
https://doi.org/10.1016/j.apacoust.2020.107573
[96]T. Zoughi, M. M. Homayounpour, and M. Deypir, "Adaptive windows multiple deep residual networks
for speech recognition," Expert Systems with Applications, vol. 139, article no. 112840, 2020.
https://doi.org/10.1016/j.eswa.2019.112840
[97] Z. Song, "English speech recognition based on deep learning with multiple features," Computing, vol. 102,
pp. 663-682, 2020. https://doi.org/10.1007/s00607-019-00753-0
[98] A. Shewalkar, D. Nyavanandi, and S. A. Ludwig, "Performance evaluation of deep neural networks applied
to speech recognition: RNN, LSTM and GRU," Journal of Artificial Intelligence and Soft Computing
Research, vol. 9, no. 4, pp. 235-245, 2019. https://doi.org/10.2478/jaiscr-2019-0006
[99]H. Veisi and A. Haji Mani, "Persian speech recognition using deep learning," International Journal of
Speech Technology, vol. 23, pp. 893-905, 2020. https://doi.org/10.1007/s10772-020-09768-x
[100]A. Santhanavijayan, D. Naresh Kumar, and G. Deepak, "A semantic-aware strategy for automatic speech
recognition incorporating deep learning models," in Intelligent System Design. Singapore: Springer, 2021,
pp. 247-254. https://doi.org/10.1007/978-981-15-5400-1_25
[101]N. Q. Pham, T. S. Nguyen, J. Niehues, M. Muller, S. Stuker, and A. Waibel, "Very deep self-attention
networks for end-to-end speech recognition," in Proceedings of the 20th Annual Conference of the
Human-centric Computing and Information Sciences Page  / 33
International Speech Communication Association (INTERSPEECH), Graz, Austria, 2019, pp. 66-70.
https://doi.org/10.21437/interspeech.2019-2702
[102]V. Passricha and R. K. Aggarwal, "A hybrid of deep CNN and bidirectional LSTM for automatic speech
recognition," Journal of Intelligent Systems, vol. 29, no. 1, pp. 1261-1274, 2020.
https://doi.org/10.1515/jisys-2018-0372
[103]M. D. Hassan, A. N. Nasret, M. R. Baker, and Z. S. Mahmood, "Enhancement automatic speech
recognition by deep neural networks," Periodicals of Engineering and Natural Sciences, vol. 9, no. 4, pp.
921-927, 2021. https://doi.org/10.21533/pen.v9i4.2450
[104]W. Ying, L. Zhang, and H. Deng, "Sichuan dialect speech recognition with a deep LSTM network,"
Frontiers of Computer Science, vol. 14, pp. 378-387, 2020. https://doi.org/10.1007/s11704-018-8030-z
[105]W. Zhang, X. Cui, U. Finkler, G. Saon, A. Kayi, A. Buyuktosunoglu, B. Kingsbury, D. Kung, and M.
Picheny, "A highly efficient distributed deep learning system for automatic speech recognition," 2019
[Online]. Available: https://arxiv.org/abs/1907.05701.
[106]L. Pipiras, R. Maskeliunas, and R. Damasevicius, "Lithuanian speech recognition using purely phonetic
deep learning," Computers, vol. 8, no. 4, article no. 76, 2019. https://doi.org/10.3390/computers8040076
[107]V. Kadyan, M. Dua, and P. Dhiman, "Enhancing the accuracy of long contextual dependencies for a
Punjabi speech recognition system using deep LSTM," International Journal of Speech Technology, vol.
24, pp. 517-527, 2021. https://doi.org/10.1007/s10772-021-09814-2
[108]D. Raval, V. Pathak, M. Patel, and B. Bhatt, "Improving deep learning-based automatic speech
recognition for Gujarati," Transactions on Asian and Low-Resource Language Information Processing, vol.
21, no. 3, article no. 47, 2021. https://doi.org/10.1145/3483446
[109]Y. Xu, "English speech recognition and evaluation of pronunciation quality using deep learning," Mobile
Information Systems, vol. 2022, article no. 7186375, 2022. https://doi.org/10.1155/2022/7186375
[110]A. Mukhamadiyev, I. Khujayarov, O. Djuraev, and J. Cho, "An automatic speech recognition method
based on deep learning approaches to the Uzbek language," Sensors, vol. 22, no. 10, article no. 3683, 2022.
https://doi.org/10.3390/s22103683
[111]M. Ali Humayun, I. A. Hameed, S. Muslim Shah, S. Hassan Khan, I. Zafar, S. Bin Ahmed, and J. Shuja,
"Regularized Urdu speech recognition with semi-supervised deep learning," Applied Sciences, vol. 9, no. 9,
article no. 1956, 2019. https://doi.org/10.3390/app9091956
[112]A. Alsobhani, H. M. A. Alabboodi, and H. Mahdi, "Speech recognition using convolution deep neural
networks," Journal of Physics: Conference Series, vol. 1973, article no. 012166, 2021.
https://doi.org/10.1088/1742-6596/1973/1/012166
[113]T. Zoughi and M. M. Homayounpour, "A gender-aware deep neural network structure for speech
recognition," Iranian Journal of Science and Technology, Transactions of Electrical Engineering, vol. 43,
pp. 635-644, 2019. https://doi.org/10.1007/s40998-019-00177-8
[114]T. Rajapakshe, R. K. Rana, S. Latif, S. Khalifa, and B. W. Schuller, "Pre-training in deep reinforcement
learning for automatic speech recognition," 2019 [Online]. Available: https://arxiv.org/abs/1910.11256.
[115] M. U. Farooq, F. Adeeba, S. Rauf, and S. Hussain, "Improving large vocabulary Urdu speech recognition
system using deep neural networks," in Proceedings of the 20th Annual Conference of the International
Speech Communication Association (INTERSPEECH), Graz, Austria, 2019, pp. 2978-2982.
https://doi.org/10.21437/interspeech.2019-2629
[116]S. Isobe, S. Tamura, and S. Hayamizu, "Speech recognition using deep canonical correlation analysis in
noisy environments," in Proceedings of the 10th International Conference on Pattern Recognition
Applications and Methods (ICPRAM), Virtual Event, 2021, pp. 63-70.
https://doi.org/10.5220/0010268200630070
[117]H. Lei, Y. Xiao, Y. Liang, D. Li, and H. Lee, "DLD: an optimized Chinese speech recognition model
based on deep learning," Complexity, vol. 2022, article no. 6927400, 2022.
https://doi.org/10.1155/2022/6927400
[118]T. Rajapakshe, S. Latif, R. Rana, S. Khalifa, and B. Schuller, "Deep reinforcement learning with pre-
training for time-efficient training of automatic speech recognition," 2020 [Online]. Available:
https://arxiv.org/abs/2005.11172.
Page  / 33 Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments
[119]S. Shukla and M. Jain, "A novel stochastic deep resilient network for effective speech recognition,"
International Journal of Speech Technology, vol. 24, pp. 797-806, 2021. https://doi.org/10.1007/s10772-
021-09851-x
[120]I. K. Tantawi, M. A. M. Abushariah, and B. H. Hammo, "A deep learning approach to automatic speech
recognition of The Holy Qur’ān recitations," International Journal of Speech Technology, vol. 24, pp. 1017-
1032, 2021. https://doi.org/10.1007/s10772-021-09853-9
[121]B. Tombaloglu and H. Erdem, "Deep learning-based automatic speech recognition for Turkish," Sakarya
University Journal of Science, vol. 24, no. 4, pp. 725-739, 2020.
https://doi.org/10.16984/saufenbilder.711888
[122]Y. R. Oh, K. Park, and J. G. Park, "Online speech recognition using multichannel parallel acoustic score
computation and deep neural network (DNN)-based voice-activity detector," Applied Sciences, vol. 10, no.
12, article no. 4091, 2020. https://doi.org/10.3390/app10124091
[123]T. G. Fantaye, J. Yu, and T. T. Hailu, "Investigation of automatic speech recognition systems via the
multilingual deep neural network modeling methods for a very low-resource language, Chaha," Journal of
Signal and Information Processing, vol. 11, no. 1, pp. 1-21, 2020. https://doi.org/10.4236/jsip.2020.111001
[124]H. Hugeng and E. Hansel, "Implementation of Android-based speech recognition for an Indonesian
geography dictionary," Ultima Computing: Jurnal Sistem Komputer, vol. 7, no. 2, pp. 76-82, 2015.
https://doi.org/10.31937/sk.v7i2.296
[125]G. Kaur, M. Srivastava, and A. Kumar, "Speech recognition using enhanced features with deep belief
network for real-time applications," Wireless Personal Communications, vol. 120, pp. 3225-3242, 2021.
https://doi.org/10.1007/s11277-021-08610-0
[126] Z. Niu, "Voice detection and deep learning algorithms application in remote English translation classroom
monitoring," Mobile Information Systems, vol. 2022, article no. 3340999, 2022.
https://doi.org/10.1155/2022/3340999
[127] J. Oruh, S. Viriri and A. Adegun, "Long short-term memory recurrent neural network for automatic
speech recognition," IEEE Access, vol. 10, pp. 30069-30079, 2022.
https://doi.org/10.1109/ACCESS.2022.3159339
[128]A. Dehghani and S. Seyyedsalehi, "Time-frequency localization using deep convolutional maxout neural
network in Persian speech recognition," Neural Processing Letters, vol. 55, pp. 3205-3224, 2023.
https://doi.org/10.1007/s11063-022-11006-1
[129] X. Xie, X. Sui, X. Liu, and L. Wang, "Investigation of deep neural network acoustic modelling approaches
for low-resource accented Mandarin speech recognition," 2021 [Online]. Available:
https://arxiv.org/abs/2201.09432.
[130] A. K. Singh, P. Singh, and K. Nathwani, "Using deep learning techniques and inferential speech statistics
for AI synthesised speech recognition," 2021 [Online]. Available: https://arxiv.org/abs/2107.11412.
[131]J. Yu, N. Ye, X. Du, and L. Han, "Automated English speech recognition using dimensionality reduction
with deep learning approach," Wireless Communications and Mobile Computing, vol. 2022, article no.
3597347, 2022. https://doi.org/10.1155/2022/3597347
[132]S. Girirajan and A. A. Pandian, "An acoustic model with a hybrid deep bidirectional single gated unit
(DBSGU) for low resource speech recognition," Multimedia Tools and Applications, vol. 81, pp. 17169-
17184, 2022. https://doi.org/10.1007/s11042-022-12723-4
[133] H. Dridi and K. Ouni, "Towards a robust combined deep architecture for speech recognition: experiments
on TIMIT," International Journal of Advanced Computer Science and Applications, vol. 11, no. 4, pp. 525-
534, 2020. https://doi.org/10.14569/ijacsa.2020.0110469
[134]T. J. Park and J. H. Chang, "Deep Q-network-based noise suppression for robust speech recognition,"
Turkish Journal of Electrical Engineering and Computer Sciences, vol. 29, no. 5, pp. 2362-2373, 2021.
https://doi.org/10.3906/elk-2011-144
[135] S. Lee, S. Han, S. Park, K. Lee, and J. Lee, "Korean speech recognition using deep learning," The Korean
Journal of Applied Statistics, vol. 32, no. 2, pp. 213-227, 2019.
https://doi.org/10.5351/KJAS.2019.32.2.213
[136]P. Dubey and B. Shah, “Deep speech based end-to-end automated speech recognition (ASR) for Indian-
English accents,” 2022 [Online]. Available: https://arxiv.org/abs/2204.00977.
Human-centric Computing and Information Sciences Page / 3
[137]H. Seki, K. Yamamoto, T. Akiba, and S. Nakagawa, "Discriminative learning of filterbank layer within
deep neural network-based speech recognition for speaker adaptation," IEICE Transactions on Information
and Systems, vol. 102D, no. 2, pp. 364-374, 2019. https://doi.org/10.1587/transinf.2018EDP7252
[138]C. Bai, X. Cui, and A. Li, "Robust speech recognition model using multi-source federal learning after
distillation and deep edge intelligence," Journal of Physics: Conference Series, vol. 2033, no. 1, article no.
012158, 2021. https://doi.org/10.1088/1742-6596/2033/1/012158
[139]P. Shao, "Chinese speech recognition system based on deep learning," Journal of Physics: Conference
Series, vol. 1549, no. 2, article no. 022012, 2020. https://doi.org/10.1088/1742-6596/1549/2/022012
[140]A. M. Samin, M. H. Kobir, S. Kibria, and M. S. Rahman, "Deep learning-based large vocabulary
continuous speech recognition of an under-resourced language Bangladeshi Bangla," Acoustical Science
and Technology, vol. 42, no. 5, pp. 252-260, 2021. https://doi.org/10.1250/ast.42.252
[141]A. Rista and A. Kadriu, "A model for Albanian speech recognition using end-to-end deep learning
techniques," Interdisciplinary Journal of Research and Development, vol. 9, no. 3, article no. 1, 2022.
https://doi.org/10.56345/ijrdv9n301
[142] G. Savitha, B. N. Shankar, and S. Shahi, "Deep recurrent neural network based audio speech recognition
system," Information Technology in Industry, vol. 9, no. 2, 2021, pp. 941-949, 2021.
https://doi.org/10.17762/itii.v9i2.434
[143]H. Abera and S. H. Mariam, "Speech recognition for Tigrinya language using deep neural network
approach," in Proceedings of the 2019 Workshop on Widening NLP (WNLP), Florence, Italy, 2019, pp. 7-
9. https://aclanthology.org/W19-3603
[144]M. G. Al-Obeidallah, D. G. Al-Fraihat, A. M. Khasawneh, A. M. Saleh and H. Addous, "Empirical
Investigation of the Impact of the Adapter Design Pattern on Software Maintainability," in Proceedings of
2021 International Conference on Information Technology (ICIT), Amman, Jordan, 2021, pp. 206-211.
https://doi.org/10.1109/ICIT52682.2021.9491719
[145]L. Huang, Z. Xiang, J. Yun, Y. Sun, Y. Liu, D. Jiang, H. Ma, and H. Yu, "Target detection based on two-
stream convolution neural network with self-powered sensors information, " IEEE Sensors Journal, vol. 23,
no. 18, pp. 20681-20690, 2023. https://doi.org/10.1109/JSEN.2022.3220341
... Necessity of this survey While numerous review articles on data augmentation in AI research have explored various techniques, most focus on traditional 8,15,18,[22][23][24][25][26][27][28][29][30], including [31][32][33]. However, these studies often focus on a single modality, such as NLP or image processing. ...
... Advanced approaches like neural style transfer and GANs were developed to produce high-quality, contextually accurate synthetic text, addressing issues of data scarcity and class imbalance in specialized tasks such as medical report generation [58,59]. • Audio Data Augmentation: The rapid evolution of DL technologies has greatly influenced fields such as NLP, computer vision, and especially speech recognition [23,91,92]. These models' effectiveness depends significantly on the availability of large, high-quality datasets [93,94]. ...
Preprint
Full-text available
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified potential solutions to these limitations from the literature to enhance the efficacy of data augmentation practices using multimodal LLMs. This survey serves as a foundation for future research, aiming to refine and expand the use of multimodal LLMs in enhancing dataset quality and diversity for deep learning applications. (Surveyed Paper GitHub Repo: https://github.com/WSUAgRobotics/data-aug-multi-modal-llm. Keywords: LLM data augmentation, LLM text data augmentation, LLM image data augmentation, LLM speech data augmentation, audio augmentation, voice augmentation, chatGPT for data augmentation, DeepSeek R1 text data augmentation, DeepSeek R1 image augmentation, Image Augmentation using LLM, Text Augmentation using LLM, LLM data augmentation for deep learning applications)
... A potent technology for achieving artificial intelligence is deep learning. Deep learning is widely utilized in a variety of applications, including speech recognition, image and video processing, and natural language processing [1][2][3][4], and has a very wide impact on academia and industry. The most significant deep learning model is the one-dimensional (1D) or twodimensional (2D) convolutional neural network (CNN), which has made significant contributions to the advancement of image recognition, processing, and comprehension technology since 2012. ...
Article
Full-text available
Two-dimensional (2D) discrete matrix-product operation (DMPO) and its corresponding matrix-product neural networks (MPNNs) are the substitutions of 2D discrete convolutional operation and its corresponding convolutional neural networks (CNNs), respectively. MPNNs not only achieve the better performance in comparison with CNNs, but also the calculation amount of MPNNs is less than that of CNNs. Therefore, the DMPO occupies an important position in deep learning. However, there is no research on three-dimensional (3D) and higher dimensional DMPO. Therefore, the paper presents 3D and n-dimensional (n-D) DMPO based on 2D DMPO. The paper first presents the definition, properties, and matrix-product theorem of 3D DMPO and then presents the definition, properties, and matrix-product theorem of n-D DMPO, which provides enough theoretical support for the realization of 3D or n-D MPNN.
... Natural Language Processing[32]. ...
Article
Full-text available
This study investigates the enhancement of automated driving and command control through speech recognition using a Deep Neural Network (DNN). The method depends on some sequential stages such as noise removal, feature extraction from the audio file, and their classification using a neural network. In the proposed approach, the variables that affect the results in the hidden layers were extracted and stored in a vector to classify them and issue the most influential ones for feedback to the hidden layers in the neural network to increase the accuracy of the result. The result was 93% in terms of accuracy and with a very good response time of 0.75 s, with PSNR 78 dB. The proposed method is considered promising and is highly satisfactory to users. The results encouraged the use of more commands, more data processing, more future exploration, and the addition of sensors to increase the efficiency of the system and obtain more efficient and safe driving, which is the main goal of this research.
... Long Short-Term Memory (LSTM) networks are considered as a sort of neural network called recurrent. This type of recurrent neural network can learn to accomplish its task order dependencies by using the neural connections feedback [32] This type of neural network usually deals with complicated domain problems such as machine translation, forecasting [27], speech recognition [33], medical applications [13], etc. This type of neural network is called a long short-term memory (LSTM) because of its structure that is founded on short-term memory processes to create longer-term memory by the required program. ...
Article
Full-text available
Deep learning is a large-scale model in machine learning that uses multi-layer neural networks, which mimic the complex structure of the human brain. Combinations of layers and neurons in artificial neural networks create excellent practical applications. This paper explores the use of Long Short-Term Memory (LSTM), a subset of recurrent neural networks, known for its complexity and versatility in deep learning. The primary goal involves using LSTM to address the challenge of predicting daily river flow for two prominent rivers in the USA. Real datasets related to the daily flow of the Black and Gila Rivers were used and divided into different sets for training and testing. A comparative analysis was performed between the training and test sets, and error metrics were evaluated to confirm the effectiveness of the LSTM model. The experimental results collected from this study using LSTM were remarkably good, and showed significantly low error values, demonstrating its effectiveness in predicting river flow.
Article
The application of a genetic algorithm in training automatic speech recognition is considered. In the learning process, hidden Markov models are used to evaluate the statistical properties of each word, including the sequence of cepstral coefficients, as well as the transition probabilities between model states. As the model trains, it seeks to improve the fit between the input cepstral coefficients and the target words in order to improve the accuracy of speech recognition. Once trained, the system is used to recognize new speech inputs and identify the corresponding words. This is done by applying trained Hidden Markov Models to new sequences of cepstral coefficients. Genetic algorithms can be useful for optimizing some aspects of speech recognition systems, such as selecting optimal parameters for processing speech signals and choosing the most appropriate models for specific tasks. The results of a comparative analysis based on typical examples are presented, demonstrating the advantages and high efficiency of the genetic algorithm. The main advantages of the GA are indicated, such as independence from initial conditions and the ability to overcome local extrema, which makes it a promising tool for training speech recognition models. The conclusion emphasizes the significance of the research results and their contribution to the development of SMM training methods for speech recognition.
Article
Мова є найбільш природною формою людського спілкування, тому реалізація інтерфейсу, який базується на аналізі мовленнєвої інформації є перспективним напрямком розвитку інтелектуальних систем управління. Система автоматичного розпізнавання мовлення – це інформаційна система, що перетворює вхідний мовленнєвий сигнал на розпізнане повідомлення. Процес розпізнавання мовлення є складним і ресурсоємним завданням через високу варіативність промови, яка залежить від віку, статі та фізіологічних характеристик мовця. У статті представлено узагальнений опис задачі розпізнавання мовлення, що складається з етапів: передискретизація, кадрування та застосування вікон, виділення ознак, нормалізація довжини голосового тракту та шумопригнічення. Попередня обробка мовленнєвого сигналу є першим і ключовим етапом у процесі автоматичного розпізнавання мови, оскільки якість вхідного сигналу суттєво впливає на якість розпізнавання і кінцевий результат цього процесу. Попередня обробка мови складається з очищення вхідного сигналу від зовнішніх і небажаних шумів, виявлення мовленнєвої активності та нормалізації довжини голосового тракту. Метою попередньої обробки мовленнєвого сигналу є підвищення обчислювальної ефективності систем розпізнавання мови та систем керування із природньомовним інтерфейсом. У статті запропоновано використання швидкого перетворення Фур’є для описування вхідного аудіо сигналу; вікна Hamming для створення сегментів аудіосигналу з подальшим визначенням ознак засобами Mel-Frequency Cepstral Coefficients. Описано використання алгоритму динамічного трансформування часової шкали для нормалізації довжини голосового тракту та рекурентної нейронної мережі для шумопригнічення. Наведено результати експерименту щодо попередньої обробки аудіо сигналу голосових команд для керування застосунками мобільного телефону з оперативною системою Android.
Article
Full-text available
End-to-end Automatic Speech Recognition (ASR) system folds the acoustic model (AM), language model (LM), and pronunciation model (PM) into a single neural network. The joint optimization of all these components optimizes performance of the model. In this paper, we introduce a model for Albanian speech recognition (SR) using end-to-end deep learning techniques. The two main modules that build this model are: Residual Convolutional Neural Networks (ResCNN), which aims to learn the relevant features and Bidirectional Recurrent Neural Networks (BiRNN) aiming to leverage the learned ResCNN audio features. To train and evaluate the model, we have built a corpus for Albanian Speech Recognition (CASR), which contains 100 hours of audio data along with their transcripts. During the design of the corpus we took into account the attributes of the speaker such as: age, gender, and accent, speed of utterance and dialect, so that it is as heterogeneous as possible. The evaluation of the model is done through word error rate (WER) and character error rate (CER) metrics. It achieves 5% WER and 1% CER.
Article
Full-text available
Water pollution is one of the most challenging environmental issues. A powerful tool for measuring the suitability of water for drinking is required. The Water Quality Index (WQI) is a widely used parameter for the assessment of water quality through mathematical formulas. In this paper, a Deep Neural Network (DNN) model is developed to forecast WQI based on parameters selected for the dry and wet seasons throughout the year. Statistical modeling and unsupervised machine learning techniques are used. These modelings include the Principal Component Analysis/Factor Analysis (PCA/FA) which is used to interpret seasonal changes and the sources of springs under study. The other modeling technique utilized in this study is the Hierarchical Cluster Analysis (HCA). The results of this study reveal that the developed DNN model has achieved a high accuracy of ***. The goodness of fit of the developed model using R-Squared (R2) is 0.98 which is deemed high. The Mean Square Error metric is close to zero. Furthermore, the PCA/FA revealed five major parameters that impact water quality which together account for 92% of the total variance of water quality in summer and 96% in winter. Moreover, results show that the average of the WQI for all springs is of poor water quality at 46.75% during the dry season and medium water quality at 55.5% during the wet season.
Article
Full-text available
The widespread appeal of Smart Voice Assistants (SVAs) stems from their ability to enhance the everyday lives of consumers in a practical, enjoyable, and meaningful manner. Despite their popularity, the factors that shape consumer adoption of SVAs remain largely unexplored. To address this research gap, we utilized complexity theory to construct an integrated model that sheds light on the determinants of consumer decision-making in regard to SVA adoption. Furthermore, we applied fuzzy-set Qualitative Comparative Analysis (fsQCA) to examine the proposed model and uncover the causal recipes associated with SVA adoption. Our necessary condition analysis highlights that perceived ease of use, perceived usefulness, perceived humanness, and perceived social presence are necessary predictors for consumers' intentions to adopt SVA. This study constitutes a significant addition to the existing literature by providing a comprehensive and nuanced understanding of the drivers of SVA adoption. Moreover, it offers crucial implications for online service provider managers to improve the adoption of SVAs among their customers.
Chapter
Full-text available
Artificial intelligence (AI) ethics is a hotly debated subject. Increasingly, both service providers and users face ethical dilemmas. To ensure that the development, deployment, and usage of AI are all morally permissible, various initiatives have been launched. For the most part, it's not clear how AI-enabled businesses deal with these ethical challenges. Our paper identified four main ethical issues of AI (i.e. “Ubiquitous surveillance, social engineering, transhumanism, machine learning issues, and metaphysical issues”). It also identified two main mitigation strategies (i.e. “Policy-level mitigation, corporate governance of AI ethics”). Many people, including those working in AI-related organisations and academia can benefit from these findings, but policymakers grappling with how to best handle ethical concerns generated by AI may find them most useful.
Article
Full-text available
Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
Article
Full-text available
In this paper, a CNN-based structure for the time-frequency localization of information is proposed for Persian speech recognition. Research has shown that the receptive fields’ spectrotemporal plasticity of some neurons in mammals’ primary auditory cortex and midbrain makes localization facilities improve recognition performance. Over the past few years, much work has been done to localize time-frequency information in ASR systems, using the spatial or temporal immutability properties of methods such as HMMs, TDNNs, CNNs, and LSTM-RNNs. However, most of these models have large parameter volumes and are challenging to train. For this purpose, we have presented a structure called Time-Frequency Convolutional Maxout Neural Network (TFCMNN) in which parallel time-domain and frequency-domain 1D-CMNNs are applied simultaneously and independently to the spectrogram, and then their outputs are concatenated and applied jointly to a fully connected Maxout network for classification. To improve the performance of this structure, we have used newly developed methods and models such as Dropout, maxout, and weight normalization. Two sets of experiments were designed and implemented on the FARSDAT dataset to evaluate the performance of this model compared to conventional 1D-CMNN models. According to the experimental results, the average recognition score of TFCMNN models is about 1.6% higher than the average of conventional 1D-CMNN models. In addition, the average training time of the TFCMNN models is about 17 h lower than the average training time of traditional models. Therefore, as proven in other sources, time-frequency localization in ASR systems increases system accuracy and speeds up the training process.
Article
With the rapid development of artificial intelligence, neural network is widely used in various fields. Target detection algorithm is mainly based on neural network, but the accuracy of target detection algorithm is greatly related to the complexity of scene and texture. A target detection algorithm based on RGB-D image from the perspective of the lightweight of target detection network model and the integration of depth map to overcome the weak environmental illumination with self-powered sensors information is proposed. This paper analyzes the network model structure of YOLOv4 and MobileNet, compares the variation of parameter number between depthwise separable convolution and convolutional neural network, and combines the advantages of YOLOv4 network and MobileNetv3 network. The main network of three effective feature layers in YOLOv4 is replaced by MobileNetv3 network for initial feature layer extraction to strengthen the feature extraction network. At the same time, the standard convolution models in the network are replaced by depthwise separable convolution. The proposed method is compared with YOLOv4 and YOLOv4-MobileNetv3 in this paper, and the experimental results show that the proposed network retains its original advantages in accuracy, but the size of the network model is about 23% of that of YOLOv4 network model, and the processing speed is about 42% higher than that of YOLOv4 network model, and the detection accuracy can still reach 83% in the environment with poor lighting conditions.
Conference Paper
Deep neural networks (DNNs) have made remarkable achievements in acoustic modeling for speech recognition. In this paper, we compare the performance of two proposed models based on Convolutional neural network (CNN). In the first model, CNN is used for features extraction and Long Short- Term Memory (LSTM) is used for recognition. In the second model, CNN with deep architecture is mainly used to execute feature learning and recognition process. This work is focused on single word Arabic automatic speech recognition. We explore the optimal network structure and training strategy for the proposed models. All experiments are conducted using the Arabic Isolated Words Corpus (ASD) database. The results demonstrate the performance and advantages of using the deep CNN for both features extraction and recognition steps. KeywordsDeep learningCNNLSTMArabic speechSingle word recognition