ArticlePDF Available

Speech Recognition System: A review

Authors:

Abstract and Figures

Speech recognition system play an essential role in every human being life. It is a software that allows the user to interact with their mobile phones through speech. Speech recognition software splitting down the audio of a speech into various sound waves forms, analyzing each sound form, using various algorithms to find the most appropriate word fit in that language and transcribing those sounds into text. This paper will illustrate the popular existing system namely SIRI, CORTANA, GOOGLE ASSISTANT, ALEXA, BIXBY. Apart from that, this paper also analysis the concept of NLP (Natural processing) with speech recognition. In addition to this, our main function is to find out the most accurate technique of speech recognition so that we can achieve the best results. Comparative analysis will indicate the difference and demerit points of various speech recognition.
Content may be subject to copyright.
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2547
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
Speech Recognition System: A review
Rohit Pahwa, Harion Tanwar, Dr Sachin Sharma
Manav Rachna International Institute of Research and Studies, Faridabad, Haryana
Abstract:
Speech recognition system play an essential role in every human being life. It is a software that allows the
user to interact with their mobile phones through speech. Speech recognition software splitting down the
audio of a speech into various sound waves forms, analyzing each sound form, using various algorithms
to find the most appropriate word fit in that language and transcribing those sounds into text. This paper
will illustrate the popular existing system namely SIRI, CORTANA, GOOGLE ASSISTANT, ALEXA,
BIXBY. Apart from that, this paper also analysis the concept of NLP (Natural processing) with speech
recognition. In addition to this, our main function is to find out the most accurate technique of speech
recognition so that we can achieve the best results. Comparative analysis will indicate the difference and
demerit points of various speech recognition.
Keywords: Speech recognition, Deep learning, SIRI, CORTANA, ALEXA, BIXBY, GOOGLE ASSISTANT,
Acoustic Models, NLP
1. Introduction
Speech Recognition is a kind of technology which allows the user to operate the electronic device through
spoken word instead of using different tools such as keystrokes, button and keyboard etc. Speech
recognition software convert the words and phrases which is spoken by user into machine-readable format
so that user can easily operate the device through speech. Speech recognition which is also known as
automatic speech recognition (ASR). The main objective of developing speech recognition is any people
whether it is technical or non-technical can easily operate the device. As well as an illiterate which have no
knowledge about device and its parts they can be operated it very easily. Speech recognition is basically
designed for a single user. The block model of Speech recognition system is shown in Figure 1. The field of
speech recognition is an emerging research area with important application in Banking, marketing,
healthcare, language learning and many more. There are various parameter in speech processing such as
pitch, duration, voice quality, intensity, signal-to-noise ratio, voice activity detection and strength of voice.
What speech recognition does it
extract all this parameter from the algorithm so that it can execute the user query on the behalf of these
parameters. Nowadays, speech
recognition is working with various technology such as m
achine learning, IOT,NLP and many more.
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2548
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
Figure 1: Speech Recognition System
[8]
The main motive of speech recognition is WHO is speaking and WHAT was spoken. It is used to
identify a person by analyzing its tone, voice pitch and many more parameter.[8]
Voice recognition was developed as a faster method of typing up work and was originally designed with
people who suffer from various disabilities in mind as well as people who suffer with physical difficulties
can find typing tedious, painful or even impossible and this give them the chance to still do in the manner
of speaking. Voice recognition
software allows you to dictate to your computer. When you will give the instruction through this software
on that moment all the instruction will be appeared on the screen. It is so advanced that it can predict what
you want to say, so that it can correct error of mis-speech or grammatical errors for you.
Working of Speech Recognition process
Figure 2: Working of Speech Recognition process[10]
1) Feature Extraction:-
Speech recognition software will analysis the sound through various parameter such as pitch, variation,
strength of voice.
2) Acoustic Models:
This model is a computational file that contain different forms of sound that make up a word. It shows the
relationship between audio signal and other basic language unit that create the speech. These computational
and mathematical representation is known as HMM’s model.
3) Decoder:
Neural network break down the speech in various neurons. we have a different types of algorithm they will
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2549
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
decode these neurons through these algorithm so that we can get the
correct result.
4) Language Model:
The language model in speech recognition helps differentiate between words and phrases that sound
similar .When we integrate the evidence of language model with pronunciation and acoustic model, with
the help of this we can solve the problem of ambiguity. [9]
The remaining paper is prepared as follow: Section 2 describes the various speech recognition software.
Section 3 machine learning with speech recognition. Section 4 contain the comparative analysis of
different speech recognition. Section 5 summarizes conclusion and the tail of the paper contain
conclusion.
Various Speech Recognition system
Speech recognition system is a machine or program that helps to identify words and phrases in spoken
audio and convert them into a machine understandable format but various speech recognition software has
a limited vocabulary of words and phrase and it may only identify these words which are available in their
database or which they are spoken by very clearly.
Existing System which uses speech recognition
APPLE (SIRI):
Siri is a virtual assistant is a part of Apple Inc. It is designed to offer you a multiple way of interaction
with your phone by speak up. It will take the query through microphone and help you solve the query
with in time. It has some features that make it different from other speech recognition, for instance, can
activate low power mode, Enable do-not disturb mode as well it has Non-English option. Most of them
use it as entertaining purpose. On the other hand, Siri has one demerit point it only works on IOS
devices.[4]
GOOGLE ASSISTANT:
Google Assistant is virtual assistant of google
Inc’s. Google assistant control your devices and
smart phone. Some important features of Google assistant are, it control your device and your smartphone
and access information from calendar. As well as it can also handle your music system. It has some demerit
points such as it use more battery power due to this sometime it slow down the working of system.[6]
MICROSOFT CORTANA
Like Siri and Google Assistant, Cortana is also a voice assistant developed and created by Microsoft.
Basically it is designed for window devices Nowadays, it is available in various devices .It can performs a
multiple task for users, like remainder setting, as well as it can also scheduling the calendar events, even
most of people who use Cortana to performing some computational data and many more Cortana has an
API (application programming interface) and can work with a variety of windows app, as well as third-party
apps such as Facebook and Twitter. Apart from that, it has several demerits such as Vulnerability found hit
the listening button again and again and many more.
AMAZON ALEXA
Alexa is a virtual assistant technology designed and created by Amazon. This technology is based upon
Machine learning, NLP(natural language processing).Alexa can perform various task such as it can
acknowledge the user about Weather. Furthermore, it can handle your smart phone like when user give the
instruction it will take and solve it with in time so that user can do another work. But it has several demerit
points, for example, it cannot send a message and Email through voice command. Apart from that, if a
person wants to access the heath and hobbies related data through Alexa, in that cases Alexa will not give
the accurate result.[1]
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2550
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
SAMSUNG BIXBY
Bixby is a virtual assistant developed by Samsung Electronics. With the help of this system you can send
message from one device to another ,as well as you can check the cricket and any other game score. A
hardware button on the side of the device to bring up Bixby. Bixby also supports some fancy features. But
Bixby has some limitation, for illustrating, as it supports some limited language due to this, sometimes it
fails to provide the direct answer of query.[2]
3) Speech Recognition using Deep Learning and NLP (Natural processing language):
4)
Figure:3 Speech recognition using NLP[11]
Speech recognition software use the various such as NLP (natural language processing) and Deep
learning neural network. ”NLP is a way for computers to analyze, understand, and derive meaning from
human language in a smart and useful way”. NLP is actually a branch of artificial intelligence and
particularly deals with the interaction between human and computer. It works on machine learning
algorithm and enhances the ability of a computer program to understand human spoken language. It helps
the computer to understand and manipulate human language and perform tasks as question answering and
language translation.
Deep learning is the subset of Machine learning in A.I that follow the working of human brain in
processing data and creating pattern that use in decision making[5].
Related work
T. Al Smadi (May 2015) developed an algorithm based on neural network for speech recognition and its
objective is to capture and digitizing the each sound waves, then it converts the each sound waves into the
basic language units ,After that it creating words from this unit, and contextually analysing the words to
ensure the correct spelling for words .
Advantages of Neural network in speech recognition: Neural networks are the most essential feature of
speech recognition. This method is the possibility of parallel processing. Neural network are fed with huge
amount of data. Training is given by providing input and educating the network so that it can executed the
input and produce the output.
Disadvantages of Neural network in speech recognition: speech recognition are the black box nature of
Neural network. Neural network require a lot of data to process as compared to other machine learning
algorithms. At least they have millions of labelled sample data. To process this huge amount of data made
neural network expensive in terms of size and time complexity. There is no specific rule for determining the
structure of a neural network.
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2551
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
Halageri. A, Bidappa, Arjun, Sarathy and Sultana developed(6 August 2017) an algorithm based on speech
recognition using deep learning and its objective is to capture and digitizing the each sound waves, then it
converts the each sound waves into the basic language units ,After that it creating words from this unit, and
contextually analysing the words to ensure the correct spelling for words . The main objective of this paper
is to review the pattern matching abilities of neural networks on speech signal.
Advantages of the existing system are:- Powerful, Self-adjusting, Sophisticated pattern recognition and
many more.
Disadvantages of the existing system are, It is not good for device , because it require extra memory to store
the data of different individual voice ,as well as it require extra time to execute the task ,so that we can say
that it is inefficient in terms of memory and compute time. GMMs are mathematically inefficient for
handling the modelling data that lie on or near a nonlinear manifold in the data space. The HMM needs to
be trained so that it can execute the user query
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2552
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
with in time. Apart from that, HMM require set of instruction to execute the datasets.
Advantages of neural network, They can be used to design an input space to any kind of output space.
They are simple and, Due to this reason they are commonly used. They are naturally judicial. They are
modular in design, so they can be easily attached into larger systems .They have a probabilistic
interpretation, so they can be easily combined with statistical techniques like HMMs.
Purwar.K(2015) developed an algorithm based on smart home automation system based on internet of
things and speech recognition .The main objectives of this algorithm is to handle the appliances or devices
at home. This will provide a better way in automated home as compared to other homes.
Advantages of smart home automation system : This technology is based on internet of things and speech
recognition. In this technology we used the command converter which gives the command to the devices.
The Raspberry Pi's network and DNS settings, which use the Raspbian operating system to operate and
handle the devices. The system is able to use the device at home through speech. There are many benefits
of using Raspberry pi as compare to other devices are: It is robust, automated and the capability to run
multiple programs. This system is boon for the human being. This technology is fully functional and can
be controlled through the wireless system.
Disadvantages of this system is based on internet of things through speech: Some time incompatibility
occur between different kind of devices when they are trying to connect each other. Due to this, it
increases the complexity between device and also increases the chance of failure of the devices.
Pal Singh .R , Arora. S ,(December 2012)developed an algorithm based on
automatic speech recognition system and objective of this paper is study on various ASR technologies
which used in different counties As well the bring the light process which is made for ASR.
1970 Independent approach: Merit points of automatic speech recognition system are the use of the finite
network to reduce the computation and determine the closet matching string efficiency. Demerit of
automatic speech recognition system are the system recognize speech with vocabulary size of 1011 words
with reasonable accuracy.1990 pattern recognition approach: Demerit Several speech recognition error
occurred. Merit points: MCC(Minimum classification error) and MMI(Maximum mutual information )
both techniques are use reduce the error rate.
Deshmukh .R , Malik Abdullah Alasadi .A,(22 may 2018) developed an algorithm based on automated
speech recognition and its objective is study on different speech recognition system and its recent progress
also it will describe the characteristics of various database which use in different speech recognition.
Feature extraction:
Feature
extraction
techniques
Merit points
Demerit points
MFCC( Mel
Frequency
Cepstral
Coefficients
)
It provide
better
discrimination
Less
correlati
on between
coefficients
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2553
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
DWT(Dis
crete
Wavelet
Transfor
m)
Successfully
used for
noising task.
Capacity of
compressing a
signal
witho
ut major
degradation.
Not based on
linear
characteristics
WPT(Word
Perfect
Template)
Same as DWT
But WPT
show also
further
deta
il present in
high
frequency
broad.
Low
orde
r
coefficient.
LPC(Linear
Predictive
Codes)
LPC is easy to
implement
an
d
mathematical
precise
Static
featu
re Extraction
5) Comparison between some popular Existing Speech Recognition systems
Table 1: Comparison between some popular Existing Speech Recognition
systems [3]
Differentiable
Factors
Apple Siri
Google
Assistant
Microsoft
CORTANA
Amazon Alexa
Samsung Bixby
Release date
201
1
201
6
201
4
201
4
201
7
Device
compatibility
Apple devices
Android devices
Microsoft
devices
Alexa and echo
Samsung
phones
Unsupported
apps
Google mail
services
Support all third
party apps
Youtube
Youtube
Support all third
party apps
Type of
connection
Required
WIFI and
cellular
data
WIFI and
cellular
data
WIFI and
cellular
data
Only WIFI
WIFI and
cellular
data
Ways to
interact
Manual and
voice
commands
Manual and
voice
commands
Manual and
voice
commands
Manual (in
silent echo)
voice
Manual and
voice
commands
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2554
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
commands
Algorithms
Dynamic
tim
e wrapping,
discretization
algorithm.
PLP features,
Viterbi search,
Deep Neural
Networks
natural
langua
ge processing
(NLP)
, data is
sent to
Microsoft’s
servers
to be analyzed
natural
langua
ge processing
(NLP)
discriminative
training,
WF
ST framework
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2555
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
Accuracy
according
to search
results
74.6
88
63
72.
5
78
Languages
supported
21
40
8
7
7
Major
drawbacks
Only on when
hold the home
button.
Maximum
battery usage
and slow
down the
device
Vulnerability
found hit the
listening button
again and
again
Take several
days to install
or schedule a
new version of
software
Server less
extensibility ,
NLP platform
Different speech recognition system are compared and descriptions are mentioned in the above table. After
studying these different application systems some drawbacks are found in these existing systems such as
most of the time, some recognition system are not be able to work with every languages ,they only support
some limited languages. Apart from that, some individual has a habit of speak very fast with a strong
accent, in that cases some systems are not be able to handle the user task .Even though, some recognition
system create huge background noise during user request .Due to this, they are not be able to attain the user
request on that moment that’s why individual has to speak again and again until system do not respond the
user query.
ANALYTICAL STUDY FOR DIFFERENT TRAINING DATASET:
Firstly, we compared different research papers of the speech recognition system. On behalf of this paper
we find out different training datasets, then we executed these datasets on the IDLE.
Different training datasets are available below in form of screenshot which we executed on the IDLE and
JUPITER notebook platform. Here we find out different dataset produce the different scenarios of speech.
This situation can be highlighted by an example, many individuals provide the same input in form of
speech, with the help of these training dataset we executed the input and it produce the several outputs
rather than to produce the single output and the reason behind for each individual belongs to a different
country each individual has own accents. Due to this, speech recognition system is not able to determine
the voice (input) of each individual.[17]
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2556
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
ATTRIBUTES OF DIFFERENT TRAINING DATASETS:
Dataset
Waveleng
th(in
meter)
Frequen
cy(in Hz)
Chann
el
Bitrat
e(in
kbps)
Sampling
rate(in
kHz)
Afrikaan
3.811
90
1
705
44.
1
Arabic
2.144
16
0
1
705
44.
1
Bambara
2.318
14
8
1
705
44.
1
Catalan
2.579
13
3
1
705
44.
1
Dutch
2.257
15
2
1
705
44.
1
English
3.333
10
3
1
705
44.
1
Hindi
2.433
14
1
1
705
44.
1
Farsi
2.701
12
7
1
705
44.
1
German
3.090
11
1
1
705
44.
1
Italian
2.199
15
6
1
705
44.
1
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2557
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
DIFFERENT OUTPUTS OF DIFFERENT DATASETS:
TRAINING DATA SET1 OUTPUT:-
TRAINING DATASET2 OUTPUT:
TRAINING DATASET3 OUTPUT:
TRAINING DATASET4 OUTPUT:
TRAINING DATASET5 OUTPUT:
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2558
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
RESULT OF DIFFERENT TRAINING DATASET:
From the above screenshots, it is concluded that limited training data which we use in our code is not able
to determine the exact output what user expect from the system. Each training data sets take the same inputs
but it produces different output. We provided several individual voices but it is not able to distinguish each
individual voice because each individual has own accents. In future our objective is to remove these
drawbacks by developing a new application which is able to communicate easily to us, it doesn’t require a
lot of human effort to make that machine-understandable or it can be easily controlled by our voice and
doesn’t require any buttons to be pressed. The application should be light, handy, and easy to use means it
doesn’t use a lot of battery, storage, and memory which basically affects the
performance of the device. The main motive of the application is to perform every single function on the
device that our fingers are able to do.
6) Conclusion and Future scope
In this paper, we give a brief introduction to speech recognition, how does it works and some popular
systems that works on speech recognition algorithms. A comparative analysis of various speech
recognition systems are also discussed in this paper. This paper also helps in having an idea of current
best system on speech recognition by analyzing their performance according to the different differentiable
factors. The discussion may be used to improve or develop a new or modified system which is more
accurate and perform better than these current running systems.
References:
1) Gonfalonieri,A.,2018,Toward Data science <https://towardsdatascience.com/how-amazon-alexa-
works-your-guide-to-natural-language-processing-ai-7506004709d3>
2) Hall, C& Tillman, M, 13 march 2020,Pocket-Lint <https://www.pocket-
lint.com/phones/news/samsung/140128-what-is-bixby-samsungs-assistant-explained-and-how-to-
use-it>
3) August 9,2019,Reachbyte, <https://reachbyte.com/siri-cortana-alexa-google-assistant-bixby>
4)Goel, A., February 2,2018,Artifical Intelligence,<https://magoosh.com/data-science/siri-work-
science-behind-siri>
5) Le,J.,Sep20,2019,Deep learning-based Automatic speech
recognition,<https://heartbeat.fritz.ai/the-3-deep-learning-frameworks-for-end-to-end-speech-
recognition-that-power-your-devices-37b891ddc380>
6) Prospero, M. & Kozuch, K., February 24,2020,The best Google assistant
skills,<https://www.tomsguide.com/round-up/best-google-assistant-features>
7) Hui ,J., Sep 16,2019,Language model,<https://medium.com/@jonathan_hui/speech-recognition-
acoustic-lexicon-language-model-aacac0462639>
8) Sarma , M. & Sarma,K.,12 sep,2015,'Acoustic Modelling of speech signal using Artificial Neuron
Network :A review of techniques and current Trends,DOI:10.4018/978-1-4666-8493-5.ch012
9)Rudnicky,1July,2010,language Model,<http://www.voxforge.org/home/docs/faq/faq/what-is-a-
language-model>
10) Saba,G.,22July,2018,Speech recognition Python,<https://www.simplifiedpython.net/speech-
recognition-python>
11) NDZ,28 April,2017,Machine Learning Vs Artificial
Intelligencehttps://ndimensionz.com/kb/theoretical-explanation-of-how-siri-works
12) T. Al Smadi et al, Journal of Signal and Information Processing, 2015, 6, 66-72 Published Online
May 2015 in SciRes. http://www.scirp.org/journal/jsip http://dx.doi.org/10.4236/jsip.2015.62006
13)Halageri.A et al, / (IJCSIT) International Journal of Computer Science and Information
International Journal of Future Generation Communication and Networking
Vol. 13, No. 3, (2020), pp. 25472559
2559
ISSN: 2233-7857 IJFGCN
Copyright 2020 SERSC
Technologies, Vol. 6 (3) , 2015, 3206-3209
14) Purwar.K, International Journal of Computer Applications (0975 8887) Volume 172 No.6,
August 2017
15) R. pal Singh ,S. Arora , December 2012, Automatic speech recognition system :A review,vol:60
16) R.Deshmukh ,A.malik Abdullah Alasadi,22 may 2018,Atomated speech recognition: A review.
17)R.Tatman,2017,machine training dataset< https://www.kaggle.com/rtatman/speech-accent-
archive>
... In terms of technological development, speech recognition has a long history in man technological applications. Recently, it has developed a lot when deep learning was used as well as when using big data, and it has been significantly improved not only in term of academic papers and research published in this regard but also in its use in the globa industry and information technology, such as in global companies such as Google, Face book, and Microsoft [29]. ...
... Therefore, scientists have been keen to change the policy and method o processing that deals with the nature of language [31], as shown in Figure 2. In terms of technological development, speech recognition has a long history in many technological applications. Recently, it has developed a lot when deep learning was used, as well as when using big data, and it has been significantly improved not only in terms of academic papers and research published in this regard but also in its use in the global industry and information technology, such as in global companies such as Google, Facebook, and Microsoft [29]. ...
... In terms of technological development, speech recognition has a long history in many technological applications. Recently, it has developed a lot when deep learning was used, as well as when using big data, and it has been significantly improved not only in terms of academic papers and research published in this regard but also in its use in the global industry and information technology, such as in global companies such as Google, Facebook, and Microsoft [29]. ...
Article
Full-text available
This study investigates the enhancement of automated driving and command control through speech recognition using a Deep Neural Network (DNN). The method depends on some sequential stages such as noise removal, feature extraction from the audio file, and their classification using a neural network. In the proposed approach, the variables that affect the results in the hidden layers were extracted and stored in a vector to classify them and issue the most influential ones for feedback to the hidden layers in the neural network to increase the accuracy of the result. The result was 93% in terms of accuracy and with a very good response time of 0.75 s, with PSNR 78 dB. The proposed method is considered promising and is highly satisfactory to users. The results encouraged the use of more commands, more data processing, more future exploration, and the addition of sensors to increase the efficiency of the system and obtain more efficient and safe driving, which is the main goal of this research.
... In speech recognition [260], [261], [262], [263] , SSOD aids in identifying and classifying speech patterns and phonetic elements within audio data, even with limited labeled samples. By leveraging both labeled and unlabeled speech data, these models can better discern speech signals from background noise and accurately transcribe spoken words into text. ...
Preprint
Full-text available
The impressive advancements in semi-supervised learning have driven researchers to explore its potential in object detection tasks within the field of computer vision. Semi-Supervised Object Detection (SSOD) leverages a combination of a small labeled dataset and a larger, unlabeled dataset. This approach effectively reduces the dependence on large labeled datasets, which are often expensive and time-consuming to obtain. Initially, SSOD models encountered challenges in effectively leveraging unlabeled data and managing noise in generated pseudo-labels for unlabeled data. However, numerous recent advancements have addressed these issues, resulting in substantial improvements in SSOD performance. This paper presents a comprehensive review of 27 cutting-edge developments in SSOD methodologies, from Convolutional Neural Networks (CNNs) to Transformers. We delve into the core components of semi-supervised learning and its integration into object detection frameworks, covering data augmentation techniques, pseudo-labeling strategies, consistency regularization, and adversarial training methods. Furthermore, we conduct a comparative analysis of various SSOD models, evaluating their performance and architectural differences. We aim to ignite further research interest in overcoming existing challenges and exploring new directions in semi-supervised learning for object detection.
Article
Мова є найбільш природною формою людського спілкування, тому реалізація інтерфейсу, який базується на аналізі мовленнєвої інформації є перспективним напрямком розвитку інтелектуальних систем управління. Система автоматичного розпізнавання мовлення – це інформаційна система, що перетворює вхідний мовленнєвий сигнал на розпізнане повідомлення. Процес розпізнавання мовлення є складним і ресурсоємним завданням через високу варіативність промови, яка залежить від віку, статі та фізіологічних характеристик мовця. У статті представлено узагальнений опис задачі розпізнавання мовлення, що складається з етапів: передискретизація, кадрування та застосування вікон, виділення ознак, нормалізація довжини голосового тракту та шумопригнічення. Попередня обробка мовленнєвого сигналу є першим і ключовим етапом у процесі автоматичного розпізнавання мови, оскільки якість вхідного сигналу суттєво впливає на якість розпізнавання і кінцевий результат цього процесу. Попередня обробка мови складається з очищення вхідного сигналу від зовнішніх і небажаних шумів, виявлення мовленнєвої активності та нормалізації довжини голосового тракту. Метою попередньої обробки мовленнєвого сигналу є підвищення обчислювальної ефективності систем розпізнавання мови та систем керування із природньомовним інтерфейсом. У статті запропоновано використання швидкого перетворення Фур’є для описування вхідного аудіо сигналу; вікна Hamming для створення сегментів аудіосигналу з подальшим визначенням ознак засобами Mel-Frequency Cepstral Coefficients. Описано використання алгоритму динамічного трансформування часової шкали для нормалізації довжини голосового тракту та рекурентної нейронної мережі для шумопригнічення. Наведено результати експерименту щодо попередньої обробки аудіо сигналу голосових команд для керування застосунками мобільного телефону з оперативною системою Android.
Article
The rapid evolution of technology has positioned Artificial intelligence (AI) at the forefront of innovation in various sectors, notably in healthcare. This review explores AI’s current and future roles in primary healthcare, highlighting its transformative impact and the challenges that accompany its integration. AI currently aids healthcare practitioners in a myriad of ways. Diagnostic procedures have been revolutionized by AI algorithms capable of analysing medical images with precision. Administrative tasks, such as patient scheduling and record-keeping, have become more efficient due to AI’s streamlining capabilities. Predictive analytics, a critical AI feature, plays a pivotal role in pre-empting health complications by analysing extensive patient data. Additionally, AI has significantly advanced telemedicine, offering wider access to healthcare, a crucial development amidst global health emergencies. Moreover, AI contributes substantially to personalized medicine, analysing large-scale data, including genetic information, to tailor treatment plans. Its integration into Electronic Health Records (EHR) systems enhances data processing, improving treatment outcomes and operational efficiency. Despite these advancements, challenges persist. Data privacy concerns, potential erosion of the human element in patient care, difficulties integrating AI with existing systems, and the risk of over-reliance on technology are pressing issues that require careful management. Ethical considerations, including algorithmic transparency and accountability, also pose significant challenges. Looking ahead, AI’s trajectory in primary healthcare is geared towards further advancements, with an emphasis on ethical implications and maintaining a balance between technology and human-centred care. The potential of AI to enhance areas like chronic disease management and mental health care is vast. As we approach an AI-driven era in primary healthcare, this review underscores the importance of merging technological innovation with empathetic patient care.
Article
Speech Recognition is one of the prominent research topics in the field of Natural Language Processing (NLP). The Speech Recognition technique removes the barriers and makes the system ease for inter-communication between human beings and devices. The aim of this study is to analyze the Automatic Speech Recognition System (ASRS) proposed by different researchers using Machine learning and Deep Learning techniques. In this work, Indian and foreign languages speech recognition systems like Hindi, Marathi, Malayalam, Urdu, Sanskrit, Nepali, Kannada, Chinese, Japanese, Arabic, Italian, Turkish, French, and German are considered. An integrated framework is presented and elaborated with recent advancement. The various platform like Hidden Markov Model Toolkit (HMM Toolkit), CMU Sphinx, Kaldi toolkit are explained which is used for building the speech recognition model. Further, some applications are elaborated which depict the uses of ASRS.
Conference Paper
In the realm of computer science, a huge number of studies have been done for speech application areas, particularly speech recognition, in the last few decades. Through the process of determining and interpreting, speech recognition enables the system to turn received speech signals into instructions. Speech recognition was not quite as impressive at the time of development or in its initial phases as it is now, so many studies focused on it and made it one of the unique qualities. In the subject of automatic voice recognition, a lot of excellent progress has been made on a range of problems, one of which is determining the speaker's surroundings from the audio around the speaker. The core contribution of this project is to explore how to use extensive speech recognition knowledge to extract the speaker's surrounding environment. Since its inception, the research also gives a brief review of voice recognition techniques and applications.
Chapter
Full-text available
Acoustic modeling of the sound unit is a crucial component of Automatic Speech Recognition (ASR) system. This is the process of establishing statistical representations for the feature vector sequences for a particular sound unit so that a classifier for the entire sound unit used in the ASR system can be designed. Current ASR systems use Hidden Markov Model (HMM) to deal with temporal variability and Gaussian Mixture Model (GMM) for acoustic modeling. Recently machine learning paradigms have been explored for application in speech recognition domain. In this regard, Multi Layer Perception (MLP), Recurrent Neural Network (RNN) etc. are extensively used. Artificial Neural Network (ANN)s are trained by back propagating the error derivatives and therefore have the potential to learn much better models of nonlinear data. Recently, Deep Neural Network (DNN)s with many hidden layer have been up voted by the researchers and have been accepted to be suitable for speech signal modeling. In this chapter various techniques and works on the ANN based acoustic modeling are described.
Article
Speech recognition or speech to text includes capturing and digitizing the sound waves, transformation of basic linguistic units or phonemes, constructing words from phonemes and contextually analyzing the words to ensure the correct spelling of words that sounds the same. Approach: Studying the possibility of designing a software system using one of the techniques of artificial intelligence applications neuron networks where this system is able to distinguish the sound signals and neural networks of irregular users. Fixed weights are trained on those forms first and then the system gives the output match for each of these formats and high speed. The proposed neural network study is based on solutions of speech recognition tasks, detecting signals using angular modulation and detection of modulated techniques.
Toward Data science <https
  • A Gonfalonieri
Gonfalonieri,A.,2018,Toward Data science <https://towardsdatascience.com/how-amazon-alexaworks-your-guide-to-natural-language-processing-ai-7506004709d3>
  • A Goel
Goel, A., February 2,2018,Artifical Intelligence,<https://magoosh.com/data-science/siri-workscience-behind-siri>
  • Halageri
Halageri.A et al, / (IJCSIT) International Journal of Computer Science and Information ISSN: 2233-7857 IJFGCN Copyright ⓒ2020 SERSC Technologies, Vol. 6 (3), 2015, 3206-3209
  • Purwar
Purwar.K, International Journal of Computer Applications (0975 -8887) Volume 172 -No.6, August 2017
Atomated speech recognition: A review
  • R Deshmukh
  • A Malik Abdullah Alasadi
R.Deshmukh,A.malik Abdullah Alasadi,22 may 2018,Atomated speech recognition: A review.