ArticlePDF Available

Hidden Markov model/Gaussian mixture models (HMM/GMM) based voice command system: A way to improve the control of remotely operated robot arm TR45


Abstract and Figures

A speech control system for a didactic manipulator arm TR45 is designed as an agent in a telemanipulator system command. Robust Hidden Markov Model (HMM) and Gaussian Mixture models (GMM) are applied in spotted words recognition system with Cepstral coefficients with energy and differentials as features. The HMM and GMM are used independently in automatic speech recognition agent to detect spotted words and recognize them. A decision block will generate the appropriate command and send it to a parallel port of the Personal Computer (PC). To implement the approach on a real-time application, a PC parallel port interface was designed to control the movement of robot motors using a wireless communication component. The user can control the movements of robot arm using a normal speech containing spotted words.
Content may be subject to copyright.
Scientific Research and Essays Vol. 6(2), pp. 341-350, 18 January, 2011
Available online at
ISSN 1992-2248 ©2011 Academic Journals
Full Length Research Paper
Hidden Markov model/Gaussian mixture models
(HMM/GMM) based voice command system: A way to
improve the control of remotely operated robot arm
Ibrahim M. M. El-emary1*, Mohamed Fezari2 and Hamza Attoui3
1Information Technology Deanship, King Abdulaziz University, Kingdom of Saudi Arabia.
2Department of Electronics, University of Annaba, Faculty of Engineering, Laboratory of Automatic and Signals,
Annaba, BP.12, Annaba, 23000, Algeria.
Accepted 11 November, 2010
A speech control system for a didactic manipulator arm TR45 is designed as an agent in a tele-
manipulator system command. Robust Hidden Markov Model (HMM) and Gaussian Mixture models
(GMM) are applied in spotted words recognition system with Cepstral coefficients with energy and
differentials as features. The HMM and GMM are used independently in automatic speech recognition
agent to detect spotted words and recognize them. A decision block will generate the appropriate
command and send it to a parallel port of the Personal Computer (PC). To implement the approach on a
real-time application, a PC parallel port interface was designed to control the movement of robot motors
using a wireless communication component. The user can control the movements of robot arm using a
normal speech containing spotted words.
Key words: Human-machine interaction, hidden Markov model, Gaussian mixture models, artificial intelligence,
automatic guided vehicle, voice command, robot arm and robotics.
Manipulator robots are used in industry to reduce or
eliminate the need for humans to perform tasks in
dangerous environments. Examples of it include space
exploration, mining, and toxic waste cleanup. However,
the motion of articulated robot arms differs from the
motion of the human arm. While robot joints have fewer
degrees of freedom, they can move through greater
angles. For example, the elbow of an articulated robot
can bend up or down whereas a person can only bend
their elbow in one direction with respect to the straight
arm position (Beritelli et al., 1998; Bererton and Khosla,
2001). There have been many research projects dealing
with robot control and tele-operation of arm manipulators,
among these projects, there are some projects that build
intelligent systems (Kwee, 1997; Buhler et al., 1994;
*Corresponding author. E-mail:
Ibrahim et al., 2010;
3.pdf?...). Since we have seen human-like robots in
science fiction movies such as in “I ROBOT” movie,
making intelligent robots or intelligent systems became
an obsession within the research group.
In addition, speech or voice command as human-robot
interface has a key role in many application fields and
various studies made in the last few years have given
good results in both research and commercial
applications (Bererton and Khosla, 2001; Rao et al.,
3.pdf?... ; Yussof et al., 2005) just for speech recognition
systems. In this paper, we present a new approach to
solve the problem of the recognition of spotted words
within a phrase, using statistical approaches based on
342 Sci. Res. Essays
HMM and GMM (Gu and Rose, 2001; Rabiner, 1989). By
combining the two methods, the system achieves
considerable improvement in the recognition phase, thus
facilitating the final decision and reducing the number of
errors in decision taken by the voice command guided
Speech recognition systems constitute the focus of a
large research effort in Artificial Intelligence (AI), which
has led to a large number of new theories and new tech-
niques. However, it is only recently that the field of robot
and Automatic Guided Vehicle (AGV) navigation has
started to import some of the existing techniques
developed in AI for dealing with uncertain information.
HMM is a robust technique developed all applied in
pattern recognition. Very interesting results were ob-
tained in isolated words speaker independent recognition
system, especially in limited vocabulary. However, the
rate of recognition is lower in continuous speaking
system. The GMM is also a statistical model that has
been used in speaker recognition and in isolated word
recognition systems. These two techniques HMM and
GMM were experimented independently and then
combined in order to increase the recognition rate. The
approach proposed here in this paper is to design a
system that gets specific words within a large or small
phrase, process the selected words (Spots) and then
execute an order (Djemili et al., 2004; Rabiner, 1989;
3.pdf?...). As an application of this approach, a set of four
reduction motors were activated via a wireless designed
system installed on a Personal Computer (PC) parallel
port interface. The application uses a set of twelve
commands in Arabic words, divided in two subsets one
subset contains the names of main parts of a robot arm
(arm, fore-arm, wrist (hand), and gripper), the second
subset contains the actions that can be taken by one of
the parts in subset one (left, right, up, down, stop, open
and close). A specific word like “yade” which means arm
is also used at the beginning of the phrase as a
“password”. Voice command needs the recognition of
spotted words from a limited vocabulary used in AGV
system (Ferrer et al., 2000; Heck, 1997) and in
manipulator arm control (Rodriguez et al., 2003).
Our used application is based on the voice command for
a set of four reduction motors. It therefore involves the
recognition of spotted words from a limited vocabulary
used to recognise the part and the action of a robot arm.
The vocabulary is limited to twelve words divided into
two subsets: object name subset necessary to select the
part of the robot arm to move and command subset
necessary to control the movement of the arm example
like: turn left, turn right and stop for the base (shoulder),
Open close and stop for the gripper. The number of
words in the vocabulary was kept to a minimum both to
make the application simpler and easier for the user.
The user selects the robot arm part by its name then
gives the movement order on a microphone, connected
to sound card of the PC. The user can give the order in a
natural language phrase as example: Yade, gripper
open execute”. A speech recognition agent based on
HMM technique detects the spotted words within the
phrase, recognises the main word “Yade” witch is used
as a keyword in the phrase, it recognises the spotted
words, then the system will generate a byte where the
four most significant bits represent a code for the part of
the robot arm and the four less significant bits represent
the action to be taken by the robot arm. Finally, the byte
is sent to the parallel port of the PC and then it is
transmitted to the robots through a wireless transmission
The application is first simulated on PC. It includes
three phases: the training phase, where a reference
pattern file is created, the recognition phase where the
decision to generate an accurate action is taken and the
appropriate code generation phase, where the system
generates a code of 8 bits on parallel port. In this code,
four higher bits are used to codify the object names and
four lower bits are sued to codify the actions. The action
is shown in real-time on parallel port interface card that
includes a set of four stepper motors to show what
command is taken and the radio frequency emitter.
The speech recognition agent is based on HMM. In this
paragraph, a brief definition of HMM is presented and
speech processing main blocks are explained.
However, a pre-requisite phase is necessary to process
a data base composed of twelve vocabulary words
repeated twenty times by fifty persons (Twenty five
male and twenty five female). So, before starting in the
creation of parameters, 50*20*12 “wav” files are
recoded in a repository. Files from 35 speakers are
saved on DB1 to be used for training and files from 15
speakers are used for tests and then saved in DB2,
theses test are done off-line.
In the training phase, each utterance (saved wav file)
is converted to a Cepstral domain (MFCC features,
energy, and first and second order deltas) which
constitutes an observation sequence for the estimation
of the HMM parameters associated to the respective
word. The estimation is performed by optimisation of
the likelihood of the training vectors corresponding to
each word in the vocabulary. This optimisation is
carried by the Baum-Welch algorithm (Rabiner, 1989;
Figure 1. Presentation of left-right (Bakis) HMM.
Ibrahim et al., 2010).
A HMM is a type of stochastic model appropriate for non
stationary stochastic sequences, with statistical
properties that undergo distinct random transitions among
a set of different stationary processes. In other words, the
HMM models a sequence of observations as a piecewise
stationary process. Over the past years, HMM have been
widely applied in several models like pattern (Djemili et
al., 2004), or speech recognition (Djemili et al., 2004;
Ferrer et al., 2000). The HMMs are suitable for the
classification from one or two dimensional signals and
can be used when the information is incomplete or
uncertain. To use HMM, we need a training phase and a
test phase. For the training phase, we usually work with
the Baum-Welch algorithm to estimate the parameters (Π
i,A,B) for the HMM (Rabiner, 1989; Ferrer et al., 2000).
This method is based on the maximum likelihood
criterion. To compute the most probable state sequence,
the Viterbi algorithm is the most suitable.
The HMM model is basically stochastic finite state
automaton which generates an observation string, that is,
the sequence of observation vectors, O = O
,… ,O
Thus, HMM model consists of a number of N states
} and of the observation string produced as a result
of emitting a vector O
for each successive transitions
from one state S
to a state S
. O
is d dimension and in
the discrete case takes its values in a library of M
The state transition probability distribution between
state S
to S
is A= {a
}, and the observation probability
distribution of emitting any vector O
at state Sj is given
by B= {b
)}. The probability distribution of initial state
is ={
( )
ij t j t i
a P q S q S
= = = (1)
El-emary et al. 343
B = {b
)} (2)
( )
i i
P q S
= = (3)
Given an observation O and a HMM model = (A,B,),
the probability of the observed sequence by the
forward-backward procedure P(O/) can be computed
(Kwee, 1997). Consequently, the forward variable is
defined as the probability of the partial observation
sequence 1 2
(until time t) and the state S at
time t, with the model as (i). The backward variable is
defined as the probability of the partial observation
sequence from t+1 to the end, given state S at time t and
the model as (i). The probability of the observation
sequence is computed as follow:
1 1
( / ) ( )* ( ) ( )
t t T
i i
P O i i i
λ α β α
= =
= =
and the probability of being in state I at time t (given the
observation sequence O and the model ) is computed
as follows:
( )
i i
P q S
= = (5)
The flowchart of a connected HMM is an HMM with all
the states linked altogether (every state can be reached
from any state). The Bakis HMM is left to right transition
HMM with a matrix transition defined as shown in Figure
The GMM can be viewed as a hybrid model between
parametric and non- parametric density models as shown
in Figure 2. Like a parametric model, it has structure and
parameters that control the behavior of density in known
ways. Like non-parametric model it has many degrees of
freedom to allow arbitrary density modeling. The GMM
density is defined as weighted sum of Gaussian densities
given by Equation 6 as follows:
,mmmMG CmxgWxP
= (6)
Here m is the Gaussian component (m=1…M), and M is
the total number of Gaussian components. Wm are the
component probabilities (wm = 1), also called weights.
We consider K-dimensional densities, so the argument is
a vector x = (x1, ... , xK)T. The component probability
344 Sci. Res. Essays
Figure 2. Speech recognition agent based on HMM/GMM
0 0.5 1 1.5 2 2.5 3 3.5 4
x 10
Figure 3. Phrase test “yade diraa fawk tabek” and Silence at the
beginning and at the end.
density function (pdf), g(x, m, Cm), is a K-dimensional
Gaussian probability density function (pdf) given in
Equation 7 as follows:
)()(2/1 )2/(*1),,(
mm CeCxg m
Where m is the mean vector, and Cm is the covariance
matrix. Now, a Gaussian mixture model probability
density function is completely defined by a parameter list
given by = {w1, 1, C1... w1, 1, C1} m, m=1…M.
Organizing the data for input to the GMM is important
since the components of GMM play a vital role in making
the word models. For this purpose, we use K- means
clustering technique to break the data into 256 cluster
centroids. These centroids are then grouped into sets of
32 and then passed into each component of GMM. As a
result, we obtain a set of 8 components of GMM. Once
the component inputs are decided, the GMM modelling
can be implemented as (Figure 3).
The expectation maximization (EM) algorithm is an
iterative method for calculating maximum likelihood
distribution parameter estimates from incomplete data
(elements missing in feature vectors). The EM update
equations are used which gives a procedure to iteratively
maximize the log-likelihood of the training data given the
model. The EM algorithm is a two step process:
Estimation step in which current iteration values of the
mixture are utilized to determine the values for the next
iteration as given in Equation 8.
Maximization step in which the predicted values are
then maximized to obtain the real values for the next
iteration as given in Equations 9, 10 and 11.
1 (10)
jm y
EM algorithm is well known and highly appreciated for its
numerical stabilities under threshold values of min.
Using the final re-estimated w, and C, the value of
Lis calculated with respect to all the word models
available with the recognition engine as shown in
1 (12)
El-emary et al. 345
Figure 4. Windowing diagram.
The HMM/GMM hybrid model has the ability to find the
joint maximum probability among all possible reference
words W given the observation sequence O. In real case,
the combination of the GMMs and the HMMs with a
weighted coefficient may be a good scheme because of
the difference in training methods. The ith word
independent GMM produces likelihood LiGMM, I = 1,
2,…, W, where W is the number of words. The ith word
independent HMM also produces likelihood LiHMM, I = 1,
2,…, W. All these likelihood values are passed to the
likelihood decision block, where they are transformed into
the new combined likelihood L’ (W):
L'(W)= (1− x(W))LiGMM + x(W)LiHMM (13)
Where x(W) denotes a weighting coefficient.
The value of x is calculated during training of the Hybrid
model. In Hybrid Testing, the subset of training data is
used and its HMM and GMM likelihood values are
calculated which combined using weighing coefficient.
Static values of weighted coefficient are also used in
order to get higher recognition rate. In that case the
conception of 12 HMM models one per vocabulary word
and 12 GMM models, on for each word. The resulting of
both models is taken by the decision block.
Once the phrase is acquired through a microphone and
the PC sound card, the samples are stored in a wav file.
Then the speech processing phase is activated. During
this phase the signal (samples) goes through different
steps: pre-emphasis, frame-blocking, windowing, feature
extraction and Mel-Filter Cepstral Coefficients (MFCC)
Pre-emphasis step
In general, the digitized speech waveform has a high
dynamic range. In order to reduce this range, pre-
emphasis is applied. By pre-emphasis [1], we imply the
application of a high pass filter, which is usually a first -
order FIR of the form
1. The pre-emphasis
is implemented as a fixed- coefficient filter or as an
adaptive one, where the coefficient a i s a dj ust ed wi th
tim e acc ordi ng to the autocorrelation values of the
speech. The pre-emphasis block has the effect of spectral
flattening which renders the signal less susceptible to
finite precision effects (such as overflow and underflow)
in any subsequent processing of the signal. The selected
value for a in our work is 0.9375.
Frame blocking
Since the vocal tract moves mechanically slowly, speech
can be assumed to be a random process with slowly
varying properties. Hence, the speech is divided into
overlapping frames of 20 ms every 10 ms. The speech
signal is assumed to be stationary over each frame and
this property will prove useful in the following steps.
346 Sci. Res. Essays
Figure 5. MFCC block diagram.
To minimize the discontinuity of a signal at the beginning
and the end of each frame, we window each frames. The
windowing tapers the signal to zero at the beginning and
end of each frame. A typical window is the Hamming
window of the form:
( ) 0.54 0.46*cos 0 1
W n n N
 
= ≤ ≤
 
  (14)
Feature extraction
In this step, speech signal is converted into stream of
feature vectors coefficients which contain only that
information about given utterance that is important for its
correct recognition. An important property of feature
extraction is the suppression of information irrelevant for
correct classification, such as information about speaker
(e.g. fundamental frequency) and information about
transmission channel (e.g. characteristic of a
microphone). The feature measurements of speech
signals are typically extracted using one of the following
spectral analysis techniques: MFCC Mel frequency filter
bank analyzer, LPC analysis or discrete fourier transform
analysis. Currently the most popular features are Mel
frequency Cepstral coefficients MFCC (Rabiner, 1989).
MFCC analysis
The MFCC are extracted from the speech signal as
shown in Figure 4. The speech signal is pre-emphasized,
framed and then windowed, usually with a Hamming
window. Mel-spaced filter banks are then utilized to get
the Mel spectrum. The natural logarithm is then taken to
transform into the cepstral domain and the discrete
cosine transform is finally computed to get the MFCCs as
shown in the block diagram of Figure 5.
( 1/ 2)
log( )* cos
k i
k i
 
 
 
Where the acronyms signify:
- PE-FB-W: Pre-Emphasis, Frame Blocking and
- FFT: Fast Fourier Transform
- LOG: Natural Logarithm
- DCT: Discrete Cosine Transform
The speech recognition agent based on HMM will detect
words, and process each word. Depending on the
probability of recognition of the object name and the
command word a code is transmitted to the parallel port
of the PC. The vocabulary to be recognized by the
system and their meanings are listed as in Table 1. It is
obvious that within these words, some are object names
and other are command names. The code to be
transmitted is composed of 8 bits, four bits most
significant bits are used to code the object name and the
four least significant bits are used to code the command
to be executed by the selected object. Example: “yade
diraa fawk tabek”.
A parallel port interface was designed to display the
real-time commands. It is based on the following TTL IC
(integrated circuits): a 74LS245 buffer, a microcontroller
PIC16F84 and a radio frequency transmitter from RADIO
Table 1. Meaning of the vocabulary voice commands, assigned
code and controlled motor.
1) Yade (1) Name of the manipulator (keyword)
2) Diraa (2) Upper limb motor (M1)
3) Saad (3) Limb motor (M2)
4) Meassam(4) Wrist (hand) motor (M3)
5) Mikbath(5) Gripper motor (M4)
6) Yamine (1) Left turn (M0)
7) Yassar (2) Right turn (M0)
8) Fawk (3) Up movement M1, M2 and M3
9) Tahta (4) Down movement M1, M2 and M3
10) Iftah (5) Open Grip, action on M4
11) Ighlak (6) Close grip, action on M4
12) Kif (7) Stop the movement, stops M0,M1, M2,
M3r or M4
METRIX TX433-10 (modulation frequency 433 Mhz and
transmission rate 10 Kbs) (Table 1), Data Sheet
PIC16F876 (2001).
As in Figure 6.b and 6.c, the structure of the mechanical
hardware and the computer board of the robot arm in this
paper is similar to MANUS (Kwee, 1997; Buhler et al.,
1994). However, since the robot arm needs to perform
simpler tasks than those in (Heck, 1997). do. The robot
arm is composed of four feedback controlled movements
for the elements: base, upper-limb, limb and wrest the
movement command is realised by a moto-reductor block
(1/500) powered by +12 and – 12 volts. The copy of
voltage is given by a linear rotator potentiometer fixed on
the moto-reductor block and powered by +10 and -10
One open loop controlled movement, for the gripper,
with the same type of command. Displacement
Characteristics is given by the following angle values:
Base : 290°
Upper limb: 108°
Lim : 280°
Wrist : 290°
Gripper: 100°
The computer board of the robot arm consists of a
PIC16F876, with 8K-instruction Electrically
Programmable Read Only Memory (EEPROM), three
timers and 3 ports (Larson, 1999], four power circuits to
drive the moto-reductors and one H bridges driver using
BD134 and BD133 transistors for DC motor to control the
El-emary et al. 347
gripper, a RF receiver module from RADIOMETRIX
which is the SILRX-433-10 (modulation frequency 433
MHz and transmission rate is 10 Kbs) [16 as shown in
Figure 6b.
Each motor in the robot arm performs the
corresponding task to a received command (example:
“yamin”, “kif” or Fawk”) as in Table 1. Commands and
their corresponding tasks in autonomous robots may be
changed in order to enhance or change the application.
In the recognition phase, the speech recognition agent
gets the sentence to be processed, treats the spotted
words, then takes a decision by setting the corresponding
bit on the parallel port data register and hence the
corresponding LED is on. The code is also transmitted in
serial mode to the TXM-433-10, (Yamano et al., 2005).
The speech recognition agent is tested within the laboratory of
L.A.S.A, were two different conditions to be tested are runes: off
line, by using DB2 and on real-time. There are three types of testes:
HMM and GMM models then the HMM/GMM model is tested on-
line and in real-time.
The results are presented in Figure 7 After testing the recognition
of command and object words 100 times in the following conditions:
(a) off-line witch means test words are selected from DB1 (b) On
real-time witch means some users will command the system in real-
time . The results are shown in Figures 7a and b. it is obvious that
the real-time test results are lower compared to that of off-line tests,
this is mainly due to environment and material conditions changes.
The recognition of spotted words from a limited
vocabulary in the presence of background noise was
used in this paper. The application is speaker-inde-
pendent. Therefore, it does not need a training phase for
each user. It should, however, be pointed out that this
condition does not depend on the overall approach but
only on the method with which the reference patterns
were chosen. So, by leaving the approach unaltered and
choosing the reference patterns appropriately (based on
speakers), this application can be made speaker-
The effect of the environment is also taken into con-
sideration and here the results on the HMM/GMM model
just by changing the microphone, we notice that with the
microphone Mic1 used in recording the data base we get
better rate than using a new microphone Mic2. The HMM
based model gives better results than GMM
independently, by combining GMM and HMM and using
as features MFCC and differentials we increased the
recognition rate. The application is speaker independent.
However, by computing parameters based on speakers’
pronunciation the system can be speaker dependant.
348 Sci. Res. Essays
Figure 6a. Parallel interface circuit and a photo of the
designed card.
Figure 6b. Robot arm block diagram (Computer board and
A voice command system for robot arm is proposed and
implemented in this paper based on a hybrid model
HMM/GMM for spotted words. The results of the tests
shows that a better recognition rate can be achieved
using hybrid techniques and especially if the phonemes
Figure 6c. Overview of the robot arm and parallel interface.
Meassa m
Faw k
Figure 7a. HMM, GMM and HMM/GMM models results, off-line
of the selected word for voice command are quite
different. The effect of the used microphone for tests is
proved in the results presented in Figure 7c. However, a
good position of the microphone and additional filtering
may enhance the recognition rate.
Spotted words detection is based on speech detection
then processing of the detected. Once the parameters
were computed, the idea can be implemented easily
El-emary et al. 349
Figure 7b. HMM, GMM and HMM/GMM models results, on real-time tests.
Mic2 Mic1
Figure 7c. Microphone effect on teh results.
within a hybrid design using a DSP and a microcontroller
since it does not need too much memory capacity.
Finally, since the designed electronic command for the
robot arm consists of a microcontroller and other low-cost
components namely wireless transmitters, the hardware
design can easily be carried out as a future works (Kim et
al., 1998; Fezari et al., 2005). Also, the application can be
implemented on a DSP or a microcontroller in the future
in order to be autonomous (Hongyu et al., 2004
Beritelli F, Casale S, Cavallaro A (1998). A Robust Voice Activity
Detector for Wireless Communications Using Soft Computing, IEEE
Journal on Selected Areas in Communications (JSAC), special Issue
Signal Process. Wireless Communications, 16: 9.
Bererton C, Khosla P(2001). Towards a team of robots with
reconfiguration and repair capabilities, Proceedings of the 2001 IEEE
International Conference on Robotics and Automation, pp. 2923-
Rao RS, Rose K, Gersho A (1998). Deterministically Annealed Design
of Speech Recognizers and Its Performance on Isolated Letters,
Proceedings IEEE ICASSP'98, pp. 461-464.
Gu L, Rose K (2001). Perceptual Harmonic Cepstral Coefficients for
Speech Recognition in Noisy Environment. Proc ICASSP 2001, Salt
Lake City.
Djemili R , Bedda M, Bourouba H (2004). Recognition Of Spoken
Arabic Digits Using Neural Predictive Hidden Markov Models. Int.
Arab J. Inform. Technol., IAJIT, 2: 226-233.
Rabiner LR ( 1989).Rabiner. Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition. Readings in Speech
Recognition, chapter A, pp. 267-295,
Hongyu LY, Zhao Y, Dai, Wang Z (2004). A secure Voice
Communication System Based on DSP, IEEE 8th International Conf.
on Cont. Atau. Robotc and Vision, Kunming, China, pp. 132-137.
Ferrer MA, I Alonso, C Travieso (2000). Influence of initialization and
350 Sci. Res. Essays
Stop Criteria on HMM based recognizers, Electronics Lett. IEE, 36:
Kwee H (1997). Intelligent control of Manus Wheelchair. In:
proceedings Conference on Reabilitation Robotics, ICORR’97, Bath
1997, pp. 91-94.
Yussof JM (2005). A Machine Vision System Controlling a Lynxarm
Robot along a Path, University of Cape Town, South Africa, October
Yamano HM, Nasu Y, Mitobe K, Ohka M (2005). Obstacle Avoidance in
Groping Locomotion of a Humanoid Robot, Int. J. Adv. Robotic Syst.,
2(3): 251 – 258.
Buhler C, Heck H, Nedza J, Schulte D(1994). MANUS wheelchair-
Mountable Manipulator- Further Devepolements and Tests. Manus
Usergroup Mag., 2(1): 9-22.
Heck H (1997). User Requirements for a personal Assistive Robot, In
proc. Of the 1st MobiNet symposium on Mobile Robotics Technology
for Health Care Services, Athens, pp. 121-124.
Rodriguez E, Ruiz B, Crespo AG, Garcia F (2003). Speaker Recognition
Using a HMM/GMM Hybrid Model”. In: Proceedings of the First
International Conference on Audio- and Video-Based Biometric
Person Authentication, pp. 227- 234.
larson M (1999). Speech Control for Robotic arm within rehabilitation.
Master thesis, Division of Robotics, Dept of mechanical engineering
Lund Unversity, Sweden
Data sheet PIC16F876 (2001). From Microchip inc. User’s Manual,
Radiometrix components (2010). TXm-433 and SILRX-433 Manual, HF
Electronics Company.
Kim WJ, Lee JM, Kang SY, Shin JC (1998). Development of A voice
remote control system. In Proceedings of the 1998 Korea Automatic
Control Conference, Pusan, Korea, pp. 1401-1404.
Fezari MM, Bousbia-S, Bedda M (2005). Hybrid technique to enhance
voice command system for a wheelchair, In: proceedings of Arab
Conference on Information Technology ACIT’05, Jordan.
Ibrahim M, El Emary M, Fezari M (2010). "Speech as a High Level
control for Teleoperated Manipulator Arm", The second International
Conference on Advance Computer Control, China, 27-29.March.,
http: i
... These statistical methods use probability distribution density and build a model with the entire data. So, it will have the knowledge and complete description of the actual problems [7]. Next to these approaches, the machine learning techniques like Artificial Neural Networks(ANNs), SVM, Deep Belief Networks (DBNs) are proposed to replace the conventional HMM/GMM systems [8]. ...
... GMM is commonly used in pattern matching problems since it involves an efficient mathematical straightforward analysis with a series of good computational properties [7]. GMM is a mixture of several Gaussian distributions and can therefore represent different subclasses inside one class [14,15]. ...
In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR) system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW), Statistical Pattern Matching techniques such as Hidden Markov Model (HMM) and Gaussian Mixture Models (GMM), Machine Learning techniques such as Neural Networks (NN), Support Vector Machine (SVM), and Decision Trees (DT) are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.
... Even though the exclusive signals have various correctness measures, we can still identify the region with more accurately. Based on the higher similarity of the values, the methods classify speech recognition without noise signals (Choudhary, 2013b;Cohen, 2002;Cohen & Baruch, 2009;El-Emary, 2011;Elminir, 2012). ...
Full-text available
Automatic Speech Recognition (ASR) is a self-governing, computer-based spoken language transcript for real-time applications. It is used in various real time applications and it listens the speech signals through a microphone, identifies the words, and assists a network in converting the written text. When we use the ASR system in multiple environments there is a possibility of ambient noise captured by a microphone unit and ASR system doesn’t predicting correct words. The Non-linear Acoustic Noise Cancellation (NANC) approach based automatic speech recognition method focused on the properties of non-linear sound noise cancellation. There are several distinct small segments in this approach, such as speech signal sounds, syllables, and so on. As an acrylic symbol associated with organs, these units analyze syllables to find acoustic properties of speech signals. This experimental study has adopted Convolutional Neural Network (CNN) based noise reduction in the speech recognition system with an accuracy of 98.5%. Finally, a speech signal has been identified through the ASR's vocabulary, which has been obtained with correct words after all phonetic signs are present.
... ARM-1 implements HMM-GMM (Gaussian Mixture Model) speech recognition methodology (El-emary et al. 2011). Speech recognition process is a sequence of the following three phases: digital signal processing (DSP), decoding and rescoring as shown in Figure 1. ...
In this paper I present a summary of my results from the competition that took place this year and was organized by PolEval. One of the tasks of this competition was the detection of offensive comments in social media. By joining this competition, I set myself a goal to compare some of the popular text classification models used on Kaggle or recommended by Google. That’s why during the competition I went through models such as: Ngrams and MLP, word embedding and sepCNN, Flair from Zalando with different embedding, combination of LSTM and GRU with word embedding trained from scratch. (
... El-Emary et al designed a spotted word recognition system to control a didactic manipulator arm TR45. In this work, Robust Hidden Markov Model (HMM) and Gaussian Mixture models (GMM) are used [8]. ...
... This method can be applied in some field, such as; control smart home [6] [5], control mobile robot [7], control wheelchair [8], control robot [9], [10], biomatrix [11], speaker identification [2] , control arm robot [12] [13]. To learning and classify of speech method, have been investigating using Artificial Neural Networks (ANN) [2] [3] [14] [15], Neuro-Fuzzy [6], and other soft computing [16] [17]. ...
Full-text available
This research shows the implementation of speech recognition to control arm robot. The method to identify the speech recognition using Linear Predictive Coding (LPC) and Adaptive Neuro-Fuzzy Inference System (ANFIS). LPC method used to feature extraction the signal of speech and ANFIS method used to learn the speech recognition. The data learning which used to ANFIS processed are 6 features. The examination system of speech identification using trained and not trained data. The result of the research shows the successful grade for trained speech data is 88.75% and not trained data is 78.78%. Identification of speech recognition system was applied to controlled arm robot based on Arduino microcontroller.
... . El-emary et al. (2010) have built voice command system based on hidden markov model and gaussian mixture models using Cepstral coefficients with energy and differentials as features. ...
Speaker verification is one of the biometric verification techniques used to verify the claimed identity of a speaker. It is mainly applied for security reasons and managing users' authentication. Voiceprint can be used as a unique password of the user to prove his/her identity. In this paper, we propose a new Arabic text-dependent speaker verification system for mobile devices using artificial neural networks (ANN) to recognize authorized user and unlock devices for him/her. We describe system components and demonstrate how it works. We present the performance of our system and analyze its results.
This paper proposes a novel human machine interface (HMI) and electronics system design to control a rehabilitation robotic exoskeleton glove. Such system can be activated with the user’s voice, take voice commands as input, recognize the command and perform biometric authentication in real-time with limited computing power, and execute the command on the exoskeleton. The electronics design is a stand-alone plug-and-play modulated design independent of the exoskeleton design. This personalized voice activated grasping system achieves better wearability, lower latency, and improved security than any existing exoskeleton glove control system.
After an investigation of state-of-the-art research in Chap. 3, Chap. 4 proposed a new two-stage audiovisual speech enhancement system that makes use of both audio and visual information to filter speech. The results of comprehensive testing in Chap. 5, identified a number of key strengths and weaknesses. It was concluded that although good results were found, and it demonstrated the feasibility of speech enhancement using visual information, there were also limitations. Chapter 5 concluded by identifying some potential refinements to the system, one of which was to extend the initial system with the use of fuzzy logic to make more cognitively inspired use of audio and visual speech information. This chapter presents a multimodal fuzzy logic based speech enhancement framework. Firstly, some limitations with the initial system are discussed. The decision to make use of a fuzzy logic based system is then justified. The chapter then presents a novel, multimodal, fuzzy logic based speech filtering framework. The utilisation of the audio and visual input data, the inputs to the fuzzy inference system, and the resulting fuzzy sets are described. The rules for the fuzzy logic based system, based on these fuzzy sets are then discussed. Finally, the challenges of thoroughly evaluating this initial system are then briefly discussed. While the work presented in this chapter does not represent a final and completed system, it is intended to demonstrate the feasibility of such an approach as an extension of the initial system presented previously in this work, showing that making more intelligent use of multimodal information is viable.
Conference Paper
With the widespread use of wireless sensor networks (WSN), more types of applications are emerging. However, controlling a deployed wireless sensor network after deployment still requires expertise on embedded system programming and administration. In this paper, we study how to exploit the widely available smartphones to reduce these restrictions on users' expertise. In particular, we present a system called PhoneCon, which stands for Voice-driven SmartPhone Controllable Wireless Sensor Networks. Through PhoneCon, users may simply speak to their phones to operate a deployed wireless sensor network, such as getting the status of deployed motes, without being exposed to any of system or programming level details. We have implemented PhoneCon on Android phones for a testbed of MicaZ sensor nodes. Our evaluation results show that PhoneCon can interpret users' intentions well enough through the microphone and perform corresponding actions with acceptable delays.
Full-text available
Details of designing and developing a voice guiding system for a robot arm is presented. The features combination technique is investigated and then a hybrid method for classification is applied. Based on research and experimental results, more features will increase the rate of recognition in automatic speech recognition. Thus combining classical components used in ASR system such as Crossing Zero, energy, Mel frequency cepstral coefficients with wavelet transform (to extract meaningful formants parameters) followed by a pipelining ordered classifiers GMM and HMM has contributed in reducing the error rate considerably. To implement the approach on a real-time application, a PC interface was designed to control the movements of a four degree of freedom robot arm by transmitting the orders via RF circuits. The voice command system for the robot is designed and tests showed an Improvement by combining techniques.
Full-text available
Discontinuous transmission based on speech/pause detection represents a valid solution to improve the spectral efficiency of new generation wireless communication systems. In this context, robust voice activity detection (VAD) algorithms are required, as traditional solutions present a high misclassification rate in the presence of the background noise typical of mobile environments. This paper presents a voice detection algorithm which is robust to noisy environments, thanks to a new methodology adopted for the matching process. More specifically, the VAD proposed is based on a pattern recognition approach in which the matching phase is performed by a set of six fuzzy rules, trained by means of a new hybrid learning tool. A series of objective tests performed on a large speech database, varying the signal-to-noise ratio (SNR), the types of background noise, and the input signal level, showed that, as compared with the VAD standardized by ITU-T in Recommendation G.729 annex B, the fuzzy VAD, on average, achieves an improvement in reduction both of the activity factor of about 25% and of the clipping introduced of about 43%. Informal listening tests also confirm an improvement in the perceived speech quality
Full-text available
This paper describes the development of an autonomous obstacle-avoidance method that operates in conjunction with groping locomotion on the humanoid robot Bonten-Maru II. Present studies on groping locomotion consist of basic research in which humanoid robot recognizes its surroundings by touching and groping with its arm on the flat surface of a wall. The robot responds to the surroundings by performing corrections to its orientation and locomotion direction. During groping locomotion, however, the existence of obstacles within the correction area creates the possibility of collisions. The objective of this paper is to develop an autonomous method to avoid obstacles in the correction area by applying suitable algorithms to the humanoid robot's control system. In order to recognize its surroundings, six-axis force sensors were attached to both robotic arms as end effectors for force control. The proposed algorithm refers to the rotation angle of the humanoid robot's leg joints due to trajectory generation. The algorithm relates to the groping locomotion via the measured groping angle and motions of arms. Using Bonten-Maru II, groping experiments were conducted on a wall's surface to obtain wall orientation data. By employing these data, the humanoid robot performed the proposed method autonomously to avoid an obstacle present in the correction area. Results indicate that the humanoid robot can recognize the existence of an obstacle and avoid it by generating suitable trajectories in its legs.
Perceptual harmonic cepstral coefficients (PHCC) are proposed as features to extract from speech for recognition in noisy environments. A weighting function, which depends on the prominence of the harmonic structure, is applied to the power spectrum to ensure accurate representation of the voiced speech spectral envelope. The harmonics' weighted power spectrum undergoes mel-scaled band-pass filtering, and the log-energy of the filters' output is discrete cosine transformed to produce cepstral coefficients. Lower spectral clipping is applied to the power spectrum, followed by within-filter root-power amplitude compression to reduce amplitude variation without compromise of the gain invariance properties. Experiments show significant recognition gains of PHCC over MFCC, with 23% and 36% error rate reduction for the Mandarin digit database in white and babble noise environments.
Conference Paper
In this paper, a speaker recognition voice based system is presented [5]. We have implemented it in a Sun platform.We train (and test) the system using a Database recorded in several sessions in order to repair the huge effects that the speech variability with time has in the recognition rate system. Several experiments have been made in order to achieve the best configuration in the system set up. This is an important point to take into account in a real world system in which users train the system once and the models generated in the training process are not updated for strategic reasons. The recognition rate obtained for the proposed system is around 93% if the speech came from a microphone is around 90% when the speech came from a phone line.
In this study, we propose an algorithm for Arabic isolated digit recognition. The algorithm is based on extracting acoustical features from the speech signal and using them as input to multi-layer perceptrons neural networks. Each word in the vocabulary digits (0 to 9) is associated with a network. The networks are implemented as predictors for the speech samples for a certain duration of time. The back-propagation algorithm is used to train the networks. The hidden markov model (HMM) is implemented to extract temporal features (states) for the speech signal. The input vector to the networks consists of twelve mel frequency cepstral coefficients, log of the energy, and five elements representing the state. Our results show that we are able to reduce the word error rate comparing with an HMM word recognition system.
Conference Paper
We attack the general problem of HMM-based speech recognizer design, and in particular, the problem of isolated letter recognition in the presence of background noise. The standard design method based on maximum likelihood (ML) is known to perform poorly when applied to isolated letter recognition. The minimum classification error (MCE) approach directly targets the ultimate design criterion and offers substantial improvements over the ML method. However, the standard MCE method relies on gradient descent optimization which is susceptible to shallow local minima traps. We propose to overcome this difficulty with a powerful optimization method based on deterministic annealing (DA). The DA method minimizes a randomized MCE cost subject to a constraint on the level of entropy which is gradually relaxed. It may be derived based on information-theoretic or statistical physics principles. DA has a low implementation complexity and outperforms both standard ML and the gradient descent based MCE algorithm by a factor of 1.5 to 2.0 on the benchmark CSLU spoken letter database. Further, the gains are maintained under a variety of background noise conditions
A study is presented into the importance of two commonly overlooked factors influencing generalisation ability in the field of hidden Markov model (HMM) based recogniser training algorithms by means of a comparative study of four initialisation methods and three stop criteria in different applications. The results show that better results have been found with the equal-occupancy initialisation method and the fixed-threshold stop criterion
A Machine Vision System Controlling a Lynxarm Robot along a Path
  • J M Yussof
Yussof JM (2005). A Machine Vision System Controlling a Lynxarm Robot along a Path, University of Cape Town, South Africa, October 28.
Rabiner. Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Readings in Speech Recognition, chapter A A secure Voice Communication System Based on DSP
  • Lr Rabiner
  • Ly Hongyu
  • Y Zhao
  • Dai
  • Z Wang
Rabiner LR ( 1989).Rabiner. Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Readings in Speech Recognition, chapter A, pp. 267-295, Hongyu LY, Zhao Y, Dai, Wang Z (2004). A secure Voice Communication System Based on DSP, IEEE 8 th International Conf. on Cont. Atau. Robotc and Vision, Kunming, China, pp. 132-137.
TXm-433 and SILRX-433 Manual, HF Electronics Company
  • Radiometrix Components
Radiometrix components (2010). TXm-433 and SILRX-433 Manual, HF Electronics Company.