ArticlePDF Available

Real-Time Multi-Modal Human–Robot Collaboration Using Gestures and Speech

Authors:

Abstract and Figures

As artificial intelligence and industrial automation are developing, human-robot collaboration (HRC) with advanced interaction capabilities has become an increasingly significant area of research. In this paper, we design and develop a real-time, multi-model HRC system using speech and gestures. A set of sixteen dynamic gestures is designed for communication from a human to an industrial robot. A data set of dynamic gestures is designed and constructed, and it will be shared with the community. A convolutional neural network (CNN) is developed to recognize the dynamic gestures in real time using the Motion History Image (MHI) and deep learning methods. An improved open-source speech recognizer is used for real-time speech recognition of the human worker. An integration strategy is proposed to integrate the gesture and speech recognition results, and a software interface is designed for system visualization. A multi-threading architecture is constructed for simultaneously operating multiple tasks, including gesture and speech data collection and recognition, data integration, robot control, and software interface operation. The various methods and algorithms are integrated to develop the HRC system, with a platform constructed to demonstrate the system performance. The experimental results validate the feasibility and effectiveness of the proposed algorithms and the HRC system.
Content may be subject to copyright.
Haodong Chen
1
Department of Mechanical and Aerospace
Engineering,
Missouri University of Science and Technology,
Rolla, MO 65409
e-mail: h.chen@mst.edu
Ming C. Leu
Department of Mechanical and Aerospace
Engineering,
Missouri University of Science and Technology,
Rolla, MO 65409
e-mail: mleu@mst.edu
Zhaozheng Yin
Department of Biomedical Informatics &
Department of Computer Science,
Stony Brook University,
Stony Brook, NY 11794
e-mail: zyin@cs.stonybrook.edu
Real-Time Multi-Modal
HumanRobot Collaboration
Using Gestures and Speech
As articial intelligence and industrial automation are developing, humanrobot collabo-
ration (HRC) with advanced interaction capabilities has become an increasingly signicant
area of research. In this paper, we design and develop a real-time, multi-model HRC system
using speech and gestures. A set of 16 dynamic gestures is designed for communication from
a human to an industrial robot. A data set of dynamic gestures is designed and constructed,
and it will be shared with the community. A convolutional neural network is developed to
recognize the dynamic gestures in real time using the motion history image and deep learn-
ing methods. An improved open-source speech recognizer is used for real-time speech rec-
ognition of the human worker. An integration strategy is proposed to integrate the gesture
and speech recognition results, and a software interface is designed for system visualiza-
tion. A multi-threading architecture is constructed for simultaneously operating multiple
tasks, including gesture and speech data collection and recognition, data integration,
robot control, and software interface operation. The various methods and algorithms are
integrated to develop the HRC system, with a platform constructed to demonstrate the
system performance. The experimental results validate the feasibility and effectiveness of
the proposed algorithms and the HRC system. [DOI: 10.1115/1.4054297]
Keywords: humanrobot collaboration, gesture recognition, speech recognition, real-time,
multi-modal, multiple threads
1 Introduction
During the era of modern manufacturing, humanrobot collabo-
ration (HRC) has emerged as one of the next-generation technolo-
gies in automated systems because of its ability to integrate the
exibility of humans and the productivity of robots. An HRC
system should ideally function like a humanhuman collaboration
using multiple communication channels, and the system can
monitor, interact with, control, and respond to the physical environ-
ment in real time. Despite the existence of various existing commu-
nication methods and object recognition algorithms, it remains
challenging for a robot to realize natural, precise, and real-time
identication and respond to human actions in a manufacturing
environment, so as to maintain the HRCsefcient performance.
This is because factory-based humanrobot communication
methods and real-time multi-modal collaboration systems are still
lacking. Without the ability to operate in real time, it is difcult
for an HRC system to conduct realistic applications and per-
form tasks with concurrent recognition of human actions and
decision-making of robot responses [16]. In order to mitigate the
communication issues on the factory oor, we have designed a real-
time, multi-modal HRC system, which includes a series of HRC
communication protocols that combine the designed arm gestures
and natural speech for humans to communicate with robots in
real time on the factory oor.
1.1 Related Work
1.1.1 HumanRobot Communication. Researchers have been
using a wide variety of methods, both verbal and non-verbal, to
facilitate communication between humans and robots. Through
verbal communication technology, speech signals can be
understood and expressed fairly clearly. Zinchenko et al. [7]
designed a speech recognition interface for surgical robot control.
Bingol and Aydogmus [8] proposed an interactive speech control
with an industrial robot using speech recognition software in the
Turkish language. Two studies were conducted by Kuhn et al. [9]
for the HRC to explore the intentional inuence of speech meta-
phors on human workers. Unlike the verbal communication, the
non-verbal communication usually uses body language and
biological signals to convey information, such as gestures, electro-
encephalogram (EEG), and electromyography (EMG). In these
communication methods, gestures are widely used for the
humanrobot communication. Coupeté et al. [10] designed a user-
adaptive method which increased the accuracy rate of the
dynamic gesture recognition algorithm by around 3.5% by incorpo-
rating less than 1% of the previous database. Unhelkar et al. [11]
designed the prediction of motion with a look-ahead time of up to
16 s. Pinto et al. [12] compared different networks in static hand
gesture recognition. Li et al. [13] constructed an attention-based
model for human action recognition, which could effectively
capture the long-range and long-distance dependencies in actions.
Apart from gestures, wearable sensors are used for specic commu-
nication systems. Tao et al. [14] proposed a method for worker
activity recognition on a factory oor using inertial measurement
unit and EMG signals. Treussart et al. [15] designed an upper-limb
exoskeleton robot using a wireless EMG signal to help carry
unknown loads without the use of force sensors. Although wearable
sensors are widely adopted and proved to be useful in HRC
systems, it needs to be pointed out that they often limit the move-
ments of human workers due to their attachments. Regarding the
gesture and speech communication, the lack of associated gesture
vocabulary for intuitive HRC, the heavy reliance on features of
the interaction objectsenvironments (such as the skin color
model), the effects of lighting, the high computing power require-
ment for speech processing with hosted services, and the uctuating
manufacturing background noise limit the exibility and robustness
of their applications. This paper focuses on gesture and speech
1
Corresponding author.
Manuscript received December 8, 2021; nal manuscript received April 6, 2022;
published online June 10, 2022. Assoc. Editor: Chinedum Okwudire.
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-1
Copyright © 2022 by ASME
command only as they are the two most common modalities.
Although they are rather mature technologies, there is still impor-
tant research to be done to improve either of the two modalities
as well as their integration.
1.1.2 Multi-Modal Sensors. Recent applications of hybrid
control systems combine different sensors into a multi-modal com-
munication, which provides higher accuracy and robustness than
the use of a single type of sensors. Ajoudani et al. [16] combined
EMG and EEG sensors to extract motion features of human
workers in a collaborative setup. A humanrobot interaction
system was developed by Yongda et al. [17], which combined
nger movements and speech for training robots to move along a
pre-designed path. A multi-modal-based network architecture to
improve the fused decision-making of humanrobot interaction
was devised by Lin et al. [18]. It used various sensors, including
EEG, blood pressure, body temperature, and other sensors. These
sensor fusions used in HRC show the robustness of multi-modal
communication. However, when humans and robots interact in a
shared working environment, the lack of intuitive and natural multi-
modal communication between humans and robots limits the trans-
ferability of an HRC system between different tasks, which also
undermines the symbiotic relationship between humans and
robots to facilitate environmental and task changes during collabo-
ration [19].
1.1.3 Real-Time Recognition in Human-Robot Collaboration.
To collaborate with humans on a factory oor in a seamless
manner, robots in HRC systems should recognize human activities
in a real-time modality. Yu et al. [20] proposed a real-time recogni-
tion of human object interaction using depth sensors, and it
achieved an accuracy of 73.80%. Shinde et al. [21] proposed a
you only look once-based human action recognition and localiza-
tion model with an accuracy of 88.37%, which used real-time
video streams as inputs. Sun et al. [22] designed a locally aggre-
gated kinematic-guided skeleton-based method to recognize
human gestures in real time, and it achieved an accuracy of
around 95%. Yu et al. [23] proposed a discriminative deep learning
model to recognize human actions based on real-time feature fusion
and temporal attention, and the model achieved 96.40% accuracy on
a real data set. Although these studies show that HRC systems with
real-time recognition modality can powerfully perform many
dynamic tasks, the single frame or channel-based recognition with
limited features affects recognition performance of human
actions, and the real-time recognition with high accuracy of multi-
ple communication modalities in an HRC system remains a chal-
lenging task. This is because the calculation of deep networks for
multiple human behavior recognitions, as well as the requirement
for integrating and coordinating multiple concurrent processes
may impact the systemsefciency.
1.2 Contributions of This Article. This paper presents the
development of a real-time, multi-modal HRC system that can
detect and recognize a human workers dynamic gesture and
natural speech commands, thereby allowing the worker to collabo-
rate with a designated robot based on the gesture and speech inputs.
An overview of the proposed HRC system is depicted in Fig. 1. This
system operates several parallel threads simultaneously in real time,
including input data capturing, gesture/speech recognition, multi-
modal integration, robot control, and interface operation.
The main contributions of this paper are as follows:
Overall, we propose a real-time, multi-modal HRC system
which combines our designed dynamic gestures and natural
speech commands. This system can recognize gesture and
speech signals in real time and conduct collaboration
between a human and a robot in real time.
Regarding the communication of the HRC system, a data set of
dynamic gestures is designed and constructed, which will be
shared with the community via our GitHub website.
A gesture recognizer is built using a multi-view data set and
a convolutional neural network (CNN). Robust speech recog-
nition is achieved through improvements to an online open-
source speech recognizer.
Regarding the operation of the HRC system, a real-time
motion history image (MHI) generation method is designed
to extract dynamic features of gestures to represent motion
information in real time. A multi-threading architecture is
built to conduct the seven threads shown in Fig. 1simulta-
neously in real time.
1.3 Organization of This Article. The remainder of this paper
is organized as follows. Section 2presents the designed dynamic
gestures, real-time MHI for feature extraction, and a deep learning
model for gesture recognition. Section 3describes our method of
speech recognition. Section 4describes the integration of the
gesture and speech recognition results, and the construction of a
multi-threading model for real-time system operation. Section 5
presents evaluation and demonstration with the HRC system and
details the system performance through experiments between a
human worker and a robotic arm. Section 6presents the conclusion.
2 Design and Recognition of Dynamic Gestures
This section describes the design of dynamic gestures, the con-
struction of a gesture data set, and the design of a real-time
gesture recognition model. A set of dynamic gestures is designed
and an associated data set is constructed in Sec. 2.1. The MHI
method is used to extract dynamic gesture features. The video-based
MHI is applied in Sec. 2.2 to extract action features from gesture
videos, allowing for the construction of a training data set for the
recognition model. Then, a real-time MHI method is proposed in
Sec. 2.3 to realize real-time gesture recognition.
2.1 Gesture Design and Data Collection
2.1.1 Design of Dynamic Gestures. Dynamic gestures are gen-
erated by moving the two upper limbs with both the upper and
lower arms but without movement of ngers. Different from
static gestures which mainly rely on the shapes and exure angles
of limbs, dynamic gestures rely on limb trajectories, orientations,
and motion speeds saved in the temporal sequence, as well as
shapes and limb angles. These features allow dynamic gestures to
have a higher number of conguration options and contain more
information than static gestures [24].
The gestures used in HRC systems should be simple to sign,
socially acceptable, and minimize the cognitive workload.
McNeill [25] proposed a classication scheme of gestures with
four categories: iconic (gestures present images of concrete entities
and/or actions), metaphoric (gestures are not limited to depictions of
concrete events), deictic (the prototypical deictic gesture is an
extended indexnger, but almost any extensible body part or
held object can be used), and beats (using hands to generate time
Fig. 1 System overview
101007-2 / Vol. 144, OCTOBER 2022 Transactions of the ASME
beats). The metaphoric gestures put abstract ideas into a more literal
and concrete form, which is not straight-forward. The beats gestures
are only used to keep the rhythm of speech, and they usually do not
convey any semantic content [26]. Therefore, we create 16 dynamic
gestures in iconic and deictic categories for the HRC system, as
shown in Fig. 2.
The gestures in Fig. 2are mainly designed for robot calibration
and operation in a HRC system. Calibration gestures allow the
robot to perform routine procedures, including gestures 14
(Start,Stop,Go Home, and Disengage). The Start gesture com-
mands a robot to calibrate the kinematic structure, which is the ini-
tialization process of identifying the initial values of parameters in
the robots kinematic structure, and move the robot into a working
position for upcoming tasks. The Stop gesture commands a robot to
stop its movement and keep it in its current pose. The Go Home
gesture commands the robot to go to its home position, such as
keeping the robotic arm straight up. When the robot gets stuck
because it is commanded to move outside a joint limit, the Disen-
gage gesture command can disengage the robot and make it manda-
tory to keep the joints away from the end of stroke and move back to
the standard workspace. In addition to calibration gestures, an oper-
ation gesture (gestures 516) instructs the robot to move its
end-effector in a different direction, change its speed of motion,
open or close the gripper, conduct clockwise (CW) or counterclock-
wise (CCW) rotation of the end-effector, etc. The instructions for
performing the proposed gestures are given in Appendix A.
2.1.2 Multi-view Data Collection. As shown in Fig. 3, two
redgreenblue (RGB) cameras in Fig. 3(a)are used to collect
gesture features from multiple views. Each human subject keeps
staying in the 3D cubic space of both camera views while turning
around arbitrarily and changing locations in the gesture data collec-
tion (Fig. 3(b)). The gesture data set includes the 16 dynamic ges-
tures of eight human subjects. In particular, the data set of seven
unilateral gestures (Stop,Up,Down,Inward,Outward,Clockwise
(CW),Counterclockwise (CCW)) include gestures performed by
the left or right upper limb. Image augmentation is applied
through brightness variation, horizontal/vertical shifts, zooms, and
perspective transformations.
2.2 Feature Extraction and Data Set Construction Using
Video-Based MHI. In order to extract the features of the designed
dynamic gestures into static images, the motion history image
(MHI) approach is used because it can show the cumulative
object motion with a gradient trail. This approach is a view-based
template approach that records the temporal history of a movement
and converts it into a static scalar-valued image, i.e., the MHI,
where more recently moving pixels are brighter and vice versa
[27]. As shown in Fig. 4, a video-based MHI Hτ(x,y,t) is generated
from binary images of the input videos sequential frames using the
following formula:
Hτ(x,y,t)=τif Ψ(x,y,t)=1
max(0,Hτ(x,y,t1) δ) otherwise
(1)
where xand yare pixel coordinates of the image and tis the time. Ψ
(x,y,t) denotes the binary image which saves the movement of an
object in the current video frame, which is called for each new video
frame analyzed in the sequence, where white pixels have a value of
255 and black pixels have a value of 0. The duration τdenotes the
temporal extent of a movement (e.g., in terms of frames), which is
set to the same value as the number of frames in the video clips. This
is because a τsmaller than the number of frames decays earlier
motion information in an MHI, whereas a considerable value of τ
obscures brightness changes (pixel value changes) in the MHI.
The decay parameter δdenotes decrease in previous pixel value
when a new frame is given. If no new motion covers some specic
pixels that were covered by an earlier motion while loading the
frames, the values of these pixels will be reduced by δ[28]. The
decay parameter δis usually set to 1 [29].
In Fig. 4, frame subtraction is conducted to obtain the binary
image. If the difference D(x,y,t) between two adjacent frames is
greater than the threshold ξ, a binary image Ψ(x,y,t) is generated
as follows:
Ψ(x,y,t)=1ifD(x,y,t)ξ
0 otherwise
(2)
where Ψ(x,y,t) denotes a binary image at the tth frame, and ξis a
threshold removing the background noise from MHIs. The value of
ξis set to 10 based on the validation experiments described later in
Sec. 5.1.1 [30]. The frame difference D(x,y,t)isdened as
D(x,y,t)=I(x,y,t)I(x,y,t)(3)
Fig. 2 Illustration of the designed 16 dynamic gestures (CW,
clockwise; CCW, counterclockwise): (a) gesture 1, (b) gesture
2, (c) gesture 3, (d) gesture 4, (e) gesture 5, ( f) gesture 6, (g)
gesture 7, (h) gesture 8, (i) gesture 9, ( j) gesture 10, (k) gesture
11, (l) gesture 12, (m) gesture 13, (n) gesture 14, (o) gesture 15,
and (p) gesture 16
Fig. 3 Illustration of the multi-view dynamic gesture collection:
(a) Two camera views and (b) multi-view gestures
Fig. 4 Generation of a video-based MHI
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-3
where I(x,y,t) represents the intensity value of pixel location with
the coordinates (x,y) at the tth frame of the image sequence and the
intensity value range is [0 255]. denotes the temporal difference
between two pixels in the same location, which is set to 1 to take all
frames into account [27].
Figure 5shows consistent features in MHIs of the gesture Left in
different views, i.e., front (F) and top-front (TF) views, which
shows the feasibility of multi-view data collection for dynamic ges-
tures in Sec. 2.1. After setting up all parameters for MHI generation,
the MHIs for the 16 gestures are obtained, as illustrated in Fig. 6,
where these MHIs successfully exhibit appearance differences for
different dynamic gestures.
2.3 Real-Time Gesture Recognition. To enable real-time
system operations in the feature extraction and recognition of
dynamic gestures, MHIs should be generated in real time from
input camera frames. Therefore, a real-time MHI method is pro-
posed in this section.
2.3.1 Generation of Real-Time Motion History Image.
According to Eqs. (1)(3), there are three key parameters in the gen-
eration of a video-based MHI, i.e., the duration τ, decay parameter
δ, and threshold ξ. The primary goal of this section is to generate the
MHI in real time using continuous input frames through represent-
ing the decay parameter δusing the duration τduring the MHI gen-
eration. Since the real-time MHI method takes each new input frame
into account, the difference D(x,y,t) between adjacent frames and
the threshold ξare removed from the real-time MHI calculation,
which means that each new input frame triggers a new binary
image Ψ(x,y,t). The process of the real-time MHI generation is
illustrated in Fig. 7, in which the camera continuously captures
input intensity value images I(x,y,t) of input frames, and each
new recorded input frame triggers a new binary image Ψ(x,y,t)
by subtracting the intensity values of each pair of adjacent frame
images. Next, the program continuously combines the new binary
image Ψ(x,y,t) into the constantly updating real-time MHI
Hτ(x,y,t). In the meantime, the value of all previous pixels in the
updating MHI is reduced by δuntil it equals to zero. The value of
a pixel in an eight-bit grayscale image ranges from 0 (black) to
255 (white) [28].
Since each new input frame triggers a new binary image, the
number of frames, i.e., the duration τ, needs to be determined.
The value of the duration τis determined by the time it takes to
perform a gesture. The length of all the gesture videos in the data
set are 75 frames, where the frame rate in this system is 30
frames per second (fps). Therefore, the duration τis set to 75
frames to record all gestures completely.
The initial value of white dynamic pixels in binary images is 255,
and these white pixels turn black after τ=75 times decay to maxi-
mize the brightness difference between different binary images in
an MHI, i.e., the motion history features. Since each new input
frame updates the real-time MHI by adding a new binary image
Ψ(x,y,t) and reducing the values of previous pixels by δ, the
decay parameter δcan be denoted by the duration τthrough 255/τ
3.40.
The real-time MHI generator enables the real-time and continu-
ous MHI generation of current frames. Furthermore, the fact that
a new input frame triggers a new real-time MHI means that every
Fig. 5 Illustration of the MHIs of the gesture Left in two different
views (F, front camera view; TF, top-front camera view)
Fig. 6 Illustration of the 16 dynamic gesture MHIs
Fig. 7 Generation of real-time MHIs
101007-4 / Vol. 144, OCTOBER 2022 Transactions of the ASME
input frame is recorded in the real-time MHI, i.e., every movement
(more than one frame) of the human worker can update more than
one similar MHI, which is critical in providing high error tolerance
and recognition accuracy for later real-time gesture recognition.
The pseudo-code for the real-time MHI generation is given in
Algorithm 1.
Algorithm 1 Real-time MHI generation
Input: Ip/*previous frame image*/, Ic/*current frame image*/
Output: MHIf/*frame-based real-time MHI*/
1: function MHIfIp,Ic
2: MHIf=O; /*Initialization of the MHIf*/
3: for each Ic
4: Id=|IcIp|;
5: Values of different pixels between Icand Ipin Id=255;
6: Ib=Id;/*Ib: binary image*/
7: MHI
f
=max(0, MHI
f
3.40); /*δ=3.40*/
8: MHIf=MHIf+Ib;
9: Ip=Ic;
10: end for
11: end function
To demonstrate the robustness of the MHI generation, we col-
lected gesture data under ve different environments (Fig. 8).
According to the MHI results in Fig. 8, only dynamic gesture
pixels are extracted, and the stationary environments have no
effect on MHI generation. In future work, we will combine the skel-
eton model and the MHI method to process gestures in dynamically
changing environments.
2.3.2 Gesture Recognition. A deep learning model is built
using convolutional neural networks (CNNs) to recognize the
designed dynamic gestures. Figure 9depicts the overall architecture
of the CNN model. The input images are MHIs that have been
resized to 32 × 32 (width× height). The model consists of two con-
volution layers, and each layer is followed by a max-pooling layer.
The sizes of the convolution kernels, feature maps at each layer, and
the pooling operators are shown in Fig. 9. The dropout occurs after
the second pooling layer, which randomly drops units from the
neural network during training and can effectively avoid over-tting
issues [31]. Following the second pooling layer, a 5 × 5 × 40 feature
map is obtained. It is attened into a 400 feature vector, and then
followed by a 100-neuron fully connected layer.
The output of this network is a 16-dimensional score vector layer
transformed by the softmax function. The softmax function can
carry out the normalization process by limiting the function
output to the interval [0, 1] and making the components add up to
1. A very condent event is denoted by 1, while an impossible
event is denoted by 0 [32]. This process allows for the computation
of class-membership probabilities for the 16 gestures, and the larger
output components correspond to larger condence [33]. The
softmax function is calculated as follows:
P(xi)=exi
16
k=1exk
for i=1,...,16 (4)
where P(x
i
) denotes the condence values of the sample xbelonging
to the class i,x
k
denotes the weighted input of the softmax layer, and
16 is the number of gestures.
The distribution with a single spike yields a more condent
result, and the spike label is therefore the nal recognized gesture
label. When the result is a uniform distribution without spikes, it
indicates that the input gesture does not belong to any of the 16 ges-
tures, such as walking in front of the camera. In our case, if the con-
dence value is greater than 90%, we assume that the associated
gesture has been successfully recognized. Due to the fact that the
real-time MHI generator can create a new MHI for each input
frame, a series of continuous MHIs for a single gesture can be
Fig. 8 MHI generation in ve different environments
Fig. 9 Overview of the CNN architecture for the recognition of the designed dynamic gestures
(the terms Conv.and Pool.refer to the convolution and pooling operations, respectively)
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-5
obtained. While the real-time recognition discerns each of these
MHIs individually, the nal result is the most frequently occurring
label of high-condence recognition.
3 Recognition of Speech Commands
To improve the robustness and exibility of the proposed HRC
system, speech data from human workers can be combined with
the aforementioned real-time gestures to generate a more reliable
recognition result. The Google open-source speech recognizer
2
is
used to recognize speech data because it is capable of converting
more than 120 different languages from speech to text. The
log-Mel lterbank is used in the Google open-source speech recog-
nizer to reduce the dimension of all speech log-spectral magnitude
vectors, resulting in improved speech recognition performance for a
deep neural network with complex parameters [34,35]. To evaluate
the effectiveness of this speech recognizer in a factory oor envi-
ronment, experiments are conducted in which the speech is recog-
nized in ten different background noises. The waveforms of the
10 background noises used in our speech experiments are depicted
in Fig. 10. These audio les are extracted from video and audio les
on the internet and recorded at a level of 6070 decibels (dB).
Volume levels greater than 70 dB are not included in tests
because normal human speaking and communication are around
60 dB, and background noise levels greater than 70 dB can sub-
merge the normal speech tone [36].
Figure 11 displays waveforms of our speech commands in a quiet
environment. Figure 12 compares the waveforms of the Start
command without background noise, a drilling noise without speech
commands, and the Start command in the drilling noise environment.
To reduce the inuence of background noise on speech recogni-
tion, we use the spectral subtraction method, which estimates the
speech signal spectrum and the noise spectrum, and subtracts the
noise spectrum from the speech signal, resulting in an improved
speech signal [37,38]. The rst 0.5 s of our speech data set contains
only the background noise. Therefore, we use the rst 0.5 s of an
audio le to represent the background noise spectrum, and the
rest of the audio le as the speech signal spectrum. The spectral sub-
traction technique then deducts the background noise spectrum
from the speech signal spectrum. To achieve real-time denoising,
the microphone is activated 0.5 s before the camera is turned on,
i.e., prior to the worker giving gestures and speech, and the addi-
tional 0.5 s of audio is recorded as the background noise.
The audio data set contains 16 different speech commands of two
native English-speaking subjects (one male and one female) under
ten different background noises. The data set contains about 1600
audio les. To evaluate the performance of the recognition model,
we dene the speech recognition accuracy of a particular class C
as the ratio of the number of correct recognitions to the total
number of samples within C. The recognition results are shown in
Fig. 13, in which most speech commands have >90% recognition
accuracy. However, the command speed up has a recognition accu-
racy of around 75%, while the command inward has a recognition
accuracy of less than 15%.
Table 1displays low-accuracy transcriptions of the above two
commands. For each sample, the Google open-source speech recog-
nizer gives several possible transcripts during recognition of an
input command and calculates the condence of all possible tran-
scripts. Condence is a value between 0% and 100% used to eval-
uate the reliability of recognition results. The transcription with the
highest level of condence outputs as the recognized result. In the
rst sample of the command inward in Table 1, it is recognized
as several possible transcripts, including uworld,in word,inward,
keyword, and in world. The uworld is the output with the highest
condence (69.70%). The command speed up in its rst sample is
recognized possibly as beat up,speed up,beat-up,bead up, and
the beat up. The output beat up is the one with the highest con-
dence (94.07%). In the second sample of these two commands,
even though they are correctly recognized, their condence levels
(61.21% for inward and 86.89% for speed up) are lower than
those of the other commands as shown in Fig. 13 (>90%).
According to the literature on speaking habits, the reason for the
low accuracy of the command inward is that the in is stressed, and
there is a short pause between in and ward. For the speed up, the sis
generally pronounced at a low volume [39]. These speaking habits
reduce the recognition accuracy of these two speech commands. To
address this issue, we use the fuzzy matching strategy [40,41] for
the command inwardand speed up,in which all possible
Fig. 10 Waveform samples of the ten different background
noises
Fig. 11 Waveform samples of the 16 commands
Fig. 12 Waveform samples of the Start command in the drilling
noise environment. (a)Start command, (b) drilling noise, and
(c)Start command in drilling noise environment.
Fig. 13 Performance (%) of the speech recognizer on our
speech data set
2
https://pypi.org/project/SpeechRecognition/
101007-6 / Vol. 144, OCTOBER 2022 Transactions of the ASME
transcripts in Table 1are set as correct recognition results in the
speech database. This strategy is feasible because those possible
transcripts are very distinct and generally cannot be confused
with other commands in HRC scenarios on the factory oor.
We further design a real-time keyword extraction technique for the
16 speech commands, which allows the speech recognizer to extract
keywords from a short sentence, such as the speed upfrom the sen-
tence speed up the processand the inwardfrom the short sen-
tence go inward.The experimental results in Sec. 5show that
the recognition accuracy of the two commands inward and speed
up are increased to >90% using our designed technique.
4 System Integration
This section describes the integration strategy of gesture and
speech results under several different cases. Following that, the con-
struction of a multi-threading architecture is demonstrated for
system operation on several parallel tasks.
4.1 Integration of Gesture and Speech Recognition
Results. In the recognition of human commands, our system eval-
uates both the recognition result of gesture and/or speech com-
mands and the corresponding recognition condence. The
recognition condence is a value between 0% and 100% used to
evaluate the reliability of recognition results, which represents the
probability that the output is correct. Our system assumes that a
gesture and/or speech command is successfully recognized only if
the condence level is 90%. If the condence is <90%, the
system shows that the result is invalid on the interface, gives a
beep sound to remind the worker that a new gesture/speech needs
to be given, and no robot movement is performed. Using the recog-
nition condence enables the system to prevent most false-positive
and false-negative results from occurring and maintain the system
robustness.
The integration strategy of gesture and speech recognition results
is shown in Algorithm 2. The human worker can use gesture and
speech in the collaboration with the robot. First, the intensity of
environmental noise is assessed, and only if it is less than 70 dB,
the speech result is considered in integration. Following that, the
results of gesture and speech recognition, as well as the correspond-
ing probability/condence, are input and ve cases are evaluated:
Case 1: If the gesture and speech results are identical and valid
(i.e., the recognition result is within the designed 16
labels), the corresponding result label is given.
Case 2: If the gesture and speech results are different but both
valid, the integration operator compares the condence of
the speech result and the probability of the gesture result.
The integration result is the one with the larger value.
Case 3: If the gesture result is valid but the speech result is not,
the integration yields the gesture result.
Case 4: If the speech result is valid but the gesture result is not,
the speech result is selected.
Case 5: If both the gesture and speech results are invalid, the
system displays invalidindicating that it is waiting for
the next input, and the system continuously collects new
gesture and speech data.
Only the gesture is considered if the environmental noise is
greater than 70 dB, and cases 6 and 7 are evaluated:
Case 6: If the gesture result is valid, it is output as the result.
Case 7: If the gesture result is invalid, no result is obtained and
transmitted into the robot, and the system stands by to con-
tinuously collect new gesture and speech data.
Algorithm 2 Integration of gesture and speech recognition
results
Input: Gr/*gesture result*/ and associated condence, Sr/*speech
result*/ and associated condence, C/*16 command labels*/
Output: Ir/*integrated result*/
1: function Integration Gr,Sr,C
2: if Sound Intensity 70, dB; then /* high speech condence */
3: if Gr== Srand Grand Srboth C;then /* case1*/
4: Ir=Gr(or Sr);
5: else if Gr!=Sr,Grand Srboth C;then /*case2*/
6: Ir=the one with larger probability/condence value;
7: else if GrC, and SrC;then /*case3*/
8: Ir=Gr;
9: else if SrC, and GrC;then /*case4*/
10: Ir=Sr;
11: else /* no valid result, i.e., Grand Srare not C*/ /*case5*/
12: Ir=[ ]; /*no output*/
13: end if
14: else /* low speech condence */
15: if GrC;then /*case6*/
16: Ir=Gr;
17: else /*case7*/
18: Ir=[ ]; /*no output and wait for the next input*/
19: end if
20: end if
21: end function
4.2 Multi-Threading Architecture of the HRC System. To
achieve high-speed operation and response, a multi-threading
model is designed to manage multiple tasks simultaneously. The
structure and procedure of the multi-threading model are shown
in Fig. 14. Seven threads are created in this model to handle
seven concurrent tasks, including interface operation, gesture and
speech capturing, real-time gesture and speech recognition, result
integration, and robot control. To meet the requirements of real-time
system operation, all threads must operate continuously and
Table 1 Performance (%) of the speech recognizer in recognizing the inward and speed up
commands
Command Sample Possible transcripts Output Condence of the output (0100)
inward1uworld,”“in word,”“inward,”“uworld69.70
keyword,”“in world
2inward,”“in-word,”“in word,”“inward61.21
uworld,”“any word
speed up1beat up,”“speed up,”“beat-up,”“beat up94.07
bead up,”“the beat up
2speed up,”“subitup,”“to beat up,”“speed up86.89
who beat up,”“the beat up
Note: The recognizer outputs the transcription with the highest condence, and the lowercase/capitalization does
not affect the recognition result).
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-7
simultaneously, and the functions and algorithms used in real-time
operation must be isolated from other functions and algorithms to
minimize interference. As a result, all concurrent threads are inde-
pendent. The queues shown in Fig. 14 are linear data structures
that store items in a rst in rst out manner [42].
We use the Pycharm application programming interface (API)
and python language in our system. In prediction, our hardware
(24-core CPU, two Nvidia GeForce GTX 1080Ti GPUs, and 64G
memory) can conduct the recognition of one gesture MHI in
0.026 s (using camera input at 30 fps, or 0.033 s per frame),
which means the video processing is real-time. One speech
command is about 3 s in our system, and the Google speech recog-
nition API can recognize it in 0.218 s (i.e., a 3-s audio signal can
be processed in 0.218 s, faster than real-time). Therefore, the CNN
and the Google speech recognition API do not affect the real-time
capability of our framework.
Thread 1 is constantly active when the system is running to ensure that
the interface functions properly. Threads 2 and 3 read data from the RGB
camera and microphone, respectively. The camera and microphone turn
off after each execution time and automatically turn back on when
threads 6 and 7 nish. Thread 2 and 3 collect gesture and speech data
in an execution time, which is a set variable that depends on the time
required for the worker to perform a command (gesture and/or
speech). The real-time MHI is displayed on the interface window and
saved in a gesture data queue. The speech data in an execution time is
also stored in a speech data queue. Given that our data set shows that
all gestures and speech take less than 2.5 s to complete, the execution
time for collecting gesture and speech data (described in Sec. 4.2)is
set to 2.5 s to collect complete gesture and speech information. To
record background noise for speech denoising, the microphone is
turned on 0.5 s before the camera is activated. Meanwhile, threads 4
and 5 begin recognizing the MHIs and speech data that popped out of
the data queues. Along with labeling the recognition results, threads 4
and 5 display them on the interface and store them in label queues.
After obtaining the gesture and/or speech results from the label
queues, thread 6 integrates them and saves the nal result in a result
queue. Thread 7 continuously pops out the results from the result
queue and sends them to the robotic arm to perform the associated tasks.
5 System Evaluation and Demonstration
The experimental platform is Ubuntu 16.04 on a computer
equipped with 64343M memory and two NVIDIA GeForce GTX
1080 Ti graphics cards. Gestures and speech are captured using a
Logitech HD Webcam C920, which consists of an RGB camera
and a microphone. The following experiments are carried out: (i)
evaluation of the proposed gesture recognition model, including
the hold-out and leave-one-out experiments, (ii) evaluation of the
improved techniques proposed for speech recognition, and (iii)
demonstration of the HRC system.
5.1 Evaluation of the Proposed Gesture Recognition Model
5.1.1 Validation Experiments of the Proposed Gesture
Recognition Model. Evaluation Results of Hold-Out Experiment.
There are approximately 57,000 MHI samples after data augmenta-
tion, with approximately 3500 samples for each gesture class. The
hold-out experiment is carried out to evaluate the performance of
the deep learning model constructed in Sec. 2, in which the data
set is randomly divided into a training data set (80% of the total)
and a testing data set (20% of the total). The threshold ξin
Eq. (2) of Sec. 2is determined using an eight-fold validation exper-
iment. The training data set is divided into eight folds. Each fold is
Fig. 14 Multi-threading model of the proposed real-time HRC system with an example
101007-8 / Vol. 144, OCTOBER 2022 Transactions of the ASME
treated as a pseudo-test set in turn, and the other seven folds are
pseudo-train sets. We calculate the average recognition accuracy
of 16 gestures for each threshold ξ, which shows the ability to clas-
sify a sample gesture xfrom a certain class Ccorrectly. The results
are summarized in Table 2, which indicates that the accuracy thresh-
old ξshould be set to 10 for the maximum accuracy.
Several widely used metrics are used to assess classication per-
formance:
Accuracy =TP +TN
TP +FN +FP +TN (5)
Precision =TP
TP +FP (6)
Recall =TP
TP +FN (7)
Fscore =2·precision ·recall
precision +recall (8)
where true positive (TP) refers to a sample xbelonging to a partic-
ular class Cthat is correctly classied as C. True negative (TN) indi-
cates that a sample xfrom a not Cclass is correctly classied as a
member of the not Cclass. The false positive (FP) is dened as
when a sample xfrom a not Cclass is incorrectly classied as
Table 4 Performance (%) of the gesture recognition model in
the leave-one-out experiment
Left Outsubject Accuracy Precision Recall F-score
1 99.36 90.89 99.80 95.14
2 99.57 95.15 98.10 96.60
3 99.06 91.26 94.00 92.61
4 98.82 90.11 91.13 90.62
5 99.63 97.96 96.01 96.97
6 99.00 94.68 89.00 91.75
7 99.82 97.74 99.40 98.56
8 98.94 91.09 91.98 91.53
Table 3 Performance (%) of the gesture recognition model in
the hold-out experiment
Class Accuracy Precision Recall F-score
Start 99.50 97.00 94.98 95.98
Stop 99.41 95.21 95.28 95.24
Go Home 99.54 95.46 97.23 96.34
Disengage 99.64 96.33 98.01 97.16
Up 99.45 95.53 95.64 95.59
Down 98.71 88.65 90.93 89.78
Left 99.85 98.36 99.27 98.81
Right 99.65 96.37 98.14 97.25
Inward 98.82 91.00 90.03 90.51
Outward 98.83 92.53 88.80 90.63
Open 99.59 97.59 95.76 96.66
Close 99.34 94.37 95.12 94.75
CW 98.75 89.29 91.13 90.20
CCW 99.11 93.76 91.79 92.77
Speed Up 99.43 96.17 94.68 95.42
Slow Down 99.66 96.87 97.66 97.26
Table 2 Performance (%) of different ξin gesture recognition
Threshold ξ5 10152025303540
Accuracy 92.24 98.83 96.56 95.08 95.05 93.44 91.09 91.04
Fig. 15 Confusion matrix and the most confusing pairs of the hold-out experiment
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-9
class C. The false negative (FN) describes a situation in which a
sample xfrom class Cis misclassied as belonging to not C
classes. The Fscore is the harmonic mean of the precision and
recall, which ranges in the interval [0,1] [43,44]. The values of
the metrics of the classication results on the testing data set are
shown in Table 3. It can be observed that the recognition accuracy
is >98% for all gestures, and the other metrics also yield good
results (all >88%). The results demonstrate how well the trained
model recognized various gestures.
Evaluation Results of Leave-One-Out Experiment. The
leave-one-out experiment treats the data set of each individual
human subject as a test set in turn, and the rest of them as training
sets. The leave-one-out experiment results are summarized in
Table 4. The metrics for the test results of each left outsubject
are the average value of the 16 class metrics of the subject. As illus-
trated in Table 4, the gestures of eight subjects are correctly recog-
nized with an accuracy >98%, and the other metrics for the subjects
are 89%. The precision, recall, and F-scores for subjects 3, 4, 6,
and 8 are below 95% due to their high subject-wise variability.
The results show how well the trained model recognized the ges-
tures of a new untrained subject.
5.1.2 Failure Analysis of Validation Experiments. This sub-
section discusses validation experiment failures. As illustrated in
Fig. 15, we compute the confusion matrix of the hold-out experi-
ment for our multi-view gesture classication model.
The confusion matrix is an error matrix that is used to visualize
classication performance. Each column of the matrix denotes
instances belonging to a predicted class, while each row denotes
instances belonging to a ground truth class. The most confusing
pairs of gestures are marked by squares in Fig. 15, which includes
DownStart,DownStop,DownInward,DownOutward,
InwardUp,InwardOutward,OpenGo Home,CWInward,
CWOutward, and CWCCW.
By reviewing the failure cases, we found that the high degree of
similarity between the confusing pairs makes them challenging to
distinguish. For example, Fig. 16(a)illustrates three samples of
the gesture Start that were incorrectly recognized as Down.A
likely reason is that they share a similar texture and shape in the
square marks shown in Fig. 16(a). Besides, the subject-wise differ-
ence for the same gesture makes it challenging to learn these unseen
feature variations in advance, and this subject-wise variation can be
seen in the samples of the gesture Up performed by different sub-
jects shown in Fig. 16(b). To address these failure cases and
further improve the performance of gesture recognition, the follow-
ing future work will be considered: (i) more subjects can be added
to include additional gesture features in model training and (ii)
depth sensors can be used in the MHI generation process to
obtain depth information of gestures.
5.2 Evaluation of Speech Recognition. To assess the pro-
posed fuzzy matching strategy in Sec. 3, we constructed a data
set of two non-native English-speaking subjects under the 10
noise backgrounds in Sec. 3to show the robustness of our speech
recognition on different English accents. The speech test data set
contains 160 samples under every background noise for each
command, including short sentences, such as move inward,”“go
left,etc. The performance of speech recognition is shown in
Table 5, where the recognition accuracy of Inwardand Speed
Fig. 16 Analysis of failure cases in the hold-out experiment:
(a) some failure cases of the Start incorrectly recognized as
Down and (b) gesture Up performed by different subjects
Table 5 Performance (%) of the speech recognition on non-native English-speaking subjects (labels 110 are the 10 different
background noises in Fig. 10 of Sec. 3)
Class
Recognition accuracy under ten background noises
Average accuracy12 345678910
Start 98.75 96.88 98.13 96.88 93.75 99.38 98.75 97.50 98.13 97.50 97.57
Stop 98.13 98.75 99.38 98.75 96.88 98.75 97.50 93.75 98.13 98.13 97.82
Go Home 99.38 98.75 100.00 95.00 93.75 95.00 93.75 98.75 99.38 94.38 96.81
Disengage 99.38 96.88 98.75 96.88 93.13 97.50 93.75 95.00 95.00 97.50 96.38
Up 99.38 98.75 98.75 93.75 93.75 95.00 99.38 93.75 98.75 99.38 97.06
Down 96.88 96.88 95.00 99.38 98.75 99.38 97.50 98.75 98.75 99.38 98.07
Left 99.38 98.75 100.00 96.88 94.38 98.75 98.75 98.75 98.75 98.75 98.31
Right 99.38 100.00 99.38 98.75 95.00 98.75 98.75 94.38 100.00 96.88 98.13
Inward 91.25 92.50 90.63 90.63 93.75 90.63 91.88 91.25 91.25 90.63 91.44
Outward 95.00 96.88 95.00 93.75 94.38 94.38 93.75 98.75 98.75 97.50 95.81
Open 97.50 100.00 94.38 94.38 99.38 98.75 95.00 97.50 94.38 98.75 97.00
Close 95.00 97.50 93.75 95.00 95.00 99.38 98.75 99.38 99.38 97.50 97.06
CW 98.75 96.88 93.75 99.38 95.00 98.75 97.50 95.00 97.50 100.00 97.25
CCW 95.00 99.38 94.38 93.75 95.00 99.38 99.38 93.75 95.00 100.00 96.50
Speed Up 97.50 97.50 98.75 96.88 98.75 94.38 99.38 95.00 95.00 97.50 97.06
Slow Down 93.75 93.75 94.38 99.38 95.00 95.00 99.38 97.50 99.38 97.50 96.50
Table 6 Performance of the proposed integration technique of gesture and speech results (sound intensity: 6070dB)
Number of experiments Number of correct recognition Correct rate
Input commands Subject 1 Subject 2 Subject 1 Subject 2 Subject 1 Subject 2
Gesture only 213 176 198 166 92.96% 94.32%
Speech only 297 223 277 209 93.27% 93.72%
Gesture and speech both 210 199 205 190 97.62% 95.48%
101007-10 / Vol. 144, OCTOBER 2022 Transactions of the ASME
Upis increased greatly to >90%, and all the other commands have
high accuracy of >95%. The performance data in Table 5show that
the fuzzy matching strategy increases the performance of the speech
recognition in our HRC system.
5.3 Evaluation of Gesture and Speech Integration. To
evaluate the proposed multi-modal integration strategy in Sec. 4,
comparison experiments were carried out to compare system perfor-
mance when both gesture and speech are included or only one of
them is used. The background sound intensity is 6070 dB. The
experimental results of two human subjects are shown in Table 6,
and the correct rate of our proposed integration method performs
better than when only gesture or speech is used.
5.4 Demonstration of the Human-Robot Collaboration
System. In this section, we apply the proposed real-time, multi-
modal HRC system to perform a pick-and-place task using a
six-degree-of-freedom (6-DOF) e.DO robotic arm to demonstrate
the system performance. A view of this system is shown in
Fig. 17, where camera A and a microphone are equipped to
collect gestures and speech from the human worker, and camera
B is equipped to monitor the platform. The two-camera setting
for MHI data collection in Sec. 2.1 is to have a more comprehensive
data set on human gestures from different views in order to
develop a robust CNN to recognize gestures. The platform
includes a white block, an obstacle, and a target location. After
the system is started, it continuously runs in the background to
listen to and watch the human operator. The goal is to command
the robot using gestures and speech to move the block to the
target location along a collision-free path. As described in
Sec. 4.2, the execution time for collecting gesture and speech data
is set to 2.5 s. The microphone is activated 0.5 s prior to the
camera A being activated, and the additional 0.5 s of audio is
recorded as background noise.
Figures 18(a)18(c)illustrate the experimental result of the Start
command. To realize system visualization, a software interface is
designed in Fig. 18(a), where two windows are used to update
camera frames and real-time MHIs, respectively. The sound inten-
sity level and recognition probability distribution of the current
gesture are displayed in real time. The nal identication results,
i.e., the gesture, speech and integration results, are displayed at
the bottom of the interface. In Fig. 18(a), the sound intensity is
less than 70 dB. The speech of the human worker, i.e., Start,is
selected as the nal result since no gesture is given. In Fig. 18(b),
the sound intensity is greater than 70 dB, and the gesture of the
human worker is recognized as the nal result of Start. Note that
a gesture result is considered valid only when it is the most fre-
quently occurring label of high-condence recognition. After the
integration result is transmitted to the robot, the end-effector per-
forms the movements as shown in Fig. 18(c).
The experimental resultof multiple commands for a pick-and-place
task is demonstrated in Fig. 19, in which the robot delivers the white
block to the target location along a collision-free path based on
Fig. 17 Overview of the system demonstration
Fig. 18 Demonstration of the command Start.(a) Sound intensity 70 dB, (b) sound intensity >70 dB, and (c) robot response
for the command Start.
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-11
commands from a human worker. During the delivery, the robot
keeps the end-effector straight down, facing the platform. The
human worker can use gestures and/or speech commands to commu-
nicate with the robot in real time. If a command is not recognized by
the system, the human worker can continue to give it during the next
execution time when the camera and microphone are turned on.
When a command (such as Left) is received, the robot keeps
moving (to the left) and stops until it reaches its joint limit or receives
the Stop command. A video of the demonstration is available.
3
6 Conclusion
We describe in this paper the design and development of a real-
time, multi-modal humanrobot collaboration (HRC) system using
the integration of speech and gestures. For the communication, a set
of 16 dynamic gestures is designed. A real-time motion history
image (MHI) method is developed to extract dynamic gesture fea-
tures. A convolutional neural network (CNN)-based recognition
model is built for dynamic gesture recognition, and an open-source
speech recognizer is adopted and improved for natural speech rec-
ognition. An integration strategy for combining gesture recognition
and speech recognition is proposed, for which a multi-threading
architecture is designed to enable the system to perform parallel
tasks. The validation experiments demonstrate that the proposed
gesture recognition model has an accuracy of >98% for the 16
designed dynamic gestures, and the speech experiments show that
the improved speech recognizer achieves >95% accuracy for
speech commands corresponding to the 16 dynamic gestures. A
system demonstration using a 6-DOF robotic arm illustrates the per-
formance of the proposed gesturespeech integration strategy and
the feasibility of the proposed real-time HRC system.
Our system enables the worker to use both gesture and voice
commands to control a robot, and integration of the two to increase
the recognition accuracy. When a human is performing a task along-
side a robot, the human will not be able to sign a gesture during the
humanrobot collaboration but will still have the option of using
voice to command the robot, which has >91% accuracy (>96%
in average) in our system. This still provides a natural and effective
way for HRC. To further improve the proposed approach, future
studies will include exploring other modalities to enhance HRC,
such as brain waves and eye gaze, experimenting with other
gesture representation methods to fully exploit the discriminative
gesture features in dynamic backgrounds with moving objects
and/or co-workers, performing object segmentation to generate
MHI on the target human worker only, exploring speech recogni-
tion using a headset with a superb noise cancelation function in a
noisy environment, developing system response to low-condence
cases by saying something like Could you do it again?for gesture
and/or speech commands, and exploring energy-efciency issues
by designing sleep and activate modes for the system.
Acknowledgment
This research work was nancially supported by the National
Science Foundation grants CMMI-1646162 and CMMI-1954548
and also by the Intelligent Systems Center at Missouri University
of Science and Technology. Any opinions, ndings, and conclu-
sions or recommendations expressed in this material are those of
the authors and do not necessarily reect the views of the National
Science Foundation.
Conict of Interest
There are no conicts of interest.
Data Availability Statement
The datasets generated and supporting the ndings of this article
are obtainable from the corresponding author upon reasonable
request.
Appendix A
The ways to perform the proposed gestures as illustrated in Fig. 2
of the main text are as follows:
Gesture 1 (Start): Clap in front of the chest.
Gesture 2 (Stop): Raise the left/right arm until the hand reaches
the height of the shoulder, and extend the arm with the palm
facing the front, like a stopgesture in the trafc direction.
Gesture 3 (Go Home): Straighten the two arms so that they are
at an angle of 45 deg to the body, and then swing both arms
inside and lock the ngers in front of the stomach.
Gesture 4 (Disengage): Rotate the two arms, and move the two
hands in front of the chest, and then extend the two hands to
the sides until the hands reach the shoulder level.
Gesture 5 (Up): Extend the left/right arm straight up.
Gesture 6 (Down): Bend the left/right hand, and raise the wrist
to the height of the chest, and then extend the hand straight
down.
Gesture 7 (Left): Swing the left arm straight out and up to the
side until the arm reaches the height of the shoulder.
Gesture 8 (Right): Swing the right arm straight out and up to
the side until the arm reaches the height of the shoulder.
Gesture 9 (Inward): Rotate the left/right forearm up around the
elbow joint on the same side with the hand open, until the hand
reaches the chest level. The palm faces back.
Gesture 10 (Outward): Rotate the left/right forearm up around
the elbow joint on the same side with the left hand open until
the hand reaches the height of the chest, and then rotate the arm
around the elbow joint on the same side until the arm is straight
with about 30 deg from the body line. The palm faces back.
Fig. 19 Overview of the demonstration including various commands for the robot to perform a pick-and-place operation
3
https://youtu.be/GHGbFNu3lpQ
101007-12 / Vol. 144, OCTOBER 2022 Transactions of the ASME
Gesture 11 (Open): Bend each of the two arms up around each
elbow on the side of the body until the hands touch the same
side of the shoulder.
Gesture 12 (Close): Bend the two arms and cross them in front
of the chest with the two hands on different sides of the shoul-
der. The palms face backwards and the ngers are open.
Gesture 13 (Clockwise (CW)): Rotate the forearm of the left/
right arm clockwise around the elbow joint of the same side
until the forearm reaches the shoulder level.
Gesture 14 (Counterclockwise (CCW)): Rotate the forearm of
the left/right arm counterclockwise around the elbow joint of
the same side until the forearm reaches the height of the
shoulder.
Gesture 15 (Speed Up): Swing the two arms straight up in the
front of the body.
Gesture 16 (Slow Down): Swing the two forearms up around
the elbow joints until both hands reach the height of the shoul-
der, and then swing the forearms back to the start position in
front of the body.
References
[1] Burns, A., and Wellings, A., 2001, Real-Time Systems and Programming
Languages, 3rd ed., Pearson Education, Harlow, UK.
[2] Nicora, M. L., Ambrosetti, R., Wiens, G. J., and Fassi, I., 2021, HumanRobot
Collaboration in Smart Manufacturing: Robot Reactive Behavior Intelligence,
ASME J. Manuf. Sci. Eng.,143(3), p. 031009.
[3] Liu, S., Wang, L., and Wang, X. V., 2021, Function Block-bBs ed Multimodal
Control for Symbiotic HumanRobot Collaborative Assembly,ASME
J. Manuf. Sci. Eng.,143(9), p. 091001.
[4] Arinez, J. F., Chang, Q., Gao, R. X., Xu, C., and Zhang, J., 2020, Articial
Intelligence in Advanced Manufacturing: Current Status and Future Outlook,
ASME J. Manuf. Sci. Eng.,142(11), p. 110804.
[5] Chen, H., Leu, M. C., Tao, W., and Yin, Z., 2020, Design of a Real-Time
HumanRobot Collaboration System Using Dynamic Gestures,ASME
International Mechanical Engineering Congress and Exposition, Virtual
Conference, Nov. 1619.
[6] Wang, X. V., and Wang, L., 2021, A Literature Survey of the Robotic
Technologies During the Covid-19 Pandemic,J. Manuf. Syst.,60, pp. 82336.
[7] Zinchenko, K., Wu, C.-Y., and Song, K.-T., 2016, A Study on Speech
Recognition Control for a Surgical Robot,IEEE Trans. Ind. Inf.,13(2),
pp. 607615.
[8] Bingol, M. C., and Aydogmus, O., 2020, Performing Predened Tasks Using the
HumanRobot Interaction on Speech Recognition for an Industrial Robot,Eng.
Appl. Artif. Intell.,95, p. 103903.
[9] Kuhn, M., Pollmann, K., and Papadopoulos, J., 2020, Im Your Partner-Im
Your Boss: Framing HumanRobot Collaboration With Conceptual
Metaphors,Companion of the 2020 ACM/IEEE International Conference on
HumanRobot Interaction, Virtual Conference, Mar. 2426, pp. 322324.
[10] Coupeté, E., Moutarde, F., and Manitsaris, S., 2016, A User-A daptive Gesture
Recognition System Applied to HumanRobot Collaboration in Factories,
Proceedings of the 3rd International Symposium on Movement and Computing,
Thessaloniki, GA, Greece, July 56, pp. 17.
[11] Unhelkar, V. V., Lasota, P. A., Tyroller, Q., Buhai, R.-D., Marceau, L., Deml, B.,
and Shah, J. A., 2018, Human-Aware Robotic Assistant for Collaborative
Assembly: Integrating Human Motion Prediction With Planning in Time,
IEEE Rob. Autom. Lett.,3(3), pp. 23942401.
[12] Pinto, R. F., Borges, C. D., Almeida, A., and Paula, I. C., 2019, Static Hand
Gesture Recognition Based on Convolutional Neural Networks,J. Electr.
Comput. Eng.,2019.
[13] Li, J., Liu, X., Zhang, M., and Wang, D., 2020, Spatio-Temporal Deformable 3d
Convnets With Attention for Action Recognition,Pattern Recognit.,98,
p. 107037.
[14] Tao, W., Lai, Z.-H., Leu, M. C., and Yin, Z., 2018, Worker Activity Recognition
in Smart Manufacturing Using IMU and SEMG Signals With Convolutional
Neural Networks,Procedia Manuf.,26, pp. 11591166.
[15] Treussart, B., Geffard, F., Vignais, N., and Marin, F., 2020, Controlling an
Upper-Limb Exoskeleton by EMG Signal While Carrying Unknown Load,
2020 IEEE International Conference on Robotics and Automation (ICRA),
Virtual Conference, May 31Aug. 31, IEEE, pp. 91079113.
[16] Ajoudani, A., Zanchettin, A. M., Ivaldi, S., Albu-Schäffer, A., Kosuge, K., and
Khatib, O., 2018, Progress and Prospects of the HumanRobot Collaboration,
Auton. Rob.,42(5), pp. 957975.
[17] Yongda, D., Fang, L., and Huang, X., 2018, Research on Multimodal Human
Robot Interaction Based on Speech and Gesture,Comput. Electr. Eng.,72,
pp. 443454.
[18] Lin, K., Li, Y., Sun, J., Zhou, D., and Zhang, Q., 2020, Multi-sensor Fusion for
Body Sensor Network in Medical HumanRobot Interaction Scenario,Inf.
Fusion,57, pp. 1526.
[19] Wang, L., Liu, S., Liu, H., and Wang, X. V., 2020, Overview of HumanRobot
Collaboration in Manufacturing,Proceedings of 5th International Conference on
the Industry 4.0 Model for Advanced Manufacturing, Belgrade, Serbia, June 14,
Springer, pp. 1558.
[20] Yu, G., Liu, Z., and Yuan, J., 2014, Discriminative Orderlet Mining for
Real-Time Recognition of Human-Object Interaction,Asian Conference on
Computer Vision, Singapore, Nov. 15, Springer, pp. 5065.
[21] Shinde, S., Kothari, A., and Gupta, V., 2018, Yolo Based Human Action
Recognition and Localization,Procedia Comput. Sci.,133, pp. 831838.
[22] Sun, B., Wang, S., Kong, D., Wang, L., and Yin, B., 2021, Real-Time Human
Action Recognition Using Locally Aggregated Kinematic-Guided Skeletonlet
and Supervised Hashing-by-Analysis Model,IEEE Trans. Cybern.
[23] Yu, J., Gao, H., Yang, W., Jiang, Y., Chin, W., Kubota, N., and Ju, Z., 2020, A
Discriminative Deep Model With Feature Fusion and Temporal Attention for
Human Action Recognition,IEEE Access,8, pp. 4324343255.
[24] Pisharady, P. K., and Saerbeck, M., 2015, Recent Methods and Databases in
Vision-Based Hand Gesture Recognition: A Review,Comput. Vis. Image
Understand.,141, pp. 152165.
[25] McNeill, D., 2008, Gesture and Thought, University of Chicago Press, Chicago,
IL.
[26] Holler, J., and Wilkin, K., 2009, Communicating Common Ground: How
Mutually Shared Knowledge Inuences the Representation of Semantic
Information in Speech and Gesture in a Narrative Task.Lang. Cogn. Process.,
24(2), pp. 267289.
[27] Yin, Z., and Collins, R., 2006, Moving Object Localization in Thermal Imagery
by ForwardBackward MHI,2006 Conference on Computer Vision and Pattern
Recognition Workshop (CVPRW06), New York, NY, June 1722, IEEE, pp.
133133.
[28] Ahad, M. A. R., Tan, J. K., Kim, H., and Ishikawa, S., 2012, Motion History
Image: Its Variants and Applications,Mach. Vision Appl.,23(2), pp. 255281.
[29] Bobick, A. F., and Davis, J. W., 2001, The Recogni tion of Human Movement
Using Temporal Templates,IEEE Trans. Pattern Anal. Mach. Intell.,23(3),
pp. 257267.
[30] Chen, H., Tao, W., Leu, M. C., and Yin, Z., 2020, Dynamic Gesture Design and
Recognition for HumanRobot CCollaboration With Convolutional Neural Net-
works,International Symposium on Flexible Automation, Virtual Conference,
July 89, American Society of Mechanical Engineers, p. V001T09A001.
[31] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.,
2014, Dropout: A Simple Way to Prevent Neural Networks From Overtting ,
J. Mach. Learn. Res., 15(1), pp. 19291958.
[32] Chen, B., Deng, W., and Du, J., 2017, Noisy Softmax: Improving the
Generalization Ability of Dcnn Via Postponing the Early Softmax Saturation,
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Honolulu, HI, July 2126, pp. 53725381.
[33] Yeo, K., and Melnyk, I., 2019, Deep Learning Algorithm for Data-Driven
Simulation of Noisy Dynamical System,J. Comput. Phys.,376, pp. 12121231.
[34] Kopparapu, S. K., and Laxminarayana, M., 2010, Choice of Mel Filter Bank in
Computing MFCC of a Resampled Speech,10th International Conference on
Information Science, Signal Processing and Their Applications (ISSPA 2010),
Kuala Lumpur, Malaysia, May 1013, IEEE, pp. 121124.
[35] Li, B., Sainath, T. N., Narayanan,A., Caroselli, J., Bacchiani, M., Misra, A ., Shafran,
I.,Sak,H.,Pundak,G.,Chin,K.K.,andSim,K.C.,2017,Acoustic Modeling for
Google Home,Interspeech, Stockholm, Sweden, Aug. 2024, pp. 399403.
[36] Rabinowitz, P., 2000, Noise-Induced Hearing Loss,Am. Family Physician,
61(9), pp. 27492756.
[37] Kamath, S., and Loizou, P., 2002, A Multi-Band Spectral Subtraction Method
for Enhancing Speech Corrupted by Colored Noise,ICASSP, Orlando, FL,
May 1317.
[38] Upadhyay, N., and Karmakar, A., 2015, Speech Enhancement Using Spectral
Subtraction-Type Algorithms: A Comparison and Simulation Study,Procedia
Comput. Sci.,54, pp. 574584.
[39] Gilakjani, A. P., 2016, English Pronunciation Instruction: A Literature Review,
Int. J. Res. Engl. Educ., 1(1), pp. 16.
[40] Amano, A., Aritsuka, T., Hataoka, N., and Ichikawa, A., 1989, On the Use of
Neural Networks and Fuzzy Logic in Speech Recognition,Proceedings of the
1989 International Joint Conference Neural Networks, Washington, DC, June
1822, Vol. 301, pp. 147169.
[41] Vani, H., and Anusuya, M., 2020, Fuzzy Speech Recognition: A Review,
Int. J. Comput. Appl., 177(47), pp. 3954.
[42] Karimov, E., 2020, Data Structures and Algorithms in Swift, Springer, New
York City.
[43] Visentini, I., Snidaro, L., and Foresti, G. L., 2016, Diversity-Aware Classier
Ensemble Selection Via F-Score,Inform. Fusion,28, pp. 2443.
[44] Al-Amin, M., Tao, W., Doell, D., Lingard, R., Yin, Z., Leu, M. C., and Qin, R.,
2019, Action Recognition in Manufacturing Assembly Using Multimodal
Sensor Fusion,Procedia Manuf.,39, pp. 158167.
Journal of Manufacturing Science and Engineering OCTOBER 2022, Vol. 144 / 101007-13
... Assembly assistance is essential in modern manufacturing, particularly as the adoption of artificial intelligence (AI) and robotics becomes more widely used and changes the industry dynamics. It is not only essential for ensuring that workers are adept and adaptable in handling complex tasks and machinery but also important for maintaining productivity, fostering innovation, and boosting competitiveness in an industrial environment (Pereira and Romero, 2017;Li, 2022;Chen et al., 2022). However, as reported by Fox Business in 2023, the lack of skilled workers is posing a serious challenge for the manufacturing and construction industries, affecting their capacity to staff adequately and complete projects timely (FOX Business, 2023). ...
Article
Full-text available
Modern manufacturing faces significant challenges, including efficiency bottlenecks and high error rates in manual assembly operations. To address these challenges, we implement artificial intelligence (AI) and propose a gaze-driven assembly assistant system that leverages artificial intelligence for human-centered smart manufacturing. Our system processes video inputs of assembly activities using a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network for assembly step recognition, a Transformer network for repetitive action counting, and a gaze tracker for eye gaze estimation. The application of AI integrates the outputs of these tasks to deliver real-time visual assistance through a software interface that displays relevant tools, parts, and procedural instructions based on recognized steps and gaze data. Experimental results demonstrate the system's high performance, achieving 98.36% accuracy in assembly step recognition, a mean absolute error (MAE) of 4.37%, and an off-by-one accuracy (OBOA) of 95.88% in action counting. Compared to existing solutions, our gaze-driven assistant offers superior precision and efficiency, providing a scalable and adaptable framework suitable for complex and large-scale manufacturing environments.
... The subsequent section provides a comprehensive overview of the cobot response to physiological human input stimuli. This may include objectives such as security, Table 2 Human input modalities: command information Article Type of human input modalities Cobot behaviour [39] Hand command Cobot arm positioning and trajectory control [40] Voice command Control of cobot for specific transport and logistic services on the factory floor [41] Face expression recognition Cobot presents adaptation based on facial expression (no matching presented, and no robotic action suggested) [42] Graphical and Voice command Indicates to cobot partner which task to perform next [43] Hand command Control cobot position and orientation [14] Foot command Cobot mode selection (autonomous, free drive mode, physical interaction, fast trajectory control, precise trajectory control, stop, learn new position, go home, turn left, turn right) [44] Haptic input Track the estimated human motion intention under a variable desired impedance model [45] Voice command Guide the cobot through the assembly process using voice commands ( bring in assembly parts) [46] Hand and voice command 16 cobot actions (start, stop, go home, disengage, up, down, left, right, inward, outward, open, close, CW, CCW, speed up, slow down) [47] Hand command Cobot trajectory definition [48] Hand command Action to perform (no examples are provided) [49] Voice command Cobot action e.g. (drilling operation) [50] Voice command Cobot action ( up, down, left, right) [51] Hand command Cobot action ( start, stop, inward, outward, up, down, left, right) [13] Foot command Cobot selection mode ( third hand mode, motor holding, fast trajectory control, precise control, stop) [52] Body command Cobot selection mode ( standby safe mode, physical human cobot interaction, obstacle detection and collision avoidance) [53] Vocal commands and gaze input Cobot action such as specialize polishing actions, stop cobot etc.… [54] Hand command Cobot action (base motion, shoulder motion, elbow, wrist 1, 2 and 3 movement of a UR5 cobot, x, y and z axis motion) [10] Hand command Cobot action (bring parts or tools, physical human-cobot interaction, learning tasks, co-manipulation) [55] Hand command Human teleoperates the cobot partner via hand tracking [56] Hand command Cobot action (tool change command, pause in operation) [57] Body tracking and command Based on human wrist motion analysis, the cobot can trigger two types of handover operations: careless and careful ...
Article
Full-text available
The advent of mass customization has precipitated a need within the industry for the implementation of collaborative robots, which facilitate the integration of human cognitive capabilities with the speed and repeatability of robots. This coupling, however, engenders a closer collaboration between the partners, thereby necessitating collective synergy to achieve optimal scheduling while circumventing musculoskeletal disorders. It is imperative to study and analyze the behavior of humans and robots in interaction, as the current paradigm strives to achieve an optimal interaction between the two partners with the objective of ensuring productivity, safety, cognitive ergonomics and preventing musculoskeletal disorders. However, human behavior is variable and can, on occasion, give rise to anomalies in the interaction. Consequently, it is imperative that the robot partner exhibits precise behavior, whether proactive or reactive. This paper puts forth a unified perspective on robot behavior when confronted with human abnormal behavior during interaction on the factory floor. This systematic literature review and meta-analysis employs the PRISMA methodology to examine the literature on human and robot behavior in human–robot interaction in an industrial context, with a particular focus on robot behavior when confronted with human abnormal behavior during interaction. A systematic search of nearly 2,609 papers yielded 133 for inclusion in this systematic review. In light of the findings presented in this review, it can be concluded that the selection of robot actions based on human behavior represents a novel area of research that requires further investigation, particularly with regard to proactive online behavioral approaches. Indeed, there is a vast array of robot behavior modalities in response to typical human behavior (e.g., command input). However, there is currently no prescribed robot reaction based on atypical human behavior (e.g., misplacement in the factory floor, repetition of tasks, etc.). This lack of definition complicates the deployment of such technology in the smart factory. Consequently, it is essential to define new decision strategies based, for instance, on artificial intelligence approaches.
... In contrast to conventional programming methods and interfaces, nowadays more intuitive approaches can be employed for human-robot interaction [1]. In this regard, speech is probably the most intuitive means of communication, and its integration into HRC scenarios [2,3] stands as a transformative step towards a more productive and synergistic working relationship between humans and robots. However, speech recognition errors stemming from factors like background noise, varied pronunciations, and industry-specific term misinterpretations, can induce frustration, especially when users must persistently reiterate sentences that the ASR system consistently fails to understand. ...
... Everyday speech and gestures, when combined, can convey more nuanced meaning than either modality could on its own. We still need to learn more about these modalities to draw any firm conclusions about how they work together or independently in AR/HMD settings [5]. To keep the human-robot collaboration (HRC) efficient performance 53 ISSN: 3006-8894 going, it still needs to be easier for robots to respond to human actions in manufacturing and achieve natural, accurate, and real-time identification [6]. ...
Article
Multimodal interaction (MMI) represented by speech and motion data (SMD) has enormous potential in virtual reality (VR) systems. However, real-time synchronization, context-sensitive interpretation, and effective fusion of heterogeneous data modalities remain open. The study presents a deep learning-based framework that fuses speech and motion data to provide better performance in interaction. This study proposes a novel method called MMI-CNNRNN that combines a Convolutional Neural Network (CNN) that features extraction in speech with a Recurrent Neural Network (RNN) for temporal motion analysis, integrated into a Transformer-based architecture to enhance the contextual understanding and responsiveness of the system. In this regard, the performance of the proposed framework is evaluated using benchmark multimodal datasets such as the IEMOCAP dataset. These results represent a 20% increase in interaction accuracy and a 15% latency reduction compared to unimodal and early fusion methods. The fusion of CNN and RNN mechanisms translates into more natural and intuitive interactions, making both the assistive device and the VR environment more adaptive and user-friendly. Concluding from the findings of the proposed work, efficient multimodal system development supports better accessibility and engagement among users with various needs.
... Improved sensing, perception, and control algorithms enable mobile robots to get closer to humans in a variety of domains, such as healthcare [1], search and rescue [2], [3], advanced manufacturing [4], [5], as well as arts and entertainment [6]- [8]. In these scenarios, it will be necessary to incorporate non-verbal communication to improve the quality of human-robot interaction [9]. ...
Preprint
One key area of research in Human-Robot Interaction is solving the human-robot correspondence problem, which asks how a robot can learn to reproduce a human motion demonstration when the human and robot have different dynamics and kinematic structures. Evaluating these correspondence problem solutions often requires the use of qualitative surveys that can be time consuming to design and administer. Additionally, qualitative survey results vary depending on the population of survey participants. In this paper, we propose the use of heterogeneous time-series similarity measures as a quantitative evaluation metric for evaluating motion correspondence to complement these qualitative surveys. To assess the suitability of these measures, we develop a behavioral cloning-based motion correspondence model, and evaluate it with a qualitative survey as well as quantitative measures. By comparing the resulting similarity scores with the human survey results, we identify Gromov Dynamic Time Warping as a promising quantitative measure for evaluating motion correspondence.
... These approaches are also capable of inferring latent or absent components of given commands based on semantic associations learned from noisy text corpora [23,24]. In other words, these systems acquire new knowledge in order to complete incomplete instructions. ...
Article
Full-text available
In recent years, technologies for human–robot interaction (HRI) have undergone substantial advancements, facilitating more intuitive, secure, and efficient collaborations between humans and machines. This paper presents a decentralized HRI platform, specifically designed for printed circuit board manufacturing. The proposal incorporates many input devices, including gesture recognition via Leap Motion and Tap Strap, and speech recognition. The gesture recognition system achieved an average accuracy of 95.42% and 97.58% for each device, respectively. The speech control system, called Cellya, exhibited a markedly reduced Word Error Rate of 22.22% and a Character Error Rate of 11.90%. Furthermore, a scalable user management framework, the decentralized multimodal control server, employs biometric security to facilitate the efficient handling of multiple users, regulating permissions and control privileges. The platform’s flexibility and real-time responsiveness are achieved through advanced sensor integration and signal processing techniques, which facilitate intelligent decision-making and enable accurate manipulation of manufacturing cells. The results demonstrate the system’s potential to improve operational efficiency and adaptability in smart manufacturing environments.
Conference Paper
Full-text available
With the development of industrial automation and artificial intelligence, robotic systems are developing into an essential part of factory production, and the human-robot collaboration (HRC) becomes a new trend in the industrial field. In our previous work, ten dynamic gestures have been designed for communication between a human worker and a robot in manufacturing scenarios, and a dynamic gesture recognition model based on Convolutional Neural Networks (CNN) has been developed. Based on the model, this study aims to design and develop a new real-time HRC system based on multi-threading method and the CNN. This system enables the real-time interaction between a human worker and a robotic arm based on dynamic gestures. Firstly, a multi-threading architecture is constructed for high-speed operation and fast response while schedule more than one task at the same time. Next, A real-time dynamic gesture recognition algorithm is developed, where a human worker’s behavior and motion are continuously monitored and captured, and motion history images (MHIs) are generated in real-time. The generation of the MHIs and their identification using the classification model are synchronously accomplished. If a designated dynamic gesture is detected, it is immediately transmitted to the robotic arm to conduct a real-time response. A Graphic User Interface (GUI) for the integration of the proposed HRC system is developed for the visualization of the real-time motion history and classification results of the gesture identification. A series of actual collaboration experiments are carried out between a human worker and a six-degree-of-freedom (6 DOF) Comau industrial robot, and the experimental results show the feasibility and robustness of the proposed system.
Article
Full-text available
Since the late 2019, the COVID-19 pandemic has been spread all around the world. The pandemic is a critical challenge to the health and safety of the general public, the medical staff and the medical systems worldwide. It has been globally proposed to utilise robots during the pandemic, to improve the treatment of patients and leverage the load of the medical system. However, there is still a lack of detailed and systematic review of the robotic research for the pandemic, from the technologies’ perspective. Thus a thorough literature survey is conducted in this research and more than 280 publications have been reviewed, with the focus on robotics during the pandemic. The main contribution of this literature survey is to answer two research questions, i.e. 1) what the main research contributions are to combat the pandemic from the robotic technologies’ perspective, and 2) what the promising supporting technologies are needed during and after the pandemic to help and guide future robotics research. The current achievements of robotic technologies are reviewed and discussed in different categories, followed by the identification of the representative work’s technology readiness level. The future research trends and essential technologies are then highlighted, including artificial intelligence, 5 G, big data, wireless sensor network, and human-robot collaboration.
Conference Paper
Full-text available
Human-robot collaboration (HRC) is a challenging task in modern industry and gesture communication in HRC has attracted much interest. This paper proposes and demonstrates a dynamic gesture recognition system based on Motion History Image (MHI) and Convolutional Neural Networks (CNN). Firstly, ten dynamic gestures are designed for a human worker to communicate with an industrial robot. Secondly, the MHI method is adopted to extract the gesture features from video clips and generate static images of dynamic gestures as inputs to CNN. Finally, a CNN model is constructed for gesture recognition. The experimental results show very promising classification accuracy using this method.
Conference Paper
Full-text available
Implementing an intuitive control law for an upper-limb exoskeleton dedicated to force augmentation is a challenging issue in the field of human-robot collaboration. The aim of this study is to design an innovative approach to assist carrying an unknown load. The method is based on user's intentions estimated through a wireless EMG armband allowing movement direction and intensity estimation along 1 Degree of Freedom. This control law aimed to behave like a gravity compensation except that the mass of the load does not need to be known. The proposed approach was tested by 10 participants on a lifting task with a single Degree of Freedom upper-limb exoskeleton. Participants performed it in three different conditions : without assistance, with an exact gravity compensation and with the proposed method based on EMG armband (Myo Armband). The evaluation of the efficiency of the assistance was based on EMG signals captured on seven muscles (objective indicator) and a questionnaire (subjective indicator). Results showed a statically significant reduction of mean activity of the biceps, erector spinae and deltoid by 20% ± 14, 18% ± 12 and 25% ± 16 respectively while comparing the proposed method with no assistance. In addition, similar muscle activities were found both in the proposed method and the traditional gravity compensation. Subjective evaluation showed better precision, efficiency and responsiveness of the proposed method compared to the traditional one.
Article
Full-text available
Today's manufacturing systems are becoming increasingly complex, dynamic and connected. The factory operation faces challenges of highly nonlinear and stochastic activity due to the countless uncertainties and interdependencies that exist. Recent developments in Artificial Intelligence (AI), especially Machine Learning (ML) have shown great potential to transform the manufacturing domain through advanced analytics tools for processing the vast amounts of manufacturing data generated, known as Big Data. The focus of this paper is threefold: (1) Review the State-of-the-Art applications of AI to representative manufacturing problems, (2) Provide a systematic view for analyzing data and process dependencies at multiple levels that AI must comprehend, and (3) Identify challenges and opportunities to not only further leverage AI for manufacturing, but also influence the future development of AI to better meet the needs of manufacturing. To satisfy these objectives, the paper adopts the hierarchical organization widely practiced in manufacturing plants in examining the interdependencies from the overall system level to the more detailed granular level of incoming material process streams. In doing so, the paper considers a wide range of topics from throughput and quality, supervisory control in human robotic collaboration, process monitoring, diagnosis and prognosis, finally to advances in materials engineering to achieve desired material property in process modeling and control.
Conference Paper
Full-text available
Human-robot collaboration (HRC) in the manufacturing context aims to realise a shared workspace where humans can work side by side with robots in close proximity. In human-robot collaborative manufacturing, robots are required to adapt to human behaviours by dynamically changing their pre-planned tasks. However, the robots used today controlled by rigid native codes can no longer support effective human-robot collaboration. To address such challenges, programming-free and multimodal communication and control methods have been actively explored to facilitate the robust human-robot collaborative manufacturing. They can be applied as the solutions to the needs of the increased flexibility and adaptability, as well as higher effort on the conventional (re)programing of robots. These high-level multimodal commands include gesture and posture recognition, voice processing and sensorless haptic interaction for intuitive HRC in local and remote collaboration. Within the context, this paper presents an overview of HRC in manufacturing. Future research directions are also highlighted.
Article
3-D action recognition is referred to as the classification of action sequences which consist of 3-D skeleton joints. While many research works are devoted to 3-D action recognition, it mainly suffers from three problems: 1) highly complicated articulation; 2) a great amount of noise; and 3) low implementation efficiency. To tackle all these problems, we propose a real-time 3-D action-recognition framework by integrating the locally aggregated kinematic-guided skeletonlet (LAKS) with a supervised hashing-by-analysis (SHA) model. We first define the skeletonlet as a few combinations of joint offsets grouped in terms of the kinematic principle and then represent an action sequence using LAKS, which consists of a denoising phase and a locally aggregating phase. The denoising phase detects the noisy action data and adjusts it by replacing all the features within it with the features of the corresponding previous frame, while the locally aggregating phase sums the difference between an offset feature of the skeletonlet and its cluster center together over all the offset features of the sequence. Finally, the SHA model combines sparse representation with a hashing model, aiming at promoting the recognition accuracy while maintaining high efficiency. Experimental results on MSRAction3D , UTKinectAction3D , and Florence3DAction datasets demonstrate that the proposed method outperforms state-of-the-art methods in both recognition accuracy and implementation efficiency.
Article
In human-robot collaborative assembly, robots are often required to dynamically change their pre-planned tasks to collaborate with human operators in close proximity. One essential requirement of such an environment is enhanced flexibility and adaptability, as well as reduced effort on the conventional (re)programming of robots, especially for complex assembly tasks. However, the robots used today are controlled by rigid native codes that cannot support efficient human-robot collaboration. To solve such challenges, this paper presents a novel function block-enabled multimodal control approach for symbiotic human-robot collaborative assembly. Within the context, event-driven function blocks as re-usable functional modules embedded with smart algorithms are used for the encapsulation of assembly feature-based tasks/processes and control commands that are transferred to the controller of robots for execution. Then, multimodal control commands in the form of sensorless haptics, gestures, and voices serve as the inputs of the function blocks to trigger task execution and human-centred robot control within a safe human-robot collaborative environment. Finally, the performed processes of the method are experimentally validated by a case study in a two-robot assembly work cell on assisting the operator during the collaborative assembly. This unique combination facilitates programming-free robot control and the implementation of the multimodal symbiotic human-robot collaborative assembly with the enhanced adaptability and flexibility.
Article
To enable safe and effective human-robot collaboration (HRC) in smart manufacturing, seamless integration of sensing, cognition and prediction into the robot controller is critical for real-time awareness, response and communication inside a heterogeneous environment (robots, humans, equipment). The specific research objective is to provide the robot Proactive Adaptive Collaboration Intelligence (PACI) and switching logic within its control architecture in order to give the robot the ability to optimally and dynamically adapt its motions, given a priori knowledge and predefined execution plans for its assigned tasks. The challenge lies in augmenting the robot's decision-making process to have greater situation awareness and to yield smart robot behaviors/reactions when subject to different levels of human-robot interaction, while maintaining safety and production efficiency. Robot reactive behaviors were achieved via cost function-based switching logic activating the best suited high-level controller. The PACI's underlying segmentation and switching logic framework is demonstrated to yield a high degree of modularity and flexibility. The performance of the developed control structure subjected to different levels of human-robot interactions was validated in a simulated environment. Open-loop commands were sent to the physical e.DO robot to demonstrate how the proposed framework would behave in a real application.