ArticlePDF Available

Abstract and Figures

The paradigm for robot usage has changed in the last few years, from a scenario in which robots work isolated to a scenario where robots collaborate with human beings, exploiting and combining the best abilities of robots and humans. The development and acceptance of collaborative robots is highly dependent on reliable and intuitive human-robot interaction (HRI) in the factory floor. This paper proposes a gesture-based HRI framework in which a robot assists a human co-worker delivering tools and parts, and holding objects to/for an assembly operation. Wearable sensors, inertial measurement units (IMUs), are used to capture the human upper body gestures. Captured data are segmented in static and dynamic blocks recurring to an unsupervised sliding window approach. Static and dynamic data blocks feed an artificial neural network (ANN) for static, dynamic, and composed gesture classification. For the HRI interface, we propose a parameterization robotic task manager (PRTM), in which according to the system speech and visual feedback, the co-worker selects/validates robot options using gestures. Experiments in an assembly operation demonstrated the efficiency of the proposed solution.
Content may be subject to copyright.
Gesture-based human-robot interaction for human
assistance in manufacturing
Pedro Neto, Miguel Simão, Nuno Mendes, Mohammad Safeea
Abstract—The paradigm for robot usage has changed in the
last few years, from a scenario in which robots work isolated to a
scenario where robots collaborate with human beings, exploiting
and combining the best abilities of robots and humans. The
development and acceptance of collaborative robots is highly
dependent on reliable and intuitive human-robot interaction
(HRI) in the factory floor. This paper proposes a gesture-based
HRI framework in which a robot assists a human co-worker
delivering tools and parts, and holding objects to/for an assembly
operation. Wearable sensors, inertial measurement units (IMUs),
are used to capture the human upper body gestures. Captured
data are segmented in static and dynamic blocks recurring to an
unsupervised sliding window approach. Static and dynamic data
blocks feed an artificial neural network (ANN) for static, dynamic
and composed gesture classification. For the HRI interface we
propose a parameterization robotic task manager (PRTM), in
which according to the system speech and visual feedback the
co-worker selects/validates robot options using gestures. Exper-
iments in an assembly operation demonstrated the efficiency of
the proposed solution.
Index Terms—Human-Robot Interaction, Collaborative
Robotics, Gesture Recognition, Intuitive Interfaces
COLLABORATIVE robots are increasingly present in
manufacturing domain, sharing the same workspace and
collaborating with human co-workers. This collaborative sce-
nario allows to exploit the best abilities of robots (accuracy,
repetitive work, etc.) and humans (cognition, management,
etc.) [1][2]. The development and acceptance of collabora-
tive robots in industry is highly dependent on reliable and
intuitive human-robot interaction (HRI) interfaces [3], i.e.,
making robots accessible to human beings without major
skills in robotics. Collaborative robots and humans have
to understand each other and interact in an intuitive way,
creating a co-working partnership. This will allow a greater
presence of collaborative robots in industrial companies which
are struggling to have ever more flexible production due to
consumer demand for customized products [4]. For example,
a human-robot collaborative platform for constructing panels
from preimpregnated carbon fibre fabrics in which the human
and robot share the workspace promoting situation awareness,
danger perception and enrichment of communication [5].
Instructing and programming an industrial robot by the
traditional teaching method (text and teach pendant based
methods) is a tedious and time-consuming task that requires
Pedro Neto is with the Department of Mechanical Engineering, University
of Coimbra, Coimbra, Portugal, e-mail:
Miguel Simão, Nuno Mendes and Mohammad Safeea are with the Depart-
ment of Mechanical Engineering, University of Coimbra, Coimbra, Portugal,
Static and dynamic gestures
Speech and visual feedback
IMUs x5
UWB Data stream
Task manager
Fig. 1. Overview of the proposed gesture-based HRI framework.
technical expertise [6]. In addition, these modes of robot inter-
facing are hard to justify for flexible production where the need
for robot re-configuration is constant. Recently, human-robot
interfaces based in robot hand-guiding (kinesthetic teaching)
and haptic interfaces demonstrated to be intuitive to use by hu-
mans without deep skills in robotics [7]. In addition, advanced
and natural HRI interfaces such as human gestures and speech
still lack in reliability in industrial/unstructured environment
[8]. An interesting study reports the impact of human-robot in-
terfaces to intuitively teach a robot to recognize objects [9].The
study demonstrated that the smartphone interface allows non-
expert users to intuitively interact with the robot, with a good
usability and user’s experience when compared to a gesture-
based interface. The efficiency of a conventional keyboard and
a gesture-based interface in controlling the display/camera of a
robot is presented in [10]. The gesture-based interface allowed
smoother and more continuous control of the platform, while
the keyboard provided superior performance in terms of task
completion time, ease of use, and workload.
Making an analogy with the way humans interact and teach
each other, allows us to understand the importance of gesture-
based HRI. Static gestures are human postures in which the
human is static (small motion like body shaking can occur)
and dynamic gestures are represented by a dynamic behaviour
of part of the human body (normally the arms). Gestures can
be used as an interface to teleoperate a robot, allowing to setup
robot configurations and combine with other interfaces such
as kinesthetic interface and speech. For instance, a human co-
worker can point to indicate a grasping position to the robot,
use a dynamic gesture to move the robot to a given position
and use a static gesture to stop the robot [11], [12]. This
scenario allows the human co-worker to focus on the process
task and not in the robot programming [13].
Fig. 1 illustrates the proposed framework. Static and dy-
namic gesture data are acquired from upper body IMUs,
segmented by motion, and different ANNs are employed
to classify static and dynamic gestures. Recognized gesture
patterns are used to teleoperate/instruct a collaborative robot
in a process conducted by a parameterization robotic task
manager (PRTM) algorithm. The system provides visual and
speech feedback to the human co-worker, indicating to the user
what gesture was recognized, or if no gesture was recognized.
Depending on the industrial domain and the company it-
self, the shop floor presents restrictions to the technologies
used in the manufacturing processes. The implementation of
human-robot collaborative manufacturing processes is today
a main challenge for industry. Beyond the related human
factors, the advanced human-robot interfaces (gestures, speech,
hybrid, etc.) are constrained by the shop floor conditions. In
noisy environments the human-human verbal communication
is difficult to achieve or prohibitive in some cases, especially
when the workers are using earplugs. In this scenario speech
interfaces are not efficient and gesture interfaces are a valid
alternative. On the other hand, confined spaces hamper the
use of arm gestures. In these conditions, the design of the
collaborative robotic system has to be adapted according to
the specific manufacturing conditions.
In this study we assume that the shop floor environment is
noisy and not confined in space, so that gestures are used
to interface with the robot. Our proposed approach brings
benefits and it is practically relevant in the context of flexible
production in small lot sizes [8], [14], namely:
1) The human co-worker and robot work in parallel, while
the robot is ready to assist the human when required;
2) The use of the robot reduces the exposition of the human
co-worker to poor ergonomic conditions and possible
injuries (through hand-guiding the robot can be adjusted
online to the human body dimensions);
3) The use of the robot reduces error in production since
the work plan is strictly followed and managed by the
4) The robot assists the human in complex tasks that cannot
be fully automated, reducing the cycle time;
5) The introduction of the collaborative robot improves
the quality of some tasks when compared with human
6) The collaborative robot allows to reduce drastically the
setup time for a new product or variant of a product.
This is critical in small lot production.
This work was developed according to the needs of the project
ColRobot1, which intends the development of a collaborative
robot for assembly operations in automotive and spacecraft
industry. The robot should be able to assist workers, acting as
a third hand, by delivering parts and tools for the assembly
Section II presents the segmentation by motion process.
Section III details the proposed classifiers and the feature
dimensionality reduction and regularization. The robot task
manager is presented in section IV, while experiments and
results are shown in section V. Finally, the conclusion and
directions for future work are in section VI.
A. Challenges, Proposed Approach and Contributions
The problems and challenges to address in collaborative
HRI are multiple. Especial attention has to be devoted to
the reliability of the existing interfaces, the accuracy of
gesture classification in continuous and real-time, and the
interface with the robot. This is especially important in a
situation where a wrong classification of a gesture may lead
to accidents/collisions. The HRI interface has to be prepared
to manage this situation, having validation procedures and
hardware capable to ensure safety in all circumstances. In
presence of an unstructured/industrial environment, several
challenges can be pointed out:
1) Achieve high gesture recognition rates (close to 100%)
and assure the generalization capability in respect to
untrained samples and new users (user independent).
The appearance of false positives and false negatives
should be reduced to a minimum;
2) Combine and fuse sensor data to better describe the
human behavior (hand, arms and body in general) with
accuracy, no occlusions and independently from environ-
ment conditions (light, magnetic fields, etc.). Selection
of proper gesture features according to each specific
3) Intuitive and modular interfacing with robot, ensuring
the management and coordination of human and robot
actions. The human co-worker has to receive feedback
in anticipation related with future robot actions.
In this paper we propose a gesture-based HRI framework in
which a collaborative robot acts as a “third hand” by delivering
to the human shared workplace tools, parts, and holding work
pieces while the human co-worker performs assembly working
operations on it. This framework was tested in an standard
manufacturing assembly operation.
The proposed gesture-based HRI, Fig. 1, relies on IMUs
to capture human upper body gestures and a ultra wideband
(UWB) positioning system to have an indication of the rel-
ative position between human and robot. Static and dynamic
segments are obtained automatically with a sliding-window
motion detection method. Static segments will input the clas-
sification of static gestures (SGs) and dynamic segments will
input the classification of dynamic gestures (DGs) after up- or
down-sampling of gesture frames using bicubic interpolation.
To avoid false positives/negatives, we implemented what we
call composed gestures, which combine SGs and DGs in a
given sequence. We proved that ANNs are reliable to classify
both SGs, DGs and consequently the composed gestures. A
PRTM correlates the classified gestures with actual commands
to be sent to a robot and automatic speech and visual feedback
for the co-worker.
Inspired by the way humans interact with a phone auto
attendant (digital receptionist) in which computer speech
feedback indicates to the human the phone number to select
according to the desired service (navigate in the menus), our
proposed gesture-based HRI interface works in a similar way.
The PRTM uses computer speech and visual feedback to
indicate the options available to the human co-worker (for
example bring a tool, a part or holding a part by setup a kines-
thetic teaching mode) and the human uses gestures to select
and validate the existing options. This is a modular solution
(other functionalities can be added), intuitive (the co-workers
does not have to remember a large number of gestures), and
flexible (adapted to different scenarios, users and robots). The
PRTM can be customized to run with speech recognition
commands or robot touch commands instead gestures. Due to
the advances in speech recognition in the last two decades
it is expected that such a solution will work with a high
level of reliability in silent environments. Nevertheless, the use
of automatic speech feedback (using headphones) combined
with visual feedback (using a monitor installed in the robotic
cell) to the human demonstrated to be effective. The feedback
information is redundant so that when the level of noise is
too high the human co-worker can follow the information in
the monitor screen. Both audio and visual feedback provides
information about robot state, the next task of the sequence
and if the task ended.
The human co-worker is free to move in the workspace,
which may conduct to the appearance/classification of gesture
false positives (human behaviors are unexpectedly classified
as gestures). To avoid this scenario, since the UWB provides
human positional data, gesture classification is only activated
when the human is in a specific place in front of the robot
(other places may be defined). In addition, the classifiers only
act when the PRTM is expecting a given gesture during a
parameterization phase. The human co-worker selects from
the available library what gestures associate to the robot
actions managed by the PRTM, customizing the human-robot
The experiments performed in an assembly operation
demonstrated the following contributions:
1) The proposed unsupervised segmentation allows to de-
tect all static and dynamic motion blocks, i.e., when a
given static or dynamic gesture starts and ends;
2) Gesture recognition accuracy is relatively high (90% -
100%) for a library of 8 SGs and 4 DGs. These results
were obtained in continuous, real-time and with seven
different subjects (user independent);
3) A good generalization can be achieved with respect to
untrained samples and new subjects using the system;
4) The PRTM demonstrated efficiency, reliability, and easy
to use behaviour. Several users indicated in question-
naires that it is easy to understand the speech and visual
instructions to select robot tasks and use the robot as a
“tool”, without skills in robot programming.
B. Related Work
Collaborative robotics is an emerging and multidisciplinary
research field, in which gesture-based HRI is an active research
topic. Gestures are a meaningful part of human commu-
nication, sometimes providing information that is hard to
convey in speech [15]. Gestures can be categorized according
to their functionality [16]. Communicative gestures provide
information that is hard to convey in speech, for example
command gestures [17], pointing [18], gestures to represent
meaningful, objects or actions, and mimicking gestures [19],
[17]. Gestures have been proven to be one of the most effective
and natural mechanisms for reliable HRI, promoting a natural
interaction process. In the context of HRI, they have been
used for robot teleoperation, and to coordinate the interaction
process and cooperation activities between human and robot.
As stated in [7], a gesture-based robotic task generally consists
of individual actions, operations, and gestures that are arranged
in a hierarchical order. Also, there is not necessarily a one-
on-one relationship between gestures and actions, one gesture
can encode several actions. Therefore, a hierarchical chain of
gestures is required to perform a certain task. For example,
the user can point to an object in order to select it, but the
action to be taken in respect to that object is unknown to the
system. The actions can be picking up the object, painting it,
welding it or inspecting it, among others.
Recognized human gestures and actions can be applied to
define robot motion directions [20] and to coordinate the inter-
action process and cooperation activities [21]. Some authors
discuss what gestures are the most effective in improving
human robot interaction processes [22], [23].
Some gestures, although not all, can be defined by their
spatial trajectory. This is particularly true for pantomimic
gestures [19], which are often used to demonstrate a certain
motion to be done, e.g., a circle. Burke and Lasenby focused
with success on using Principal Component Analysis (PCA)
and Bayesian filtering to classify these time series. In [24],
Shao and Li propose the use of an estimation of integral
invariants – line integrals of a class of kernel functions
along a motion trajectory – to measure the similarity between
trajectories. They also propose boosting the classification using
machine learning methods such as Hidden Markov Models
(HMMs) and Support Vector Machines (SVMs).
Gesture spotting, either static or dynamic, is an active area
of research with many possible applications. The problem be-
comes more challenging when gestures are recognized in real-
time [13]. The difficulty is that gestures typically appear within
a continuous stream of motion. Temporal gesture segmentation
is the problem of determining when a gesture starts and ends
in a continuous stream of data. Segmentation should also
decrease the number of classifications performed, reducing the
processing load and enhancing the real-time characteristic of a
system. When the segmentation is incorrect the recognition is
more likely to fail [25]. Analyzing continuous image streams
is a challenge to solve spatial and temporal segmentation [26].
The input features for gesture recognition are normally the
hand/arm/body position, orientation and motion [27], often
captured from vision sensors. However, it is difficult to con-
struct reliable features from only vision sensing due to occlu-
sions, varying light conditions and free movement of the user
in the scene [28], [17]. With this in mind, several approaches
to gesture recognition rely on wearable sensors such as data
gloves, magnetic tracking sensors, inertial measurement units
(IMUs), Electromyography (EMGs), etc. In fact, these interac-
tion technologies have been proven to provide reliable features
in unstructured environments. Nevertheless, they also place an
added burden on the user since they are wearable. Data from
commercial off-the-shelf devices like a smartwatch can be
used to recognize gestures and for defining velocity commands
for a robot in an intuitive way [29].
Researchers have used various methods such as HMMs,
Artificial Neural Networks (ANNs), SVMs, Dynamic Time
Warping (DTW), deep learning, among other techniques, to
recognize gesture patterns. HMMs can be used to find time
dependencies in skeletal features extracted from image and
depth data (RGB-D) with a combination of Deep Belief
Networks (DBNs) and 3D Convolutional Neural Networks
(CNNs) [30]. Deep learning combined with recurrent networks
demonstrated state of the art performance in the classifica-
tion of human activities from wearable sensing [31]. ANNs
demonstrated superior performance in the classification of high
number of gesture patterns, for example an accuracy of 99%
for a library of 10 dynamic gestures and 96% for 30 static
gestures [13]. Field et al. used a Gaussian Mixture Model
(GMM) to classify human’s body postures (gestures) with
previous unsupervised temporal clustering [32]. A Gaussian
temporal smoothing kernel is incorporated into a Hidden-State
Conditional Random Fields (HCRF) formulation to capture
long-range dependencies and make the system less sensitive
to input noise data [33].
Hand detection is critical for reliable gesture classification.
This problem has been approached using wearable and vision
sensing. Recent studies report interesting results in hand detec-
tion and gesture classification from RGB-D video using deep
learning [34]. Boosting methods, based on ensembles of weak
classifiers, allow multi-class hand detection [35]. A gesture-
based interface based on EMG and IMU sensing report the
classification of 16 discrete hand gestures which are mapped
to robot commands [12]. This was materialized in point-to-
goal commands and a virtual joystick for robot teleoperation.
A challenging study deals with a situation in which users
need to manually control the robot but both hands are not
available (when users are holding tools or objects in their
hands) [23]. In this scenario, hand, body and elbow gestures
are recognized and used to control primitive robot motions.
Gestures can also be used to specify the relevant action
parameters (e.g. on which object to apply the action) [36]. The
study refers that according to the experiments with 24 people
the system is intuitive to program the robot, even for a robotics
novice [36]. The required HRI reliability and efficiency can
be achieved through a multimodal interactive process, for
example combining gestures and speech [37]. Multimodal
interaction has been used to interact with multiple unmanned
aerial vehicles from sparse and incomplete instructions [38].
Gesture recognition associated to HRI is today an important
research topic. However, it faces important challenges such as
the large amount of training data required for gesture clas-
sification (especially for deep learning) and problems related
with appearance of false positives and false negatives in on-
line classification. Moreover, many studies approach gesture-
based HRI in an isolated fashion and not as an integrated
framework that includes segmentation, classification and the
interface with the robot.
The segmentation of continuous data streams in static and
dynamic blocks depends on several factors: (1) interaction
technologies, (2) classification method (supervised or unsu-
pervised), (3) if gestures are static, dynamic or both, (4) if
the inter-gesture transitions (IGT) were previously trained or
not, among other factors. Another problem is related with
the difficulty to eliminate the appearance of false positives
and false negatives. In the context of gesture segmentation, it
can be stated that false negatives are more costly than false
positives since they divide the data representing a dynamic
gesture into two sections, corrupting the meaning of that
gesture. False positives are more easily accommodated by the
classifier, which can report that the pattern is not a trained
Real-time segmentation relies on the comparison of the cur-
rent state (frame) fiwith the previous states, {fi1,...,fiη}.
We propose a method to segment a continuous data stream
into dynamic and static segments in an unsupervised fashion,
i.e. without previous training or knowledge of gestures and
the sequence, unsegmented and unbounded [25]. The method
detailed in [25] was partially implemented and customized to
the specific sensor data used in this study (input data, sliding
window size and thresholds). We propose establishing a fea-
sible (optimal or not) single threshold for each motion feature
using a genetic algorithm (GA) – because the performance
function is non linear and non smooth – fed by a set of
calibration data. The GA parameters were obtained by manual
search. Gesture patterns with sudden inversions of movement
direction are analyzed recurring to the available velocities and
accelerations. The proposed method deals with upper body
gesture motion patterns varying in scale, rate of occurrence and
different kinematic constraints. A sliding window addresses
the problem of spatio-temporal variability.
We consider that there is motion if there are motion features
above the defined thresholds. The threshold is a vector, t0,
with a length equal to the number of motion features chosen,
p. The features obtained from a frame are represented by the
vector t. The sliding window Tis composed of wconsecutive
frames of t. At an instant i, the real-time sliding window T(i)
T(i) = t(iw+ 1) · · · t(i1) t(i)(1)
At each instant i, the wsized window slides forward one
frame and T(i)is updated and evaluated. A static frame is
only acknowledged as such if none of the motion features
exceed the threshold within the sliding window. This way, we
guarantee that a motion start is acknowledged with minimal
delay (real-time). On the other hand, this also causes a fixed
delay on the detection of a gesture end, equal to the size of
the window w.
The proposed method to achieve the motion function m(i)
relies in the computation of the infinite norm of a vector %
that contains feature-wise binary motion functions:
m(i) = (1,if k%k1
0,otherwise (2)
where vector %, for each instant of time i, is calculated by
comparing the sliding window with the threshold vector:
%m= (max
g|Tmg| ks·t0m), m = 1, . . . , p,
g= 1, . . . , w (3)
in which ksrepresents a user-defined threshold sensitivity
factor and t0mthe vector of thresholds of the motion features.
t0mis determined by an initial calibration process in which
two sets of data with equal length/time are acquired: static
samples (recorded with the user performing a static pose)
and motion samples (recorded with the user performing slow
movements that activate the selected motion features). These
data are used to estimate the segmentation error caused by an
arbitrary threshold vector, which is then optimized by a GA
with variables bounded by their maximum and minimum value
in these data, a population size of 100 and mutation rate of
0.075. The sensitivity factor is then adjusted online for each
user when needed by trial and error according to the human
body shaking behaviour and the speed a dynamic gesture is
performed, especially for gestures with sudden inversions of
movement direction.
In an ideal system, the absence of movement would be
defined by null differences of the system variables between
frames. Therefore, the simplest set of features that can be used
for this method is the frame differences, 4f, that at an instant
iis given by:
f(i) = f(i)f(i1) (4)
However, these features do not yield consistently reliable
results. For example, if we consider as input a position in
Cartesian coordinates, this approach performs poorly, since the
differences would be relative to the coordinated axis. A motion
pattern with a direction oblique to an axis would have lower
coordinate differences compared to a pattern parallel to an axis
with similar speed, thus producing different results. This issue
can be solved by replacing the three coordinate differences
with the respective Euclidean length, directly acquired from
the IMUs angular velocity ω(i).
kω(i)k=qωx(i)2+ωy(i)2+ωz(i)2, i R+(5)
In the presence of gesture patterns with sudden inversions of
direction false negatives are very detrimental to the classifier
accuracy. The proposed solution is adding an extra motion
feature, the acceleration, a(i). The acceleration is at its highest
when an inversion of direction occurs, which solves the low
velocity problem. This feature does not cause false positives
in a static gesture and deals successfully with the inversions of
movement on dynamic gestures. The accelerations are directly
acquired from the IMUs.
In summary, the features for segmentation by motion are
the IMUs parameters representing motion, namely the accel-
erations and angular velocity. They are organized in a feature
vector t:
t(i) = ω1(i)a1(i)· · · ωu(i)au(i)T(6)
where ωu(i)is the angular velocity for IMU number u, and
au(i)is the acceleration for IMU number u.
Hidden layer 1
Hidden layer n-1
( 1)n
Fig. 2. Architecture of a feed-forward MLNN with nlayers.
In case the segmentation process detects a static block (all
frames are identical with no variation), a random and single
frame from the static block (a single feature vector) serves as
input for Static Gestures (SGs) classification. If the segmenta-
tion detects a dynamic block, the frames composing the block
are the input for Dynamic Gestures (DGs) classification. DGs
are defined by a large set of features and may also have a
variable number of frames (the same DGs can have different
A. Multi-Layer Neural Networks
A two-hidden-layer Multi-Layer Neural Networks
(MLNNs) is proposed, Fig. 2. The state y(q+1) of each
layer (q+ 1) is defined by the state y(q)of the previous layer
y(q+1) =f(q+1)(y(q)) = s(q+1) b(q+1) +W(q+1) y(q)(7)
where sis the transfer function, bis the biases vector and W
is the weight matrix. The estimation of band Wis obtained
by training the network with samples of which we know the
classification result a priori (training samples). Given a set of
training samples Xwith known target classes tg (supervised
learning), the objective is obtaining weights and biases that
optimize a performance parameter E, e.g., the squared error
E= (ty)2. The optimization is very often done with a
gradient descent method in conjunction with the backward
propagation of errors, method called Backpropagation (BP).
Specifically, we used the Scaled Conjugate Gradient (SCG)
BP method [39] which has the benefits of not requiring user-
dependent parameters and of being fast to converge.
The performance function used was cross-entropy, Ece =
tg·log y, which heavily penalizes very inaccurate outputs
(y0) and penalizes very little fairly accurate classifications
(y1). This is valid assuming a softmax transfer function
was used on the last layer, so that yRk,σ(y)[0,1]
and Σσ(y)=1. A log-sigmoid function is also often used
slogsig (x) = 1
/1+ex, s [0,1].
BP is an iterative method that relies on the initialization
(often done randomly) of the weight and bias vector, ˜w1
(k= 1). The next step is determining the search direction
˜pkand step size αkso that E( ˜wk+αk)< E ( ˜wk). This
leads to the update of ˜wk+1 = ˜wk+αk˜pk. If the first
derivative E0˜wk6=˜
0, meaning that we are not yet at a
minimum/maximum, then a new iteration is made (k=k+ 1)
Fig. 3. Control architecture highlighting the central role of the PRTM. The
PRTM receives information from the gesture recognition system and sends
commands to the robot. In addition, the PRTM manages the feedback provided
to the human co-worker.
and a new search direction is found. Else, the process is
over and ˜wkshould be returned as the desired minimum.
BP variations typically rely on different methods to find
˜pk, determination of αkor new terms to the weight update
equation. This often leads to the introduction of user-defined
parameters that have to be determined empirically.
B. Feature Dimensionality Reduction and Regularization
For the SGs no dimensionality reduction is proposed, since
the feature space is still small. To solve the issue of unde-
termined feature size of the DGs, we propose re-sampling
with bicubic interpolation. It allows to transform a DG sample
X(i), i iD,XMd×η, which has a variable number of
frames η, into a fixed-dimension sample X0,X0Md×η0.
Usually η0η, being η0arbitrarily defined as the maximum η
in all the samples iso that iiD. So, although in almost every
case the proposed transformation is up-sampling the sample,
it is also valid for new cases where η0< η, effectively down-
sampling the sample.
interp :<d×η→<d×η0
The gesture recognition acts in parallel with the called
Parametrization Robotic Task Manager (PRTM), which is used
to parametrize and manage robotic tasks with the human co-
worker in the loop, Fig. 3. Additionally, PRTM is used to
provide speech feedback to the user through computer text-
to-speech (TTS), and visual feedback using a monitor. The
gesture recognition has implemented the methods presented
in previous sections such as data sensory acquisition, raw
data processing, segmentation, and static and dynamic gesture
classification. The communication between PRTM and the
gesture classification module is achieved by using sockets
TCP/IP. The PRTM communicates with the robot through
When a gesture (static or dynamic) is recognized a socket
message is sent to the PRTM with information about the rec-
ognized gesture. It works as a phone auto attendant providing
options to the human (speech feedback) which selects the
intended robot service using gestures. The proposed PRTM
includes in the first layer 2 options, BRING and KINES-
THETIC, Fig. 4. The BRING option refers to the ability of the
robot to deliver parts, tools, and consumables to the human co-
worker, while the KINESTHETIC is related with the operation
mode in which the co-worker can physically guide the robot
to the desired poses in space to teach a specific task or to
hold a part while he/she is working on it. In the second
layer, and for the BRING option, the user can select Tools
or Parts, with different possibilities in each one (third layer).
The BRING functionalities and operation actions related with
the human co-worker, robot and user feedback are detailed in
Fig. 5. The robot poses were previously defined using teach-in
programming, i.e., moving the robot end-effector to the target
poses and saving them.
The interactive process starts with the user performing a
gesture called “Attention”. This gesture makes the system to
know that the user wants to perform a given robotic task
parametrization. The speech and visual feedback informs the
human user about the selection options in the first layer.
The user has few seconds (a predefined time) to perform a
“Select” gesture to select the desired option. After this process,
the PRTM through images and text displayed in the monitor
and TTS asks the user to validate the selected option with a
“Validation” gesture. If validated the PRTM goes to the next
layer, if not validated the system continues in the current layer.
If the user does not perform the “Select” gesture during the
predefined time period, the PRTM continues with the other
options within the layer. The procedure is repeated until the
user selects one of the options or until the PRTM through TTS
repeats all of the options three times. The process is similar for
the second and third layer. In the third layer the PRTM sends a
socket message to the robot to perform the parametrized task.
If required, at any moment the user can perform the “Stop”
gesture so that the system returns to initial layer and the robot
The above interactive process consumes a significant
amount of time. In response to this problem, the PRTM can be
setup with the pre-established sequence of operations so that
the human intervention resumes to accept or not the PRTM
suggestions in some critical points of the task being performed.
The pros and cons of this mode of operation are discussed in
the Experiments and Results section.
Fig. 4. The three layers of the proposed PRTM. The BRING and KINESTHETIC options are in the first layer. For the BRING option we have in the second
layer two options to select PARTS and TOOLS. In the third layer we have all the parts and tools available to be selected.
Fig. 5. BRING architecture with the role of the human co-worker, robot and feedback.
A. Setup and Data Acquisition
Five IMUs and a UWB positioning system were used to
capture the human upper body shape and position in space,
respectively, Fig. 6. The collaborative robot is a KUKA iiwa
with 7 DOF equiped with the Sunrise controller.
The 5 IMUs (Technaid Tech-MCS) are composed by 3
axis acelerometers, magnetometers and gyroscope. The IMUs
are synchronized in the Technaid Tech-MCS HUB and an
extended Kalman filter is applied to fuse sensor data to esti-
mate IMUs orientation Euler angles α,βand γ. In Bluetooth
connection mode and for 5 IMUs the system outputs data at
25 Hz. These data will be the input for the gesture recognition
The UWB (ELIKO KIO) provides the relative position of
the human co-worker in relation to the robot. This information
is used to define if the human is close to the robot. If the
human is at less than 1 meter from the robot the interactive
mode is valid. The UWB tag is in the human’s pocket and the
4 anchors are installed in the working room.
The sensors are connected to a computer running MATLAB.
Sensor data are captured and stored in buffers. A script reads
the newest samples from the buffers and processes them. The
stream of data is segmented by the motion-threshold method
detailed in section II, Eq. (6), considering a sensitivity factor
of 3.0, and with the following segmentation features related
with the 5 IMUs:
t= [ω1a1ω2a2... ω5a5]T(9)
Concerning the classification features, a full frame of data
from the IMUs is represented by f, Eq. (10), namely the IMUs
accelerations and Euler angles in a total of 30 DOF. These
features represent almost all representative data from IMUs
and were selected by manual search. A binary segmentation
variable m, Eq. (2), represents whether the frame belongs to
a dynamic segment or not.
Fig. 6. Wearable sensors applied for the proposed HRI interface, 5 IMUs and
a UWB tag.
f= [ax1ay1az1... az5α1β1γ1... γ5m](10)
Where axh,ayh and az h represent the accelerations (including
gravity effect) from IMUhwith h= 1, ..., 5along the
coordinated axis of IMUh. The Euler angles αh,βhand γhare
relative to IMUhwith h= 1, ..., 5. The frames fare arranged
in static and dynamic samples according to the segmentation
output m.
B. Gesture Data Set
According to the functionalities to be achieved and industry
feedback, a dataset of continuous static/dynamic gestures is
used. It contains 8 SGs, Fig. 7, and 4 DGs, Fig. 8. These
gestures are composed by upper body arm data captured from
IMUs, Eq. (10). Industry feedback was provided by production
engineers from automotive sector and by two shop floor
workers that experienced the system. They indicated that the
number of gestures to be memorized by the robot co-workers
should be relatively small, the co-workers should be able to
customize each gesture to a given robot functionality, error in
gesture classification should not cause a safety problem or to
Fig. 7. Representation of the 8 SGs.
Fig. 8. Representation of the 4 DGs.
be detrimental to the work being done, and the co-workers
should have feedback about the process (for example they
need to know if the robot is moving or is waiting for a com-
mand). They selected these gestures from a library of possible
gestures we provided. To avoid false positives/negatives, we
implemented what we call composed gestures, which are a mix
of the SGs and DGs mentioned above. The composition of a
composed gesture can be customized by each different user
according to the following rules: (1) the composed gesture
begins with a static pose with the beginning of a selected
dynamic gesture B-DG, (2) a DG, (3) a static pose with the end
of the dynamic gesture E-DG, (4) an inter-gesture transition
(IGT), and (5) a SG. Three examples of composed gestures
are detailed in Fig. 9. However, several other combinations
may be selected/customized by each different user.
The training samples for SGs S(iS)and DGs S(iD)were
obtained from two different subjects, subjects A and B (60
samples each subject for each gesture (8 SGs and 4 DGs) in a
total of 720 trained patterns). These two subjects participated
in the development of the proposed framework.
Fig. 9. Example of 3 composed gestures. B-DG indicates the beginning of
a DG, E-DG indicates the end of a DG and IGT the inter-gesture transition
between gestures.
Fig. 10. Human arms described by 2 unit vectors each (representing
orientation of arm and forearm).
Fig. 11. DG 2 gesture data before (at left) and after compression and regularization (at right).
tanh tanh softmax
. . .
Fig. 12. LSTM network architecture.
C. Features
For the SGs, m= 0, the features for classification are all
the elements of f, excluding m. The notation for the ith-SG
feature vector is z0SR30:
z0S= (ax1ay1az1... az5α1β1γ1... γ5)(11)
For DGs, the features will be derived from f, Eq. (11),
namely the unit vectors representing the human arms orien-
tation. Each arm will be described by two rigid links, with
a unit vector representing each, arm oa(i)=(oax, oay , oaz )i
and forearm of(i)=(ofx, of y , ofz )i. From the Euler angles
of each IMU we can define the spherical joints between each
two sensors, such that we get three orthogonal rotation angles
between each two sensors. From these we can construct the
direct kinematics for each arm of the human body and obtain
the unit vectors, Fig. 10. The notation for the ith-DG feature
vector is z0DR12:
z0D= (oa1 oa2 of1 of2)(12)
Each DG, including gestures in the same class, normally
have a variable number of frames. For classification purposes,
we need to establish a fixed dimension for all DGs, recurring
to bicubic interpolation as detailed in previous section. Given
a sample X(i):iiDwith ηframes (X(i)M12×η), the
objective is to resample it to a fixed size η0. The value for η0
can be chosen arbitrarily but higher values have a detrimental
effect on the classification accuracy. For that reason, η0should
have an upper bound such that η0η, η|X(iD)M12×η.
For the proposed gesture dataset, the gesture length varies
between 42 and 68 frames. Therefore, we choose the lowest
ηof the the DG samples, η0= 42. Applying the bicubic
interpolation, the result is a matrix ZR12×42. Fig. 11
shows an example of gesture data before and after compression
and regularization for DG 2 (length reduced from 48 to 42
frames). It is visible that the data significance is maintained.
By concatenating every frame vertically, Zis transformed into
a vector zR504:
z(i)=concat(Z(i)) =
The last feature processing step is feature scaling. It is es-
sential for achieving smaller training times and better network
performance with less samples. It harmonizes the values of
different features so that all of them fall within the same range.
This is especially important when some features have distinct
orders of magnitude. Applying linear rescaling, l:
l(x) = 2xb
where bis the max+min operator defined in Eq. (15). XT=
z(i):iiTis the set of unscaled features of the training
set. This operator is valid both for static and dynamic gestures
but the sample subsets used should be exclusive.
Xi= max Xi+ min Xi, i = 1, ..., d (15)
D. Results and Discussion: Gesture Recognition
Experiments were conducted to verify the performance and
effectiveness of the proposed framework. It was tested by two
subjects (subject A and B) that contributed to the development
of the system and created the gesture training data set, and
five subjects (subject C, subject D, subject E, subject F and
subject G) that are not robotics experts and are using the
system for the first time. Subjects F and G are automotive
plant workers with 25-30 years old and with expertise in
the assembly of components for gear boxes. For the testing
dataset, each subject performed each SG 60 times (for the 8
SGs we have a total of 480 testing patterns for each subject)
and each DG 60 times (for the 4 DGs we have a total of 240
testing patterns for each subject).
The proposed solution for gesture segmentation aims to
accurately divide a continuous data stream in static and
Subject A B C D E F G
0% 0% 8% 4% 10% 7% 9%
(supervised) 0% 0% 6% 4% 6% 2% 8%
(supervised) 0% 2% 8% 4% 10% 4% 8%
dynamic segments. The conducted experiments consisted in
the analysis of samples containing sequences of static and
dynamic behaviours. For each subject, ten composed gestures
were analysed, each with 2 motion blocks and 3 static blocks,
Fig. 9.
Segmentation performance depends largely on the size of
the sliding window. The segmentation accuracy was measured
for different sliding window sizes. Considering small sliding
windows, there is excessive segmentation (oversegmentation),
leading to low accuracy. The best results were achieved for a
window size of 20.
The proposed unsupervised solution was compared with two
supervised methods, a one-class feed-forward neural network
(ANN) and a Long Short-Term Memory (LSTM) network,
Fig. 12. For both networks, inputs are the sliding window
data and a single output neuron outputting a motion index.
They were trained with the same calibration data applied in
the unsupervised method to achieve an optimal sliding window
size (data from subject A and subject B).
Table I shows the segmentation error results. Results indi-
cate that the segmentation error for the supervised methods
(ANN and LSTM) is identical to the proposed unsupervised
solution. For subjects A and B the segmentation error is
almost zero, justified by the fact that they tested a system
calibrated/trained with data they produced. The error detected
for the other subjects (C, D, E, F and G) is mainly due to
oversegmentation. Generally, oversegmentation occurs in the
IGT phase and is not critical for the classification. The pro-
posed unsupervised method is effective, especially if calibrated
(threshold parameters) with data from the user.
Concerning SG classification, S(i), i iS, 60 samples from
subject A and B were used for the training set (iiST ) and
60 samples from subjects A, B, C, D, E, F and G for the
validation set (iiSV ). The validation set is not used for
optimization purposes. The loss of the network on this set is
monitored during optimization, which is halted when this loss
stops decreasing in order to prevent overfitting.
The MLNN architecture for SG classification, Fig. 13, has
30 neurons in the input layer, which is the size of the SG
feature vector (Eq. 11). Also, it is composed by one hidden
layer with 50 neurons, having the hyperbolic tangent as the
transfer function. The output layer has 8 nodes, the number
of classes, with the softmax function as transfer function.
The accuracy results, Table II, indicate an overall classi-
fication accuracy on the testing set for subject A of 99.0%
(475/480) and for subject B of 98.50% (473/480). For subjects
Fig. 13. ANN architecture for SG classification.
Fig. 14. ANN architecture used for DG classification.
C, D, E, F and G that did not train the system the accuracy
was reduced, Table II. SG1 was mistaken with SG3, which are
very similar gestures if the user is not positioning the right arm
For DG classification S(i) : iiD, the training set is
composed by 60 samples from subject A and B (iiDT )
and 60 samples from subjects A, B, C, D, E, F and G for the
validation set (iiDV ). The network architecture, Fig. 14, has
504 input neurons, one hidden layer with 20 neurons and the
output layer has 4 output neurons, the transfer function is the
hyperbolic tangent in the first layer and the softmax function
in the last layer.
The gesture classification accuracy, Table II, shows a good
accuracy for subjects A and B. Even for subject C, D, E, F and
G that did not train the system the accuracy is relatively high.
These good results are due to the relatively small number of
DG classes. It should be noted that the model was not trained
with data from subject C, D,E, F and G and no calibration
was performed.
For the composed gestures, the accuracy is directly related
with the accuracy of the SGs and DGs.
Deep learning algorithms require a large number of training
data, being more suitable for the classification of images
and sequences of images. The results we obtained with the
proposed ANN-based classification solution are satisfactory,
especially considering that we have few training data from
wearable sensors and only from two subjects. Nevertheless,
the results are acceptable for subjects C, D, E, F and G,
and excellent for the subject A and B. In this context,
we compared the proposed MLNN method with a common
classification method, SVM. The SVM was not optimized.
Subject A B C D E F G
(proposed ANN) 99.0 98.5 94.2 93.5 89.6 95.0 90.1
(proposed ANN) 99.5 99.0 95.8 92.5 94.6 96.9 95.4
SGs (SVM) 98.6 97.6 92.4 88.3 84.3 89.5 87.5
DGs (SVM) 98.2 97.4 92.1 88.7 91.1 92.1 88.1
We tried different SVM methods recurring to the MATLAB
Classification Learner, obtaining the best results with the
Medium Gaussian SVM with a Gaussian kernel function.
Results indicate that SVM method presents interesting results
but compares unfavourably with proposed MLNN method,
Table II. The results for subjects F and G are in line, or even
better, when compared with the results for the other three
subjects that did not train the system. This can be justified
by the fact that these workers are relatively young (average
age of 25 years old) and familiarized with information and
communications technologies (ICT).
E. Results and Discussion: Robot Interface
The collaborative robot acts as a “third hand” by assisting
the human co-worker in an assembly operation by delivering
to the human shared workplace tools, parts, and holding work
pieces, Fig. 15. After a gesture is recognized it serves as
input for the PRTM that interfaces with the robot and provides
speech and visual feedback to the human co-worker (section
IV), Fig. 3.
The framework was tested by the seven subjects mentioned
above. Subjects C, D, E, F and G received a 15 minutes
introduction to the system by subjects A and B that contributed
to the system development and created the gesture dataset.
From the library of 8 SGs and 4 DGs the seven subjects chose
the gestures that best suited them to associate with the PRTM
commands: “attention”, “select”, “validation”, “stop”, “abort”
and “initialize” (according to the functionalities detailed in
section IV). Finally, subjects C, D,E, F and G were briefed on
the assembly sequence and components involved.
The complete assembly task is composed of subtasks:
manipulation of parts, tools, consumables, holding actions
and screw. Some tasks are more suited to be executed by
humans, others by robots, and others by the collaborative work
between human and robot. When requested by the human co-
worker (using gestures), the robot has to deliver to the human
workplace the parts, consumables (screws and washers) and
tools for the assembly process. The parts and tools are placed
in know fixed positions. Moreover, the human can setup the
robot in kinesthetic precision mode [40] to manually guide
it to hold workpieces while tightening the elements, Fig. 15.
Although the gestures recognition rate is high, the occurrence
of false positives and negatives was analysed. Our experiments
demonstrated that if a given gesture is wrongly classified
the “validation” procedure allows the user to know from the
speech and visual feedback that it happened, so that he/she
can adjust the interactive process.
The collaborative activities may present the risk of potential
collisions between human and robot. From the UWB positional
data, when a threshold separation distance is reached the robot
stops. The estimation of the separation distance contemplates
the velocity and reach of both robot and human (dimensions
of the human upper limbs), and the UWB error (about 15
cm). In our experiments we considered a separation distance
of 1 meter. This is valid when the robot is delivering the tools
and consumables to the human co-worker. The robot is also
performing these actions with a velocity according to safety
standards so that this stop operation is not mandatory. For the
kinesthetic teaching the separation distance is not considered.
During the interactive process, the reached target points can
be saved and used in future robot operations. The impedance
controlled robot compensates positioning inaccuracies, i.e., the
co-worker can physically interact with the robot (kinesthetic
mode) to adjust positioning.
On average, the time that passes between the recognition
of a gesture and the completion of the associate PRTM/robot
command is about 1 second. If the setup of the PRTM is taken
into account, with the selection of the desired options, it takes
more than 5 seconds.
The seven subjects filled a questionnaire about the proposed
interface, resulting in the following main conclusions:
1) The gesture-based interface is intuitive but delays the
interactive process. It can be complemented with a tablet
to select some robot options faster;
2) It was considered by all the subjects that the “validation”
procedure slows the interactive process. The subjects
F and G indicated that this is discouraging from an
industrial point of view. Nevertheless, they indicated
that the problem is attenuated when we setup a given
sequence in the PRTM avoinding the validations;
3) The shop floor workers (subject F and G) indicated that
the main concerns they have are the safety (emergency
buttons recommended) and the need to make the interac-
tive process as simple as possible. They adapted easily to
the system but pointed that this can be a difficult task for
older workers. At this stage we can assume that mainly
these systems have to be operated by young workers
familiarized with basic ICT technologies;
4) Operating a version of the PRTM without all the val-
idations proved to be faster. Nevertheless, the system
presents lower flexibility, i.e., requires an initial setup of
the task sequence so that the human intervention resumes
to accept or not the PRTM suggestions with the NEXT
5) The composed gestures are more complex to perform
compared to SGs and DGs. Nevertheless, they are more
reliable than SGs and DGs;
6) The automatic speech and visual feedback is considered
essential for a correct understanding of the interactive
process, complementing each other;
7) The subjects that were not familiarized with the system
(subjects C, D, E, F and G) considered that working with
the robot without fences present some degree of danger
(they did not feel totally safe). The industry workers
indicate the need of one or several emergency buttons
placed close to the robotic arm;
8) All subjects reported that the proposed interface allows
the human co-worker to abstract from the robot pro-
gramming, save time in collecting parts and tools for the
assembly process, and have better ergonomic conditions
by adjusting the robot as desired. The ergonomics factor
was reinforced from subjects F and G from industry.
The task completion time was analysed for the presented
assembly use case. The task completion time of the collabora-
Fig. 15. Human-robot collaborative process. In this use case the robot delivers tools and parts to the human co-worker (top and middle) and the robot holds
the workpiece while the co-worker is working on it (bottom). For better ergonomics the co-worker adjusts the workpiece position and orientation through
robot hand-guiding. The monitor that provides visual feedback to the user is also represented, indicating the task being performed, the next task and asking
for human intervention id required in each moment.
tive robotic solution (eliminating the validation procedures) is
about 1.4 times longer than when performed by the human
worker alone. The collaborative robotic solution is not yet
attractive from an economic perspective and needs further
research. This result is according to similar studies that report
that the collaborative robotic solutions are more costly in terms
of cycle time than the manual processes [41]. Nevertheless,
the system demonstrated to be intuitive to use and with better
ergonomics for the human.
This paper presented a novel gesture-based HRI framework
for collaborative robots. The robot assists a human co-worker
by delivering tools and parts, and holding objects to/for an
assembly operation. It can be concluded that the proposed
solution accurately classifies static and dynamic gestures,
trained with a relatively small number of patterns, and with an
accuracy of about 98% for a library of 8 SGs and 4 DGs. These
results were obtained having IMUs data as input, unsupervised
segmentation by motion and a MLNN as classifier. The pro-
posed parameterization robotic task manager (PRTM) demon-
strated intuitiveness and reliability managing the recognised
gestures with robot action control and speech/visual feedback.
Future work will be dedicated to testing the proposed
solution with other interaction technologies (vision) and adapt
the PRTM to be easier to setup a novel assembly task. In
addition we will perform more tests with industry workers.
[1] G.-Z. Yang, J. Bellingham, P. E. Dupont, P. Fischer, L. Floridi,
R. Full, N. Jacobstein, V. Kumar, M. McNutt, R. Merrifield,
B. J. Nelson, B. Scassellati, M. Taddeo, R. Taylor, M. Veloso,
Z. L. Wang, and R. Wood, “The grand challenges of science
robotics,” Science Robotics, vol. 3, no. 14, 2018. [Online]. Available:
[2] L. Johannsmeier and S. Haddadin, “A hierarchical human-robot
interaction-planning framework for task allocation in collaborative in-
dustrial assembly processes,” IEEE Robotics and Automation Letters,
vol. 2, no. 1, pp. 41–48, Jan 2017.
[3] B. Sadrfaridpour and Y. Wang, “Collaborative assembly in hybrid man-
ufacturing cells: An integrated framework for human-robot interaction,
IEEE Transactions on Automation Science and Engineering, vol. PP,
no. 99, pp. 1–15, 2017.
[4] K. Kaipa, C. Morato, J. Liu, and S. Gupta, “Human-robot collaboration
for bin-picking tasks to support low-volume assemblies.” Robotics
Science and Systems Conference, 2014.
[5] E. Matsas, G.-C. Vosniakos, and D. Batras, “Effectiveness and
acceptability of a virtual environment for assessing human–robot
collaboration in manufacturing,” The International Journal of Advanced
Manufacturing Technology, vol. 92, no. 9, pp. 3903–3917, Oct 2017.
[Online]. Available:
[6] I. E. Makrini, K. Merckaert, D. Lefeber, and B. Vanderborght, “Design
of a collaborative architecture for human-robot assembly tasks,” in 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), Sept 2017, pp. 1624–1629.
[7] T. Ende, S. Haddadin, S. Parusel, T. Wüsthoff, M. Hassenzahl, and
A. Albu-Schäffer, “A human-centered approach to robot gesture based
communication within collaborative working processes.” IROS 2011,
25-30 Sept. 2011, San Francisco, California.
[8] S. Sheikholeslami, A. Moon, and E. A. Croft, “Cooperative gestures
for industry: Exploring the efficacy of robot hand configurations
in expression of instructional gestures for human-robot interaction,”
The International Journal of Robotics Research, vol. 36, no. 5-
7, pp. 699–720, 2017. [Online]. Available:
[9] P. Rouanet, P. Oudeyer, F. Danieau, and D. Filliat, “The impact of
humanârobot interfaces on the learning of visual objects,” IEEE Trans-
actions on Robotics, vol. 29, no. 2, pp. 525–541, April 2013.
[10] S. Radmard, A. J. Moon, and E. A. Croft, “Interface design and usability
analysis for a robotic telepresence platform,” in 2015 24th IEEE Inter-
national Symposium on Robot and Human Interactive Communication
(RO-MAN), Aug 2015, pp. 511–516.
[11] M. Simao, P. Neto, and O. Gibaru, “Natural control of an industrial robot
using hand gesture recognition with neural networks,” in IECON 2016
- 42nd Annual Conference of the IEEE Industrial Electronics Society,
Oct 2016, pp. 5322–5327.
[12] M. T. Wolf, C. Assad, M. T. Vernacchia, J. Fromm, and H. L. Jethani,
“Gesture-based robot control with variable autonomy from the JPL
BioSleeve,” in 2013 IEEE International Conference on Robotics and
Automation. IEEE, may 2013, pp. 1160–1165.
[13] P. Neto, D. Pereira, J. N. Pires, and a. P. Moreira, “Real-time and
continuous hand gesture spotting: An approach based on artificial
neural networks,” 2013 IEEE International Conference on Robotics and
Automation, pp. 178–183, 2013.
[14] B. Gleeson, K. MacLean, A. Haddadi, E. Croft, and J. Alcazar, “Gestures
for industry intuitive human-robot communication from human observa-
tion,” in 2013 8th ACM/IEEE International Conference on Human-Robot
Interaction (HRI), March 2013, pp. 349–356.
[15] S. Goldin-Meadow, “The role of gesture in communication and
thinking,” Trends in Cognitive Sciences, vol. 3, no. 11, pp. 419 – 429,
1999. [Online]. Available:
[16] R. S. Feldman, Fundamentals of Nonverbal Behavior. Cambridge
University Press, 1991.
[17] S. Waldherr, R. Romero, and S. Thrun, “A Gesture Based Interface for
Human-Robot Interaction,” Autonomous Robots, vol. 9, no. 2, pp. 151–
173, sep 2000.
[18] C. P. Quintero, R. T. Fomena, A. Shademan, N. Wolleb, T. Dick, and
M. Jagersand, “Sepo: Selecting by pointing as an intuitive human-robot
command interface,” in Robotics and Automation (ICRA), 2013 IEEE
International Conference on, May 2013, pp. 1166–1171.
[19] M. Burke and J. Lasenby, “Pantomimic gestures for human–robot
interaction,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1225–
1237, Oct 2015.
[20] Y. Okuno, T. Kanda, M. Imai, H. Ishiguro, and N. Hagita, “Providing
route directions: Design of robot’s utterance, gesture, and timing,” pp.
53–60, 2009.
[21] M. Salem, S. Kopp, I. Wachsmuth, K. Rohlfing, and F. Joublin, “Gen-
eration and Evaluation of Communicative Robot Gesture,International
Journal of Social Robotics, vol. 4, no. 2, pp. 201–217, feb 2012.
[22] C.-M. Huang and B. Mutlu, “Modeling and evaluating narrative gestures
for humanlike robots,” in In Proceedings of Robotics: Science and
Systems, 2013, p. 15.
[23] M. Wongphati, H. Osawa, and M. Imai, “Gestures for manually
controlling a helping hand robot,” International Journal of Social
Robotics, vol. 7, no. 5, pp. 731–742, 2015. [Online]. Available:
[24] Z. Shao and Y. Li, “Integral invariants for space motion trajectory
matching and recognition,” Pattern Recognition, vol. 48, no. 8, pp.
2418 – 2432, 2015. [Online]. Available:
[25] M. A. Simao, P. Neto, and O. Gibaru, “Unsupervised gesture segmenta-
tion by motion detection of a real-time data stream,” IEEE Transactions
on Industrial Informatics, vol. PP, no. 99, pp. 1–1, 2016.
[26] J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff, “A unified framework
for gesture recognition and spatiotemporal gesture segmentation.” IEEE
transactions on pattern analysis and machine intelligence, vol. 31, no. 9,
pp. 1685–99, sep 2009.
[27] R. Yang, S. Sarkar, and B. Loeding, “Handling movement epenthesis and
hand segmentation ambiguities in continuous sign language recognition
using nested dynamic programming.” IEEE transactions on pattern
analysis and machine intelligence, vol. 32, no. 3, pp. 462–77, mar 2010.
[28] B. Burger, I. Ferrané, F. Lerasle, and G. Infantes, “Two-handed gesture
recognition and fusion with speech to command a robot,” Autonomous
Robots, vol. 32, no. 2, pp. 129–147, dec 2011.
[29] V. Villani, L. Sabattini, G. Riggio, C. Secchi, M. Minelli, and C. Fan-
tuzzi, “A natural infrastructure less human robot interaction system,
IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1640–1647,
July 2017.
[30] D. Wu, L. Pigou, P. J. Kindermans, N. LE, L. Shao, J. Dambre, and
J. M. Odobez, “Deep dynamic neural networks for multimodal gesture
segmentation and recognition,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2016.
[31] F. J. Ordonez and D. Roggen, “Deep convolutional and lstm recurrent
neural networks for multimodal wearable activity recognition,Sensors,
vol. 16, no. 1, 2016. [Online]. Available:
[32] M. Field, D. Stirling, Z. Pan, M. Ros, and F. Naghdy, “Recognizing
human motions through mixture modeling of inertial data,” Pattern
Recognition, vol. 48, no. 8, pp. 2394 – 2406, 2015. [Online]. Available:
[33] Y. Song, D. Demirdjian, and R. Davis, “Continuous body and hand
gesture recognition for natural human-computer interaction,” ACM
Transactions on Interactive Intelligent Systems, vol. 2, no. 1, pp. 1–28,
mar 2012.
[34] C. Monnier, S. German, and A. Ost, A Multi-scale Boosted Detector
for Efficient and Robust Gesture Recognition. Springer International
Publishing, 2015, pp. 491–502.
[35] K. Mei, J. Zhang, G. Li, B. Xi, N. Zheng, and J. Fan, “Training
more discriminative multi-class classifiers for hand detection,Pattern
Recognition, vol. 48, no. 3, pp. 785 – 797, 2015. [Online]. Available:
[36] M. R. Pedersen and V. Krüger, “Gesture-based extraction of robot skill
parameters for intuitive robot programming,Journal of Intelligent &
Robotic Systems, vol. 80, no. 1, pp. 149–163, 2015. [Online]. Available:
[37] S. Rossi, E. Leone, M. Fiore, A. Finzi, and F. Cutugno, “An extensible
architecture for robust multimodal human-robot communication,” in
2013 IEEE/RSJ International Conference on Intelligent Robots and
Systems, Nov 2013, pp. 2208–2213.
[38] J. Cacace, A. Finzi, V. Lippiello, M. Furci, N. Mimmo, and L. Marconi,
“A control architecture for multiple drones operated via multimodal
interaction in search rescue mission,” in 2016 IEEE International
Symposium on Safety, Security, and Rescue Robotics (SSRR), Oct 2016,
pp. 233–239.
[39] M. F. Møller, “A scaled conjugate gradient algorithm for fast supervised
learning,” NEURAL NETWORKS, vol. 6, no. 4, pp. 525–533, 1993.
[40] M. Safeea, R. Bearee, and P. Neto, End-Effector Precise Hand-
Guiding for Collaborative Robots. Cham: Springer International
Publishing, 2018, pp. xx–xx. [Online]. Available:
[41] O. Madsen, S. BÞgh, C. Schou, R. S. Andersen, J. S. Damgaard,
M. R. Pedersen, and V. KrÃŒger, “Integration of mobile manipulators
in an industrial production,” Industrial Robot: An International
Journal, vol. 42, no. 1, pp. 11–18, 2015. [Online]. Available:
... New technologies such as augmented and virtual reality provide alternatives for robots to communicate with humans. These interfaces need novel approaches for human interaction, and researchers took inspiration from human-human communication exploring speech [5], gestures [6][7][8] and their combination [9][10][11]. ...
... Because of their advantages, researchers have explored the usage of gesturebased interfaces for industrial applications [6], search and rescue [11], and direct control of mobile robots [7]. Neto et al. 2019 [6] proposed a gesture-based interface using five inertial measurement units (IMUs), worn by the user, to send commands to an industrial manipulator. ...
... Because of their advantages, researchers have explored the usage of gesturebased interfaces for industrial applications [6], search and rescue [11], and direct control of mobile robots [7]. Neto et al. 2019 [6] proposed a gesture-based interface using five inertial measurement units (IMUs), worn by the user, to send commands to an industrial manipulator. At the same time, other studies proposed interfaces using non-wearable devices, such as RGB [17] and RGB-D [7] cameras, or alternative wearable devices such as Electromyography (EMG) sensors [11,18]. ...
Full-text available
Close human-robot interaction (HRI), especially in industrial scenarios, has been vastly investigated for the advantages of combining human and robot skills. For an effective HRI, the validity of currently available human-machine communication media or tools should be questioned, and new communication modalities should be explored. This article proposes a modular architecture allowing human operators to interact with robots through different modalities. In particular, we implemented the architecture to handle gestural and touchscreen input, respectively, using a smartwatch and a tablet. Finally, we performed a comparative user experience study between these two modalities.
... Most research in HRCom limits the use of communication vocabulary for ease of use (Neto et al., 2018). This does not provide the scope for nuanced communication especially in conditions wherein the human is unsure of their actions or plan and their judgement is clouded due to mental noise. ...
... Based on the feedback, humans adjust their behavior in order to accommodate the level of knowledge, understanding, and viewpoints of the collaborating partner. In industry as well, feedback is essential for safely carrying out tasks (Neto et al., 2018). This becomes quite critical in uncertain scenarios wherein the humans need to know if the robot is either waiting for a command or will begin to move to fulfill a command. ...
Full-text available
Human-robot communication is one of the actively researched fields to enable efficient and seamless collaboration between a human and an intelligent industrial robotic system. The field finds its roots in human communication with the aim to achieve the “naturalness” inherent in the latter. Industrial human-robot communication pursues communication with simplistic commands and gestures, which is not representative of an uncontrolled real-world industrial environment. In addition, naturalness in communication is a consequence of its dynamism, typically ignored as a design criterion in industrial human-robot communication. Complexity Theory-based natural communication models allow for a more accurate representation of human communication which, when adapted, could also benefit the field of human-robot communication. This paper presents a perspective by reviewing the state of human-robot communication in industrial settings and then presents a critical analysis of the same through the lens of Complexity Theory. Furthermore, the work identifies research gaps in the aforementioned field, fulfilling which, would propel the field towards a truly natural form of communication. Finally, the work briefly discusses a general framework that leverages the experiential learning of data-based techniques and naturalness of human knowledge.
... It is essential to help real-time communication among the deaf community, hard of hearing, and speech difficulties without the aid of an interpreter. It can be used in various research fields, including humanrobot interaction, computer games, learning, sign word and language recognition, virtual reality, medical diagnostics, and autism analysis [1][2][3][4]. In this case, they use body movements parallel with arm, body, head, finger, hand movements, and facial expressions to communicate with humans. ...
Full-text available
Communication between people with disabilities and people who do not understand sign language is a growing social need and can be a tedious task. One of the main functions of sign language is to communicate with each other through hand gestures. Recognition of hand gestures has become an important challenge for the recognition of sign language. There are many existing models that can produce a good accuracy, but if the model test with rotated or translated images, they may face some difficulties to make good performance accuracy. To resolve these challenges of hand gesture recognition, we proposed a Rotation, Translation and Scale-invariant sign word recognition system using a convolu-tional neural network (CNN). We have followed three steps in our work: rotated, translated and scaled (RTS) version dataset generation, gesture segmentation, and sign word classification. Firstly, we have enlarged a benchmark dataset of 20 sign words by making different amounts of Rotation, Translation and Scale of the original images to create the RTS version dataset. Then we have applied the gesture segmentation technique. The segmentation consists of three levels, i) Otsu Thresholding with YCbCr, ii) Morphological analysis: dilation through opening morphology and iii) Watershed algorithm. Finally, our designed CNN model has been trained to classify the hand gesture as well as the sign word. Our model has been evaluated using the twenty sign word dataset, five sign word dataset and the RTS version of these datasets. We achieved 99.30% accuracy from the twenty sign word dataset evaluation, 99.10% accuracy from the RTS version of the twenty sign word evolution, 100% accuracy from the five sign word dataset evaluation , and 98.00% accuracy from the RTS version five sign word dataset evolution. Furthermore, the influence of our model exists in competitive results with state-of-the-art methods in sign word recognition.
... Im einfachsten Fall sind dies Schalter, Tasten oder berührungsempfindliche Oberflächen. Modernere Ansätze umfassen den Einsatz von AR-Systemen, um Eingaben virtuell zu tätigen [9] sowie die Steuerung mit Gesten [10] oder Spracheingaben [11]. ...
Der Begriff "Industrie 4.0" fasst Technologien und Methoden zusammen, mit denen Produktionssysteme sich durch eine systematische Erfassung und Analyse von Daten selbstständig steuern und optimieren können. Insbesondere dort, wo Tätigkeiten innerhalb des Produktionsprozesses von Menschen ausgeführt werden, entsteht die Herausforderung, dass für die Erfassung menschlicher Daten in der Regel eine komplexe technische Infrastruktur notwendig ist. Die vorliegende Arbeit befasst sich mit der Fragestellung, ob der Prozess, menschliche Daten zu erfassen, durch die Anwendung eines AR-Headsets vereinfacht werden kann und auf ortsfeste Infrastruktur verzichtet werden kann. Zur Überprüfung des Ansatzes wird eine AR-Anwendung entwickelt, die kontinuierlich die Position eines Arbeiters in der Produktionsanlage bestimmt. Grundlage hierfür ist das dauerhaft vom AR-Headset durchgeführte SLAM-Tracking in Verbindung mit einem optischen, markerbasierten Kalibrierungsverfahren. Im Rahmen einer Messreihe wird die Präzision der Positionsbestimmung überprüft und mit anderen Verfahren zur Lokalisierung in Innenräumen verglichen. Für die Analyse der Positionsdaten wird zusätzlich eine Serveranwendung entwickelt, welche die Positionsdaten empfängt, speichert und mit Hilfe einer vorher festgelegten Prozessbeschreibung automatisiert Prozesszeiten aus den ermittelten Positionsdaten berechnet. Anhand eines praktischen Anwendungsfalls wird aufgezeigt, wie die ermittelten Prozesszeiten genutzt werden können, um die Reihenfolge des Prozesses zu optimieren.
... Wearable sensors are widely used in gesture recognition. For instance, incorporating IMUs in the wrist or some combination of them in the arms to track the body motion that will be used to determine the robot commands, like in [6], [13] and [14]. Moreover, users can use data-gloves that track the hand and its finger's position, [5] and [17]. ...
Full-text available
In this work, we propose a gesture based language to allow humans to interact with robots using their body in a natural way. We have created a new gesture detection model using neural networks and a custom dataset of humans performing a set of body gestures to train our network. Furthermore, we compare body gesture communication with other communication channels to acknowledge the importance of adding this knowledge to robots. The presented approach is extensively validated in diverse simulations and real-life experiments with non-trained volunteers. This attains remarkable results and shows that it is a valuable framework for social robotics applications, such as human robot collaboration or human-robot interaction.
In this survey paper, we focus on smart interactive technologies and providing a picture of the current state of the art, exploring the way new discoveries and recent technologies changed workers’ operations and activities on the factory floor. We focus in particular on the Industry 4.0 and 5.0 visions, wherein smart interactive technologies can bring benefits to the intelligent behavior machines can expose in a human-centric AI perspective. We consider smart technologies wherein the intelligence may be in and/or behind the user interfaces, and for both groups we try to highlight the importance of designing them with a human-centric approach, framed in the smart factory context. We review relevant work in the field with the aim of highlighting the pros and cons of each technology and its adoption in the industry. Furthermore, we try to collect guidelines for the human-centric integration of smart interactive technologies in the smart factory. In this wa y, we hope to provide the future designers and adopters of such technologies with concrete help in choosing among different options and implementing them in a user-centric manner. To this aim, surveyed works have been also classified based on the supported task(s) and production process phases/activities: access to knowledge, logistics, maintenance, planning, production, security, workers’ wellbeing, and warehousing.
Collection of selected papers submitted and presented at the III Latin American Workshop on Computational Neuroscience (LAWCN'21), held in the city of São Luís do Maranhão, Brazil, from 8 to 10th December 2021. Papers have been peer-reviewed and selected for their superior quality and impact. Topics covered are within the areas of Computational Neuroscience, Artificial Intelligence, and Neuroengineering.
Each day, robotic systems are becoming more familiar and common in different contexts such as factories, hospitals, houses, and restaurants, creating a necessity of seeking for affordable and intuitive interface for effective and engaging communication with humans. Likewise, innovative devices that offer alternative methods of interacting with machines allow us to create new interfaces, improving the learning training and motion application. Thus, this paper compares two interaction modes using leap motion to control a robotic manipulator (UR3) simulator. Users can control the robot through numerical gestures to set up the angle joints (coded mode) or counter/clockwise gestures to increase or decrease the angle values (open mode). We evaluate these modes objectively, capturing from 30 subjects the number of gestures and employed time to reach three specific poses. Likewise, we collected subjective questionnaires to compare the control methods and preferences. Our findings suggest that both methods employ similar gestures, but coded control takes less time with higher variations among ages. Moreover, subjects’ preferences indicate a slight inclination towards the open mode. Finally, it is mandatory to explore different difficulties in the tasks and increase the population to have a more general understanding of the preferences and performance.
Full-text available
One of the ambitions of Science Robotics is to deeply root robotics research in science while developing novel robotic platforms that will enable new scientific discoveries. Of our 10 grand challenges, the first 7 represent underpinning technologies that have a wider impact on all application areas of robotics. For the next two challenges, we have included social robotics and medical robotics as application-specific areas of development to highlight the substantial societal and health impacts that they will bring. Finally, the last challenge is related to responsible innovation and how ethics and security should be carefully considered as we develop the technology further.
Full-text available
Hand-guiding is a main functionality of collaborative robots, allowing to rapidly and intuitively interact and program a robot. Many applications require end-effector precision positioning during the teaching process. This paper presents a novel method for precision hand-guiding at the end-effector level. From the end-effector force/torque measurements the hand-guiding force/torque (HGFT) is achieved by compensating for the tools weight/inertia. Inspired by the motion properties of a passive mechanical system, mass subjected to coulomb/viscous friction, it was implemented a control scheme to govern the linear/angular motion of the decoupled end-effector. Experimental tests were conducted in a KUKA iiwa robot in an assembly operation.
Full-text available
Recent emergence of safe, lightweight, and flexible robots has opened a new realm for human-robot collaboration in manufacturing. Utilizing such robots with the new human-robot interaction (HRI) functionality to interact closely and effectively with a human co-worker, we propose a novel framework for integrating HRI factors (both physical and social interactions) into the robot motion controller for human-robot collaborative assembly tasks in a manufacturing hybrid cell. To meet human physical demands in such assembly tasks, an optimal control problem is formulated for physical HRI (pHRI)-based robot motion control to keep pace with human motion progress. We further augment social HRI (sHRI) into the framework by considering a computational model of the human worker's trust in his/her robot partner as well as robot facial expressions. The human worker's trust in robot is computed and used as a metric for path selection as well as a constraint in the optimal control problem. Robot facial expression is displayed for providing additional visual feedbacks to the human worker. We evaluate the proposed framework by designing a robotic experimental testbed and conducting a comprehensive study with a human-in-the-loop. Results of this paper show that compared to the manual adjustments of robot velocity, an autonomous controller based on pHRI, pHRI and sHRI with trust, or pHRI and sHRI with trust, and emotion result in 34%, 39%, and 44% decrease in human workload and 21%, 32%, and 60% increase in robot's usability, respectively. Compared to the manual framework, human trust in robot increases by 38% and 42%, respectively, in the latter two autonomous frameworks. Moreover, the overall efficiency in terms of assembly time remains the same.
Full-text available
A highly immersive and interactive virtual environment was constructed as an experimentation platform for human–robot collaboration in constricting panels from preimpregnated carbon fibre fabrics. The application involves highly collaborative tasks such as handover, removal of adhesive backing strip and fabric layup in a mould. Furthermore, the user is expected to be most of the time within the robot’s workspace, jointly working as teammates on collaborative manufacturing tasks. The environment embeds two interaction metaphors for complex tasks and advocates use of cognitive aids to cultivate proactive behaviour of the user, thus promoting situation awareness, danger perception and enrichment of communication between human and robot. The application was put under test by a group of users. Their experience was registered scholarly through questionnaires and objective observation and is reported in the paper to explore the effectiveness and acceptability of such an environment. Overall, the application was judged positively, especially the use of cognitive aids which, under circumstances turned into alarms and readily provided mental association of collision danger to its cause. Furthermore, some deficiencies were identified pertaining to lack of hand-tracking performance and need to improve the layup metaphor.
Fast and reliable communication between human worker(s) and robotic assistants is essential for successful collaboration between the agents. This is especially true for typically noisy manufacturing environments that render verbal communication less effective. In this work, we investigate the efficacy of nonverbal communication capabilities of robotic manipulators that have poseable, three-fingered end-effectors (hands). We explore the extent to which different poses of a typical robotic gripper can effectively communicate instructional messages during human–robot collaboration. Within the context of a collaborative car door assembly task, we conducted a series of three studies. We first observed the type of hand configurations that humans use to nonverbally instruct another person (Study 1, N = 17); based on the observation, we examined how well human gestures with frequently used hand configurations are understood by recipients of the message (Study 2, N = 140). Finally, we implemented the most human-recognized human hand configurations on a seven-degree-of-freedom robotic manipulator to investigate the efficacy of having human-inspired hand poses on a robotic hand compared to an unposed hand (Study 3, N = 100).
p>A supervised learning algorithm (Scaled Conjugate Gradient, SCG) with superlinear convergence rate is introduced. The algorithm is based upon a class of optimization techniques well known in numerical analysis as the Conjugate Gradient Methods. SCG uses second order information from the neural network but requires only O(N) memory usage, where N is the number of weights in the network. The performance of SCG is benchmarked against the performance of the standard backpropagation algorithm (BP), the conjugate gradient backpropagation (CGB) and the one-step Broyden-Fletcher-Goldfarb-Shanno memoryless quasi-Newton algorithm (BFGS). SCG yields a speed-up of at least an order of magnitude relative to BP. The speed-up depends on the convergence criterion, i.e., the bigger demand for reduction in error the bigger the speed-up. SCG is fully automated including no user dependent parameters and avoids a time consuming line-search, which CGB and BFGS use in each iteration in order to determine an appropriate step size. Incorporating problem dependent structural information in the architecture of a neural network often lowers the overall complexity. The smaller the complexity of the neural network relative to the problem domain, the bigger the possibility that the weight space contains long ravines characterized by sharp curvature. While BP is inefficient on these ravine phenomena, it is shown that SCG handles them effectively.</p
This letter introduces a novel methodology for letting a user interact with a robotic system in a natural manner. The main idea is that of defining an interaction system that does not require any specific infrastructure or device but relies on commonly utilized objects while leaving the user's hands free. Specifically, we propose to utilize a smartwatch (or any commercial sensorized wristband) for recognizing the motion of the user's forearm. This is achieved by measuring accelerations and angular velocities, which are then elaborated for recognizing gestures and for defining velocity commands for the robot. The proposed interaction system is evaluated experimentally with different users controlling a semiautonomous quadrotor. Results show that the usability and effectiveness of the proposed natural interaction system provide significant improvement in the human-robot interaction experience.
Conference Paper
Continuous and real-time gesture spotting is a key factor for the development of novel Human-Robot Interaction (HRI) modalities and further push the use of robots in our society. In this paper we present a hand gesture recognition module for large vocabularies of static and dynamic gestures, with limited training. The recognition module uses feature-samples obtained with an automatic motion detection-based segmentation algorithm, being the source data obtained from a magnetic tracker for the wrist and a data glove for the hand. The classifiers proposed are Multi-Layer Neural Networks (Perceptrons) (MLP) with one or two hidden-layers, with an accuracy of 98.7% for 25 Static Gestures (SGs) and up to 99.0% for 10 Dynamic Gestures (DGs). The results are on par or better than similar studies.