ArticlePDF Available

Gesture-based Human-robot Interaction for Field Programmable Autonomous Underwater Robots

Authors:

Abstract and Figures

The uncertainty and variability of underwater environment propose the request to control underwater robots in real time and dynamically, especially in the scenarios where human and robots need to work collaboratively in the field. However, the underwater environment imposes harsh restrictions on the application of typical control and communication methods. Considering that gestures are a natural and efficient interactive way for human, we, utilizing convolution neural network, implement a real-time gesture-based recognition system, who can recognize 50 kinds of gestures from images captured by one normal monocular camera, and apply this recognition system in human and underwater robot interaction. We design A Flexible and Extendable Interaction Scheme (AFEIS) through which underwater robots can be programmed in situ underwater by human operators using customized gesture-based sign language. This paper elaborates the design of gesture recognition system and AFEIS, and presents our field trial results when applying this system and scheme on underwater robots.
Content may be subject to copyright.
Gesture-based Human-robot Interaction for Field Programmable
Autonomous Underwater Robots
Pei Xu1
Abstract The uncertainty and variability of underwater
environment propose the request to control underwater robots
in real time and dynamically, especially in the scenarios where
human and robots need to work collaboratively in the field.
However, the underwater environment imposes harsh restric-
tions on the application of typical control and communication
methods. Considering that gestures are a natural and efficient
interactive way for human, we, utilizing convolution neural
network, implement a real-time gesture-based recognition sys-
tem, who can recognize 50 kinds of gestures from images
captured by one normal monocular camera, and apply this
recognition system in human and underwater robot interaction.
We design A Flexible and Extendable Interaction Scheme
(AFEIS) through which underwater robots can be programmed
in situ underwater by human operators using customized
gesture-based sign language. This paper elaborates the design
of gesture recognition system and AFEIS, and presents our
field trial results when applying this system and scheme on
underwater robots.
I. INTRODUCTION
On the land, keyboard, mouse, and other physical input
devices are commonly used as a quite reliable mean to
control computer and robots or interact with autonomous
vehicles. Without physical input devices, the interaction
between human and computer or robot still can be conducted
smoothly through speech, infrared ray or other wireless
communication approaches or/and sensors. However, in the
case underwater, most wireless communication approaches
and sensors based on electromagnetic signal would become
useless due to the attenuation caused by the water, thereby
imposing a huge challenge to interact with robots underwater.
Usually, two methods are used for human and underwa-
ter robot interaction. For autonomous underwater vehicles
(AUVs), we need to program the robot carefully before
putting it in the water such that the robot can conduct its
mission underwater independently without the intervention
of operators. For remotely operated vehicles (ROVs), a tether
is needed to connect the robot to a control platform on the
land or on the surface of water such that the robot can be con-
trolled by operators in real time on the platform. However,
both of these interaction schemes have their limitations. For
the first one, due to the lack of direct interaction between
operators and robots, operators or programmers for AUVs
have to consider all possible situations that a robot may
encounter underwater and ensure the robot can properly deal
with all kinds of accidents when programming the robot.
This may be an impossible task due to the uncertainty of
*This work was not supported by any organization
1Pei Xu is a student in the Department of Electrical and Computer En-
gineering, University of Minnesota, Twin Cities xuxx0884@umn.edu
the underwater environment. In the interaction scheme for
ROVs, tether may become a big problem if the underwater
scene has complex topography; and due to that the ROV is
controlled by operators on the remote control platform, some
complex tasks that need operators to, according to the on-site
situation, cooperate with robots in the field may be unable
to be conducted.
In this context, we present A Flexible and Extendable
Interaction Scheme (AFEIS) through which commands can
be made by operators in situ to program and control AUVs
in real time. Hand gestures are adopted as the input method
for AFEIS in the case to interact with underwater robots.
First, hand gestures are a natural way for most people to
perform interaction and can be implemented conveniently
without the support of external equipment. Besides, even
underwater, information of gestures can be captured easily
using optical cameras as long as the lighting condition is
guaranteed. Although AFEIS is designed for interaction with
underwater robots through hand gestures, it can be applied in
other human-computer or human-robot interaction scenarios
or/and through other interaction approaches like speech and
picture-based signs or tags.
Gesture recognition with high accuracy is a prerequisite
to implementing such an interaction scheme. In normal
application scenarios, gesture recognition using wearable
electromagnetic devices and infrared ray can provide quite
accurate recognition results. However, both of these methods
are not employable underwater due to the confines of water.
Traditionally, gesture recognition also can be implemented,
based on images captured from optical cameras, by means of
orientation histogram [1], hidden Markov model [2], particle
filtering [3], and support vector machine [4]. A common
characteristic of these methods is that we must extract gesture
features, like convexity defects, elongatedness and eccentric-
ity, from images manually before feeding these features into
classifiers. As for the final recognition result, it depends
a lot on what features we take into account to describe
gestures. Due to the lack in effective and comprehensive
ways to describe various gestures based on certain manually
extractable features, these methods usually only can provide
satisfactory accuracy when recognizing very limited kinds of
gestures. In this context, we use a convolution neural network
(CNN) to perform gesture recognition and thus avoid extract
features manually. The CNN we use has a simple structure
such that the recognition system can run in real time on
a platform with limited computation resources. We train
a model which can recognize 50 kinds of gestures with
over 99.6% accuracy rate. When combining with AFEIS, a
arXiv:1709.08945v1 [cs.HC] 26 Sep 2017
probabilistic model can be introduced to further improve the
robustness of the whole system.
While introducing gestures as the interaction approach,
AFEIS has two distinguishing characteristics: flexible and
extendable. AFEIS allows operators to define the meaning of
gestures by themselves. The ‘gestures’ mentioned here are
those used to control or program the robot in situ. They can
be substituted by other control signals like speech and visual
signs or tags in certain application scenarios. In AFEIS, the
meaning represented by each gesture, instead of being fixed
in the code, is parsed through independent configure files,
which can be customized by each operator him or herself and
which can be specified with respect to the hardware platform
of the robot. Such a kind of design makes robots able to be
controlled by a comfortable way accepted by each operator
him or herself, and thus reduces the difficulty for operators to
learn and use such an interaction system. Meanwhile, AFEIS
decouples the input system, i.e. the hand gesture or other
input or recognition system, and the control system, i.e. the
system to directly control the action of robots. Therefore,
it can be easily deployed on various robot platform with
little additional work. Once AFEIS is deployed, we can
through modifying configuration files make AFEIS support
more control commands when the robot platform is extended.
Furthermore, through dynamically loading different config-
ure files during the process of interaction, operators can give
various sets of control commands to the robot by a unique set
of gestures, and thus the number of control commands that
can be expressed by gestures increases significantly. Another
key characteristic of AFEIS is to make robot ‘programmable‘
in situ. AFEIS, based on a set of simple syntax similar
to programming language, parses signals obtained from the
input systems. It does not command the robot according to
each single gesture, but guides the robot to complete a series
of commands according to a sequence of gestures posed by
operators. Besides directly making commands to robots, it
allows operators to define functions and set variables in situ
when interacting with robots and thus makes the interaction
process more flexible.
The rest part of this paper is organized as the following:
a brief survey of related work and our comments are listed
in Section II; the implementation of AFEIS using gesture-
based sign language for underwater robots are elaborated in
Section III; the field test results are shown in Section IV; and
some improvement schemes that we are testing are presented
in Section V.
II. BACKGROUND AND RELATED WORK
Gesture recognition is a fundamental link in the applica-
tion of human and robot interaction, since hand gestures are
chosen as the interaction way between human and robots in
our design. Besides the recognition part, a potential problem
involved in hand-based recognition system is hand detection
and background removal. Without the device, like stereo
cameras and infrared ray sensors to provide depth infor-
mation, most of hand detection methods using monocular
cameras are performed based on shape [5], color [6], Harr
features [7] or context information [8] of hands. Due to the
variousness and variability of hand gestures, the application
of these methods usually requests some restrictions on the
background that detectors are facing. None of hand detection
methods using monocular cameras, according to what we
have learned so far, can really work well in an arbitrary
environment. This imposes a huge obstacle to the application
of gesture recognition using monocular cameras in general
cases. Fortunately, compared to the uncertain indoor or
outdoor environment that the ‘on land’ application may face,
the environment underwater usually is much simpler with a
relatively monotonous background. Moreover, the interaction
between human and underwater robots usually happens in
professional tasks. We can request operators to wear some
special equipment, such as colored gloves, before performing
the interaction. This is conducive to performing hand de-
tection and facilitates the application of gesture recognition
using monocular cameras in the underwater environment.
In [9], the authors use ARTags to interact with robots
underwater. In order to use this interaction method, operators
must print ARTags and bind together into book form before
going into the water. ARTags must be waterproofed in
advance, and it is also a problem for operators to pick out
the expected one from dozens of ARTags, which is unable
to provide any intuitive meaning to readers.
In [10], the authors use the motion of the hand and
arm instead of the hand gestures to perform interaction. A
problem of this interaction scheme is that during interaction,
operators have to hold the arm up and draw shapes by moving
the hand and arm aloft. Such an interaction method is not so
natural and it is unable for operators to draw shapes by the
hand and arm precisely. The result provided by the authors
only reaches about 97% recognition rate with just five kinds
of ‘gestures’. Another problem that may happen underwater
but the authors do not address is that it sometimes is hard
in the underwater environment to judge what causes the
motion of the hand and arm. In the underwater environment,
operators often cannot keep their poses very stable and
the motion of hand and arm is often in fact caused by
the movement of body but not that of the hand and arm
itself. Besides, water flow often could push robots, especially
robots with smaller size, to move slowly. This also could
influence the judgment about the motion of hand and arm.
In addition, the algorithm proposed by the authors is based on
the analysis of point clouds, which leads to a higher demand
in computation.
In [11], the authors employ gloves with colored markers
to implement gesture recognition underwater rather than di-
rectly basing the recognition on the whole hand or the glove
covering the whole hand. The authors in the paper reveal
neither the details of their gesture recognition algorithm nor
the performance of their algorithm. However, a conceivable
shortage is that the kinds of gestures that their algorithm can
cover are limited since the recognition may be achievable
only when there is enough space of colored markers shown
in the camera.
Besides, in [11], the authors define a set of syntax in the
Fig. 1. Structure of CNN classifier for gesture recognition. This network is obtained by modifying LeNet-5.
Fig. 2. Thumbnails of all 50 kinds of gestures supported by the recognition
model.
Fig. 3. Flow of process to obtain the image fed into the CNN classifier.
form similar to natural language as the way to translate
gestures to commands accepted by robots. An effort to
interact with robots using natural sentences expressed by
gestures are made in the paper. However, in practice, we
find that the syntax similar to programming language is
more intuitive and acceptable by operators. Therefore, in
AFEIS, we define a set of syntax similar to programming
language but with simpler form. Besides commands the robot
to make action through hand gestures, the syntax of AFEIS
also supports function definition and variable setting, which
are convenient for doing repetitive tasks.
III. METHODOLOGY
A. Gesture Recognition
The CNN that we use to perform hand gesture recognition
is obtained by modification to LeNet-5 [12]. The net structure
is shown in Fig. 1. Differing from most CNNs that are
popular at the present, the CNN in Fig. 1 performs recog-
nition based on binary (black-and-white) images instead of
color ones. Two concerns are taken into account to adopt
binary images and the CNN with such a simple structure.
First, for hand gesture recognition, color, even grayscale,
usually is not a reliable feature. Most hand gestures are
expressed by the combination of fingers posing at different
positions in a quite small space. A subtle change in the
lighting condition like the lighting angle and illumination
intensity could significantly influence the looks of gesture
images captured by monocular cameras and thus make the
pre-trained model blind. The second concern is the com-
putational intensity. Robots, especially those small ones,
usually only can provide limited computation resources. We
prefer a light-weighted recognition system, which can lead
the whole system to running in real time at the cost of as
few computation resources as possible. For this purpose, we
also develop a light-weighted CNN framework named PNet
to conduct gesture recognition. PNet is written in C++ for
performance reason and designed as a light-weighted and
flexible deep learning framework. The first version of PNet
will be released later in 2017.
Our model supports 50 kinds of static, single-hand ges-
tures, as shown in Fig. 2. It covers most static single-hand
gestures that people can pose normally and is enough to
satisfy the requirement of normal tasks in which operators
have the need to interact with robots in the field underwater.
In our tests, hand detection is performed by color filtering.
Operators are required wearing gloves with a certain color
and the recognition system is calibrated according to the
color before put into the robot. In the detection process,
we only keep those pixels whose color is in a specific
range and convert the image captured by the camera into
a black-and-white one by thresholding. Then, after some
augmentation, including Gaussian blurring and opening and
closing morphological transformation, the binary image is
cropped to keep only the largest contour region. Finally,
the image is resized while keeping the aspect ratio before
fed to the recognition system. The whole process of image
processing is quite cheap, since the recognition system does
not requests a piece of quite concise contour information
such that there is no need for high-resolution image or
complex technique to perform image processing.
B. A Flexible and Extendable Interaction Scheme
AFEIS is a scheme which is designed as a middleware
between the recognition system and the robot’s execution
system. Fig.4 shows the grammar of AFEIS parsing input
signals, which are the recognition results obtained by the
gesture recognition system in our tests. AFEIS itself does
care neither the way to obtain input signals nor how the
robot to execute each command. Its main task is to translate
a series of input signals into a list of executable commands,
based on which the robot can make actions sequentially.
Besides, AFEIS provides the feature to define functions
(<def-fn>) and variables (<set-var>) by operators in
the field. It can record a series of commands as a custom-
defined function, which is callable later (<call-fn>) when
needed; or store a number as a variable that can be modified
<explist> ::= <exp> | <set-var> | <def-fn>
| <change-keymap>
<def-fn> ::= <def> <integer> CMD_SEP <cmdset> END
<def> ::= DEF |BEGIN
<set-var> ::= <set> <integer> PARAM_SEP <num> END
<change-keymap> ::=
<set> <integer> END
<set> ::= SET |BEGIN
<exp> ::= BEGIN <integer> <do>
<cmdset> [CMD_SEP]END
<do> ::= DO |BEGIN
<cmdset> ::= <cmd> [CMD_SEP <cmdset>]
<cmd> ::= <function> | <call-fn>
| <load-keymap> | <set-var>
| <math-fn> | <exp>
<function> ::= FN <arg-list>
<arg-list> ::= <arg> [PARAM_SEP <arg-list>]
<arg> ::= <num> | <load-var> | PARAM
<load-var> ::= CALL <integer>
<call-fn> ::= CALL <integer>
<load-keymap> ::= <set> <integer>
<set-var> ::= <set> <integer> PARAM_SEP <num>
<math-fn> ::= <math-op> <load-var> PARAM_SEP
<math-arg-list>
<math-op> ::= +|-|*|/
<math-arg-list> ::= <load-var> | <num>
<num> ::= [neg-sign] <integer>
[<decimal-point> <integer>]
<neg-sign> ::= -
<decimal-point> ::= .
<integer> ::= <digit> [integer]
<digit> ::= 1|2|3|4|5|6|7|8|9|0
Fig. 4. AFEIS grammar defined by BNF.
by mathematical functions (<math-fn>) or be loaded as
a function argument later (<load-var>). The <num> in
<exp> after the BEGIN identifier indicates how many
times the following <cmdset> will be executed repeatedly,
which is like a for-loop in common programming languages.
In theory, operators can define any number of functions
and variables in the field through AFEIS. The grammar of
AFEIS is quite close to common programming language
and is readily acceptable for most operators with whom
we contacted. Through combining a function with custom-
defined variables, AFEIS provides a way for operators to, in
the field, define functions with arguments and dynamically
call them.
AFEIS needs firstly to define some input signals to
represent the basic symbols, BEGIN,END,DEF,SET,
DO,CMD SEP (separator of commands), PARAM SEP
(separator of parameters), CALL, 10 digits, decimal point
and negative sign. Among those symbols, DEF,SET and DO
can be replaced by BEGIN; and those representing <num>
can also be used as normal symbols for PARAM. Therefore,
AFEIS only needs to occupy 5 slots of input signals but
provide the features of function and variable definition and
loop. Each of the rest available input signals can be used
as a symbol representing a command FN or a parameter
PARAM. Due to that when facing a <fn>, AFEIS always
parses the first input signal as the symbol of FN and the rest
[system] [system]
BEGIN=A BEGIN=A
END=B END=B
CALL=C CALL=C
CMD_SEP=D CMD_SEP=D
PARAM_SEP=E PARAM_SEP=E
DO= DO= ; =BEGIN by default
DEF= DEF= ; =BEGIN by default
SET= SET= ; =BEGIN by default
[fn] [fn]
0=FORWARD 0=UP
1=LEFT 1=DOWN
2=RIGHT 2=SNAPSHOT
[param] [param]
0=0 0=0
1=1 1=1
2=2 2=2
3=3 3=3
Fig. 5. An example of two AFEIS configure files writing in INI format.
The two files can be set simultaneously, and the operator can manually load
either of them at a time when interacting with robots.
A 1 D // DEF 1 // define a funtion in slot 1
A 1 D // SET 1 // load keymap 1
1 1 D // DOWN 1 // down 1 meter
2 D // SNAPSHOT // take a photo
A 0 // SET 0 // load default keymap
B // END
A 1 A // BEGIN 3DO // do 3 times
C 1 // CALL 1 // call function in slot 1
B // END // execute
Fig. 6. An example of a signal sequence and its literal semantics to interact
with a robot via AFEIS according to the configure files in Fig. 5. The
input signals are consisted of two parts. First, they define a function, which
commands the robot to dive 1 meter and then to take a photo. Then this
function is called 3 times and it is equal to command the robot to dive 3
meters and take a photo every time diving 1 meter.
(for i from 1 to 3 do
(inform_robot, DOWN, 1)
(inform_robot, SNAPSHOT)
)
(inform_robot, EXECUTE)
Fig. 7. The general way through which AFEIS communicates with a robot
after receiving the last END signal in Fig. 6.
as PARAM, an input signal can, at the same time, have two
definitions as symbols of FN and PARAM respectively. The
<math-fn> is optional and can be extended or disabled
based on the request of tasks. Since most robots have
the ability to perform mathematical operations, in practice,
<math-fn> also can be designed as a command accepted
by the robot.
The definition of each input signal, in AFEIS, is defined in
independent configure (keymap) files instead of being hard-
coded inside the code of AFEIS such that the recognition and
execution systems are decoupled. Through such a way, the
definition of each input signal can be set dynamically through
loading different configure files during the interaction with
robots. Besides, operators can define their own configure files
and, by themselves, set the meaning of each input signal, i.e.
the meaning of each gesture, thereby interacting with the
robot by a most comfortable way accepted by themselves.
This could largely reduce the difficulty of operators master-
ing the way to perform interaction with robots via AFEIS.
Fig. 8. Some transient gestures are captured by the camera when the
operator changes the gesture from ‘palm’ to ‘fist’.
Besides, due to decoupling, AFEIS can be easily deployed
in most robot/computer systems with an interface to accept
external commands.
Fig. 5 shows an example of two configure files writing in
INI format. In the example, A,B,Cand 0,1,2and 3on the
left side of equations are the indices of input signals each of
which is corresponding to a hand gesture in our tests, while
those on the right side of equations in the fn and param
sections are the definitions of corresponding input signals
when parsed as a symbol of FN and PARAM respectively.
Additionally, there is neither the requirement that gestures
and signals must be paired strictly nor the requirement that
all gestures have to owe a definition. One signal can be
represented by multiple gestures, while we can leave some
gestures an empty definition if there is no need to use so
many functions or parameters in our task.
Take the configure files in Fig. 5 for example. If we
want the robot to dive 3 meters and take a photo every
time diving 1 meter, the operator can program the robot by
giving the input signals shown in Fig. 6. After the last END
signal is received, AFEIS will communicate with the robot
and transfer commands to the robot by a way like what is
shown in Fig. 7. The robot simply stores all commands that
it receives from AFEIS and executes the commands when
meeting the command EXECUTE. The way shown in Fig. 7
is a general way through which AFEIS communicates with
a robot. For robots who has an interface to parse a series of
commands at once, the interaction between AFEIS and the
robots can be done via just one communication.
Moreover, AFEIS also introduces a confirmation mech-
anism to improve the robustness of the gesture recognition
system. When using a CNN classifier, any input instance will
be classified as one of the classes that this classifier supports.
This means that the gesture recognition system must make
a mistake when facing any unrecognized gestures or noise,
since it cannot tell us that the image it faces does not belong
to any class it recognizes. One may require operators to try
their best to accurately pose gestures. However, regardless
of noise, due to the higher sampling speed, the camera
always could capture some transient gestures when operators
change their gesture into another one, as shown in Fig.
8. These transient gestures would lead the classifier to an
unexpected recognition result. This means that in practice, it
is unavoidable for the recognition system to encounter some
unrecognized gestures and then make mistakes. In order to
improve the robustness, AFEIS does not directly accept an
input signal according to the recognition result obtained from
a single image captured by the camera, but bases the signal
on a probabilistic model which counts on the recognition
results obtained in a time interval.
Suppose that Gis a set of gestures recognized by the
recognition system during a small time interval, i.e.
G={g1, g2,· · · , gn}
where giis the gesture recognized by the system according
to the i-th image captured by the camera in the time interval.
The discriminant function for AFEIS to accept the input
signal as skis defined as
fk(G) = Pr(sk|G)
A Markov model or Bayes risk estimator is employable
to evaluate fk. If we have no idea about the distribution of
gestures with respect to a signal, a simple method is to count
the occurrence of each gesture during a time interval, i.e.
fk(G) = Pirk ,i
|G| if X
i
rk,i > t
where | · | is the cardinality operation, tis a threshold value
and
rk,i =(1,if giis defined as signal sk
0,otherwise
Another key reason to introduce such a confirmation link
in AFEIS is that human operators usually cannot perform
operation synchronously with the sampling speed of the
camera. Therefore, a buffer time is needed for operators
to adjust their gestures when interacting with robots. Such
a confirmation method employed by AFEIS will introduce
a delay in the recognition process while improving the
robustness. The length of the time interval used to collect
Gmust be adjusted to balance the system’s sensitivity and
robustness.
Besides, when employing AFEIS, it is suggested to always
leave some gesture definition empty for the sake of robust-
ness. As what we talked above, when facing an unrecognized
gesture or noise, the CNN classifier must make a mistake.
However, if there are some gestures that have an empty
definition, unrecognized gestures or noise has an opportunity
to be labeled as those gestures not in use and then AFEIS
will parse those false recognition result as nothing, thereby
improving the robustness of the whole system.
IV. RESULTS
A. Gesture Recognition
We collect 75,000 static images of the right hand from
50 kinds of gestures as samples. For each gesture, 1,200
samples are used for training and 300 samples are collected
for testing. Fig. 9 shows the loss curve during training and
the accuracy rate when applying the model to the test set.
Each image captured by the camera is processed by the
way that we talked in Section III-A. Finally, the obtained
binary image is resized to 64-by-64 while keeping aspect
ratio before fed to the CNN classifier.
OpenCV is employed to perform image processing, while
PNet is used to run the gesture recognition model. No GPU
Fig. 9. Loss and accuracy of our CNN model for hand gesture recognition
during training and testing.
Fig. 10. Time consumed by image processing and recognition accuracy
with respect to different camera resolution.
is used during our practical tests. The consumed time from
capturing image to obtaining a recognition result can be held
within 0.02 second on a platform with DragonBoard 410c,
which provides a CPU with 1.2GHz. This speed is achieved
through limiting the camera resolution and thus dramatically
reducing the time used for image processing. Fig. 10 shows
the time consumed by image processing and recognition
accuracy rate with respect to different camera resolution. A
key reason to cause the low accuracy is that, considering
the input size requested by the recognition model is 64-by-
64, the contour region of gestures would be too small under
lower resolution and thus cannot be extracted effectively.
During training, after 18,000 times of iteration, the model
provides a satisfactory recognition result with an accuracy
rate more than 99.6% when applied on the test set. The
error rate of each gesture is distributed averagely. This means
that there is no significant difference in the accuracy for
the classifier to recognize those gestures. Hence, in theory,
operators can arbitrarily pick any ones in the 50 kinds of
gestures through which to interact with robots.
B. Field Trails
A small ROV is used in our field trials for the sake of
monitoring if the vehicle receives correct commands. Fig. 11
and 12 shows some pictures captured by the camera installed
on the ROV during our swimming pool test. An interesting
observation from Fig. 12 is that the operator was hardly able
Fig. 11. An exhibition of digits from 0 to 9 expressed by hand gestures
during our swimming pool test.
Fig. 12. A sequence of gestures to command the robot to turn left 30
degrees and then take a picture. The definition of each gesture from left to
right is Go Left, 3, 0, SEPARATOR, and SNAPSHOT.
TABLE I
TASK S
Task Description Complexitya
1 Go Down 1 meter and Take a Photo. 8
2 Go Left 30 degree and Take a Photo. 9
3 Go to water Surface, Take a Photo and Go
Back.
9
4 Swim Circle 3 times, Go Forward 2 meter,
Take a Photo and Go Back.
16
5 Go Down 3 meter and Take a Photo every
time going down 1 meter.
8
6 Go to Location 1b, Take a Photo, Go to
Location 2, Take a Photo and Go Back.
13
7 Follow Operatorcand Take a Photo every 1
second.
8
8 Define a Function in the field to complete
Task 5.
14
aComplexity is the minimum number of input signals, i.e. the length of the
sequence of gestures, needed to complete a task via AFEIS.
bLocation 1 and 2 are two pre-defined target destination coordinates.
cFollowing action is achieved by tracking specific color.
to keep his body or hand stable when interacting with the
robot underwater, though he was tried to do so. This implies
that the motion of hand usually is not a reliable signal during
the interaction between human and robots underwater.
Table I shows a part of tasks done in our field trials. An
annotation is about Task 8. Basically, Task 8 does what we
described in Fig. 6. It reaches the same operation goal with
Task 5 but has a higher complexity. The complexity of Task
8 consists of two parts. The former part is about defining
a function, while the other one is about calling the defined
function in order to command the robot to make actions. The
latter part only has a complexity of 6, which is lower than
Task 5 and which usually would not increase when facing a
more complicated task. That is to say that through allowing
operators to define functions in the field, AFEIS provides an
easy way for operators to command robots to finish repetitive
complex tasks.
AFEIS allows a gesture having different definitions when
it appears as FN and PARAM shown in Fig. 4 as well
TABLE II
INT ERACT ION RESULT ASSOCIATED WITH EMPT Y GESTU RE
DEFI NITI ONS
Task Empty S/FcTask Empty S/F
FNaPARAMbFN PARAM
1
45 46 10/0
5
45 44 10/0
10 10 10/0 10 10 10/0
0 0 8/2 0 0 9/1
2
44 46 10/0
6
43 44 10/0
10 10 10/0 10 10 9/1
0 0 9/1 0 0 7/3
3
45 45 10/0
7
44 45 10/0
10 10 10/0 10 10 10/0
0 0 9/1 0 0 9/1
4
45 46 10/0
8
44 44 9/1
10 10 10/0 10 10 10/0
0 0 7/3 0 0 7/3
aFN means the number of gesture definition left empty for FN slots.
bPARAM means the number of gesture definition left empty for PARAM
slots.
cS is for the number of successful interactions that reach the expected
result, while F is for fail.
as allows gestures having empty definitions such that they
would be parsed as nothing by AFEIS. To leave some ges-
tures an empty definition is considered as a way to improve
robustness, since some noise or unrecognized gestures would
be classified as those having no definition and thus be
ignored by AFEIS. In the field trials, we test the interaction
performance in the case where there are different numbers
of gestures that have an empty definition. Table II shows
partial results of our tests. In Table II, the first row for each
task is the case where we define as few gestures as possible;
the second row is a general case where there are always ten
gestures that are left empty; and the last row is the case
where all gestures have their own definitions. As what we
can see from Table II, AFEIS performs poorly if no gesture
has an empty definition.
Another problem that we concern is the difficulty for
operators to adapt the interaction way provided by AFEIS.
It usually is not a problem for operators to remember the
meaning of each gesture, since operators are allowed to
define the meaning of each gesture by themselves. However,
due to that on the ROV used in our tests, there is not anything
like a screen that can provide a feedback to operators during
the interaction, the operators have to do some practice
beforehand in order to adapt the frequency in which AFEIS
accepts input signals.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed an interaction scheme, named
AFEIS, for human-robot interaction performed underwater
in the field. This scheme uses a set of simple syntax similar
to common programming language. It, besides providing a
way for operators to directly make commands to robots,
allows operators to define functions and set variables in the
field and to call them later, thereby converting robots to
programmable ones. We employ hand gestures as the way to
perform interaction with robots underwater. A CNN model,
supporting 50 kinds of static, single-hand gestures with more
than 99.6% recognition accuracy, is trained as the recognition
system to provide AFEIS input signals such that AFEIS
can control robots based on operators’ gestures. AFEIS
decouples the recognition system and the robot’s execution
system by using independent configure files to interpret input
signals, and thus can be deployed in most robot systems
with little additional work. By means of configure files,
operators are allowed to, by themselves, define the input
signal represented by each gesture such that the learning
difficulty for operators to adapt AFEIS is quite low. In our
field trials, AFEIS performs quite well and has the potential
to be used in more hash underwater environment and for
more complex tasks.
In the future work, we plan to further test AFEIS in the
environment with a poor lighting condition and visibility.
Besides, the CNN model that we trained is based on the
right-hand images and we are collecting left-hand images in
order to train a model for operators who prefer to use the left
hand. We also have a consideration to allow AFEIS to work
in a model that supports two-hand gestures. However, there
are still some challenges when applying AFEIS for two-hand
based interaction in the underwater environment.
REFERENCES
[1] W. T. Freeman and M. Roth, Orientation histograms for hand gesture
recognition. International workshop on automatic face and gesture
recognition. 1995, 12: 296-301.
[2] T. Starner and A. Pentland, Real-time american sign language recog-
nition from video using hidden markov models. Motion-Based Recog-
nition. Springer Netherlands, 1997: 227-243.
[3] L. Bretzner, I. Laptev and T. Lindeberg, Hand gesture recognition
using multi-scale colour features, hierarchical models and particle
filtering. Automatic Face and Gesture Recognition, 2002. Proceedings.
Fifth IEEE International Conference on. IEEE, 2002: 423-428.
[4] N. H. Dardas and N. D. Georganas, Real-time hand gesture detection
and recognition using bag-of-features and support vector machine
techniques. IEEE Transactions on Instrumentation and Measurement,
2011, 60(11): 3592-3607.
[5] E. J. Ong and R. Bowden, A boosted classifier tree for hand shape de-
tection. Automatic Face and Gesture Recognition, 2004. Proceedings.
Sixth IEEE International Conference on. IEEE, 2004: 889-894.
[6] J. Fritsch, S. Lang, A. Kleinehagenbrock, G. A. Fink and G. Sagerer,
Improving adaptive skin color segmentation by incorporating results
from face detection. Robot and Human Interactive Communication,
2002. Proceedings. 11th IEEE International Workshop on. IEEE, 2002:
337-343.
[7] Q. Chen, N. D. Georganas and E. M. Petriu, Real-time vision-based
hand gesture recognition using haar-like features. Instrumentation and
Measurement Technology Conference Proceedings, 2007. IMTC 2007.
IEEE. IEEE, 2007: 1-6.
[8] P. Buehler, M. Everingham, D. P. Huttenlocher and A. Zisserman,
Long term arm and hand tracking for continuous sign language TV
broadcasts. Proceedings of the 19th British Machine Vision Confer-
ence. BMVA Press, 2008: 1105-1114.
[9] G. Dudek, J. Sattar, A. Xu, A visual language for robot control and
programming: A human-interface study. Robotics and Automation,
2007 IEEE International Conference on. IEEE, 2007: 2507-2513.
[10] A. Xu, G. Dudek, and J. Sattar, A Natural Gesture Interface for
Operating Robotic Systems. Robotics and Automation, 2008. ICRA
2008. IEEE International Conference on. IEEE, 2008: 3557-3563.
[11] D. Chiarella, M. Bibuli, G. Bruzzone, et al, Gesture-based language
for diver-robot underwater interaction. OCEANS 2015-Genova. IEEE,
2015: 1-9.
[12] Y. LeCun, L. Bottou L, Y. Bengio and P. Haffner, Gradient-based
learning applied to document recognition. Proceedings of the IEEE,
1998, 86(11): 2278-2324.
... On the other hand, parameter reconfiguration instructs the robot to continue the current program task with updated parameters. Gesture-based human-robot interaction based on 50 gestures was presented in [59]. The collection of 75,000 images depicting the right hand of a diver was acquired and divided into testing and training sets. ...
... If the gesture in the current frame is not recognised, the previous recognition is assigned. A modification of the LeNet-5 architecture for gesture recognition was developed in [59]. This model was devised to recognise 50 static gestures that can be embraced for regular tasks during human-robot interaction. ...
... This approach often deployed machine learning algorithms for classification tasks based on the detected image features. The range of methods used for underwater gesture recognition is listed in Table I. [36], [37], [40], [59], [62], [71] [1], [38], [41], [51], [53], [61], [62], [63], [65] [39], [52], [54], [60], [64], [66], [67], [68], [69], [70],[72] ...
Article
Full-text available
Underwater human-robot interaction (U-HRI) demonstrates a significant potential for enhancing collaboration in challenging underwater tasks such as inspection, maintenance and exploration. Such collaboration demands effective communication. Recent advancements in underwater technology have facilitated the development of cutting-edge autonomous underwater vehicles (AUVs) and sensors. Nevertheless, effective communication remains challenging underwater due to electromagnetic wave attenuation and limited wireless options. As a result, researchers have increasingly focused on visual gesture recognition. This focus is driven by advances in visual sensors and more sophisticated computer vision algorithms. This paper reviews recent progress in deploying visual gesture recognition for U-HRI, leveraging mainly deep learning algorithms to improve communication reliability and efficiency. The main objective is to identify the most promising techniques for practical applications, ultimately enhancing the efficiency and reliability of diver-AUV collaboration.
... Regarding NLP technologies, Xu [34] has discussed gesture-based HRI for field programmable autonomous underwater robots, outlining the challenges and limitations of implementing such methods underwater. Sibirtseva et al. [35] introduced a system aimed at refining the interpretation of natural language instructions through reference disambiguation based on visualization techniques. ...
Article
Full-text available
This paper presents an in-depth exploration into the integration of augmented reality (AR), gesture recognition, and natural language processing (NLP) to enhance human-robot interaction (HRI) within the context of underwater robotics. It highlights the significant potential these technologies hold in addressing the unique challenges faced in underwater environments, such as limited visibility, complex navigation, and the need for precise, intuitive communication between divers and robots. By reviewing current technological advancements and applications, the study underscores the critical role of AR in providing real-time visual feedback, gesture recognition in enabling more natural control mechanisms, and NLP in facilitating voice-driven commands and interactions. The research further discusses the development of a conceptual framework for an AR-based intuitive interface that synergizes gesture recognition and NLP, aiming to revolutionize underwater HRI by making it more efficient, safe, and user-friendly. Through this investigation, the paper seeks to contribute to the advancement of underwater robotics, proposing innovative solutions that could significantly improve human-robot collaboration in challenging aquatic missions.
Article
Full-text available
Underwater robots are often used in marine exploration and development to assist divers in underwater tasks. However, the underwater robots on the market have some problems, such as only a single function of object detection or tracking, the use of traditional algorithms with low accuracy and robustness, and the lack of effective interaction with divers. To this end, we designed a type of gesture recognition based on interaction, using person tracking as an auxiliary means for an underwater accompanying robot (UAR). We train and test the SSDLite detection algorithm using the self-labeled underwater datasets, and combine the kernelized correlation filters (KCF) tracking algorithm with the “Active Control” target tracking rule to continuously track the underwater human body. Our experiments show that the use of underwater datasets and target tracking can effectively improve gesture recognition accuracy by 40–105%. In the outfield experiment, the performance of the algorithm was good. It achieved target tracking and gesture recognition at 29.4 FPS on Jetson Xavier NX, and the UAR made corresponding actions according to the diver gesture command.
Conference Paper
Full-text available
Underwater environment is characterized by harsh conditions and is difficult to monitor. The CADDY project deals with the development of a companion robot devoted to support and to monitor human operations and activities during the dive. In this scenario the communication and correct reception of messages between the diver and the robot are essential for success of the dive goals. However, the underwater environment poses a set of technical constraints hardly limiting the communication possibilities. For such reasons the solution proposed is to develop a communication language based on the consolidated and standardized diver gestures, commonly employed during professional and recreational dives, thus leading to the definition of a CADDY language, called CADDIAN, and a communication protocol. This article focuses on the creation of the language providing alphabet, syntax and semantics: future work will explain the part of recognition of gestures that is still in progress.
Conference Paper
Full-text available
A gesture-based interaction framework is presented for controlling mobile robots. This natural interaction paradigm has few physical requirements, and thus can be deployed in many restrictive and challenging environments. We present an implementation of this scheme in the control of an underwater robot by an on-site human operator. The operator performs discrete gestures using engineered visual targets, which are interpreted by the robot as parametrized actionable commands. By combining the symbolic alphabets resulting from several visual cues, a large vocabulary of statements can be produced. An iterative closest point algorithm is used to detect these observed motions, by comparing them with an established database of gestures. Finally, we present quantitative data collected from human participants indicating accuracy and performance of our proposed scheme.
Conference Paper
Full-text available
The ability to detect a persons unconstrained hand in a natural video sequence has applications in sign language, gesture recognition and HCl. This paper presents a novel, unsupervised approach to training an efficient and robust detector which is capable of not only detecting the presence of human hands within an image but classifying the hand shape. A database of images is first clustered using a k-method clustering algorithm with a distance metric based upon shape context. From this, a tree structure of boosted cascades is constructed. The head of the tree provides a general hand detector while the individual branches of the tree classify a valid shape as belong to one of the predetermined clusters exemplified by an indicative hand shape. Preliminary experiments carried out showed that the approach boasts a promising 99.8% success rate on hand detection and 97.4% success at classification. Although we demonstrate the approach within the domain of hand shape it is equally applicable to other problems where both detection and classification are required for objects that display high variability in appearance.
Conference Paper
Full-text available
The visual tracking of human faces is a basic functionality needed for human-machine interfaces. This paper describes an approach that explores the combined use of adaptive skin color segmentation and face detection for improved face tracking on a mobile robot. To cope with inhomogeneous lighting within a single image, the color of each tracked image region is modeled with an individual, unimodal Gaussian. Face detection is performed locally on all segmented skin-colored regions. If a face is detected, the appropriate color model is updated with the image pixels in an elliptical area around the face position. Updating is restricted to pixels that are contained in a global skin color distribution obtained off-line. The presented method allows us to track faces that undergo changes in lighting conditions while at the same time providing information about the attention of the user, i.e. whether the user looks at the robot. This forms the basis for developing more sophisticated human-machine interfaces capable of dealing with unrestricted environments.
Conference Paper
Full-text available
Hidden Markov models (HMMs) have been used prominently and successfully in speech recognition and, more recently, in handwriting recognition. Consequently, they seem ideal for visual recognition of complex, structured hand gestures such as are found in sign language. We describe a real-time HMM-based system for recognizing sentence level American Sign Language (ASL) which attains a word accuracy of 99.2% without explicitly modeling the fingers
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Conference Paper
Full-text available
This paper presents algorithms and a prototype system for hand tracking and hand posture recognition. Hand postures are represented in terms of hierarchies of multi-scale colour image features at different scales, with qualitative inter-relations in terms of scale, position and orientation. In each image, detection of multi-scale colour features is performed. Hand states are then simultaneously detected and tracked using particle filtering, with an extension of layered sampling referred to as hierarchical layered sampling. Experiments are presented showing that the performance of the system is substantially improved by performing feature detection in colour space and including a prior with respect to skin colour. These components have been integrated into a real-time prototype system, applied to a test problem of controlling consumer electronics using hand gestures. In a simplified demo scenario, this system has been successfully tested by participants at two fairs during 2001.
Article
This paper presents a novel and real-time system for interaction with an application or videogame via hand gestures. Our system includes detecting and tracking bare hand in cluttered background using skin detection and hand posture contour com- parison algorithm after face subtraction, recognizing hand ges- tures via bag-of-features and multiclass support vector machine (SVM) and building a grammar that generates gesture commands to control an application. In the training stage, after extracting the keypoints for every training image using the scale invariance feature transform (SIFT), a vector quantization technique will map keypoints from every training image into a unified dimen- sional histogram vector (bag-of-words) after K-means clustering. This histogram is treated as an input vector for a multiclass SVM to build the training classifier. In the testing stage, for every frame captured from a webcam, the hand is detected using our algorithm, then, the keypoints are extracted for every small image that contains the detected hand gesture only and fed into the cluster model to map them into a bag-of-words vector, which is finally fed into the multiclass SVM training classifier to recognize the hand gesture.
Conference Paper
The goal of this work is to detect hand and arm positions over continuous sign language video sequences of more than one hour in length. We cast the problem as inference in a generative model of the image. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) using efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large set of frames where correct configurations can be inferred, and using temporal tracking elsewhere. Results are reported for signing footage with changing background, challenging image conditions, and different signers; and we show that the method is able to identify the true arm and hand locations. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.
Orientation histograms for hand gesture recognition. International workshop on automatic face and gesture recognition
  • W T Freeman
  • M Roth
W. T. Freeman and M. Roth, Orientation histograms for hand gesture recognition. International workshop on automatic face and gesture recognition. 1995, 12: 296-301.