Conference PaperPDF Available

Taking Advantage of Data Dimensionality Reduction for Dynamic Gesture Recognition from Incomplete Data

Authors:

Abstract and Figures

Continuous gesture spotting is a major topic in human-robot interaction research. Human gestures are captured by sensors that provide large amounts of data that can be redundant or incomplete, correlated or uncorrelated. Data di-mensionality reduction (DDR) techniques allow to represent such data in a low-dimensional space, making the classification process more efficient. This study demonstrates that DDR can improve the classification accuracy and allows the classification patterns gesture patterns with incomplete data, i.e., with the initial 25%, 50% or 75% of data representing a given dynamic gesture (DG)-time series of positional and hand shape data. Re-sampling with bicubic interpolation and principal component analysis (PCA) were used as DDR methods. Experimental tests indicate that after DDR PCA the classification accuracy is higher with 50% of gesture data (100% accuracy) than with 100% of gesture data (96% accuracy), tested in a library of 10 hand/arm dynamic gestures. Recognized gestures are used to control a robot in an collaborative process.
Content may be subject to copyright.
1
Taking Advantage of Data Dimensionality
Reduction for Dynamic Gesture Recognition from
Incomplete Data
Miguel Simão, Pedro Neto, and Olivier Gibaru
Abstract—Continuous gesture spotting is a major topic in
human-robot interaction research. Human gestures are captured
by sensors that provide large amounts of data that can be
redundant or incomplete, correlated or uncorrelated. Data di-
mensionality reduction (DDR) techniques allow to represent such
data in a low-dimensional space, making the classification process
more efficient. This study demonstrates that DDR can improve
the classification accuracy and allows the classification patterns
gesture patterns with incomplete data, i.e., with the initial 25%,
50% or 75% of data representing a given dynamic gesture (DG) -
time series of positional and hand shape data. Re-sampling with
bicubic interpolation and principal component analysis (PCA)
were used as DDR methods. Experimental tests indicate that
after DDR PCA the classification accuracy is higher with 50%
of gesture data (100% accuracy) than with 100% of gesture data
(96% accuracy), tested in a library of 10 hand/arm dynamic
gestures. Recognized gestures are used to control a robot in an
collaborative process.
Index Terms—Gesture recognition, data dimensionality reduc-
tion, PCA, human-robot interaction
I. INTRODUCTION
THE ability to interact with a robot in a natural and
intuitive way, for example using speech and gestures,
has brought important advances to the way our societies look
to robots. The paradigm for robot usage has changed in the
last few years, from a concept in which robots work with
complete autonomy to a scenario in which robots cognitively
collaborate with human beings. This brings together the best of
each partner, robot and human, by combining the coordination
and cognitive capabilities of humans with the robots’ accuracy
and ability to perform monotonous tasks. For this end, robots
and humans have to understand each other and interact in
a natural way, creating a co-working partnership. This will
allow a greater presence of robots in our companies, schools,
hospitals, etc., with consequent positive impact on societys life
standards. The current problem is that the existing interaction
modalities are neither intuitive nor reliable. Instructing and
programming an industrial robot by the traditional teaching
Miguel Simão is with the Department of Mechanical Engineering at
University of Coimbra, Coimbra, Portugal, and Ecole Nationale Supérieure
d’Arts et Métiers, ParisTech, Lille, France e-mail: miguel.simao@uc.pt.
Pedro Neto is with the Department of Mechanical Engineering at University
of Coimbra, Coimbra, Portugal, e-mail: pedro.neto@dem.uc.pt.
Olivier Gibaru is with the Ecole Nationale Supérieure d’Arts et Métiers,
ParisTech, Lille, France, e-mail: Olivier.GIBARU@ENSAM.EU.
This work was supported in part by the European Comission under Contract
number HORIZON2020-688807-ColRobot and the Foundation for Science
and Technology (FCT), SFRH/BD/105252/2014.
method is a tedious and time-consuming task that requires
technical expertise.
The robot market is growing and the human-robot interac-
tion (HRI) interfaces will have a main role in the acceptance
of robots as co-workers. Gestures and other natural interaction
modalities may decrease the need for technical expertise in
robot programming, therefore decreasing the cost of owning a
robot.
Multimodal HRI interfaces combining gestures, speech and
tactile based-actions are expected to be in a near future
the standard for a reliable and intuitive interaction process.
Nonverbal communication cues in the form of gestures are
considered to be an effective way to approach natural HRI.
For instance, a person can point to indicate a position to a
robot, use a dynamic gesture to instruct a robot to move and
a static gesture to stop the robot [1], [2]. In this scenario, the
user has little or nothing to learn about the interface, focusing
on the task and not on the interaction [3]. For all the reasons
mentioned above, continuous and real-time gesture spotting
(segmentation and recognition) are key factors to bridge the
gap between laboratory research and real world application of
novel HRI modalities.
Some gestures, although not all, can be defined by their
spatial trajectory. This is particularly true for pantomimic
gestures, which are often used to demonstrate a certain motion
to be done, e.g., a circle. Burke and Lasenby focused with
success on using PCA and Bayesian filtering to classify these
time series [4].
The importance of DDR PCA to reduce the dimensionality
of a dataset representing human gestures for HRI has been
studied in [5] [6]. In a different study, PCA is applied to a con-
tinuous stream of time series data capturing body motions with
an accuracy of 91% [7]. In [8] it is proposed a unified sparse
learning framework by introducing the sparsity or L1-norm
learning, which further extends the locally linear embedding
(LLE)-based methods to sparse cases. A DDR approach using
sparsified singular value decomposition (SSVD) technique to
identify and remove trivial features before applying feature
selection is proposed in [9]. A reference study presents a
sequence kernel dimension reduction approach (S-KDR) in
which spatial, temporal and periodic information is combined
in a principled manner and an optimal manifold is learned
[10].
Gesture classification has been studied over the years.
However, there remains the problem with reliability and intu-
itiveness, which are key factors for a system’s acceptance by
2
end-users. Other challenges are related with the ability to per-
form the recognition in real-time and continuously, recognize
gesture patterns with incomplete data, and performing DDR
keeping or increasing the recognition accuracy. DDR allows to
reduce training data in supervised classification methods and
consequently reduce the training time.
To overcome the problem of classification of time se-
quences, i.e., DGs, we present two approaches to their feature
extraction and DDR: one based on up- or down-sampling
of gesture frames using interpolation, and another based on
PCA. The PCA approach has the advantage of being capable
of yielding a good classification even before the gesture is
finished – with incomplete/partial gesture data. Complex DGs
are defined by a large set of features, with a variable number
of frames, and with both trajectory and finger movement. The
classification was done with artificial neural networks (ANN).
Independently of the type of gesture, a feature vector for a
sample is defined by the vector zRd:
z(i)= [f1f2... fd](1)
where fiis the ith element of z.
II. DATA DIME NS IO NAL IT Y RED UC TI ON
A. Overview
The dynamic blocks that compose DGs are segmented in
an unsupervised way. A segmentation function sbased on a
motion-threshold algorithm [11] is applied to a stream or set
of data S, of dimensionality dand length n. This function is
here generically defined by s:
s:Rd×n{0,1}n
S7→m(2)
where the static frames are defined by m= 0 and the dynamic
are m= 1. The dynamic segments are then extracted by a
search function that finds transitions in m(from 0to 1and 1
to 0). Given two consecutive transitions in the frames iand
i+kso that mi1= 0,mi= 1,mi+k1= 1 and mi+k= 0,
the DG sample is defined by:
XD= [SiSi+1 ... Si+k1],XDRd×k(3)
where the Sivector is the ith column (frame) of the data
stream. In terms of notation, for a generic AMm×n,Aij
represents the element of the array Awith row iand column
j,Ai[Ai1·· ·Ain ]and Aj[A1j···Anj]T.
A specific sample iof a data set is represented by X(i). A
function fis then used to extract the feature-vector z0from
each sample (4). This transformation can have multiple steps
but the last one always outputs z0. In this work, the number
of primes decreases with each transformation step, i.e., z000
z00 z0z. For example, the last step before classification
is always scaling, so this step is represented by z0z.
f:Rd×nRb
Xz00 (4)
in which bis the dimensionality of the feature vector. The
vector z00 is the input for the classifiers and t∈ {0,1}nclasses
is the target value of the classifier for that sample (supervised
learning). If the target class has the number o, the target vector
t(o)is defined by:
t(o)
j=δoj , j = 1, ..., nclasses (5)
where δis the Kronecker delta and tjis the jth element of t.
The transformation fmay still yield a vector with very
high dimensionality b, which may be detrimental for the
classification. Therefore, further processing techniques, such
as PCA and interpolation, are proposed for DDR:
r:Rb00 Rb0
z00 z0, b00 < b0(6)
Nevertheless, DDR can be done either before or after feature
extraction, so in this work we consider that the composition
of the functions is independent of their order, i.e., z0=
(fr) (X)(rf) (X).
B. Re-Sampling with Bicubic Interpolation
Re-sampling is done with bicubic interpolation to transform
a DG sample X(i), i iD,XMd×n, which has a
variable number of frames n, into a fixed-dimension sample
X0,X0Md×k. Usually kn, being karbitrarily defined
as the maximum nin all the samples iso that iiD. So,
although in almost every case the proposed transformation is
up-sampling the sample, it is also valid for new cases where
k < n, effectively down-sampling the sample.
interp :<d×n→<d×k
XX0(7)
The bicubic interpolation method [12] yields a surface p
described by 3rd order polynomials in both dimensions of
space. Given a patch of dimension 2×2, there are 4 data
points in which we know the values fand derivatives fx,fy
and fxy. The derivatives are not known at the boundaries, but
they can be estimated using finite differences. The interpolated
values inside the uniformized 2×2sector are given by:
p(x, y) =
3
X
i=0
3
X
j=0
aij xiyj(8)
A representation of the sector is presented in Figure 1.
The problem is determining the 16 coefficients aij. The
function values and 3 derivatives at the 4 points provide 4×
4 = 16 linear equations, which can be written as an equation
system Aα=xwith:
α= [a00 a10 a20 a30 a01 ... a33]T(9)
x= [f(0,0) f(1,0) ... fx(0,0) ... fy(0,0) ...
fxy(0,0) ... fxy(1,1)]T(10)
3
The matrix Ais nonsingular, so the equation system can
be rewritten as α=A1x. This process is used for all
patches in the bi-dimensional grid. The derivatives at the
boundaries of a patch are maintained across neighboring
patches. In order to apply the method to the whole data grid
efficiently, techniques such as Lagrange polynomials, cubic
splines or cubic convolution algorithms are used. The resulting
interpolated data points are smother and have less artifacts
than those using other interpolation methods, such as bilinear
interpolation.
Fig. 1. Representation of the result of bicubic interpolation on a 2×2grid
of points f(0,0), f(1,0), f(0,1), f(1,1).
C. Principal Component analysis
PCA is a mathematical tool that performs an orthogonal
linear transformation of a set of n p-dimensional observations,
XRn×p, into a space defined by the PC. The PC have
necessarily a size less than or equal to the number of original
dimensions, p. The first component has the largest possible
variance observed in the observations. Each of the following
PC is orthogonal to the preceding component and has the
highest variance possible under this orthogonality constraint.
The PC are the eigenvectors of the covariance matrix and its
eigenvalues are a measure of the variance in each of the PC.
Therefore, PCA can be used for reducing the dimensionality
of data by projecting that data into the PC space and truncating
the lowest-ranked dimensions. These dimensions have the
lowest eigenvalues, so truncating them retains most of the
variance present in the data.
The first step in PCA is centering the data, because PCA
is sensitive relative to the scaling of the original dimensional
space. This is done by subtracting each of a dimension’s values
by its overall average.
The PC transformation is very often determined by another
matrix factorization method, the SVD of X:
X=UΣVT(11)
where XMn×pis the original data matrix. ΣMn×p
is a diagonal matrix with the singular values of X,Uis a
n×nmatrix whose columns are orthogonal unit vectors that
are the left singular vectors of X, and VMp×pis a matrix
whose columns are unit vectors, the right singular vectors.
Both Uand Vare orthogonal matrices, so that UTU=Ip
and VTV=Ip. The singular values σ1, σ2, ... in the diagonal
of Σare the positive square roots, σi=λi>0, of the
nonzero eigenvalues of the Gram matrix K=XTX, thus
being always positive.
The implementation used for this purpose was Matlab’s
pca function. Since the input data matrix Xis most often
rectangular, the function uses the aforementioned SVD method
(11) for the matrix decomposition. The singular values, i.e.,
the variance in each of the PC, are the eigenvalues of the
covariance matrix of X. The covariance matrix of psets
variates {x1}, ..., {xp},xi=Xiis defined by WMp×p:
Wij =cov(xi,xj)≡ h(xiµi) (xjµj)i,
i, j = 1, ..., p (12)
where µand h i denote mean value, being µi=hxii.W
can also be written as Wij 1
/n1XXT. The product XXT
has as eigenvectors the columns of U.
Although PCA is most often performed to reduce the
dimensionality of the observations, in this work we preferred
to use the PCs as features. The first PC or singular vector
U1determines the direction in the PC-space in which there
is the most variance during a DG. The variance is measured by
the respective singular value, Σ11. Therefore, we expect these
values to produce good features for the DG classification, even
if the gesture is incomplete. We used also PCA to represent
gesture features in lower two- and three-dimensional spaces,
for easier visualization.
III. EXP ER IM EN TAL RES ULT S
A. Interaction technologies and data acquisition
The proposed system works with any source of positional
and hand shape data. For these tests, we used two sensors to
capture the hand shape, position and orientation: (1) a data
glove; (2) a magnetic tracker.
The data glove was a CyberGlove II, developed by Cyber-
Glove Systems. The version used has 22 resistive bend sensors:
three flexion sensors per finger, four abduction sensors, a
palm-arch sensor, and two sensors to measure wrist flexion
and abduction. The sensors output real-time digital joint-angle
filtered data at an average rate of 100 Hz.
The magnetic tracker used was P olhemusLiberty. It has a
very low latency and outputs at a rate of 120 Hz.
The glove transmits the data to through a Bluetooth con-
nection and the tracker is connected by a physical serial
connection. Serial objects read the available data in windows
of 10 samples and store the available data with timestamps
in buffers. A script then reads the knewest samples from
both the buffers and processes them. A full frame of data
with the 22 DOF from the glove and 6 DOF from the tracker
is represented by f(13), where gkrepresents the kth DOF
of the glove and lkrepresents the kth DOF of the tracker.
Before further processing, the stream of data is segmented
by a motion-threshold method [11]. The method appends to
4
the frame a binary segmentation variable mwhich represents
whether the frame belongs to a dynamic segment or not.
f= [g1g2... g22 l1l2... l6m](13)
B. Dataset
A sample in the dataset is represented by (SD):
SD=nX(i), t(i)o,X(i)Rd×n, t(i)1, ..., LD(14)
where dis the number of DOF of the system, nis the number
of data frames in the sample, tis the target class and LDis
the number of classes for DG.
A total of 10 DGs combining hand/arm and fingers motion
were selected, Figure 2. For testing purposes, a total of 20
samples were acquired from only one subject. This sums up
to 200 gesture samples.
C. Features
To ensure reliable classification independently of the subject
position and orientation, every gesture sample has its feature
data reported to their local reference frame. A coordinate
transformation tis applied to the features vector:
X0(i)=t(X(i)), i iD,X(i)M28×n(R)(15)
Following that, two distinct feature extraction methods from
X0are proposed:
1) Re-sampling the samples with bicubic interpolation
(DG1);
2) Extracting principal vectors and values using PCA
(DG2).
In the first case, DG1, given a sample X(i):iiDwith n
frames (X(i)M28×n), the objective is to resample it to a
fixed size n0. The number n0can be chosen arbitrary but in
order to retain as many of the gesture features as possible,
n0should be an upper bound such that n0n, n|X(iD)
M28×n. Therefore, we opted by choosing the highest nof the
the DG samples, specifically n0= 161. Applying the bicubic
interpolation algorithm, the result is a matrix ZR28×161.
The following step is to transform Zinto a vector zR4508,
which is done by concatenating every frame vertically:
z(i)=concat(Z(i)) =
Z(i)
1
.
.
.
Z(i)
161
(16)
In a second case, DG2, we use PCA to extract features.
The advantage is that it allows us to obtain features from
incomplete gestures and still obtain coherent features. From
each sample X(i):iiD,XM28×nare extracted 4
feature vectors z(i)
k:k∈ {0.25,0.50,0.75,1}, where kdefines
the fraction of the number of frames that were used:
z(i)
k=U1
Σ11 (17)
where U1and Σ11 are the first singular vector and value,
respectively. The singular vector has 28 dimensions and the
value just 1 dimension. They are calculated using the the
partial sample X(i)
m, so that:
pca(X(i)
m),m={1, ..., dnke} (18)
where dnkerepresents the ceiling function, since dnke ∈ N.
The last feature processing step is feature scaling. Feature
scaling is essential for achieving smaller training times and
better network performance with less samples. It harmonizes
the values of different features so that all of them fall within
the same range. This is especially important when some fea-
tures have distinct orders of magnitude. The scaling function
chosen was linear rescaling, l:
l(x) = 2xb
XT
b
XT(19)
where bis the max+min operator defined in (20). XT=
z(i):iiTis the set of unscaled features of the training
set. This operator is valid both for static and dynamic gestures
but the sample subsets used should be exclusive.
b
Xi= max Xi+ min Xi, i = 1, ..., d (20)
D. Results and discussion
The available samples S(i) : iiDwere randomly divided
in two sets – a training set (iiDT ), and a validation set
(iiDV ). In this case, we have 20 samples per each of the
10 classes but only from one subject. Therefore, each set has
10 samples/class.
The ANN architecture is composed by one hidden layer with
25 nodes and 10 output neurons (classes) in both approaches,
as shown in Figure 3. The difference between DG1 and DG2
is the size of the input feature vector, 4508 and 29 for DG1
(16) and DG2 (17), respectively. In both cases the transfer
function is the hyperbolic tangent in the first layer and the
softmax function in the second layer.
The classification results are displayed in Table I. For DG1,
the accuracy in the validation data set was 99.0% (99/100),
where the only error was gesture 7 being classified as 8. This
can be seen in the confusion table in Figure 5 (a). In Figure 4
it is possible to see the distribution of the features in a reduced
principal component space (first and third PCs) for DG1.
In this R2space, the classes show good separability, being
gesture 7 the exception, where higher than normal dispersion
can be seen. One of the samples representatives of class 7 fell
into the middle of the class 8 cluster, which led to it being
miss-classified. There are other very close clusters, such as the
gestures D1, D2, D4 and D5, but they have very low dispersion
and show better separability when seen in higher spaces.
While for the DG1 case the gesture frames are interpolated
after the gesture is finished, for DG2 the features can be
extracted at any time during a gesture. This allows for recog-
nition even before the gesture is even finished. Given that,
the network DG2 was trained and validated with four sets of
features originated from iDT ,z(i)
k:k∈ {0.25,0.50,0.75,1},
5
Fig. 2. Representations of the 10 DGs.
b
W
Hidden
25
b
W
Output
10
Output
10
Input
4508 (DG1)
29 (DG2)
Fig. 3. ANN architecture used for DG classification using interpolated frames
(DG1) and the first principal component (DG2) as features.
-50 0 50 100
PC1
-40
-20
0
20
40
60
PC3
1
2
3
4
5
6
7
8
9
10
Errors
Fig. 4. Distribution of the validation samples iDV for the test case DG1 in
the PC space. Each color discriminates a class and the sample that originated
the error is marked with an ×.
see (17). Therefore, the training and validation sets had each
40 feature vectors per gesture. The network has the same
configuration as in DG1 except for the number of input nodes
which is 29, Figure 3.
The accuracy results in the validation data set given this
configuration are shown in Table I on page 5 for each of the
4 sets of features. With only the initial 25% of the gesture,
the accuracy was 95% and increased to 100% when half of
10
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
1
10
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
10
12345678910
Output Class
1
2
3
4
5
6
7
8
9
10
Target Class
(a)
40
0
0
0
0
0
0
0
0
0
0
39
0
0
0
0
0
0
0
0
0
0
39
0
0
0
0
0
0
0
0
0
0
40
0
0
0
0
0
0
0
0
0
0
38
1
0
0
0
0
0
0
1
0
2
39
0
0
0
0
0
0
0
0
0
0
35
0
0
0
0
0
0
0
0
0
3
39
0
0
0
1
0
0
0
0
2
1
40
0
0
0
0
0
0
0
0
0
0
40
12345678910
Output Class
1
2
3
4
5
6
7
8
9
10
Target Class
(b)
Fig. 5. Confusion table for the classification of the DG samples from the
validation set iDV using both approaches: (a) DG1, (b) DG2.
TABLE I
FINA L RE SULTS F OR T HE CL AS SIFI CATI ON AC CUR ACY O N TH E
VALID ATION D ATA SETS .
DG1 DG2 (0.25) DG2 (0.5) DG2 (0.75) DG2 (1.0)
99.00 95.00 100.00 98.00 96.00
the gesture data was used. Nevertheless, when 75 and 100%
of the available data is available, the accuracy decreased to
98% and 96%, respectively. The confusion table in Figure 5
shows that the gesture with the most errors was gesture 7,
being classified three times as 8 and two times as 9. All these
gestures are short in duration (less data) and are similar, e.g.,
the starting and ending poses of 7 and 9 are the same (closed
hand), the difference being in the order the fingers are closed.
It is possible to see the evolution of the features obtained
from 25% to 100% of the data in Figure 6. Even at 25%
there is already good separation of the classes, although
the dispersion is still high, compared to the later stages.
As more data is available, the dispersion decreases and the
feature samples form defined intra-class clusters. In the case
of gestures D6, D7 and D8 this also happens, but the clusters
partially intercept, causing the decrease of the accuracy rate.
This also means that the selected features are not ideal for the
classification of these three gestures.
In both cases DG1 and DG2 the features for the classes D1
through D6 have well defined albeit close clusters, sometimes
generating errors. This happens because there is very little
inter-gesture variation of most of the variables during these
gestures. They are distinguished by path and hand orientation,
which are defined by 6 out of the 28 DOFs of the system.
The classification accuracy remained high for these samples,
but in the future these gestures should be represented by
better features, e.g, trajectory direction, in order to improve
the robustness of the classification.
E. Interacting with a real robot
We developed a HRI interface with an industrial robot
using the classified gestures as input. The attempted task was
preparing a breakfast meal, composed of grabbing a cereal
box, pouring the contents into a bowl, grabbing a yogurt bottle
and also pouring its contents into the same bowl, Figure 7. The
output gestures were mapped to robot actions for motion and
other tasks, i.e., open/close gripper.
IV. CONCLUSIONS
This paper demonstrated that dynamic gesture data can
be subject to DDR making the classification process more
6
-505 PC1
-5
0
5
PC3
0.25
-505 PC1
-5
0
5
PC3
0.50
-505 PC1
-5
0
5
PC3
0.75
-505 PC1
-5
0
5
PC3
1.00
1 2 3 4 5 6 7 8 9 10
Fig. 6. Plots of the features obtained from the validation data set (including the sets with incomplete data – 25%, 50%, 75% and 100% of the data) in a
reduced principal component space. The features were centered and scaled. Each color represents a different class.
Fig. 7. Visualization of different stages of a teleoperation HRI process: (a) starting point, (b) virtual joystick guidance to a goal, (c) forceful stop command,
(d) rotation of the end-effector, (e) gesture-command to open the gripper, (f) grabbing a bottle and puling it up, (g) rotation of the end-effector, (h) safe
collaboration with the robot. NOTE: The virtual joystick mode moves the end-effector in a direction defined by the vector that joins a center position in which
the hand is closed and the position of the hand when it is moved.
efficient. DDR PCA can improve the classification accuracy
and allows to classify gesture patterns with incomplete data,
i.e., with the initial 25%, 50% or 75% of data representing a
given dynamic gesture (DG) - time series data. Experimental
tests indicate that after DDR PCA the classification accuracy
is higher with 50% of gesture data (100% accuracy) than with
100% of gesture data (96% accuracy), tested in a library of
10 hand/arm dynamic gestures. Recognized hand/arm gestures
proved to be a natural and intuitive HRI interface.
REF ER EN CE S
[1] P. Neto, D. Pereira, J. N. Pires, and a. P. Moreira, “Real-time and
continuous hand gesture spotting: An approach based on artificial
neural networks,” 2013 IEEE International Conference on Robotics and
Automation, pp. 178–183, 2013.
[2] M. T. Wolf, C. Assad, M. T. Vernacchia, J. Fromm, and H. L. Jethani,
“Gesture-based robot control with variable autonomy from the JPL
BioSleeve,” in 2013 IEEE International Conference on Robotics and
Automation. IEEE, may 2013, pp. 1160–1165.
[3] Y. Song, D. Demirdjian, and R. Davis, “Continuous body and hand
gesture recognition for natural human-computer interaction,” ACM
Transactions on Interactive Intelligent Systems, vol. 2, no. 1, pp. 1–28,
mar 2012.
[4] M. Burke and J. Lasenby, “Pantomimic gestures for human–robot inter-
action,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1225–1237,
Oct 2015.
[5] S. Calinon and A. Billard, “Incremental learning of gestures by imitation
in a humanoid robot,” in Human-Robot Interaction (HRI), 2007 2nd
ACM/IEEE International Conference on, March 2007, pp. 255–262.
[6] D. Kulic, W. Takano, and Y. Nakamura, “Online segmentation and
clustering from continuous observation of whole body motions,” IEEE
Transactions on Robotics, vol. 25, no. 5, pp. 1158–1166, Oct 2009.
[7] J. F. S. Lin, V. Joukov, and D. Kulic, “Full-body multi-primitive segmen-
tation using classifiers,” in 2014 IEEE-RAS International Conference on
Humanoid Robots, Nov 2014, pp. 874–880.
[8] Z. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate orthog-
onal sparse embedding for dimensionality reduction,” IEEE Transactions
on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 723–735,
April 2016.
[9] P. Lin, J. Zhang, and R. An, “Data dimensionality reduction approach
to improve feature selection performance using sparsified svd,” in 2014
International Joint Conference on Neural Networks (IJCNN), July 2014,
pp. 1393–1400.
[10] A. Shyr, R. Urtasun, and M. I. Jordan, “Sufficient dimension reduction
for visual sequence classification,” in The Twenty-Third IEEE Confer-
ence on Computer Vision and Pattern Recognition, CVPR 2010, San
Francisco, CA, USA, 13-18 June 2010, 2010, pp. 3610–3617.
[11] M. Simão, P. Neto, and O. Gibaru, “Unsupervised gesture segmentation
by motion detection on a real-time data stream,” Submitted to IEEE
Transactions on Industrial Informatics, 2016.
[12] R. Keys, “Cubic convolution interpolation for digital image processing,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29,
no. 6, pp. 1153–1160, Dec 1981.
... The support vector number is reduced by more than 99.5% without accuracy degradation. A recent study indicates that after DDR PCA the classification accuracy is higher with incomplete gesture data than with complete gesture data Simão et al. (2016a). These results were obtained in a relatively small library of 10 hand/arm dynamic gestures. ...
Article
Continuous gesture spotting is a major topic in human-robot interaction (HRI) research. Human gestures are captured by sensors that provide large amounts of data that can be redundant or incomplete, correlated or uncorrelated. Data dimensionality reduction (DDR) techniques allow to represent such data in a low-dimensional space, making the classification process more efficient. This study demonstrates that DDR can improve the classification accuracy and allows the classification of gesture patterns with incomplete data, i.e., with the initial 25%, 50% or 75% of data representing a given dynamic gesture (DG) - time series of positional and hand shape data. Re-sampling raw data with bicubic interpolation and principal component analysis (PCA) were used as DDR methods. The performance of different classifiers is compared in the classification 95 different signs of the UCI Australian Sign Language (High Quality) Dataset. Experimental tests indicate that the use of PCA-based features result in a classification accuracy that is higher with 25% of gesture data (93% accuracy) than with 100% of gesture data (82% accuracy). These results were obtained from a non-trained data set and the recognized gestures are used to control a robot in an collaborative process.
Article
Full-text available
During human-robot interaction, the robot observes a continuous stream of time-series data capturing the behaviour of the human and any changes in the environment. For applications such as imitation learning, intention and gesture recognition, the time-series data is typically segmented into action or motion primitives, requiring accurate and online temporal segmentation. This paper casts the time-series segmentation problem into a two-class classification problem, labelling each data point as either a segment edge or a within-segment point, and applies several common classifiers to a set of full body motion data. The support vector machine combined with principal component analysis dimensionality reduction were found to perform best, with a classification F1 score of 91% when applied to novel exemplars. The proposed approach can also generalize to motions unseen during training, achieving a classification F1 score of 83% when applied to novel motions.
Article
Full-text available
This paper introduces a pantomimic gesture interface, which classifies human hand gestures using unmanned aerial vehicle (UAV) behavior recordings as training data. We argue that pantomimic gestures are more intuitive than iconic gestures and show that a pantomimic gesture recognition strategy using micro-UAV behavior recordings can be more robust than one trained directly using hand gestures. Hand gestures are isolated by applying a maximum information criterion, with features extracted using principal component analysis and compared using a nearest neighbor classifier. These features are biased in that they are better suited to classifying certain behaviors. We show how a Bayesian update step accounting for the geometry of training features compensates for this, resulting in fairer classification results, and introduce a weighted voting system to aid in sequence labeling.
Article
Full-text available
New and more natural human-robot interfaces are of crucial interest to the evolution of robotics. This paper addresses continuous and real-time hand gesture spotting, i.e., gesture segmentation plus gesture recognition. Gesture patterns are recognized by using artificial neural networks (ANNs) specifically adapted to the process of controlling an industrial robot. Since in continuous gesture recognition the communicative gestures appear intermittently with the noncommunicative, we are proposing a new architecture with two ANNs in series to recognize both kinds of gesture. A data glove is used as interface technology. Experimental results demonstrated that the proposed solution presents high recognition rates (over 99% for a library of ten gestures and over 96% for a library of thirty gestures), low training and learning time and a good capacity to generalize from particular situations.
Conference Paper
Full-text available
When classifying high-dimensional sequence data, traditional methods (e.g., HMMs, CRFs) may require large amounts of training data to avoid overfitting. In such cases dimensionality reduction can be employed to find a low-dimensional representation on which classification can be done more efficiently. Existing methods for supervised dimensionality reduction often presume that the data is densely sampled so that a neighborhood graph structure can be formed, or that the data arises from a known distribution. Sufficient dimension reduction techniques aim to find a low dimensional representation such that the remaining degrees of freedom become conditionally independent of the output values. In this paper we develop a novel sequence kernel dimension reduction approach (S-KDR). Our approach does not make strong assumptions on the distribution of the input data. Spatial, temporal and periodic information is combined in a principled manner, and an optimal manifold is learned for the end-task. We demonstrate the effectiveness of our approach on several tasks involving the discrimination of human gesture and motion categories, as well as on a database of dynamic textures.
Article
Continuous and real-time gesture spotting is a key factor in the development of novel human-machine interaction (HMI) modalities. Gesture recognition can be greatly improved with previous reliable segmentation. This paper introduces a new unsupervised threshold-based hand/arm gesture segmenta-tion method to accurately divide continuous data streams into dynamic and static segments from unsegmented and unbounded input data. This segmentation may reduce the number of wrongly classified gestures in real world conditions. The proposed approach identifies sudden inversions of movement direction which are a cause of oversegmentation (excessive segmentation). This is achieved by the analysis of velocities and accelerations numerically derived from positional data. A genetic algorithm is used to compute feasible thresholds from calibration data. Experimental tests with three different subjects demonstrated an average oversegmentation error of 2.70% in a benchmark for motion segmentation with a feasible sliding window size.
Article
Feature selection is a technique of selecting a subset of relevant features for building robust learning models. In this paper, we developed a data dimensionality reduction approach using sparsified singular value decomposition (SSVD) technique to identify and remove trivial features before applying any advanced feature selection algorithm. First, we investigated how SSVD can be used to identify and remove nonessential features in order to facilitate feature selection performance. Second, we analyzed the application limitations and computing complexity. Next, a set of experiments were conducted and the empirical results show that applying feature selection techniques on the data of which the nonessential features are removed by the data dimensionality reduction approach generally results in better performance with significantly reduced computing time.
Article
Locally linear embedding (LLE) is one of the most well-known manifold learning methods. As the representative linear extension of LLE, orthogonal neighborhood preserving projection (ONPP) has attracted widespread attention in the field of dimensionality reduction. In this paper, a unified sparse learning framework is proposed by introducing the sparsity or L₁-norm learning, which further extends the LLE-based methods to sparse cases. Theoretical connections between the ONPP and the proposed sparse linear embedding are discovered. The optimal sparse embeddings derived from the proposed framework can be computed by iterating the modified elastic net and singular value decomposition. We also show that the proposed model can be viewed as a general model for sparse linear and nonlinear (kernel) subspace learning. Based on this general model, sparse kernel embedding is also proposed for nonlinear sparse feature extraction. Extensive experiments on five databases demonstrate that the proposed sparse learning framework performs better than the existing subspace learning algorithm, particularly in the cases of small sample sizes.
Conference Paper
This paper presents a new gesture-based human interface for natural robot control. Detailed activity of the user's hand and arm is acquired via a novel device, called the BioSleeve, which packages dry-contact surface electromyography (EMG) and an inertial measurement unit (IMU) into a sleeve worn on the forearm. The BioSleeve's accompanying algorithms can reliably decode as many as sixteen discrete hand gestures and estimate the continuous orientation of the forearm. These gestures and positions are mapped to robot commands that, to varying degrees, integrate with the robot's perception of its environment and its ability to complete tasks autonomously. This flexible approach enables, for example, supervisory point-to-goal commands, virtual joystick for guarded teleoperation, and high degree of freedom mimicked manipulation, all from a single device. The BioSleeve is meant for portable field use; unlike other gesture recognition systems, use of the BioSleeve for robot control is invariant to lighting conditions, occlusions, and the human-robot spatial relationship and does not encumber the user's hands. The BioSleeve control approach has been implemented on three robot types, and we present proof-of-principle demonstrations with mobile ground robots, manipulation robots, and prosthetic hands.
Article
We present a new approach to gesture recognition that tracks body and hands simultaneously and recognizes gestures continuously from an unseg-mented and unbounded input stream. Our system estimates 3D coordinates of upper body joints and classifies the appearance of hands into a set of canonical shapes. A novel multi-layered filtering technique with a temporal sliding window is developed to enable online sequence labeling and segmentation. Experimental results on the NATOPS dataset show the effectiveness of the approach. We also report on our recent work on multimodal gesture recognition and deep-hierarchical sequence representation learning that achieve the state-of-the-art performances on several real-world datasets.