ArticlePDF Available

Fast-gesture recognition and classification using Kinect: an application for a virtual reality drumkit

Springer Nature
Multimedia Tools and Applications
Authors:

Abstract and Figures

In this paper, we present a system for the detection of fast gestural motion by using a linear predictor of hand movements. We also use the proposed detection scheme for the implementation of a virtual drumkit simulator. A database of drum-hitting motions is gathered and two different sets of features are proposed to discriminate different drum-hitting gestures. The two feature sets are related to observations of different nature: the trajectory of the hand and the pose of the arm. These two sets are used to train classifier models using a variety of machine learning techniques in order to analyse which features and machine learning techniques are more suitable for our classification task. Finally, the system has been validated by means of the Kinect application implemented and the participation of 12 different subjects for the experimental performance evaluation. Results showed a successful discrimination rate higher than 95 % for six different gestures per hand and good user experience.
Content may be subject to copyright.
Multimed Tools Appl
DOI 10.1007/s11042-015-2729-8
Fast-gesture recognition and classification using Kinect:
an application for a virtual reality drumkit
Alejandro Rosa-Pujaz
´
on
1
· Isabel Barbancho
1
·
Lorenzo J. Tard
´
on
1
· Ana M. Barbancho
1
Received: 30 October 2014 / Revised: 29 April 2015 / Accepted: 1 June 2015
© Springer Science+Business Media New York 2015
Abstract In this paper, we present a system for the detection of fast gestural motion by
using a linear predictor of hand movements. We also use the proposed detection scheme
for the implementation of a virtual drumkit simulator. A database of drum-hitting motions
is gathered and two different sets of features are proposed to discriminate different drum-
hitting gestures. The two feature sets are related to observations of different nature: the
trajectory of the hand and the pose of the arm. These two sets are used to train classifier mod-
els using a variety of machine learning techniques in order to analyse which features and
machine learning techniques are more suitable for our classification task. Finally, the system
has been validated by means of the Kinect application implemented and the participation
of 12 different subjects for the experimental performance evaluation. Results showed a suc-
cessful discrimination rate higher than 95 % for six different gestures per hand and good
user experience.
Keywords Drumkit simulator · Gesture recognition · Human-computer interaction ·
Machine learning · Classification techniques
Ana M. Barbancho
abp@ic.uma.es
Alejandro Rosa-Pujaz
´
on
alejandror@uma.es
Isabel Barbancho
ibp@ic.uma.es
Lorenzo J. Tard
´
on
lorenzo@ic.uma.es
1
E.T.S.I. Telecomunicaci
´
on, Dpto. Ingenier
´
ıa de Comunicaciones, Universidad de M
´
alaga,
ATIC Research Group, Andaluc
´
ıa Tech, 29071, M
´
alaga, Spain
Multimed Tools Appl
1 Introduction
The evolution of sensing and motion-tracking technologies has allowed for the development
of new methods for human-computer interaction, offering a more ‘natural’ and immer-
sive experience than using conventional settings. In addition, the introduction of affordable,
off-the-shelf motion-tracking solutions such as Nintendo’s Wiimote or Microsoft’s Kinect
devices, has expanded the application contexts in which these interfaces can be used.
One of such fields is that of music interaction. Indeed, as attested by previous studies, the
use of advanced interfaces has proved to further enhance the users experience and expres-
siveness in their performances. A typical case of use can be found in the application of
these technologies to provide new methods for musical expression, such as mapping body
motion to the control or certain acoustic or musical parameters in a given performance or
sound stream [4, 10, 45], using accelerometer data to dynamically combine a set of musical
excepts [15] or translating musical scales into gestures [40], among others. The application
of advanced interfaces to musical interaction also leans naturally to the simulation of musi-
cal instruments [36, 42], augment the performance of existing instruments (the so called
hyperinstruments [34]) or even the creation of new ones [23], as well as to the simulation
of other musical roles, e.g. the ensemble conductor [35, 41]. Aside from providing merely
expanded user experiences for entertainment, part of the research in this field has been devo-
ted to developing interfaces that provide a better experience for learning purposes [17, 24].
Many of the studies mentioned above present human-computer interfaces that involve
some link between human motion and music, yet there are some caveats in the use of
such interfaces. Sometimes these interfaces are developed as very specific solutions to very
concrete problems, making it difficult to generalize the solutions proposed. Other times,
the devices developed are commonly expensive, intrusive and/or bulky, partially hinder-
ing user’s movement and raising potential ergonomic issues. Furthermore, the use of more
generic, off-the-shelf solutions is preferred as it mitigates some of these issues, yet bringing
forth new problems in the form of less specialized interaction metaphors and a weaker link
between the motion and the intended musical meaning. When trying to simulate real musi-
cal experiences (such as musical instruments), these limitations can detract from the overall
experience of the user. In order to address these limitations, it is necessary to perform a
more profound analysis of user motion and gestures through the use of advanced motion
recognition techniques.
1.1 Human gesture classification
Research on human motion detection and gesture recognition has typically revolved around
the use of Hidden Markov Models (HMM). HMMs allow for the encoding of continuous
streams of data into perceived gestures and/or behaviours, by identifying the state sequence
associated to the motion represented in the observable events. Their main downside is that
they require large training databases, and are particularly sensitive to segmentation errors [3,
7]. HMMs have been applied to human motion recognition for a wide range of application
fields in regard to human motion, from gait identification [12] to tennis stroke analysis [44].
In the particular field of human interaction, they have been used to build support learning
tools for conducting gestures [5].
Other approaches use Dynamic Programming techniques to match tracked trajectories
to the ones in the training database. The most commonly used technique [3] is Dynamic
Time Warping (DTW) [2, 11], which allows for the matching of sequences of different
lengths. Machine learning approaches have also been considered in previous works, such
Multimed Tools Appl
as Support Vector Machines (SVM) [38] as in the particular case of gesture recognition
for timpani percussion [6], Neural Networks [14, 39], Logistic Regression [20], Principal
Components Analysis or Linear Discriminant Analysis [3]. An example of the use of using
adaptive Bayesian models with particle filtering for motion recognition can be found in
the virtual violin described in [9]. Some studies have also shown that the use of 3D-based
features over more conventional 2D-features in recordings performed with a 3D sensor (e.g.
Kinect) provide a meaningful increase in the mean accuracy of human gesture classification
[48].
While previous studies have shown that it is indeed feasible to use these techniques to
recognize hand gestures and human motion, there are still some relevant issues to address.
More concretely, many of the available systems are not fast enough to offer a sufficiently
fast response, which limits their applicability, and also, the great variability in the way dif-
ferent humans can perform the same type of gesture constrains the validity of the databases
or the models used in the implementation of these systems, especially in the case of HMM-
based classifiers [2]. Some researchers have found Dynamic Programming techniques to
allow for faster recognition speeds as well as a lower dependence on the size of the database
[2], yet the delay introduced is still too high for time-critical interaction applications, such
as the ones related to musical instrument simulation. Using previously proposed systems,
the simulation of, for example, a drumkit or a sword, is very unrealistic because the delay
between the gesture and the system’s response is too high. So, new methods are needed to
recognize and classify fast gestures.
1.2 System overview and manuscript structure
In this paper, a simplified model that allows for the detection of fast hitting moves is con-
sidered. Following the works depicted in [36], the system implemented uses a 3D infrarred
camera to track user’s hand/arm motion, a movement linear predictor to compensate for the
lag introduced by the tracking system and algorithms, and a machine learning approach to
distinguish between different gestures according to a previously recorded database.
The proposed system aims to present a system capable of detecting fast hand motion
in real time for musical expression using a drum-hitting interaction metaphor. This paper
depicts a simplified model used to extract motion descriptors to form a feature set that can
be analysed with a minimum delay in the human-computer interface. These features are then
used to classify a given gesture according to the intended drum-hitting motion. An analysis
of different supervised learning classifiers is performed to identify the optimal machine
learning technique in our context and the overall system is further validated through an
experimental usability test in a real scenario with a group of participants.
In order to adequately detect and classify the gestures performed by the users, the
system follows a set of steps to extract features, from the data provided by the sensors,
and assess whether a given gesture is being performed or not, according to a previously
trained machine-learning model. The scheme of the proposed system is presented in Fig. 1.
Each of the steps followed and its relations with the paper structure, are briefly depicted
below:
User motion modelling: in this step, the system will analyze the data provided by the
sensors to acquire relevant motion data. The system also needs to predict user motion in
order to compensate for the delay introduced by the sensing device.The outputs of this
step are the features to detect fast hitting motion or gesture detection and the features
for gesture discrimination.
Multimed Tools Appl
ll
l
l
l
Fig. 1 Proposed system
Gesture detection: in this step, the system assesses whether the data read belongs to a
gesture of interest or not. In the first case, the data progresses to the next step, otherwise
the data are discarding.
Gesture classification: when a gesture of interest is found, the system identifies the type
of gesture performed by analyzing the other features considered.
The manuscript is structured following the scheme in Fig. 1. Section 2 presents the user
motion modeling and Section 3 presents the gesture detection and classification schemes.
Then, Section 4 presents the experimental evaluation conducted along with the most relevant
results found. Finally, the last section draws the conclusions found and potential future
works.
2 User motion modelling
In this section the complete user motion model is presented. This model includes the user’s
skeleton data (Section 2.1), the model for the identification of the presence of fast hitting-
like gestures for real-time interaction (Section 2.2) and the definition of specific features for
gesture discrimination (Section 2.3).
Multimed Tools Appl
2.1 User’s skeleton and data normalization
For the purpose of modelling user motion, we have considered that our system is capable
of tracking the position of up to 15 joints from the user’s skeleton, as shown in Fig. 2.This
data, however, needs to be processed further, as it is strongly influenced by user position
and height.
In order to define a common reference point, the centre of mass
n
0
= (c
x
,c
y
,c
z
) of the
user silhouette is calculated as:
c
i
=
1
N
up
ji
(1)
where up
ji
stands for the value of coordinate i = (x, y, z) of the j -th user pixel. The
normalized vector for each node nn
i
was calculated as:
nn
i
=
1
S
(
n
i
n
0
) (2)
where S =
n
1y
n
3y
is a reference scale value defined upon the difference between the
height values of the head and the torso nodes. The purpose of this rescaling is to make
distance measures independent of users’ size and their relative position with respect to the
camera.
2.2 Features to detect fast hitting-like gestures for real-time interaction
The na
¨
ıve approach to detect a simple hitting motion would be to set a given veloc-
ity/acceleration threshold along a given axis, assuming that the user is performing a
l
l
l
l
Fig. 2 Nodes tracked by the system developed
Multimed Tools Appl
hitting-like motion whenever that threshold is exceeded, and triggering the corresponding
gesture detection algorithm when the motion ends. However, this kind of motion is usually
quite fast, this fact, together with the inherent latency of the tracking system, can give rise
to a meaningful delay in the system’s response. Prior works [4, 29, 32]haveshownthat
full-bode sensors can introduce a noticeable lag; it has been found that, depending on the
application workload, the latency introduced can reach values of almost 0.3 seconds [29].
According to previous studies on audio perception [26], the maximum delay should be
nearly 10 times lower to avoid being noticeable. Therefore, it is necessary to predict the
user’s motion to cope with the latency introduced.
Previous research on this topic has focused on the calculation of the autocorrelation of
the motion stream in order to be able to predict future motion gestures according to its
periodicity [42], as well as using velocity and acceleration descriptors or linear predictors
to forecast user’s movements [36]. In this paper, a linear predictor approach is followed,
treating the stream of tracked hand motion as an input signal to a Wiener filtering scheme
[43]. The Wiener l-samples predictor filter h[n] of order N solves the equation (3)foran
input signal x[n] with autocorrelation R
xx
[n], implementing a least minimum mean square
error estimator (LMMSE).
R
xx
[0] R
xx
[1] ... R
xx
[N1]
R
xx
[1] R
xx
[0] ... R
xx
[N2]
.
.
.
.
.
.
.
.
.
R
xx
[N1] R
xx
[N2] ... R
xx
[0]
h[1]
h[2]
.
.
.
h[N]
=
R
xx
[l]
R
xx
[l + 1]
.
.
.
R
xx
[l + N 1]
(3)
In our context, we are not interested in the prediction of the exact position of the
hand at every instant. Instead, the main purpose of this filter is to compensate for the
latency introduced by the sensing device when performing a fast gesture along a given axis
and in one concrete direction. Thus, the signal x[n] does not actually correspond to the
exact stream of tracked hand positions, but only to the motion performed in the axis of
interest.
Let
d represent the unit vector in the axis of interest,
nn
i
[n] the streamed position data
for the node of interest i,and
nv
i
[n] its corresponding normalized velocity signal. The
motion over the axis of interest,
d ·
nn
i
[n], can be decomposed as:
d ·
nn
i
[n]=
j
D
j,T
j
,s
j
[n s
j
]+
k
R
k,M
k
,s
k
[n s
k
] (4)
where each D
j,T
j
,s
j
is a chunk of size T
j
of the original
d ·
nn
i
[n] signal that fulfils:
D
j,T
j
,s
j
[n]=
d ·
nn
i
[n+s
j
], with
d ·
nv
i
[n+s
j
] , n∈[0,T
j
1]
0,otherwise
(5)
Similarly, R
k,M
k
,s
k
is defined as:
R
k,M
k
,s
k
[n]=
d ·
nn
i
[n+s
j
], with
d ·
nv
i
[n+s
j
]≤σn∈[0,M
k
1]
0,otherwise
(6)
Multimed Tools Appl
Then, the input signal x[n] can be defined as:
x[n]=
j
D
j,T
j
,s
j
[n jT
j
] (7)
Essentially, the signal x[n] is a windowed version of
d ·
nn
i
[n]. The windows have
variable length and they are used to select the values of
d ·
nn
i
[n] such that the velocity
along the direction of interest is larger than a threshold σ .
The value of σ was empirically adjusted to minimize the influence of noise in the mea-
sures while preserving the highest possible sensitivity. The value of the threshold selected
was σ = 0.15 normalized units. Assumming that the sensing system introduces a delay of l
samples, the forecasting stage takes this fact into account, defining a bank of 5 Wiener fil-
ters, each of them predicting the future l-th sample. These l predictions are used as features
to determine whether the user is performing a fast gesture or not.
2.3 Features for gesture discrimination
In addition to the forecasting stage, a set of features must be defined to discriminate among
different hitting-gestures in order to, for example, differentiate the different drums the
user intends to play. We defined two feature sets to address this issue, one set is based
the information about the trajectory of the hand extracted and another one makes use of
the information about the current pose of the user’s arm whenever the system detects a
fast hitting gesture. We discarded relying on angular data alone as previous research had
proved this method to be unreliable when the number of gestures considered was high
[37].
Previous works have shown that by watching trajectory-based features, it is possible
to identify specific behaviours in camera recordings (e.g. identifying events of motion in
crowded areas [47]), and it also provides better computing times [48]. The trajectory of the
hand was described with a set of 25 features, corresponding to the normalized Cartesian
coordinates of the hand position when a fast gesture is detected and four previous position
values, as shown in Fig. 3. These 4 previous coordinates were calculated by recording the
samples and searching in this data for the sample when the motion in the direction of interest
started. After the starting point is found, the three remaining points are determined through
linear interpolation, so that the 5 final position values are pairwise equidistant in time. This
feature set has been labelled T
ix
,T
iy
,T
iz
, with i ranging from 1 to 5, the former representing
the coordinates at the time the gesture is detected, and the latter corresponding to the start
of the motion.
Arm pose was measured using 8 features: the 3 coordinates of the unit vector for the
elbow-to-hand limb (EH
x
,EH
y
,EH
z
), the 3 coordinates of the unit vector for the shoulder-
to-elbow limb (SE
x
,SE
y
,SE
z
), and the 2 coordinates of the unit vector of the projection
of the normalized hand data onto the XZ plane (H
x
,H
z
).
3 Gesture detection and classification
First, in this section, the gesture database used and the gesture detection scheme for drumkit
simulation are presented (Section 3.1 and Section 3.2, respectively). After the detection
of a gesture of interest, the gesture discrimination stage is executed (Section 3.3). In
order to determine the classification technique that performs best according to the sets of
Multimed Tools Appl
l
Fig. 3 Features defining the trajectory of the gesture
features considered, in Section 3.3 a comparison between different classifiers and combi-
nations of the trajectory and arm pose feature sets is performed. Finally, in Section 3.4 the
results obtained and a discussion and comparison with other related works are presented. In
Section 3.5 a deeper analysis of the arm pose feature set is shown.
3.1 Gesture database
The system has been designed to be able to discriminate between 6 different hitting ges-
tures for each hand, according to the approximate positions where each piece of the drumkit
would be in a real scenario. Thus, the 6 gestures correspond to motions aimed at hit-
ting either the snare, the left and right frontal toms, left and right cymbals, or the lateral
drum (low tom or hi-hat, depending on which hand triggered the sound). An example
is provided in Fig. 4 to better illustrate some of the gestures considered. These gestures
were labelled Snare, T
FrLeft, T FrRight, C Left, C Right and T Lateral, respectively.
Examples of the distribution of the six classes for some of the features can be found in
Fig. 5.
The gestures were performed in a controlled environment, in the ATIC research lab of
the University of Malaga. As seen in Fig. 4, the different gestures are meant to mimic the
gestures performed by a drummer when playing the drumkit. Depending on which drum
the user intends to play, the gestures require the user to perform drum-hitting strikes in
different positions. Thus, in order to hit the snare, users would normally need to perform
their gestures so that the hands hit the non-existant drum in a closer position to their torsos.
The left and right frontal toms would involve gestures within the frontal arc and with the
arms slightly more extended, while the left and right cymbals gestures are also performed
Multimed Tools Appl
Fig. 4 Illustration of some example gestures: snare (Snare), left frontal tom (T FrLeft)andleftcymbal
(C
Left)
in the front arc, but their hitting positions are a slightly higher and the arms must be extended
further for hitting the toms. On the other hand, hitting the low tom or the hi-hat requires the
user to perform their drum-hitting gestures more laterally.
3.2 Gesture detection for drumkit simulation
We applied the previously presented model to the implementation of an augmented reality
drumkit simulator. In this application, the user hands were used as mallets or drumsticks,
and depending on the position of the hands and the pose of the arms, a different drum or
cymbal was played whenever the drummer performed a hitting-like motion. Therefore, we
defined our direction of interest
d along the negative Y axis (0, 0, 1).
First, the system detects whether the user is performing a fast gesture in the negative Y
direction, according to the results from the Wiener predictor filter bank. Next, if a gesture is
detected, the system decides which piece of the drumkit is played according to the trajectory
and arm pose gesture discrimination features previously discussed.
For the implementation of the actual system, we used a Microsoft Kinect device along-
side the OpenNI framework in order to track user motion. The OpenNI framework provides
a very powerful tool for human motion tracking, as it can track the state of up to 24
joints of user’s skeleton, offering both the Cartesian coordinates of each joint as well as
a 3D orientation value. Concretely, our framework uses a 15-node skeleton configuration,
which is part of this 24-joint model. Also, only the Cartesian X,Y,Z coordinates are consid-
ered, as the orientation values were found to be significantly noisier. Regarding the delay
Multimed Tools Appl
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
Snare
T_FrLeft
T_FrRight
C_Left
C_Right
T_Lateral
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Fig. 5 Distribution of the 6 types of gestures for some feature combinations: H
x
vs H
z
(a)EH
x
vs SH
z
(b),
H
x
vs EH
y
(c)andSH
x
vs SH
z
(d)
introduced by the sensing device, it was found to be of roughly 160 milliseconds (no mean-
ingful oscillations or jitter in the delay were found), with the application running at an
average of 28 frames per second. The delay introduced oscillated between 4 and 5 samples.
Thus, the prediction stage is designed with a bank of 5 Wiener filters of order N = 40,
each of them predicting the future l-th sample, with l ranging from 1 to 5. These 5 pre-
dictions are used as features to determine whether the user is performing a fast gesture or
not.
A database with 1108 gestures performed by 3 different individuals was gathered (145,
206, 224, 151, 155 and 227 samples for the 6 types of gestures considered, respectively), as
well as a total of 977 segments of tracked motion that did not correspond to a fast hitting
gesture (for training the fast gesture detector based on the designed Wiener filter prediction
scheme).
Since the detection of fast gestures along the Y axis is a fairly simple problem of binary
classification, we deemed that a Logistic Regression model [18] would suffice for the clas-
sification task. After training the model with a tenfold cross-validation scheme along the
whole database, the highest success rate found was 99.3 % for a regularization ridge value
of 0.7.
Multimed Tools Appl
3.3 Gesture classification for drumkit simulation
For the identification of the type of gesture performed, we conducted a more detailed
analysis using different classification methods, as well as an evaluation of the trajectory
and arm pose feature sets both separately and combined. The classifiers considered were
Na
¨
ıve Bayes [22] (Section 3.3.1), Support Vector Machines [13] (Section 3.3.2) with two
different kernels (polynomial and Gaussian), K-Nearest Neighbours classifier (k-NN) [1]
(Section 3.3.3), decision tree classification based on the C4.5 algorithm [33] (Section 3.3.4),
Logistic Regression [18] (Section 3.3.5) and Multilayer Perceptron [16] (Section 3.3.6).
This analysis aims to find the best classification technique for our task with the sets of
features considered.
In order to prevent over-fitting, a ten-fold cross-validation process was followed to
properly adjust the parameters of each classifier, according to the success rate (correctly
classified instances vs. total number of instances). The results found for each classifier
are illustrated in the following subsections. The corresponding confusion matrices are also
presented, showing the number of instances for each class vs. the classification labelled
by the classifier; the confusion matrices are also colour-coded, showing a background
colour in each cell that gets darker as the number of instances classified in each cell
grows.
3.3.1 Na
¨
ıve Bayes
ANa
¨
ıve Bayes classifier is a simple probabilistic model that assumes that the different
features are independent. The success rates found after training the model with the data
gathered are summarized in Table 1, using ten-fold cross-validation. The corresponding
confusion matrices can be found in Fig. 6
3.3.2 Support vector machine
Support Vector Machine (SVM) classifiers define n-dimensional boundaries or hyper-
planes to classify the training samples. In order to ensure that a hyperplane can be
found that separates the classes in the training set, the source data is typically trans-
formed by means of a kernel function K(x
i
,x
j
),wherex
i
and x
j
represent input feature
vectors.
Two different kernel transformations were considered: polynomial and Gaussian radial
basis. Equation 8 illustrates the equation for the e-degree polynomial kernel (homogeneous
polynomial kernel). This kernel expands the feature space by considering lineal polynomial
combinations of the original feature set:
K(x
i
,x
j
) = (x
i
· x
j
)
e
(8)
Table 1 Classification
performance of the Na
¨
ıve Bayes
classifier
Feature set Success rate
Trajectory 83.75 %
Arm pose 98.65 %
All features 99.37 %
Multimed Tools Appl
144.00
2.00
0.00
0.00
0.00
0.00
0.00
74.00
0.00
7.00
0.00
0.00
0.00
0.00
89.00
1.00
26.00
0.00
0.00
79.00
1.00
216.00
1.00
0.00
1.00
0.00
61.00
0.00
179.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
144.00
0.00
0.00
0.00
0.00
0.00
0.00
153.00
1.00
4.00
0.00
0.00
0.00
1.00
149.00
0.00
0.00
1.00
1.00
0.00
0.00
218.00
0.00
0.00
0.00
0.00
0.00
0.00
205.00
2.00
0.00
1.00
1.00
2.00
1.00
224.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
154.00
2.00
1.00
0.00
0.00
0.00
0.00
149.00
0.00
0.00
1.00
0.00
0.00
0.00
222.00
1.00
0.00
0.00
1.00
0.00
0.00
205.00
0.00
0.00
0.00
0.00
1.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 6 Confusion matrix for Na
¨
ıve Bayes classifier using the Trajectory features (a), Arm Pose features (b),
and all features combined (c)
The value of e was chosen so that the overall success rate after ten-fold cross-validation
was optimal. The values tested ranged from e = 1 to 10. The best success rates found, along
with the corresponding e value, are summarized in Table 2. The corresponding confusion
matrices for these cases can be found in Fig. 7
The Gaussian radial based kernel follows:
K(x
i
,x
j
) = e
γ x
i
x
j
2
(9)
In a similar way to e for the e-degree polynomial Kernel, the best possible classifier in
terms of the success rate was chosen attending to the parameter γ . The values of γ ranged
Table 2 Classification success
rate for SVM classifier with
polynomial kernel
Feature set Success rate Polynomial degree(e)
Trajectory 92.33 % 3
Arm pose 99.01 % 1
All features 99.28 % 1
Multimed Tools Appl
142.00
3.00
0.00
0.00
0.00
0.00
0.00
73.00
0.00
4.00
1.00
0.00
0.00
0.00
107.00
1.00
32.00
0.00
0.00
79.00
1.00
219.00
0.00
0.00
3.00
0.00
43.00
0.00
173.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
152.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
0.00
0.00
1.00
0.00
1.00
0.00
222.00
0.00
0.00
0.00
1.00
0.00
1.00
205.00
3.00
0.00
1.00
0.00
1.00
1.00
223.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
152.00
1.00
1.00
0.00
0.00
0.00
2.00
150.00
1.00
0.00
0.00
0.00
0.00
0.00
222.00
1.00
0.00
0.00
1.00
0.00
0.00
205.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 7 Confusion matrix for SVM classifier with polynomial kernel using the Trajectory features (a), Arm
Pose features (b), and all features combined (c)
from 0.001 to 1000. Table 3 illustrates the best results found for each configuration set after
ten-fold cross-validation. Figure 8 shows the associated confusion matrices.
3.3.3 K-Nearest Neighbours
K-Nearest Neighbours (k-NN) classification uses a lazy algorithm based on the classifica-
tion of each sample according to the classes of the k closest training samples in the feature
space, using a voting criterion. The value of k was taken as a parameter in the training
Table 3 Classification success
rate for SVM classifier with
Gaussian kernel
Feature set Success rate Parameter (γ )
Trajectory 84.84 % 170
Arm pose 99.10 % 5
All features 99.37 % 5.5
Multimed Tools Appl
143.00
1.00
0.00
0.00
0.00
0.00
2.00
124.00
1.00
18.00
1.00
0.00
0.00
0.00
126.00
0.00
7.00
0.00
0.00
30.00
1.00
206.00
0.00
0.00
0.00
0.00
23.00
0.00
198.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
152.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
1.00
0.00
1.00
0.00
1.00
0.00
222.00
0.00
0.00
0.00
1.00
0.00
0.00
205.00
2.00
0.00
1.00
0.00
1.00
1.00
224.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
152.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
0.00
0.00
0.00
0.00
0.00
0.00
222.00
0.00
0.00
0.00
0.00
0.00
0.00
205.00
0.00
0.00
3.00
0.00
2.00
1.00
227.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 8 Confusion matrix for SVM classifier with gaussian kernel using the Trajectory features (a), Arm Pose
features (b), and all features combined (c)
process, it was chosen in order to maximize the success rate after ten-fold cross-validation;
the values of k tested ranged from 1 to 15. The best classification results found are shown
in Table 4 and Fig. 9.
3.3.4 Decision tree classifier
Decision tree classification based on the algorithm C4.5 performs by training a logical tree
in which the leaves represent classes and each node represents a set of rules to decide, by
using an input feature vector, which child node to follow until a leaf is reached. Results
Table 4 Classification success
rate for k-NN classifier
Feature set Success rate Parameter (k)
Trajectory 90.43 % 8
Arm pose 99.10 % 3
All features 99.37 % 2
Multimed Tools Appl
145.00
3.00
0.00
0.00
0.00
0.00
0.00
110.00
0.00
17.00
0.00
0.00
0.00
0.00
120.00
1.00
10.00
0.00
0.00
42.00
1.00
206.00
1.00
0.00
0.00
0.00
30.00
0.00
195.00
0.00
0.00
0.00
0.00
0.00
0.00
227.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
152.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
0.00
0.00
1.00
0.00
1.00
0.00
222.00
0.00
0.00
0.00
1.00
0.00
1.00
205.00
2.00
0.00
1.00
0.00
1.00
1.00
224.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
154.00
1.00
1.00
0.00
0.00
0.00
0.00
150.00
1.00
0.00
0.00
0.00
0.00
0.00
222.00
1.00
0.00
0.00
1.00
0.00
0.00
205.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 9 Confusion matrix for kNN classifier using the Trajectory features (a), Arm Pose features (b), and all
features combined (c)
after conducting ten-fold cross-validation using this scheme are summarized in Table 5 and
Fig. 10.
3.3.5 Logistic regression
Logistic regression classifiers estimate the relationship between a given feature vector x =
x
i
N
1
and the categorical class variable by calculating a probability score:
H
β
=
e
β
0
+
β
i
x
i
1 + β
0
+
β
i
x
i
(10)
Table 5 Classification success
rate for C4.5 classifier
Feature set Success rate
Trajectory 86.91 %
Arm pose 99.01 %
All features 99.10 %
Multimed Tools Appl
142.00
107.00
107.00
197.00
186.00
224.00
145.00
153.00
150.00
222.00
204.00
223.00
144.00
153.00
150.00
222.00
204.00
225.00
4.00
0.00
0.00
1.00
0.00
2.00
0.00
26.00
0.00
0.00
0.00
0.00
1.00
18.00
2.00
1.00
43.00
1.00
1.00
0.00
0.00
1.00
42.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
2.00
0.00
0.00
3.00
0.00
0.00
0.00
1.00
2.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
1.00
0.00
0.00
1.00
0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
1.00
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 10 Confusion matrix for C4.5 classifier using the Trajectory features (a), Arm Pose features (b), and
all features combined (c)
Agivensamplex is classified as belonging to class y according to the value of the
hypothesis probability H
β
. The different β
i
values are the parameters of the model that
minimize the cost function J(β), defined as:
J(β) =
(H
β
y)
2
+ λ
β
2
i
(11)
where λ is a regularization or ridge parameter that is used to ensure that the model does
not overfit the training set. Again, the best success rates were calculated, with λ ranging
between 10
8
and 100. Table 6 presents the best classification rates and their corresponding
λ parameters. Figure 11 shows the confusion matrices.
Table 6 Classification success
rate for logistic regression
classifier
Feature set Success rate Parameter (λ)
Trajectory 83.66 % 3 · 10
4
Arm pose 99.01 % 0.01
All features 99.37 % 10
4
Multimed Tools Appl
143.00
3.00
0.00
0.00
0.00
0.00
0.00
65.00
0.00
16.00
1.00
0.00
0.00
0.00
111.00
1.00
30.00
0.00
0.00
86.00
1.00
207.00
0.00
0.00
2.00
1.00
39.00
0.00
175.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
1.00
0.00
152.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
1.00
0.00
1.00
0.00
2.00
0.00
222.00
0.00
0.00
0.00
0.00
0.00
0.00
205.00
2.00
0.00
1.00
0.00
1.00
1.00
223.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
1.00
0.00
0.00
0.00
0.00
0.00
153.00
1.00
1.00
0.00
0.00
0.00
1.00
150.00
1.00
0.00
0.00
0.00
0.00
0.00
222.00
1.00
0.00
0.00
0.00
0.00
0.00
205.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 11 Confusion matrix for logistic regression classifier using the Trajectory features (a), Arm Pose
features (b), and all features combined (c)
3.3.6 Multilayer Perceptron
Multilayer Perceptron classifier is a feed-forward neural network of successive layers fully
interconnected, in which each node or neuron implements a sigmoid function and the
weights of the edges are adjusted in the training stage to maximize the success rate. The
sigmoid function of each neuron follow the sigmoid:
(z) =
1
1 + e
z
(12)
with z the weighted sum of the inputs to that neuron.
The number of neurons per layer and the number of hidden layers in the neural net-
work were adjusted to define a topology that attains the optimal success rate for each
configuration set. The number of hidden layers ranged from 1 to 3. However, it was found
that the utilization of more than one layer did not yield better results in any of the cases
considered. The number of neurons per hidden layer oscillated between 1 and 25. The opti-
mal results found are presented in Table 7 for a single-layer Perceptron with n neurons.
Figure 12 shows the confusion matrices.
Multimed Tools Appl
Table 7 Classification success
rate for Multilayer Perceptron
classifier
Feature set Success rate Number of neurons n
Trajectory 90.88 % 14
Arm pose 99.10 % 7
All features 99.55 % 7
3.4 Features and classifiers. Discussion
In general terms, the three feature sets considered show high success rate in the
discrimination of gestures, with an average success rate across all the classifiers of 87.54 %
for Trajectory features, 99.00 % for Arm Pose features and 99.36 % for all the features com-
bined. The best results are found for the Multilayer Perceptron classifier, with success rates
of 99.55 % when all the features are used, and 99.10 % when only Arm Pose features are
considered. Regarding the Trajectory feature set, the best results are attained by the SVM
classifier with a Gaussian kernel (92.33 %).
In the case of Arm Pose features and all the features combined, the type of classifier
used had not significant impact on the success rates yielded. For the Trajectory feature
145.00
2.00
0.00
0.00
0.00
0.00
0.00
118.00
0.00
15.00
1.00
0.00
0.00
0.00
118.00
1.00
13.00
0.00
0.00
35.00
1.00
208.00
0.00
0.00
0.00
0.00
32.00
0.00
192.00
1.00
0.00
0.00
0.00
0.00
0.00
226.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
0.00
0.00
0.00
0.00
0.00
0.00
152.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
0.00
0.00
1.00
0.00
2.00
0.00
222.00
0.00
0.00
0.00
0.00
0.00
1.00
205.00
2.00
0.00
1.00
0.00
1.00
1.00
224.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
145.00
1.00
0.00
0.00
0.00
0.00
0.00
154.00
1.00
0.00
0.00
0.00
0.00
0.00
150.00
1.00
0.00
0.00
0.00
0.00
0.00
222.00
1.00
0.00
0.00
0.00
0.00
1.00
205.00
0.00
0.00
0.00
0.00
0.00
0.00
227.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 12 Confusion matrix for Multilayer Perceptron classifier using the Trajectory features (a), Arm Pose
features (b), and all features combined (c)
Multimed Tools Appl
set, though, there are meaningful fluctuations in the classification rate, ranging from 83 %
to 92 %. These results confirm that the feature set used is representative of the motion
modelled, especially in the case of Arm Pose features. Also, although it was not illustrated
in the previous subsections, the selection of the different parameters considered for each
classifier is critical to avoid overfitting and attain good detection rates in the cross-validation
stage.
Previous studies have shown trajectory features to be a reliable type of features to identify
human motion in recorded sequences [47], capable of reaching accuracy levels of around
94 %. This has also been corroborated in this study, yet, according to the detection rates
found, it seems clear that the Arm Pose features are much more informative for fast drum-
hitting gesture discrimination than the Trajectory features used. Furthermore, the success
rates for Arm Pose features alone are nearly as high as the ones attained using all the
features combined, yet the number of features used in these two cases is quite different
(8 vs. 33).
The overall successful classification rates combining both the gesture detection and ges-
ture discrimination stages was 98.85 % when using Multilayer Perceptron classification and
all the features and 98.41 % when using the Arm Pose features alone. It is worth noting that
the set of gestures detected is not as varied as in other systems (e.g. [2]), however, our system
discriminates these similar gestures with high success rate and, more importantly, it com-
pensates for the tracking system lag in real time. This characteristic allows the utilization of
the scheme in real-time gesture classification systems for interactive applications.
It should be highlighted that the application, in its current state, provides an immediate
musical experience, the users can enjoy the application regardless their musical knowledge.
Additionally, the proposed scheme is valid not only for the simulation of a drumkit but also
for all sorts of similar percussive movements and instruments, such as a xylophone. Also,
it is possible to create new interaction metaphors that provide unique musical experience
by modifying the system response to user’s gestures, for example, by allowing the user
to trigger different types of sounds or dynamically combining different musical excerpts
associated to distinct gestures.
3.5 Further analysis of the Arm Pose feature set
The presented results seem to indicate that the Arm Pose feature set provides more infor-
mation for accurate classification of a given gesture than the Trajectory feature set. This
subsection aims to perform a further study of the Arm Pose feature set in order to find a
minimum feature set that provides optimal and accurate gesture discrimination rates. To this
end, a similar analysis to the one presented in the previous sections was performed, training
each classifier for each of the Arm Pose features separately. The results obtained from this
analysis are summarized in Table 8.
It can be observed that the performance of the classification schemes with the features
EH
y
and H
z
is rather low, with average discrimination rates of 35.82 % and 44.51 % respec-
tively. SE
z
provides an average discrimination rate of 50.88 % and the remaining features
lead to better performance of the classification task.
A similar analysis was repeated considering different combinations of the Arm Pose
features, in order to find the combinations leading to the best classification results. A sample
of some instances of this analysis can be found in Table 9 for the Multilayer Perceptron
classifier. Overall, it was found that the information contained in the features EH
y
, SE
z
and
H
z
did not provide any improvement of the discrimination capability of the classifiers with
respect to the performance attained with the rest of the features in the Arm Pose feature set.
Multimed Tools Appl
Table 8 Classification rate (%) for each Arm Pose feature individually
Feature Na
¨
ıve Bayes SVM polynomial SVM Gaussian k-NN Decision Tree Logistic Regression MLP
EH
x
71.21 68.23 75.63 72.29 77.53 72.83 76.71
EH
y
38.90 32.94 37.46 29.15 36.64 38.10 37.54
EH
z
52.07 46.12 54.24 45.49 53.97 51.08 53.16
SE
x
66.61 62.00 67.28 60.38 68.32 66.99 66.79
SE
y
64.35 50.27 65.52 59.30 63.10 63.18 62.46
SE
z
75.90 70.39 75.45 66.34 73.37 75.81 73.47
H
x
70.85 70.93 71.12 66.97 74.73 70.85 70.58
H
z
48.28 40.13 44.76 41.33 47.74 42.69 46.66
4 Experiments and evaluation
In order to corroborate the results found in the classification study described above, we
conducted a experiment in which the previously discussed feature sets and classification
techniques were used to implement an augmented reality application for drumkit simulation.
The details of the experiment conducted as well as the methods followed and the results
found are presented below.
4.1 Participants
A total of 12 participants took part in the experiment conducted: 1 female and 11 male sub-
jects, with ages ranging from 26 to 34 years (average 30.67 years, variance 10.55). There
were 6 graduate and 6 postgraduate participants. Among the 12 participants, 2 of them had
a strong musical background, one of these two and another participant were professional
musicians. Regarding the remaining 9 participants, 3 had played previously a musical instru-
ment regularly. The rest of the participants (6) were na
¨
ıve musical users, with no previous
formation or experience in music.
4.2 Experimental set-up
The proposed system is implemented using a Microsoft Kinect device alongside the OpenNI
framework in order to track the user motion. All the algorithms were implemented in a
Table 9 Classification rate (%) for some combinations of Arm Pose features
Feature combination Success rate Number of incorrectly classified instances
H
x
H
z
86.91 145
EH
x
EH
y
EH
z
87.00 144
SE
x
SE
y
SE
z
98.20 20
EH
x
SE
z
H
x
98.20 20
EH
x
EH
y
EH
z
H
x
H
z
96.93 34
SE
x
SE
y
SE
z
H
x
H
z
99.01 11
EH
x
EH
y
EH
z
SE
x
SE
y
SE
z
99.01 11
EH
x
SE
x
SE
y
SE
z
H
x
99.10 99.10 10
Multimed Tools Appl
C/C++ application. Regarding the experimental layout, the experiments were conducted
in the ATIC research lab at the E.T.S.I de Telecomunicaci
´
on of the University of M
´
alaga.
The application showed a virtual representation of the user’s tracked skeletal nodes, and a
percussion sound was played whenever the user performed the corresponding drum-hitting-
like gesture.
Note that according to the results presented in Section 3, the Arm Pose feature set yielded
better recognition rates than the Trajectory set. Furthermore, in general terms, the addition
of the 25 features associated with the Trajectory feature set did not lead to a meaningful
increase of the detection capabilities. Thus, only the Arm Pose feature set was considered
in the experiment. This set leads to success rates in gesture discrimination of 99.01 % or
99.10 %, depending on the classifier used. The features that were found to be irrelevant as
per Section 3.5 were removed.
4.3 Procedure
Two independent experimental sessions were conducted for each participant. The purpose
of the first experimental session was to assess the success rate of the detection system, both
in terms of fast gesture recognition and gesture discrimination. The second experimental
session focused on the user’s evaluation of the perceived lag, if any, in the sound played.
Each participant performed the trials assisted by a researcher, who explained the details of
the tests and observed participants’ behaviour during the experiments. At the end of the
first experimental session, the researcher asked the participants to complete a questionnaire
concerning their opinion on the experience. Additionally, the researcher also had a casual
interview with the participants regarding their overall experience and their perception of the
strengths and weaknesses of the system.
4.3.1 Detection performance assessment
In the first experimental session, each user was asked to perform six successive trials. In
order to measure the detection rate, we defined an experimental factor gesture of six levels,
corresponding to each of the six types of gestures to detect. Each trial consisted in perform-
ing one of the six gestures considered 8 times in a row. For each trial, the gesture detection
error rate (undetected or miss-classified) was measured. Additionally, we kept track of the
classification results for each gesture (correct classification, non-detected, or incorrectly
classified) for comparison against the success rates and confusions matrices depicted in
Section 3. At the end of the first experimental session, the participants were asked to fill in
a questionnaire. As part of this questionnaire, participants indicated their perceived preci-
sion, delay in the response and realism for each of the six gestures considered, scoring them
between 0 (least satisfactory) and 10 (most satisfactory).
4.3.2 User perceived lag
In the second experimental session, each user was asked to perform 10 successive trials.
Each trial consisted in performing a drum-hitting gesture, a drum-like sound was played
subsequently. The users were then asked to indicate whether they had perceived lag in the
system’s response to their gesture, or not. More concretely, they were asked to assess the
latency with a score between 0 (no lag) and 10 (high lag). In 5 of the trials, the system played
the sounds without the forecasting capabilities of the predictor described in Section 2,and
thus, the corresponding sound was played with a delay of roughly 160 milliseconds; in the
Multimed Tools Appl
other 5 trials, the predictor designed was used to compensate for this delay. An experimental
two-level delay factor registered whether there was delay or not in the system response. The
average score given for each of the two cases was used as indicator. In this case, no gesture
discrimination was performed, and the sound played was always the same one in order to
ensure minimal interference in the perception of the delay.
4.3.3 Data analysis and additional data extraction
A repeated measures approach was followed [19], so every participant was subject to all the
experimental conditions considered. In both experimental sessions, the order in which the
participants performed their trials was partially counter-balanced. In addition, prior to the
execution of the first session, participants were given up to 5 minutes to become familiar
with the application. In this preliminary session, the users were allowed to use both hands
to play the drums. However, in each of the trials in the two experimental sessions, the users
were asked to perform their gestures using only their dominant hand.
4.4 Results
A repeated measures one-factor ANOVA was performed on the factor gesture to analyse
its effect on the rate of badly classified gestures. The principal effects analysis showed a
significant effect on the error rate (F
5,7
= 10.97, p<0.003). This result can be explained
by the fact that the error rate for the sixth gesture (T Lateral) was 0. This case was excluded
in a second repeated measures ANOVA, yielding no significant effects whatsoever (F
4,8
=
0.764, p<0.554). In order to address the analysis of the users’ perception of the precision,
delay in the response and realism of each of the gestures implemented, another repeated
measures one-factor ANOVA was performed on the factor gesture. Again, no significant
effects were found for any of the variables considered (F
5,7
= 0.795, p<0.558; F
5,7
=
1.017, p<0.416; F
5,7
= 0.429, p<0.826).
Users’ estimations of the delay perceived during the second session were analysed with
a paired-samples T-test on the factor delay. From the results yielded, it was found that the
prediction of the system did have a significant impact on the perception of lag (T
1,11
=
14.53, p<0.000). Overall, when the sound was played with a certain delay, the users
reported that the delay was noticeable (reporting a value of 6 or higher) in 85 % of the cases,
otherwise delay was reported with a rate of only 5 %.
Attending to the total of 576 gestures (12 participants, with 48 gestures per partici-
pant) performed during the first session, the rate of successfully detected fast gestures
was 98.09 %, and the rate of successfully discriminated gestures was 97.74 % (95.87 %
of overall successful detection rate). The corresponding confusion matrix can be found in
Fig. 13.
4.5 Subjective results
After the experimental sessions, each participant was asked to complete a questionnaire
scoring different aspects of their experience. The questionnaire is shown in Table 10.
Each item in the questionnaire received a score ranging from 0 (unsatisfactory) to 10
(most satisfactory). The average scores of the items in the questionnaire on general expe-
rience (Table 10), can be found in Table 11. The overall response of the participants was
quite positive and most of them reported having found the application interesting and
enjoyable.
Multimed Tools Appl
92.00
1.00
0.00
1.00
1.00
0.00
0.00
93.00
0.00
3.00
0.00
0.00
0.00
0.00
95.00
0.00
0.00
0.00
1.00
1.00
0.00
92.00
1.00
0.00
3.00
1.00
1.00
0.00
93.00
0.00
0.00
0.00
0.00
0.00
1.00
96.00
Snare C_Right C_Left T_FrRight T_FrLeft T_Lateral
Snare
C_Right
C_Left
T_FrRight
T_FrLeft
T_Lateral
Fig. 13 Confusion matrix for gesture discrimination in the experiment conducted
4.6 Gesture detection for drumkit simulation. Discussion and comparison
with previous works
The previously described experiments attained a successful detection rate of 95.87 %, which
is slightly lower than the one found for the ten-fold cross-validation tests in Section 3 with
Arm Pose detection (minimum 98.41 %). However, it is still a high success rate that fur-
ther corroborates the effectiveness of the system developed for fast gesture recognition. The
experimental analysis of the effects of the factor gesture showed that the type of gesture
being performed has a strong effect on the successful detection rate for that gesture. The
inspection of the confusion matrix reveals that the T
Lateral gesture shows the best detec-
tion rates. Furthermore, the repetition of the analysis, not considering this type of gesture,
confirmed that there was no significant effect of the gesture factor on the detection rate.
This fact indicates that the only gesture with a meaningful effect on this measure is precisely
T
Lateral. This result agrees with the behaviour found in Section 3, and it is a reasonable
finding since the T Lateral gesture is the one that differs most in execution and it is easier
to identify.
The success rate of the system implemented is similar to the ones reported in previous
studies. For example, prior researches yielded success rates of 93.14 % [27], 93.25 % [46],
95.42 % [25]or99.1%[30] using Hidden Markov Models, 96.7 % [31] and 96 % [28]
for Dynamic Time Warping, 94 % for SVM [8] and 94.45 % for artificial neural networks
[39]. In particular, previous studies on gesture recognition using a Kinect device attained
Table 10 General experience
questionnaire
Opinion about Score
Overall satisfaction with the application 0 1 2 3 4 5 6 7 8 9 10
Howintuitivewastheinteraction 012345678910
Easeofuseoftheapplication 012345678910
Levelofrealismperceived 012345678910
Multimed Tools Appl
Table 11 Participants’ scores of
the users’ experience with the
application
User experience Score (0-10)
Satisfaction 7.36
Intuitiveness 8.72
Ease of use 8.64
Realism 7.50
success rates of 92.26 % [21] and 96 % [20]. The works in [48] are of particular interest in
this regard, as they propose a 3D-shape-based tracking system along with trajectory features
to identify a vast set of gestures. While the cited work supports a much bigger variety of
gestures than the work discussed here, we still achieve very high classification rates for
gestures that are performed in a very similar way, as well as providing an almost instant
response, while the cited work shows a delay of a few seconds. Thus, we conclude that our
approach feels more adequate for applications that require a seamless response on behalf of
the system (as in the case of a drumkit simulator), while the approach proposed in works like
[48]or[2] are more adequate for applications in which the delay constraints are not tight.
From the user point of view, the results showed that the prediction system had a signifi-
cant effect on the perception of delay. In fact, with the prediction enabled, the participants
perceived certain lag in 3 out of the 60 cases evaluated, otherwise, the participants correctly
identified the presence of the delay most of the time. This indicates that the predictor imple-
mented does compensate for the delay introduced by the tracking system. A commonly
problem found in previous works is that the system delay heavily hinders the user’s experi-
ence for real-time interaction ([2]), yet, our experiments show that our system can overcome
this issue through the use of motion prediction.
Overall, the participants’ assessment of the user experience was quite positive. Some of
them, however, identified some issues in the current implementation of the drumkit sim-
ulator. More specifically, two of the participants with experience as drummers found the
system to perform well in general terms, but they reported the system does not perform
sufficiently fast to recognize some typical drum beat patterns, and reckoned that the inter-
action metaphor proposed was too physically demanding, potentially leading to drummer
exhaustion.
5 Conclusions and future works
In this paper, we have presented a set of features for fast gesture recognition and we have
used this set for the implementation of an augmented virtual application for drumkit simu-
lation. The system processes the stream of positional data of the selected nodes tracked in a
linear prediction scheme to forecast future positions and extract features that allow for the
instantaneous detection of fast gestures. Two different sets of features depicting the hand
trajectory and the arm pose have been considered for gesture discrimination. The features
registering the current pose of the arm have proved to be more effective for gesture classi-
fication, as well as being sufficiently accurate for gesture discrimination. The experiments
conducted have confirmed the validity of the proposed scheme from both objective and sub-
jective points of view. The results corroborated that the system proposed can indeed classify
fast gestures with high detection rates and negligible or inexistent delay from the user point
of view.
Multimed Tools Appl
Future works will advance to expand the set of gestures that the system can recognize.
In particular, some participants have voiced some concerns with regard to a specific limi-
tation of the system that prevents the drummer from playing quick beat patterns in a row.
This (combined) gesture would require the expansion of the database with gestures of lower
amplitude and faster execution, as well as the study of the features more adequate to rep-
resent this type of gesture. Also, the tracking module can be modified in order to account
for the usage of drumsticks which could reduce the degree of exertion of playing with the
application developed in long playing sessions.
Acknowledgments This work has been funded by the Ministerio de Econom
´
ıa y Competitividad of the
Spanish Government under Project No. TIN2013-47276-C6-2-R and by the Junta de Andaluc
´
ıa under Project
No. P11-TIC-7154 . This work has been done at Universidad de M
´
alaga. Campus de Excelencia Internacional
Andaluc
´
ıa Tech.
References
1. Aha D, Kibler D, Albert M (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
2. Bandera J, Marfil R, Bandera A, Rodr
´
ıguez JA, Molina-Tanco L, Sandoval F (2009) Fast gesture
recognition based on a two-level representation. Pattern Recognit Lett 30(13):1181–1189
3. Bandera J, Rodr
´
ıguez J, Molina-Tanco L, Bandera A (2012) A survey of vision-based architectures for
robot learning by imitation. International Journal Humanoid Robot 9(1):1–40
4. Barbancho I, Rosa-Pujaz
´
on A, Tard
´
on L, Barbancho A (2013) Human–computer interaction and music.
In: Sound Perception-Performance, pp 367–389. Springer
5. Bevilacqua F, Gu
´
edy F, Schnell N, Fl
´
ety E, Leroy N (2007) Wireless sensor interface and gesture-
follower for music pedagogy. In: Proceedings of the 7th International Conference on New Interfaces for
Musical Expression, pp 124–129. ACM
6. Bou
¨
enard A, Wanderley MM, Gibet S (2010) Gesture control of sound synthesis: Analysis and
classification of percussion gestures. Acta Acustica united with Acustica 96(4):668–677
7. Calinon S (2007) Continuous extraction of task constraints in a robot programming by demonstration
framework. Unpublished doctoral dissertation, Ecole Polytechnique F
´
ed
´
erale de Lausanne (EPFL)
8. Cao D, Masoud O, Boley D, Papanikolopoulos N (2009) Human motion recognition using support vector
machines. Comp Vision Image Underst 113(10):1064–1075
9. Caramiaux B (2014) Motion modeling for expressive interaction: A design proposal using bayesian
adaptive systems. In: Proceedings of the 2014 International Workshop on Movement and Computing,
p 76. ACM
10. Castellano G, Bresin R, Camurri A, Volpe G (2007) Expressive control of music and visual media by
full-body movement. In: Proceedings of the 7th International Conference on New Interfaces for Musical
Expression, pp 390–391 ACM
11. Celebi S, Aydin A, Temiz T, Arici T (2013) Gesture recognition using skeleton data with weighted
dynamic time warping. Computer Vision Theory and Applications. VISAPP
12. Chen C, Liang J, Zhao H, Hu H, Tian J (2009) Factorial HMM and parallel HMM for gait recognition.
IEEE Transaction on Systems, Man, and Cybernetics, Part C Applications and Reviews 39(1):114–123
13. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
14. El-Baz A, Tolba A (2013) An efficient algorithm for 3d hand gesture recognition using combined neural
classifiers. Neural Computing and Applications pp 1–8
15. Halpern M, Tholander J, Evjen M, Davis S, Ehrlich A, Schustak K, Baumer E, Gay G (2011) Moboo-
gie: creative expression through whole body musical interaction. In: Proceedings of the 2011, Annual
Conference on Human Factors in Computing Systems, pp 557–560. ACM (2011)
16. Haykin S (2007) Neural networks: a comprehensive foundation. Prentice Hall Englewood Cliffs, NJ
17. Holland S, Bouwer A, Dalgelish M, Hurtig T (2010) Feeling the beat where it counts: fostering multi-
limb rhythm skills with the haptic drum kit. In: Proceedings of the fourth International Conference on
Tangible, Embedded, and Embodied Interaction, pp 21–28. ACM
18. Hosmer D, Lemeshow S, Sturdivant R (2013) Applied logistic regression. Wiley
19. Howell D (2011) Statistical methods for psychology. Cengage Learning
Multimed Tools Appl
20. Itauma I, Kivrak H, Kose H (2012) Gesture imitation using machine learning techniques. In: Signal
Processing and Communications Applications Conference (SIU), 2012 20th, pp 1–4. IEEE
21. Jacob M, Wachs J (2013) Context-based hand gesture recognition for the operating room. Pattern
Recognition Letters 36:196–203
22. John G, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings
of the Eleventh conference on Uncertainty in Artificial Intelligence, pp 338–345. Morgan Kaufmann
Publishers Inc
23. Jord
`
a S (2010) The reactable: tangible and tabletop music performance. In: Proceedings of the 28th of the
International Conference Extended Abstracts on Human Factors in Computing Systems, pp 2989–2994.
ACM
24. Khoo E, Merritt T, Fei V, Liu W, Rahaman H, Prasad J, Marsh T (2008) Body music: physical explo-
ration of music theory. In: Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games,
pp 35–42
25. Kim D, Song J, Kim D (2007) Simultaneous gesture segmentation and recognition based on forward
spotting accumulative. Pattern Recog 40(11):3012–3026
26. Lago N, Kon F (2004) The quest for low latency. In: Proceedings of the International Computer Music
Conference, pp 33–36
27. Lee H, Kim J (1999) An HMM-based threshold model approach for gesture recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence 21(10):961–973
28. Li H, Greenspan M (2011) Model-based segmentation and recognition of dynamic gestures in continuous
video streams. Pattern Recog 44(8):1614–1628
29. Livingston M, Sebastian J, Ai Z, Decker J (2012) Performance measurements for the microsoft Kinect
skeleton. In: Proceedings of the 2012 IEEE Virtual Reality Workshops, pp 119–120
30. Mannini A, Sabatini A (2010) Machine learning methods for classifying human physical activity from
on-body accelerometers. Sensors 10(2):1154–1175
31. Muhlig M, Gienger M, Hellbach S, Steil JJ, Goerick C (2009). In: Task-level imitation learning
using variance-based movement optimization. In: Proceedings of the IEEE International Conference on
Robotics and Automation, ICRA’09, pp 1177–1184
32. Odowichuk G, Trail S, Driessen P, Nie W, Page W (2011) Sensor fusion: Towards a fully expressive 3d
music control interface. In: Proceedings of the 2011 IEEE Pacific Rim Conference on Communications,
Computers and Signal Processing (PacRim), pp 836–841
33. Quinlan J (1993) C4. 5: programs for machine learning, vol 1. Morgan Kaufmann
34. Rasamimanana N, Fl
´
ety E, Bevilacqua F (2006) Gesture analysis of violin bow strokes. In: Gesture in
Human-Computer Interaction and Simulation, pp 145–155. Springer
35. Rosa-Pujaz
´
on A, Barbancho I, Tard
´
on L, Barbancho A (2013) Conducting a virtual ensemble with a
kinect device. In: SMAC 2013 - Stockholm Music Acoustics Conference 2013, pp 284–291
36. Rosa-Pujaz
´
on A, Barbancho I, Tard
´
on L, Barbancho A (2013) Drum-hitting gesture recognition and
prediction system using Kinect. In: I Simposio Espa
˙
nol de Entrenimiento Digital SEED 2013, pp 108–
118
37. Rosa-Pujaz
´
on A., Barbancho I., Tard
´
on L., Barbancho A. (2015) A virtual reality drumkit simulator sys-
tem with a Kinect device. International Journal of Creative Interfaces and Computer Graphics accepted
for publication 6(1):72–86
38. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Pattern
Recognition 2004, ICPR Proceedings of the 17th International Conference on, vol 3, pp 32–36
39. Stanton C, Bogdanovych A, Ratanasena E (2012). In: Teleoperation of a humanoid robot using full-
body motion capture, example movements, and machine learning. In, Proc Australasian Conference on
Robotics and Automation
40. Stierman C (2012) Kinotes: Mapping musical scales to gestures in a Kinect-based interface for musican
expression. Ph.D.thesis, MSc Thesis University of Amsterdam
41. Todoroff T, Leroy J, Picard-Limpens C (2011) Orchestra: Wireless sensor system for augmented
performances & fusion with Kinect. QPSR of the numediart research program 4(2)
42. Trail S, Dean M, Tavares T, Odowichuk G, Driessen P, Schloss W, Tzanetakis G (2012) Non-
invasive sensing and gesture control for pitched percussion hyper-instruments using the Kinect. In: 12th
International Conference on New Interfaces for Musical Expression. NIME’12
43. Wiener N (1964) Extrapolation, Interpolation, and Smoothing of Stationary Time Series. Wiley, New
Yo rk
44. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden
markov model. In: Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE
Computer Society Conference on, pp 379–385. IEEE
45. Yoo M, Beak J, Lee I (2011) Creating musical expression using kinect. In: Proceedings of the 2011
Conference on New Interfaces for Musical Expression, Oslo Norway
Multimed Tools Appl
46. Yoon H, Soh J, Bae Y, Seung Yang H (2001) Hand gesture recognition using combined features of
location, angle and velocity. Pattern Recog 34(7):1491–1501
47. Zhang Y, Huang Q, Qin L, Zhao S, Yao H, Xu P (2014) Representing dense crowd patterns using bag of
trajectory graphs. SIViP 8(1):173–181
48. Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards
robust action retrieval. Neurocomputing 151:533–543
Alejandro Rosa-Pujaz
´
on was born in M
´
alaga, Spain. He received the B.E. degree in Telecommunication
Engineering from the University of M
´
alaga, M
´
alaga, Spain, in 2006, and the M.Tech. degree in Telecommuni-
cation Technologies from the University of M
´
alaga in 2008. In 2006, he joined the Department of Electronic
Technology, University of M
´
alaga, as a Research Assistant. Since 2012, he has been with the Department of
Communication Engineering, University of M
´
alaga, where he is currently working as a Research Assistant
while performing Ph.D studies. His current research interests include signal processing, human-computer
interfaces and interaction with music.
Isabel Barbancho received her degree in Telecommunication Engineering and her Ph.D. degree from the
University of M
´
alaga (UMA), M
´
alaga, Spain, in 1993 and 1998, respectively, and her degree in Piano Teach-
ing from the M
´
alaga Conservatoire of Music in 1994. Since 1994, she has been with the Department of
Communications Engineering, UMA, as an Assistant and then Associate Professor. During 2013, she has
been a Visiting Scholar at University of Victoria, Victoria, BC, Canada. She has been the main researcher in
several research projects on polyphonic transcription, optical music recognition, music information retrieval,
and intelligent content management. Her research interests include musical acoustics, signal processing, mul-
timedia applications, audio content analysis, and serious games. Dr. Barbancho received the Severo Ochoa
Award in Science and Technology, Ateneo de M
´
alaga-UMA in 2009 and the ‘Premio M
´
alaga de Investigaci
´
on
2011’ Award from the Academies ‘Bellas Artes de San Telmo’ and ‘Malague
˜
na de Ciencias’.
Multimed Tools Appl
Lorenzo J. Tard
´
on received his degree in Telecommunication Engineering from University of Valladolid,
Valladolid, Spain, in 1995 and his Ph.D. degree from Polytechnic University of Madrid, Madrid, Spain,
in 1999. In 1999 he worked for ISDEFE on air traffic control systems at Madrid-Barajas Airport and at
Lucent Microelectronics on systems management. Since November 1999, he has been with the Department
of Communications Engineering, University of M
´
alaga, M
´
alaga, Spain. Lorenzo J. Tard
´
on is currently the
head of the Application of Information and Communications Technologies (ATIC) Research Group. He has
worked as main researcher of different projects on audio and music analysis. He is a member of several
international journal committees on communications and signal processing. In 2011, he has been awarded
the ‘Premio M
´
alaga de Investigaci
´
on’ by the Academies ‘Bellas Artes de San Telmo’ and ‘Malague
˜
na de
Ciencias’. His research interests include serious games, audio signal processing, digital image processing
and pattern analysis and recognition.
Ana M. Barbancho received her degree in Telecommunication Engineering and her Ph.D. degree from Uni-
versity of M
´
alaga, M
´
alaga, Spain, in 2000 and 2006, respectively. In 2001, she also received her degree
in Solfeo Teaching from the M
´
alaga Conservatoire of Music. Since 2000, she has been with the Depart-
ment of Communications Engineering, University of M
´
alaga, as an Assistant and then Associate Professor.
Her research interests include musical acoustics, digital signal processing, new educational methods, and
mobile communications. Dr. Barbancho was awarded the “Second National University Prize to the Best
Scholar 1999/2000” by the Spanish Ministry of Education in 2000, the ‘Extraordinary Ph.D. Thesis Prize’
by ETSI Telecomunicaci
´
on of University of M
´
alaga in 2007 and the ‘Premio M
´
alaga de Investigaci
´
on’ by
the Academies ‘Bellas Artes de San Telmo’ and ‘Malague
˜
na de Ciencias’ in 2010.
... There is a growing number of original ways of creating, interacting, and playing with music using different possibilities technology can offer. For several years now, new sys tems appear constantly in which the musical experience is more and more immersive and does not rely on physical instruments, but generic commercial or specific systems for motion detection [15, 16,24,34,40] or the interaction with elements of a physical desk (pencil, pen, table, etc.) [12] or other interaction means. In addition, more and more studies Ana M. Barbancho, Lorenzo J. Tardón and Isabel Barbancho are authors contributed equally to this work are emerging to understand music in relation to the way people interact with it [11,13]. ...
Preprint
Full-text available
In this paper, a system to build music in an intuitive and accessible way, with Lego bricks, is presented. The system makes use of the new powerful and cheap possibilities that technology offers for making old things in a new way. The Raspberry Pi is used to control the system and run the necessary algorithms, customized Lego bricks are used for building melodies, custom electronic designs, software pieces and 3D printed parts complete the items employed. The system designed is modular, it allows creating melodies with chords and percussion or just melodies or perform as a beatbox or a melody box. The main interaction with the system is made using Lego-type building blocks. Tests have demonstrated its versatility and ease of use, as well as its usefulness in music learning for both children and adults.
... There is a growing number of original ways of creating, interacting, and playing with music using different possibilities technology can offer. For several years now, new systems appear constantly in which the musical experience is more and more immersive and does not rely on physical instruments, but generic commercial or specific systems for motion detection [15,16,24,34,40] or the interaction with elements of a physical desk (pencil, pen, table, etc.) [12] or other interaction means. In addition, more and more studies Ana M. Barbancho, Lorenzo J. Tardón and Isabel Barbancho are authors contributed equally to this work are emerging to understand music in relation to the way people interact with it [11,13]. ...
Article
Full-text available
In this paper, a system to build music in an intuitive and accessible way, with Lego bricks, is presented. The system makes use of the new powerful and cheap possibilities that technology offers for making old things in a new way. The Raspberry Pi is used to control the system and run the necessary algorithms, customized Lego bricks are used for building melodies, custom electronic designs, software pieces and 3D printed parts complete the items employed. The system designed is modular, it allows creating melodies with chords and percussion or just melodies or perform as a beatbox or a melody box. The main interaction with the system is made using Lego-type building blocks. Tests have demonstrated its versatility and ease of use, as well as its usefulness in music learning for both children and adults.
... With the popularization of image processing models and sophisticated motion tracking sensors, several existing technologies have emphasized on building virtual musical instruments using 3D hand gestures in-air [6,14,30,31,34] and/or Microsoft's Kinect 3D sensor [17,20,29,36]. However, these sensor based virtual musical instruments are designed for playing instruments in-air, the experience of which is far from playing a real musical instrument. ...
Preprint
Full-text available
With the rise in pervasive computing solutions, interactive surfaces have gained a large popularity across multi-application domains including smart boards for education, touch-enabled kiosks for smart retail and smart mirrors for smart homes. Despite the increased popularity of such interactive surfaces, existing platforms are mostly limited to custom built surfaces with attached sensors and hardware, that are expensive and require complicated design considerations. To address this, we design a low-cost, intuitive system called MuTable that repurposes any flat surface (such as table tops) into a live musical instrument. This provides a unique, close to real-time instrument playing experience to the user to play any type of musical instrument. This is achieved by projecting the instrument's shape on any tangible surface, sensor calibration, user taps detection, tap position identification, and associated sound generation. We demonstrate the performance of our working system by reporting an accuracy of 83% for detecting softer taps, 100% accuracy for detecting the regular taps, and a precision of 95.7% for estimating hand location.
... Hand pose estimation is to recognize the keypoint location of gestures from an image. Hand pose estimation and hand recognition technology [14,24,31] plays an important role in the fields of Human-Computer Interaction (HCI), Virtual Reality (VR), and Augmented ...
Article
Full-text available
Due to severe articulation, self-occlusion, various scales, and high dexterity of the hand, hand pose estimation is more challenging than body pose estimation. Recently-developed body pose estimation algorithms are not suitable for addressing the unique challenges of hand pose estimation because they are trained without explicitly modeling structural relationships between keypoints. In this paper, we propose a novel cascaded hierarchical CNN(CH-HandNet) for 2D hand pose estimation from a single color image. The CH-HandNet includes three modules, hand mask segmentation, preliminary 2D hand pose estimation, and hierarchical estimation. The first module obtains a hand mask by hand mask segmentation network. The second module connects the hand mask and the intermediate image features to estimate the 2D hand heatmaps. The last module connects hand heatmaps with the intermediate image features and hand mask to estimate finger and palm heatmaps hierarchically. Finally, the extracted Finger(pinky,ring,middle,index) and Palm(thumb and palm) feature information are fused to estimate 2D hand pose. Experimental results on three datasets - OneHand 10k, Panoptic, and Eric.Lee, consistently shows that our proposed CH-HandNet outperforms previous state-of-the-art hand pose estimation methods.
... At the same time, in the real application environment, a single ordinary camera is relatively common. It can only obtain the user's gesture image from a two-dimensional plane, extract the color, contour and texture information of the gesture, and construct the appearance model of the gesture [10]. ...
Article
Full-text available
Objective. To explore the research and application of multifeature gesture recognition in virtual reality human-computer interaction and to explore the gesture recognition technology scheme to achieve better human-computer interaction experience. Methods. Through the study of the technical difficulties of gesture recognition, comparative static gesture feature recognition and feature fusion algorithms are applied, in the process of research on gesture partition, and adjust the contrast of characteristic parameters, combined with the feature of space-time dynamic gesture tracking trajectory and dynamic gesture recognition and gesture recognition effect under different scheme. Results. The central region was divided into 0 regions, and the central region was divided into 1-4 regions in counterclockwise direction. Compared with the traditional gesture changes, the overlapping problem in the four partition modes was reduced, the gesture was better displayed, and the operation and use of gesture processing were realized more efficiently. Conclusion. Gesture recognition requires the combination of static gesture feature information recognition, gesture feature fusion, spatiotemporal trajectory feature, and dynamic gesture trajectory feature to achieve a better human-computer interaction experience.
... These data can be exploited by the most recent Machine Learning (ML) techniques in order to extract actionable information and automatize process that are often acquired by years of experience of the teachers. Several contributions in this field have been proposed where these data sources are exploited to study musicians and their musical performance using MOCAP [7,10,12,14], MYO [11,13] and KINECT [29,39] with the purpose of supporting the students individual practice. ...
Chapter
Learning to play and perform violin is a complex task, that requires a high conscious control and coordination for the player. In this paper, our aim is to understand which technology and which motion features can be used to efficiently and effectively distinguish a professional performance from a student one trading off intrusiveness and accuracy. We collected and made freely available a dataset consisting of Motion Capture (MOCAP), Electromyography, Accelerometer, and Gyroscope (MYO), and Microsoft Kinect (KINECT) recordings of different violinists with different skills performing different exercises covering different pedagogical and technical aspects. We then engineered peculiar features starting from the different sources (MOCAP, MYO, and KINECT) and trained a data-driven classifier to distinguish among two different levels of violinist experience, namely Beginners and Experts. We then studied how much accuracy do we loose when, instead of using MOCAP data (the most intrusive and costly technology), MYO data (which is less intrusive than MOCAP), or the KINECT data (the less intrusive technology) are exploited. In accordance with the hierarchy present in the dataset, we study two different scenarios: extrapolation with respect to different exercises and violinists. Furthermore we study which features are the most predictive ones of the quality of a violinist to corroborate the significance of the results. Results, both in terms of accuracy and insight on the cognitive problem, support the proposal and support the use of the presented technique as an effective tool for students to monitor and enhance their home study and practice.
... Moreover, we want to avoid the need of any specific hardware. For example, the Microsoft Kinect is a popular platform for training models and collecting data for gesture recognition [3] [4] [5] [6], since it provides not only an RGB video but also depth data. The task of gesture recognition can be simplified by placing special markers on the person's body [7] or special gloves for hand gestures [8]. ...
Chapter
Gesture recognition opens up new ways for humans to intuitively interact with machines. Especially for service robots, gestures can be a valuable addition to the means of communication to, for example, draw the robot’s attention to someone or something. Extracting a gesture from video data and classifying it is a challenging task and a variety of approaches have been proposed throughout the years. This paper presents a method for gesture recognition in RGB videos using OpenPose to extract the pose of a person and Dynamic Time Warping (DTW) in conjunction with One-Nearest-Neighbor (1NN) for time-series classification. The main features of this approach are the independence of any specific hardware and high flexibility, because new gestures can be added to the classifier by adding only a few examples of it. We utilize the robustness of the Deep Learning-based OpenPose framework while avoiding the data-intensive task of training a neural network ourselves. We demonstrate the classification performance of our method using a public dataset.
... Moreover, we want to avoid the need of any specific hardware. For example, the Microsoft Kinect is a popular platform for training models and collecting data for gesture recognition [3] [4] [5] [6], since it provides not only an RGB video but also depth data. The task of gesture recognition can be simplified by placing special markers on the person's body [7] or special gloves for hand gestures [8]. ...
Preprint
Full-text available
Gesture recognition opens up new ways for humans to intuitively interact with machines. Especially for service robots, gestures can be a valuable addition to the means of communication to, for example, draw the robot's attention to someone or something. Extracting a gesture from video data and classifying it is a challenging task and a variety of approaches have been proposed throughout the years. This paper presents a method for gesture recognition in RGB videos using OpenPose to extract the pose of a person and Dynamic Time Warping (DTW) in conjunction with One-Nearest-Neighbor (1NN) for time-series classification. The main features of this approach are the independence of any specific hardware and high flexibility, because new gestures can be added to the classifier by adding only a few examples of it. We utilize the robustness of the Deep Learning-based OpenPose framework while avoiding the data-intensive task of training a neural network ourselves. We demonstrate the classification performance of our method using a public dataset.
... Some researchers have also used hand gesture recognition to detect whether a person has Alzheimer's disease [7] or to interpret hand-drawn diagrams [8]. In musical performances, some use the Kinect to recognize drum-hitting gestures [9]. In musical conducting, hand gestures are used to express the conductor's style and emotion and to make performers understand the conductor's guidance [10,11]. ...
Article
Full-text available
Gesture recognition is a human−computer interaction method, which is widely used for educational, medical, and entertainment purposes. Humans also use gestures to communicate with each other, and musical conducting uses gestures in this way. In musical conducting, conductors wave their hands to control the speed and strength of the music played. However, beginners may have a limited comprehension of the gestures and might not be able to properly follow the ensembles. Therefore, this paper proposes a real-time musical conducting gesture recognition system to help music players improve their performance. We used a single-depth camera to capture image inputs and establish a real-time dynamic gesture recognition system. The Kinect software development kit created a skeleton model by capturing the palm position. Different palm gestures were collected to develop training templates for musical conducting. The dynamic time warping algorithm was applied to recognize the different conducting gestures at various conducting speeds, thereby achieving real-time dynamic musical conducting gesture recognition. In the experiment, we used 5600 examples of three basic types of musical conducting gestures, including seven capturing angles and five performing speeds for evaluation. The experimental result showed that the average accuracy was 89.17% in 30 frames per second.
Conference Paper
Simulations are often used for training novice operators to avoid accidents, while they are still polishing their skills. To ensure the experience gained in the simulation be applicable in real-world scenarios, the simulation has to be made as realistic as possible. This paper investigated how to make the lifting capacity of a virtual mobile crane behave similarly like its real counterpart. We initially planned to use information from the load charts, which document how the lifting capacity of a mobile crane works, but the data in the load charts were very limited. To mitigate this issue, we trained an artificial neural network (ANN) using 90% of random data from two official load charts of a real mobile crane. The trained model could predict the lifting capacity based on the real-time states of the boom length, the load radius, and the counterweight of the virtual mobile crane. To evaluate the accuracy of the ANN predictions, we conducted a real-time experiment inside the simulation, where we compared the lifting capacity predicted by the ANN and the remaining 10% of the data from the load charts. The results showed that the ANN could predict the lifting capacity with small deviation rates. The deviation rates also had no significant impact on the lifting capacity, except when both boom length and load radius were approaching their maximum states. Therefore, the predicted lifting capacity generated by the ANN could be assumed to be close enough to the values in the load charts.
Conference Paper
Full-text available
Hyper-instruments extend traditional acoustic instruments with sensing technologies that capture digitally subtle and sophisticated aspects of human performance. They leverage the long training and skills of performers while simultaneously providing rich possibilities for digital control. Many existing hyper-instruments suffer from being one of a kind instruments that require invasive modifications to the underlying acoustic instrument. In this paper we focus on the pitched percussion family and describe a non-invasive sensing approach for extending them to hyper-instruments. Our primary concern is to retain the technical integrity of the acoustic instrument and sound production methods while being able to intuitively interface the computer. This is accomplished by utilizing the Kinect sensor to track the position of the mallets without any modification to the instrument which enables easy and cheap replication of the proposed hyper-instrument extensions. In addition we describe two approaches to higher-level gesture control that remove the need for additional control devices such as foot pedals and fader boxes that are frequently used in electro-acoustic performance. This gesture control integrates more organically with the natural flow of playing the instrument providing user selectable control over filter parameters, synthesis, sampling, sequencing, and improvisation using a commercially available low-cost sensing apparatus.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Chapter
In this chapter, the use of advanced human computer interfaces to create innovative interaction paradigms for music applications (music creation, music manipulation, music games, etc.) is explored. The advances in the design and implementation of sensing technologies have provided the means to create new ways to interact with computers in a more natural way than the conventional computer framework with a mouse and a keyboard. More involving ans immersive experiences can be offered to the user than to these technologies. However, there is no silver bullet: each kind of sensing technology excels at some fields and lack others. Each application will demand its very own selection of sensors and the development of an adequate interaction metaphor. In this chapter, some of the most commonly used technologies for motion sensing are presented with as special focus on the possibilities of a 3D camera sensor (i.e. kinect) with regard to the design of human computer interfaces for musin interaction. We will present our findings in the studies we have conducted using these devices to develop augmented instruments. These include a drumkit stimulator or a virtual theremin. Additionally, the use of this type of interface for other music applications will be discussed. A description of the technical issues that need to be addressed to successfully implement these interaction paradigms is also given.
Article
With Microsoft's launch of Kinect in 2010, and release of Kinect SDK in 2011, numerous applications and research projects exploring new ways in human-computer interaction have been enabled. Gesture recognition is a technology often used in human-computer interaction applications. Dynamic time warping (DTW) is a template matching algorithm and is one of the techniques used in gesture recognition. To recognize a gesture, DTW warps a time sequence of joint positions to reference time sequences and produces a similarity value. However, all body joints are not equally important in computing the similarity of two sequences. We propose a weighted DTW method that weights joints by optimizing a discriminant ratio. Finally, we demonstrate the recognition performance of our proposed weighted DTW with respect to the conventional DTW and state-of-the- art.
Article
In this paper, an implementation of a virtual reality based application for drumkit simulation is presented. The system tracks user motion through the use of a Kinect camera sensor, and recognizes and detects user-generated drum-hitting gestures in real-time. In order to compensate the effects of latency in the sensing stage and provide real-time interaction, the system uses a gesture detection model to predict user movements. The paper discusses the use of two different machine learning based solutions to this problem: the first one is based on the analysis of velocity and acceleration peaks, the other solution is based on Wiener filtering. This gesture detector was tested and integrated into a full implementation of a drumkit simulator, capable of discriminating up to 3, 5 or 7 different drum sounds. An experiment with 14 participants was conducted to assess the system's viability and impact on user experience and satisfaction.
Article
The aim of this paper was to address the problem of dense crowd event recognition in the surveillance video. Previous particle flow-based methods efficiently capture the convolutional motion in the crowded scene. However, the group-level description was rarely studied due to huge loss of group structure and intra-class variability. To address these issues, we present a novel crowd behavior representation called bag of trajectory graphs (BoTG). Firstly, we design a group-level representation beyond particle flow. From the observation that crowd particles are composed of atomic subgroups corresponding to informative behavior patterns, particle trajectories that simulate motion of individuals will be clustered to form groups. Secondly, we connect nodes in each group as a trajectory graph and propose 3 informative features to encode the graphs, namely, graph structure, group attribute, and dynamic motion, which characterize the structure, the motion within, and among the trajectory graphs. Finally, each clip of crowd event can be further described by BoTG as the occurrences of behavior patterns, which provides critical clues for categorizing specific crowd event. We conduct extensive experiments on public datasets for abnormality detection and event recognition. The results demonstrate the effectiveness of our BoTG on characterizing the group behaviors in dense crowd.
Article
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
3D depth data, especially dynamic 3D depth data, offer several advantages over traditional intensity videos for expressing objects׳ actions, such as being useful in low light levels, resolving the silhouette ambiguity of actions, and being color and texture invariant. With the wide popularity of somatosensory equipment (Kinect for example), more and more dynamic 3D depth data are shared on the Internet, which results in an urgent need to retrieve these data efficiently and effectively. In this paper, we propose a generalized strategy for dynamic 3D depth data matching and apply this strategy in action retrieval task. Firstly, an improved 3D shape context descriptor (3DSCD) is proposed to extract features of each static depth frame. Then we employ dynamic time warping (DTW) to measure the temporal similarity between two 3D dynamic depth sequences. Experimental results on our collected dataset consisting of 170 dynamic 3D depth video clips show that the proposed 3DSCD has a rich descriptive power on depth data and that the method using 3DSCD and DTW achieves high matching accuracy. Finally, to address the matching efficiency problem, we utilize the bag of word (BoW) model to quantize the 3DSCD of each static depth frame into visual word packages. So the original feature matching problem is simplified into a two-histogram matching problem. The results demonstrate the matching efficiency of our proposed method, while still maintaining high matching accuracy.