Article
Keep It Simple And Sparse: RealTime Action Recognition
Journal of Machine Learning Research (Impact Factor: 2.47). 09/2013; 14:26172640.
Get notified about updates to this publication Follow publication 
Fulltext
Available from: Francesca Odone, Apr 07, 2014Journal of Machine Learning Research 14 (2013) 26172640 Submitted 1/13; Revised 5/13; Published 9/13
Keep It Simple And Sparse: RealTime Action Recognition
Sean Ryan Fanello
∗
SEAN.FANELLO@IIT.IT
Ilaria Gori
∗
ILARIA.GORI@IIT.IT
Giorgio Metta GIORGIO.METTA@IIT.IT
iCub Facility
Istituto Italiano di Tecnologia
Genova, Via Morego 30, 16163, Italia
Francesca Odone FRANCESCA.ODONE@UNIGE.IT
Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi
Universit
`
a degli Studi di Genova
Genova, Via Dodecaneso 35, 16146, Italia
Editors: Isabelle Guyon and Vassilis Athitsos
Abstract
Sparsity has been showed to be one of the most important properties for visual recognition purposes.
In this paper we show that sparse representation plays a fundamental role in achieving oneshot
learning and realtime recognition of actions. We start off from RGBD images, combine motion
and appearance cues and extract s tateoftheart features in a computationally efﬁcient way. The
proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and
Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture
highlevel patterns from data. We then propose a simultaneous online video segmentation and
recognition of actions using linear SVMs. The main contribution of the paper is an effective real
time system for oneshot action modeling and recognition; the paper highlights the effectiveness of
sparse coding techniques to represent 3D actions. We obtain very good results on three different
data sets: a benchmark data set for oneshot action learning (the ChaLearn Gesture Data Set), an
inhouse data set acquired by a Kinect sensor including complex actions and gestures differing by
small details, and a data set created for humanrobot interaction purposes. Finally we demonstrate
that our system is effective also in a humanrobot interaction setti ng and propose a memory game,
“All Gestures You Can”, to be played against a humanoid robot.
Keywords: realtime action recognition, sparse representation, oneshot action learning, human
robot interaction
1. Introduction
Action recognition as a general problem is a very fertile research theme due to its strong applicability
in several r eal world domains, ranging from videosurveillance to contentbased video retrieval and
video classiﬁcation. This paper refers speciﬁcally to action recognition in the context of Human
Machine Interaction (HMI), and therefore it focuses on wholebody actions performed by a human
who is standing at a short distance from the sensor.
Imagine a system capable of understanding when to turn the TV on, or when to switch the
lights off on the basis of a gesture; the main requirement of such a system is an easy and fast learn
∗. S.R. Fanello and I. Gori contributed equally to this work.
c
2013 Sean Ryan Fanello, Ilaria Gori, Giorgio Metta and Francesca Odone.
Page 1
FANELLO, GORI, METTA AND ODONE
ing and recognition procedure. Ideally, a single demonstration sufﬁces to teach the system a new
gesture. More importantly, gestures are powerful tools, through which languages can be built. In
this regard, developing a system able to communicate with deaf people, or to understand paralyzed
patients, would represent a great advance, with impact on the quality of life of impaired people.
Nowadays these scenarios are likely as a result of the spread of imaging technologies providing
realtime depth information at consumer’s price (as for example the Kinect (Shotton et al., 2011)
by Microsoft); these depthbased sensors are drastically changing the ﬁeld of action recognition,
enabling the achievement of high performance using fast algorithms.
Following this recent trend we propose a complete system based on RGBD video sequences,
which models actions from one example only. Our main goal is to recognize actions in realtime
with high accuracy; for this reason we design our system accounting for good performance as
well as low computational complexity. The method we propose can be summarized as follows:
after segmentation of the moving actor, we extract two types of features from each image, namely,
Global Histograms of Oriented Gradient (GHOGs) to model the shape of the silhouette, and 3D
Histograms of Flow (3DHOFs) to describe motion information. We then apply a sparse coding
stage, which allows us to take care of noise and redundant information and produces a compact and
stable representation of the image content. Subsequently, we summarize the action within adjacent
frames by building feature vectors that describe the feature evolution over time. Finally, we train a
Support Vector Machine (SVM) for each action class.
Our framework can segment and recognize actions accurately and in r ealtime, even though they
are performed in different environments, at different speeds, or combined in sequences of multiple
actions. F urthermore, thanks to the simultaneous appearance and motion description complemented
by the sparse coding stage, the method provides a oneshot learning procedure. These functions are
shown on three different experimental settings: a benchmark data set for oneshot action learn
ing (the ChaLearn Gesture Data Set), an inhouse data set acquired by a Kinect sensor including
complex actions and gestures differing by small details, and an implementation of the method on a
humanoid robot interacting with humans.
In order to demonstrate that our system can be efﬁciently engaged in real world scenarios, we
developed a realtime memory game against a humanoid robot, called “All Gestures You Can” (Gori
et al., 2012). Our objective in designing this interaction game is to stress the effectiveness of our
gesture recognition system in complex and uncontrolled settings. Nevertheless, our long term goal
is to consider more general contexts, which are beyond the game itself, such as rehabilitation and
human assistance. Our game may be used also with children with memory impairment, for instance
the Attention Deﬁcit/Hyperactivity Disorder (ADHD) (Comoldi et al., 1999). These children cannot
memorize items under different conditions, and have low performances during implicit and explicit
memory tests (Burden and Mitchell, 2005). Interestingly, Comoldi et al. (1999) shows that when
ADHD children were assisted in the use of an appropriate strategy, they performed the memory
task as well as controls. The game proposed in this paper could be therefore used to train memory
skills to children with attention problems, using the robot as main assistant. The interaction with the
robot may increase their motivation to maintain attention and help with the construction of a correct
strategy.
The paper is organized as follows: in Section 2 we brieﬂy review the s tate of the art. In Sec
tion 3 sparse representation is presented; Section 4 describes the complete modeling and recognition
pipeline. Section 5 validates the approach in different scenarios; Section 6 shows a real application
2618
Page 2
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
in the context of Human Robot Interaction (HRI). Finally, Section 7, presents future directions and
possible improvements of the current implementation.
2. Related Work
The recent literature is rich of algorithms for gesture, action, and activity recognition—we refer the
reader to Aggarwal and Ryoo (2011) and Poppe (2010) for a complete survey of the topic. Even
though many theoretically sound, good performing and original algorithms have been proposed,
to the best of our knowledge, none of them fulﬁlls at the same time realtime, oneshot learning
and high accuracy requirements, although such r equirements are all equally important in real world
application scenarios.
Gesture recognition algorithms differ in many aspects. A ﬁrst classiﬁcation may be done with
respect to the overall structure of the adopted framework, that is, how the recognition problem is
modeled. In particular, some approaches are based on machine learning techniques, where each
action is described as a complex structure; in this class we ﬁnd methods based on Hidden Markov
Models (Malgireddy et al., 2012), Coupled Hidden SemiMarkov models (Natarajan and Nevatia,
2007), action graphs (L i et al., 2010) or Conditional Markov Fields (Chatzis et al., 2013). Other
methods are based on matching: the recognition of actions is carried out through a similarity match
with all the available data, and the most similar datum dictates the estimated class (Seo and Milanfar,
2012; Mahbub et al., 2011).
The two approaches are different in many ways. Machine learning methods tend to be more
robust to intraclass variations, since they distill a model from different instances of the same ges
ture, while matching methods are more versatile and adapt more easily to oneshot learning, since
they do not require a batch training procedure. From the point of view of data representation, the
ﬁrst class of methods usually extracts features from each frame, whereas matchingbased methods
try to summarize all information extracted f rom a video in a single feature vector. A recent and
prototypical example of machine learning method can be found in Malgireddy et al. (2012), which
proposes to extract local features (Histograms of Flow and Histograms of Oriented Gradient) on
each frame and apply a bagofwords step to obtain a global description of the frame. Each action is
then modeled as a multi channel Hidden Markov Model (mcHMM). Although the presented algo
rithm leads to very good classiﬁcation performance, it requires a computationally expensive ofﬂine
learning phase that cannot be used in realtime for oneshot learning of new actions. Among the
matchingbased approaches, Seo and Milanfar (2012) is particularly interesting: the algorithm ex
tract a new type of features, referred to as 3D LSKs, from spacetime regression kernels, particularly
appropriate to identify the spatiotemporal geometric structure of the action; it then adopts the Ma
trix Cosine Similarity measure (Shneider and Borlund, 2007) to perform a robust matching. Another
recent method following the trend of matchingbased action recognition algorithms is Mahbub et al.
(2011); in this work the main features are standard deviation on depth (STD), Motion History Image
(MHI) (Bobick and Davis, 2001) and a 2D Fourier Transformation in order to map all information
in the frequency domain. This procedure shows some beneﬁts, for instance the invariance to camera
shifts. For the matching step, a simple and s tandard correlation measure is employed. Considering
this taxonomy, the work we propose falls within the machine learning approaches, but addresses
speciﬁcally the problem of oneshot learning. To this end we leverage on the richness of the video
signal used as a training example and on a dictionary learning approach to obtain an effective and
distinctive representation of the action.
2619
Page 3
FANELLO, GORI, METTA AND ODONE
An alternative to classifying gesture recognition algorithms is based on the data representation
of gesture models. In this respect there is a predominance of features computed on local areas of
single frames (local features), but also holistic features are often used on the whole image or on
a region of interest. Among the most known methods, it is worth mentioning the spatiotemporal
interesting points (Laptev and Lindeberg, 2003), spatiotemporal Hessian matrices (Willems et al.,
2008), Gabor Filters (Bregonzio et al., 2009), Histograms of Flow (Fanello et al., 2010), Histograms
of Oriented Gradient (Malgireddy et al., 2012), semilocal features (Wang et al., 2012), combination
of multiple features (Laptev et al., 2008), Motion History Image (MHI) (Bobick and Davis, 2001),
SpaceTime shapes (Gorelick et al., 2007), SelfSimilarity Matrices (Efros et al., 2003). Also,
due to the recent diffusion of realtime 3D vision technology, 3D features have been recently em
ployed (Gori et al., 2012). For computational reasons as well as the necessity of speciﬁc invariance
properties, we adopt global descriptors, computed on a r egion of interest obtained through motion
segmentation. We do not rely on a single cue but rather combine motion and appearance similarly
to Malgireddy et al. (2012).
The most similar works to this paper are in the ﬁeld of HMI as for example Lui (2012) and Wu
et al. (2012): they both exploit depth information and aim at oneshot learning trying to achieve
low computational cost. The ﬁrst method employs a nonlinear regression framework on manifolds:
actions are represented as tensors decomposed via Higher Order Singular Value Decomposition.
The underlying geometry of tensor space is used. The s econd one extracts ExtendedMHI as features
and uses Maximum Correlation Coefﬁcient (Hirschfeld, 1935) as classiﬁer. Features from RBG and
Depth streams are fused via a Multiview Spectral Embedding (MSE). Differently from these works,
our approach aims speciﬁcally to obtain an accurate realtime recognition from one video example
only.
We conclude the section with a reference to some works focusing on continuous action or ac
tivity recognition (Ali and Aggarwal, 2001; Green and Guan, 2004; Liao et al., 2006; Alon et al.,
2009). In this case training and test videos contain many sequential gestures, therefore the temporal
segmentation of videos becomes fundamental. Our work deals with continuous action recognition
as well, indeed the proposed framework comprehends a novel and r obust temporal segmentation
algorithm.
3. Visual Recognition with Sparse Data
Oneshot learning is a challenging requirement as the small quantity of training data makes the
modeling phase extremely hard. For this reason, in oneshot learning settings a careful choice of
the data representation is very important. In this work we rely on sparse coding to obtain a compact
descriptor with a good discriminative power even if it is derived from very small data sets.
The main concept behind sparse coding is to approximate an input signal as a linear combination
of a few components selected from a dictionary of basic elements, called atoms. We refer to adaptive
sparse coding when the coding is driven by data. In this case, we require a dictionary learning stage,
where the dictionary atoms are learnt (Olshausen and Fieldt, 1997; Yang et al., 2009; Wang et al.,
2010).
The motivations behind the use of image coding arise from biology: there is evidence that sim
ilar signal coding happens in the neurons of the primary visual cortex (V1), which produces sparse
and overcomplete activations (Olshausen and Fieldt, 1997). From the computational point of view
the objective is to ﬁnd an overcomplete model of images, unlike methods such as PCA, which
2620
Page 4
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
Figure 1: Overview of the recognition system, where video segmentation and classiﬁcation are per
formed simultaneously.
aims at ﬁnding a number of components that is lower than the data dimensionality. Overcomplete
representation techniques have become very popular in applications such as denoising, inpainting,
superresolution, segmentation (Elad and Aharon, 2006; Mairal et al., 2008b,a) and object recogni
tion (Yang et al., 2009). In this work we assess their effectiveness also for gesture recognition. Let
X = [x
1
, . . . , x
m
] ∈ R
n×m
be the matrix whose m columns x
i
∈ R
n
are the feature vectors. The goal
of adaptive sparse coding is to learn a dictionary D (a n ×d matrix, with d the dictionary size and n
the feature vector size) and a code U (a d ×m matrix) that minimize the reconstruction error:
min
D,U
kX −DUk
2
F
+ λkUk
1
, (1)
where k·k
F
is the Frobenius norm. As for the sparsity, it is known that the L
1
norm yields to sparse
results while being robust to signals perturbations. Other penalties such as the L
0
norm could be
employed, however the problem of ﬁnding a solution becomes NPhard and there is no guarantee
that greedy algorithms reach the optimal solution. Notice that ﬁxing U, the above optimization
reduces to a least square problem, whilst, given D, it is equivalent to linear regression with the
sparsifying norm L
1
. The latter problem is referred to as a feature selection problem with a known
dictionary (Lee et al., 2007). One of the most efﬁcient algorithms that converges to the optimal
solution of the problem in Equation 1 for a ﬁxed D, is the featuresign search algorithm (Lee et al.,
2007). This algorithm searches for the sign of the coefﬁcients U; indeed, considering only non
zero elements the problem is reduced to a standard unconstrained quadratic optimization problem
(QP), which can be solved analytically. Moreover it performs a reﬁnement of the signs if they are
incorrect. For the complete procedure we refer the reader to Lee et al. (2007).
In the context of recognition tasks, it has been proved that a sparsiﬁcation of the data repre
sentation improves the overall classiﬁcation accuracy (see for instance Guyon and Elisseeff, 2003;
Viola and Jones, 2004; Destrero et al., 2009 and references therein). In this case sparse coding is
often cast into a codingpooling scheme, which ﬁnds its root in the Bag of Words paradigm. In
this scheme a coding operator is a function f (x
i
) = u
i
that maps x
i
to a new space u
i
∈ R
k
; when
k > n the representation is called overcomplete. The action of coding is followed by a pooling stage,
whose purpose is to aggregate multiple local descriptors in a single and global one. Common pool
ing operators are the max operator, the average operator, or the geometric L
p
norm pooling operator
(Feng et al., 2011). More in general, a pooling operator takes the codes located in S regions—for
2621
Page 5
FANELLO, GORI, METTA AND ODONE
Figure 2: Region of Interest detection. Left: RGB video frames. Center: depth frames. Right: the
detected ROI.
instance cells of the spatial pyramid, as in Yang et al. (2009)—and builds a succinct representation.
We deﬁne as Y
s
the set of locations within the region s. Deﬁning the pooling operator as g, the
resultant feature can be rewritten as: p
(s)
= g
(i∈Y
s
)
(u
(i)
). After this stage, a region s of the image
is encoded with a single feature vector. The ﬁnal descriptor of the image is the concatenation of
the descriptors p
s
among all the regions. Notice that the effectiveness of pooling is subject to the
coding stage. Indeed, if applied on noncoded descriptors, pooling would bring to a drastic loss of
information.
4. Action Recognition System
In this section we describe the versatile realtime action recognition system we propose. The system,
depicted in Figure 1, consists of three layers, that can be summarized as follows:
• Region Of Interest detection: we detect a Region of Interest (ROI), where the human subject
is actually performing the action. We use the combination of motion and depth to segment
the subject from the background.
• Action Representation: each ROI within a frame is mapped into a feature space with a
combination of 3D Histogram of Flow (3DHOF) and Global Histogram of Oriented Gradient
(GHOG) on the depth map. The resultant 3DHOF+GHOG descriptor is processed via a sparse
coding step to compute a compact and meaningful representation of the performed action.
• Action Learning: linear SVMs are used on frame buffers. A novel online video segmen
tation algorithm is proposed which allows isolating different actions while recognizing the
action sequence.
4.1 Region Of Interest Segmentation
The ﬁrst step of each action recognition system is to identify correctly where in the image the
action is occurring. Most of the algorithms in the literature involve background modeling techniques
2622
Page 6
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
(Stauffer and Grims on, 1999), or spacetime image ﬁltering in order to extract the interesting spatio
temporal locations of the action (Laptev and Lindeberg, 2003). Other approaches require an a
priori knowledge of the body pose (Lv and Nevatia, 2007). This task is greatly simpliﬁed in our
architecture, since in humanmachine interaction we can safely assume the human to stand in front
of the camera sensors and that there is no other motion in the scene. For each video in the data set,
we initially compute the frame differences within consecutive frames in a small buffer, obtaining
the set P of pixels that are moving. Relying on this information, we compute the mean depth
µ of the pixels belonging to P, which corresponds to the mean depth of the subject within the
considered buffer. Thus, for the rest of the video sequence, we select the region of interest as
ROI(t) = {p
i, j
(t) : µ −ε ≤ d(p
i, j
(t)) ≤ µ + ε}, where d(p
i, j
(t)) is the depth of the pixel p
i, j
(t) at
time t and ε is a tolerance value. In Figure 2 examples of segmentation are shown. We determined
empirically that this segmentation procedure achieves better performance with respect to classic
thresholding algorithms such as Otsu’s method (Otsu, 1979).
4.2 Action Representation
Finding a suitable representation is the most crucial part of any recognition system. Ideally, an
image representation should be both discriminative and invariant to image transformations. A dis
criminative descriptor should represent features belonging to the same class in a similar way, while
it should show low similarity among data belonging to different classes. The invariance property,
instead, ensures that image transformations such as rotation, translation, scaling do not affect the
ﬁnal representation. In practice, there is a tradeoff between these two properties (Varma and Ray,
2007): for instance, image patches are highly discriminative but not invariant, whereas image his
tograms are invariant but not discriminative, since different images could be associated to the same
representation. When a lot of training data is provided, one could focus on a more discriminative
and less invariant descriptor. In our speciﬁc case however, where only one training example is pro
vided, invariance is a necessary condition in order to provide discriminant features; this aspect is
greatly considered in our method.
From the neuroscience literature it is known that body parts are represented already in the early
stages of human development (Mumme, 2001) and that certainly adults have prior knowledge on the
body appearance. Many suggests that motion alone can be used to recognize actions (Bisio et al.,
2010). In artiﬁcial systems this developmentalscale experience is typically not available, although
actions can still be represented from two main cues: motion and appearance (Giese and Poggio,
2003). Although many variants of complex features describing human actions have been proposed,
many of them imply computationally expensive routines. Differently, we rely on simple features
in order to fulﬁll realtime requirements, and we show that they still have a good discriminative
power. In particular we show that a combination of 3D Histograms of Flow (3DHOFs) and Global
Histograms of Gradient (GHOGs) models satisfactorily human actions. When a large number of
training examples is available, these two features should be able to describe a wide variety of actions,
however in oneshot learning scenarios with noisy inputs, they are not sufﬁcient. In this respect,
a sparse representation, which keeps only relevant and robust components of the feature vector,
greatly simpliﬁes the learning phase making it equally effective.
2623
Page 7
FANELLO, GORI, METTA AND ODONE
Figure 3: The ﬁgure illustrates high level statistics obtained by the proposed scene ﬂow description
(3DHOFs). Starting from the left we show the histogram of the scene ﬂow directions
at time t, for a moving hand going on the Right, Le f t, Forward, Backward respectively.
Each cuboid represents one bin of the histogram, for visualization purposes we divided
the 3D space in n ×n ×n bins with n = 4. Filled cuboids represent high density areas.
4.2.1 3D HISTOGRAM OF FLOW
Whereas 2D motion vector estimation has been largely investigated and various fast and effective
methods are available today (Papenberg et al., 2006; Horn and Shunk, 1981), the scene ﬂow compu
tation (or 3D motion ﬁeld estimation) is still an active research ﬁeld due to the required additional
binocular disparity estimation problem. The most promising works are the ones from Wedel et al.
(2010), Huguet and Devernay (2007) and Cech et al. (2011); however these algorithms are compu
tationally expensive and may require computation time in the range of 1.5 seconds per frame. This
high computational cost is due to the fact that scene ﬂow approaches try to estimate both the 2D
motion ﬁeld and disparity changes. Because of the realtime requirement, we opted for a simpler
and faster method that produces a coarser estimation, but is effective for our purposes.
For each frame F
t
we compute the 2D optical ﬂow vectors U(x, y, t) and V (x, y,t) for the x and
y components with respect to the previous frame F
t−1
, via the Fanerb
¨
ack algorithm (Farneb
¨
ack,
2003). Each pixel (x
t−1
, y
t−1
) belonging to the ROI of the frame F
t−1
is reprojected in 3D space
(X
t−1
,Y
t−1
, Z
t−1
) where the Z
t−1
coordinate is measured through the depth sensor and X
t−1
,Y
t−1
are
computed by:
X
t−1
Y
t−1
=
(x
t−1
−x
0
)Z
t−1
f
(y
t−1
−y
0
)Z
t−1
f
,
where f is the focal length and (x
0
, y
0
)
T
is the principal point of the sensor. Similarly, we can
reproject the ﬁnal point (x
t
, y
t
) of the 2D vector representing the ﬂow, obtaining another 3D vector
(X
t
,Y
t
, Z
t
)
T
. For each pixel of the ROI, we can deﬁne the scene ﬂow as the difference of the two 3D
vectors in two successive frames F
t−1
and F
t
:
D = (
˙
X,
˙
Y ,
˙
Z)
T
=
= (X
t
−X
t−1
,Y
t
−Y
t−1
, Z
t
−Z
t−1
)
T
.
Once the 3D ﬂow for each pixel of the ROI at time t has been computed, we normalize it with respect
to the L2norm, so that the resulting descriptors D
1
, . . . , D
n
(n pixels of the ROI) are invariant to the
overall speed of the action. In order to extract a compact representation we build a 3D Histogram
2624
Page 8
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
of Flow (3DHOF) z(t) of the 3D motion vectors, where z(t) ∈ R
n
1
and
3
√
n
1
is the quantization
parameter of the space (i.e., the bin size). In addition we normalize each 3DHOF z(t) so that
∑
j
z
j
(t) = 1; hence we guarantee that these descriptors are invariant to the subject of interest’s
scale.
Figure 3 shows that the movements toward different directions reveal to be linearly separable,
and the main directions are accurately represented: each cuboid represents one bin of the histogram,
and the 3D space is divided in n ×n ×n bins with n = 4. It is possible to notice how, in the Right
direction for example, all the ﬁlled bins lay on the semispace deﬁned by x < 0. Similar observations
apply all cases.
4.2.2 GLOBAL HISTOGRAM OF ORIENTED GRADIENT
In speciﬁc contexts, motion information is not sufﬁcient to discriminate actions, and information
on the pose or appearance becomes crucial. One notable example is the American Sign Language
(ASL), whose lexicon is based mostly on the shape of the hand. In these cases modeling the shape
of a gesture as well as its dynamics is very important. Thus we extend the motion descriptor with
a shape feature computed on the depth map. If we assume the subject to be in front of the camera,
it is unlikely that the perspective transformation would distort his/her pose, shape or appearance,
therefore we can approximately work with invariance to translation and scale. We are interested
in characterizing shapes, and the gradient of the depth stream shows the highest responses on the
contours, thus studying the orientation of the gradient is a s uitable choice. The classical Histograms
of Oriented Gradient (HOGs) (Dalal and Triggs, 2005) have been designed for detection purposes
and do not show the abovementioned invariance; indeed dividing the image in cells makes each sub
histogram dependent on the location and the dimension of the object. Furthermore, HOGs exhibit
a high spatial complexity, as the classical HOG descriptor belongs to R
(ncells×nblocks×n
2
)
. Since we
aim at preserving such invar iance as well as limiting the computational complexity, we employed a
simpler descriptor, the Global Histogram of Oriented Gradient (GHOG). This appearance descriptor
produces an overall description of the appearance of the ROI without splitting the image in cells.
We compute the histogram of gradient orientations of the pixels on the entire ROI obtained from
the depth map to generate another descriptor h(t) ∈ R
n
2
, where n
2
is the number of bins. The scale
invariance property is preserved normalizing the descriptor so that
∑
j
h
j
(t) = 1. Computing this
descriptor on the depth map is fundamental in order to remove texture information; in fact, in this
context, the only visual properties we are interested in are related to shape.
4.2.3 SPARSE CODING
At this stage, each frame F
t
is represented by two global descriptors: z(t) ∈ R
n
1
for the motion
component and h(t) ∈ R
n
2
for the appearance component. Due to the high variability of human
actions and to the simplicity of the descriptors, a feature selection stage is needed to catch the
relevant information underlying the data and discarding the redundant ones such as background or
body parts not involved in the action; to this aim we apply a sparse coding stage to our descriptor.
Given the set of the previously computed 3DHOFs Z = [z(1), . . . , z(K)], where K is the number
of all the frames in the training data, our goal is to learn one motion dictionary D
M
(a n
1
×d
1
matrix,
with d
1
the dictionary size and n
1
the motion vector size) and the codes U
M
(a d
1
×K matrix)
that minimize the Equation 1, so that z(t) ∼ D
M
u
M
(t). In the same manner, we deﬁne the equal
optimization problem for a dictionary D
G
(a n
2
×d
2
matrix) and the codes U
G
(a d
2
×K matrix) for
2625
Page 9
FANELLO, GORI, METTA AND ODONE
Figure 4: The ﬁgure illustrates on the left the SVMs scores (Equation 2) computed in realtime at
each time step t over a sequence of 170 frames. On the right the standard deviation of
the scores and its mean computed on a sliding window are depicted. The local minima of
the standard deviation function are break points that deﬁne the end of an action and the
beginning of another one. See Section 4.3.2 for details.
the set of GHOGs descriptors H = [h(1), . . . , h(K)]. Therefore, after the Sparse Coding stage, we
can describe a frame as a code u(i), which is the concatenation of the motion and appearance codes:
u(i) = [u
M
(i), u
G
(i)].
Notice that we rely on global features, thus we do not need any pooling operator, which is
usually employed to summarize local features into a single one.
4.3 Learning and Recognition
The goal of this phase is to learn a model of a given action from data. Since we are implementing
a oneshot action recognition system, the available training data amounts to one training sequence
for each action of interest. In order to model the temporal extent of an action we extract sets of
subsequences from a sequence, each one containing T adjacent frames. In particular, instead of
using single frame descriptors (described in Section 4.2), we move to a concatenation of frames: a
set of T frames is represented as a sequence [u(1), . . . , u(T )] of codes. This representation allows
us to perform simultaneously detection and classiﬁcation of actions.
The learning algorithm we adopt is the Support Vector Machine (SVM) (Vapnik, 1998). We
employ linear SVMs, since they can be implemented with constant complexity during the test phase
fulﬁlling realtime requirements (Fan et al., 2008). Additionally, recent advances in the object
recognition ﬁeld, such as Yang et al. (2009), showed that linear classiﬁer s can effectively solve the
classiﬁcation problem if a preliminary sparse coding stage has previously been applied. Our exper
iments conﬁrm these ﬁndings. Another advantage of linear SVMs is that they can be implemented
with a linear complexity in training (Fan et al., 2008); given this property, we can provide a realtime
oneshot learning procedure, extremely useful in real applications.
The remainder of the section describes in details the two phases of action learning and action
recognition.
2626
Page 10
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
Figure 5: The ﬁgure illustrates only the scores of the recognized actions via the method descr ibed
in Section 4.3.2. Blue dots are the break points computed by the video segmentation
algorithm that indicate the end of an action and the beginning of a new one.
4.3.1 ACTION LEARNING
Given a video V
s
of t
s
frames, containing only one action A
s
, we compute a set of descriptors
[u(1), . . . , u(t
s
)] as described in Section 4.2. Then, action learning is carried out on a set of data that
are descriptions of a frame buffer B
T
(t), where T is its length:
B
T
(t) = (u(t −T ), . . . , u(t −1), u(t))
T
.
We use a oneversusall strategy to train a binary linear SVM for each class A
s
, so that at the end
of the training phase we obtain a set of N linear SVM classiﬁers f
1
(
¯
B), . . . , f
N
(
¯
B), where N is the
number of actions. In particular, in this oneshot learning pipeline, the set of buffers
B
s
= [B
T
(t
0
), . . . , B
T
(t
s
)]
computed from the single video V
s
of the class A
s
are used as positive examples for the action A
s
.
All the buffers belonging to A
j
with j 6= s are the negative examples. Although we use only one
example for each class, we beneﬁt from the chosen representation: indeed, descriptors are computed
per frame, therefore one single video of length t
s
provides a number of examples equal to t
s
−T
where T is the buffer size. Given the training data {B, y} where B is the set of positive and negative
examples for the primitive A
s
, y
i
= 1 if the example is positive, y
i
= −1 otherwise, the goal of SVM
is to learn a linear function (w
T
, b) such that a new test vector
¯
B is predicted as:
y
pred
= sign( f (
¯
B)) = sign(w
T
¯
B +b).
4.3.2 ONLINE RECOGNITION: VIDEO SEGMENTATION
Given a test video V , which may contain one or more known actions, the goal is to predict the
sequence of the performed actions. The video is analyzed using a sliding window B
T
(t) of size T .
We compute the output score f
i
(B
T
(t)) of the i = 1, . . . , N SVM machines for each test buffer B
T
(t)
and we ﬁlter these scores with a lowpass ﬁlter W that attenuates noise. Therefore the new score at
2627
Page 11
FANELLO, GORI, METTA AND ODONE
time t becomes:
H
i
(B
T
(t)) = W ⋆ f
i
(B
T
(t)) i = 1, . . . , N, (2)
where the ⋆ is the convolution operator. Figure 4 depicts an example of these scores computed
in realtime. As long as the scores evolve we need to predict (online) when an action ends and
another one begins; this is achieved computing the standard deviation σ(H) for a ﬁxed t over all the
scores H
t
i
(Figure 4, right chart). When an action ends we can expect all the SVM output scores
to be similar, because no model should be predominant with respect to idle states; this brings to a
local minimum in the function σ(H). Therefore, each local minimum corresponds to the end of an
action and the beginning of a new one. Let n be the number of local minima computed f rom the
standard deviation function; there will be n + 1 actions, and in particular actions with the highest
score before and after each break point will be recognized. We can easily ﬁnd these minima in
realtime: we calculate the mean value of the standard deviation over time using a sliding window.
When the standard deviation trend is below the mean, all the SVMs scores predict similar values,
hence it is likely that an action has just ended. In Figure 5 the segmented and recognized actions
are shown together with their scores.
5. Experiments
In this section we evaluate the performance of our system in three different settings:
• ChaLearn Gesture Data Set. The ﬁrst experiment has been conducted on a publicly avail
able data set, released by ChaLearn (, CGD2011). The main goal of the experiment is to
compare our method with other techniques.
• Kinect Data. In the second experiment we discuss how to improve the recognition rate using
all the functionalities of a real Kinect sensor. Gestures with high level of detail are easily
caught by the system.
• HumanRobot Interaction. For the last experiment we considered a real HMI scenario: we
implement the system on a real robot, the iCub humanoid robot (Metta et al., 2008), showing
the applicability of our algorithm also in humanrobot interaction settings.
For the computation of the accuracy between a sequence of estimated actions and the ground truth
sequence we use the normalized Levenshtein Distance (Levenshtein, 1966), deﬁned as:
TeLev =
S + D + I
M
,
where each action is treated as a symbol in a sequence, S is the number of substitutions (misclassi
ﬁcations), D the number of deletions (false negatives), I the number of insertions (false positives)
and M the length of the ground truth sequence. More speciﬁcally, this measure computes the min
imum number of modiﬁcations that are required to transform a sequence of events in another one.
It is widely used in speech recognition contexts, where each symbol represents an event. In action
and gesture recognition, when sequences of gestures are to be evaluated, the Levenshtein Distance
shows to be a particularly suitable metric, as it allows accounting not only for the s ingle classiﬁer
accuracy, but also for the capability of the algorithm to accurately distinguish different gestures in a
sequence (Minnen et al., 2006).
2628
Page 12
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
Figure 6: On the left examples of 2 different batches from the ChaLearn Data Set (, CGD2011).
On the right the overall Levenshtein Distance computed in 20 batches with respect to
the buffer size parameter is depicted for both 3DHOF+GHOG features and descriptors
processed with sparse coding.
We empirically choose a quantization parameter for the 3DHOF, n
1
equal to 5, n
2
= 64 bins for
the GHOG descriptor, and dictionary sizes d
1
and d
2
equal to 256 for both motion and appearance
components. This led to a frame descriptor of size 189 for simple descriptors, which increases to
512 after the sparse coding processing. The whole system runs at 25fps on 2.4Ghz Core 2 Duo
Processor.
5.1 ChaLearn Gesture Data Set
We ﬁrstly assess our method on the ChaLearn data set for the OneShot Gesture Recognition Chal
lenge (Guyon et al., 2012), see Figure 6. The data set is organized in batches, where each batch
includes 100 recorded gestures grouped in sequences of 1 to 5 gestures arbitrarily performed at dif
ferent speeds. The gestures are drawn from a small vocabulary of 8 to 15 unique gestures called
lexicon, which is deﬁned within a batch. For each video both RGB and Depth streams are provided,
but only one example is given for the training phase. In our experiments we do not use information
on the body pose of the human. We consider the batches from devel 01 to devel 20; each batch has
47 videos, where L (the lexicon size) videos are for tr aining and the remaining are used as test data.
The main parameter of the system is the buffer size T , however in Figure 6 it is possible to
notice that the parameter offers stable perfor mances with a buffer range of 1 −20, so it does not
represent a critical variable of our method. Furthermore, high performance for a wide buffer length
range imply that our framework is able to handle different speeds implicitly. We compute the
Levenshtein Distance as the average over all the batches, which is 25.11% for features processed
with sparse coding, whereas simple 3DHOF+GHOG descriptors without sparse coding lead to a
performance of 43.32%. Notably, each batch has its own lexicon and some of them are composed
of only gestures performed by hand or ﬁngers; in these cases, if the GHOG is computed on the entire
ROI, the greatest contribution of the histogram comes from the body shape, whilst ﬁnger actions
(see Figure 2, bottom row) represent a poor per centage of the ﬁnal descriptor. If we consider batches
where the lexicon is not composed of only hand/ﬁngers gestures, the Levenshtein Distance reduces
to 15%.
We compared our method with several approaches. First of all a Template Matching technique,
where we used as descriptor the average of all depth frames for each action. The test video is split in
2629
Page 13
FANELLO, GORI, METTA AND ODONE
Method TeLev TeLen
Sparse Representation (proposed) 25.11% 5.02%
3DHOF + GHOG 43.32% 9.03%
Template Matching 62.56% 15.12%
DTW 49.41% Manual
Manifold LSR (Lui, 2012) 28.73% 6.24%
MHI (Wu et al., 2012) 30.01% NA
ExtendedMHI (Wu et al., 2012) 26.00% NA
BoVW (Wu et al., 2012) 72.32% NA
2D FFTMHI (Mahbub et al., 2011) 37.46% NA
TBM+LDA (Malgireddy et al., 2012) 24.09% NA
Table 1: Levenshtein Distance on the ChaLearn Gesture Data Set. For SVM classiﬁcation we chose
the appropriate buffer size for each batch according to the deﬁned lexicon. TeLev is the
Levenshtein Distance, TeLen is the average error (false positives + false negatives) made
on the number of gestures (see text).
slices estimated using the average size of actions. In the recognition phase we classify each slice of
the video comparing it with all the templates. The overall Levenshtein Distance becomes 62.56%.
For the second comparison we employ Dynamic Time Warping (DTW) method (Sakoe and Chiba,
1978) with 3DHOF + GHOG features. We manually divided test videos in order to facilitate the
recognition for DTW; nevertheless the global Levenshtein Distance is 49.41%. Finally we report
the results presented in some recent works in the ﬁeld, which exploit techniques based on manifolds
(Lui, 2012), Motion History Image (MHI) (Wu et al., 2012), Bag of Visual Words (BoVW) (Wu
et al., 2012), 2D FFTMHI (Mahbub et al., 2011) and Temporal Bayesian Model (TBM) with Latent
Dirichlet Allocation (LDA) (Malgireddy et al., 2012).
Table 1 shows that mos t of the compared approaches are outperformed by our method except for
Malgireddy et al. (2012); however the method proposed by Malgireddy et al. (2012) has a training
computational complexity of O(n ×k
2
) for each action class, where k is the number of HMM states
and n the number of examples, while the testing computational complexity for a video frame is
O(k
2
). T hanks to the sparse representation, we are able to use linear SVMs, which reduce the
training complexity with respect to the number of training examples to O(n ×d) for each SVM,
where d is the descriptor size. In our case d is a constant value ﬁxed a priori, and does not inﬂuence
the scalability of the problem. Therefore we may approximate the asymptotic behavior of the SVM
in training to O(n). Similarly, in testing the complexity for each SVM is constant with respect
to the number of training examples when considering a single frame, and it becomes O (N) for
the computation of all the N class scores. This allows us to provide realtime training and testing
procedures with the considered lexicons.
Furthermore our online video segmentation algorithm shows excellent results with respect to
the temporal segmentation used in the compared frameworks; in fact it is worth noting that the
proposed algorithm leads to an action detection error rate TeLen =
FP+FN
M
equal to 5.02%, where
FP and FN are false positives and false negatives respectively, and M is the number of all test
2630
Page 14
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
gestures. Considering the ﬁnal results of the ChaLearn Gesture Challenge (Round 1),
1
we placed
9th over 50 teams, but our method also fulﬁlls realtime requirements for the entire pipeline, which
was not a requirement of the challenge.
5.1.1 MOTION VS APPEARANCE
In this section we evaluate the contribution of the frame descriptors. In general we notice that the
combination of both motion and appearance descriptors leads to the best results when the lexicon
is composed of actions where both motion and appearance are equally important. To show this, we
considered the 20 development batches from the ChaLearn Gesture Data Set. For this experiment,
we used only coded descriptors, since we have already experienced that they obtain higher perfor
mance. Using only the motion component, the Levenshtein Distance is equal to 62.89%, whereas
a descriptor based only on the appearance leads to an error of 34.15%. The error obtained using
only the 3DHOF descriptors was expected, due to the nature of the lexicons chosen: indeed in most
gestures the motion component has little signiﬁcance. Considering instead batch devel
01, where
motion is an important component in the gesture vocabulary, we have that 3DHOF descriptors lead
to a Levenshtein Distance equal to 29.48%, the GHOG descriptors to 21.12% and the combination
is equal to 9.11%. Results are consistent with previous ﬁndings, but in this speciﬁc case the gap
between the motion and the appearance components is not critical.
5.1.2 LINEAR VS NONLINEAR CLASSIFIERS
In this section we compare the performances of linear and non linear SVM for the action recognition
task. The main advantage of a linear kernel is the computational time: nonlinear SVMs have a worst
case training computational complexity per class equal to O(n
3
×d) against the O(n ×d) of linear
SVMs, where n is the number of training examples, and d is the descriptor size. In testing, non linear
SVMs show computational complexity of O(n ×d) per frame, since the number of support vectors
grows linearly with n. Moreover, nonlinear classiﬁers usually require additional kernel parameter
estimation, which especially in oneshot learning scenarios is not trivial. Contrarily, linear SVMs
take O(d) per frame. For this experiment we used coded features where both motion and appearance
are employed. A nonlinear SVM with RBF Kernel has been employed, where the kernel parameter
and the SVM regularization term have been chosen empirically after 10 trials on a subset of the
batches. The Levenshtein Distance among the 20 batches is 35.11%; this result conﬁrms that linear
classiﬁers are sufﬁcient to obtain good results with low computational cost if an appropriate data
representation, as the one offered by sparse coding, is adopted.
5.2 Kinect Data Set
In this section we assess the ability of our method to recognize more complex gestures captured by a
Kinect for Xbox 360 sensor. In Section 5.1, we noted that the resolution of the proposed appearance
descriptor is quite low and may not be ideal when actions differ by small details, especially on the
hands, therefore a localization of the interesting parts to model would be effective. The simplest way
to build in this speciﬁc information is to resort to a body part tracker; indeed, if a body tracker were
available it would have been easy to extract descriptors from different limbs and then concatenate
all the features to obtain the ﬁnal frame representation. An excellent candidate to provide a reliable
1. The leaderboard website is:
https://www.kaggle.com
.
2631
Page 15
FANELLO, GORI, METTA AND ODONE
Figure 7: On the right and bottom the two vocabularies used in Section 5.2; these gestures are
difﬁcult to model without a proper body tracker, indeed the most contribution for the
GHOG comes from the body shape rather than the hand. On the left the Levenshtein
Distance.
body tracker is Microsoft Kinect SDK, which implements the method in Shotton et al. (2011). This
tool retrieves the 20 principal body joints position and pose of the user’s current posture. Given these
positions, we assign each 3D point of the ROI to its nearest joint, so that it is possible to correctly
isolate the two hands and the body from the rest of the scene (see Figure 7). Then, we slightly modify
the approach, computing 3DHOF and GHOG descriptors on three different body parts (left/right
hand and whole body shape); the ﬁnal frame representation becomes the concatenation of all the
part descriptors. As for the experiments we have acquired two different sets of data (see Figure 7):
in the ﬁrst one the lexicon is composed of numbers performed with ﬁngers, in the second one we
replicate the lexicons devel
3 of the ChaLearn Gesture Data Set, the one where we obtained the
poorest performances. In Figure 7 on the left the overall accuracy is shown; using sparse coding
descriptors computed only on the body shape we obtain a Levenshtein Distance around 30%. By
concatenating descriptors extracted from the hands the system achieves 10% for features enhanced
with sparse coding and 20% for normal descriptors.
We compared our method with two previously mentioned techniques: a Template Matching
algorithm and an implementation of the Dynamic Time Warping approach (Sakoe and Chiba, 1978).
The resulted Levenshtein Distance is respectively 52.47% and 42.36%.
5.3 HumanRobot Interaction
The action recognition system has been implemented and tes ted on the iCub, a 53 degrees of free
dom humanoid robot developed by the RobotCub Consortium (Metta et al., 2008). The robot
is equipped with force sensors and gyroscopes, and it resembles a 3years old child. It mounts
two Dragonﬂy cameras, providing the basis for 3D vision, thus after an ofﬂine camera calibration
procedure we can rely on a full stereo vision system; here the depth map is computed following
Hirschmuller (2008). In this setting the action recognition system can be used for more general pur
poses such as HumanRobotInteraction (HRI) or learning by imitation tasks. In particular our goal
is to teach iCub how to perform simple manipulation tasks, such as move/grasp an object. In this
sense, we are interested in recognizing actions related to the armhand movements of the robot. We
deﬁne 8 actions, as shown in Figure 8, bottom row, according to the robot manipulation capabilities.
2632
Page 16
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
Figure 8: Accuracy for actions sequences (see bottom row). We evaluated the performance on more
than 100 actions composed of sequences of 1 to 6 actions.
Each action is modeled using only the motion component (3DHOF), since we want the descriptor
to be independent on the particular object shape used.
In Figure 8 we show the accuracy based on the Levenshtein Distance; this measure has been
calculated on more than 100 actions composed of sequences of 1 to 6 actions. Notably the error
is less than 10%; these good results were expected due to the high discriminative power of the
3DHOFs (Figure 3) on the chosen lexicon, which leads to a linearly separable set.
6. All Gestures You Can: a Real Application
As pointed out in the previous sections, our approach was designed for real applications where real
time requirements need to be fulﬁlled. We developed and implemented a “game” against a humanoid
robot, showing the effectiveness of our system in a real HRI setting: “All Gestures You Can” (Gori
et al., 2012), a game aiming at improving memory skills, visual association and concentration.
Our game takes inspiration from the classic “Simon” game; nevertheless, since the original version
has been often deﬁned as “visually boring”, we developed a revisited version, based on gesture
recognition, which involves a “less boring” opponent: the iCub (Metta et al., 2008). Both the human
and the robot have to take turns and perform the longest possible sequence of gestures by adding
one gesture at each turn: one player starts performing a gesture, the opponent has to recognize the
gesture, imitate it and add another gesture to the sequence. The game is carried on until one of
the two players loses: the human player can lose because of limited memory skills, whereas the
robot can lose because the gestur e recognition system fails. As described in the previous sections,
the system has been designed for oneshot learning; however, Kinect does not provide information
about ﬁnger conﬁguration, therefore a direct mapping between human ﬁngers and the iCub’s ones
is not immediate. Thus we set a predeﬁned pool of 8 gestures (see Figure 9, on the left). The
typical game setting is shown in Figure 10: the player s tays in f ront of the robot while performing
gestures that are recognized with Kinect. Importantly, hand gestures cannot be learned exploiting
the Skeleton Data of Kinect: the body tracker detects the position of the hand and it is not enough
to discriminate more complicate actions,—for example, see gesture classes 1 and 5 or 2 and 6 in
Figure 9.
2633
Page 17
FANELLO, GORI, METTA AND ODONE
Figure 9: On the left the hand gestures. The vision system has been trained using 8 different actors
performing each gesture class for 3 times. On the right the game architecture. There are
three main modules that take care of recognizing the action sequence, deﬁning the game
rules and making the robot gestures.
The system is simple and modularized as it is organized in three components (see Figure 9)
based on the iCub middleware, YARP (Metta et al., 2006), which manages the communication be
tween sensors, processors, and modules. The efﬁciency of the proposed implementation is assured
by its multithreading architecture, which also contributes to realtime performances. The software
presented in this section is available in the iCub repository.
2
The proposed game has been played by more than 30 different players during the ChaLearn
Kinect Demonstration Competition at CVPR 2012.
3
Most of them were completely naive without
prior knowledge about the gestures. They were asked to play using a lexicon that had been trained
speciﬁcally for the competition (Figure 9). After 50 matches we had 75% of robot victories. This
result indicates that the recognition system is robust also to different players performing variable
gestures at various speeds. 15% of the matches have been won by humans and usually they ﬁnished
during the ﬁrst 34 turns of the game; this always occurred when players performed very different
gestures with respect to the trained ones. A few players ( 10% of matches) succeeded in playing
more than 8 turns, and they won due to recognition errors. “All Gestures You Can” ranked 2nd in
the ChaLearn Kinect Demonstration Competition.
7. Discussion
This paper presented the design and implementation of a complete action recognition system to be
used in real world applications such as HMI. We designed each step of the recognition pipeline to
function in realtime while maximizing the overall accuracy. We showed how a spar se action repre
2. Code available at
https://svn.code.sf.net/p/robotcub/code/trunk/iCub/contrib/src/
demoGestureRecognition
.
3. The competition website is
http://gesture.chalearn.org/
A YouTube video of our game is available at
http://youtu.be/U_JLoe_fT3I
.
2634
Page 18
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
Figure 10: The ﬁrst two turns of a match. Left: the human player performs the ﬁrst gesture of the
sequence. Center: iCub recognized the gesture and imitates it. Right: iCub adds a new
random gesture to the sequence.
sentation could be effectively used for oneshot learning of actions in combination with conventional
machine learning algorithms (i.e., SVM), even if the latter would normally require a larger set of
training data. The comprehensive evaluation of the proposed approach showed that we achieve
good tradeoff between accuracy and computation time. The main strengths of our learning and
recognition pipeline can be summarized as follows:
1. OneShot Learning: one example is sufﬁcient to teach an new action to the system; this is
mainly due to the effective perframe representation.
2. Sparse Frame Representation: starting from a simple and computationally inexpensive de
scription that combines global motion (3DHOF) and appearance (GHOG) information over a
ROI, subsequently ﬁltered through sparse coding, we obtained a sparse representation at each
frame. We showed that these global descriptors are appropriate to model actions of the upper
body of a person.
3. Online Video Segmentation: we propos e a new, effective, reliable and online video seg
mentation algor ithm that achieved a 5% error rate on action detection on a set of 2000 actions
grouped in sequences of 1 to 5 gestures. This segmentation procedure works concurrently
with the recognition process, thus a sequence of actions is simultaneously segmented and
recognized.
4. Realtime Performances: the proposed system can be used in realtime applications, as it
does require neither a complex features processing nor a computationally expensive training
and testing phases. From the computational point of view the proposed approach scales well
even for large vocabularies of actions.
5. Effectiveness in Real Scenarios: our method achieves good performances in a HumanRobot
Interaction setting, where the RGBD images are obtained through binocular vision and dis
parity estimation. For testing purposes, we proposed a memory game, called “All Gestures
You Can”, where a person can challenge the iCub robot on action recognition and sequencing.
The system ranked 2nd at the Kinect Demonstration Competition.
4
4. The competition website is
http://gesture.chalearn.org/
.
2635
Page 19
FANELLO, GORI, METTA AND ODONE
We stress here the simplicity of the learning and recognition pipeline: each stage is easy to imple
ment and fast to compute. It is shown to be adequate to solve the problem of gesture recognition; we
obtained highquality results while fulﬁlling realtime requirements. The approach is competitive
against many of the stateoftheart methods for action recognition.
We are currently working on a more precise appearance description at frame level still under the
severe constraint of realtime performance; this would enable the use of more complex actions even
when the body tracker is not available.
Acknowledgments
This work was supported by the European FP7 ICT projects N. 270490 (EFAA) and N. 270273
(Xperience).
References
J.K. Aggarwal and M.S. Ryoo. Human activity analysis: A review. ACM Computing Surveys, 2011.
A. Ali and J.K. Aggarwal. Segmentation and recognition of continuous human activity. IEEE
Workshop on Detection and Recognition of Events in Video, 2001.
J. Alon, V. Athitsos, Y. Quan, and S. Sclaroff. A uniﬁed framework for gesture recognition and
spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intel
ligence, 2009.
A. Bisio, N. Stucchi, M. Jacono, L. Fadiga, and T. Pozzo. Automatic versus voluntary motor
imitation: Effect of visual context and stimulus velocity. PLoS ONE, 2010.
A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2001.
M. Bregonzio, S. Gong, and T. Xiang. Recognising action as clouds of spacetime interest points.
IEEE Conference on Computer Vision and Pattern Recognition, 2009.
M. J. Burden and D. B. Mitchell. Implicit memory development in schoolaged children with
attention deﬁcit hyperactivity disorder (adhd): Conceptual priming deﬁcit? In Developmental
Neurophysiology, 2005.
J. Cech, J. SanchezRiera, and R. Horaud. Scene ﬂow estimation by growing correspondence seeds.
In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
ChaLearn Gesture Dataset (CGD2011).
http://gesture.chalearn.org/data
, 2011.
S.P. Chatzis, D. Kosmopoulos, and P. Doliotis. A conditional random ﬁeldbased model for joint
sequence segmentation and classiﬁcation. In Pattern Recognition, 2013.
C. Comoldi, A. Barbieri, C. Gaiani, and S. Zocchi. Strategic memory deﬁcits in attention deﬁcit
disorder with hyperactivity participants: The role of executive processes. In Developmental Neu
rophysiology, 1999.
2636
Page 20
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE Conference
on Computer Vision and Pattern Recognition, 2005.
A. Destrero, C. De Mol, F. Odone, and Verri A. A sparsityenforcing method for learning face
features. IEEE Transactions on Image Processing, 18:188–201, 2009.
A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In International
Conference on Computer Vision, 2003.
M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Transactions on Image P rocessing, 2006.
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. Liblinear: A library for large
linear classiﬁcation. Journal of Machine Learning Research, 9, 2008.
S. R. Fanello, I. Gori, and F. Pirri. Armhand behaviours modelling: from attention to imitation. In
International Symposium on Visual Computing, 2010.
G. Farneb
¨
ack. Twoframe motion estimation based on polynomial expansion. In Scandinavian
Conference on Image Analysis, 2003.
J. Feng, B. Ni, Q. Tian, and S. Yan. Geometric lpnorm feature pooling for image classiﬁcation. In
IEEE Conference on Computer Vision and Pattern Recognition, 2011.
M. A. Giese and T. Poggio. Neural mechanisms for the recognition of biological movements. Nature
reviews. Neuroscience, 2003.
L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as spacetime shapes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 29, 2007.
I. Gori, S. R. Fanello, F. Odone, and G. Metta. All gestures you can: a memory game against a
humanoid robot. IEEERAS International Conference on Humanoid Robots , 2012.
R.D. Green and L. Guan. Continuous human activity recognition. Control, Automation, Robotics
and Vision Conference, 2004.
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. International Journal
of Machine Learning Research, 3:1157–1182, 2003.
I. Guyon, V. Athitsos, P. Jangyodsuk, B. Hammer, and H. J. E. Balderas. Chalearn gesture challenge:
Design and ﬁrst results. In Computer Vision and Pattern Recognition Workshops, 2012.
H. O. Hirschfeld. A connection between correlation and contingency. In Mathematical Proceedings
of the Cambridge Philosophical Society, 1935.
H. Hirschmuller. Stereo process ing by semiglobal matching and mutual inf ormation. IEEE Trans
actions on Pattern Analysis and Machine Intelligence, 2008.
B. K. P. Horn and B. G. Shunk. Determining optical ﬂow. Journal of Artiﬁcial Intelligence, 1981.
F. Huguet and F. Devernay. A variational method for scene ﬂow estimation from stereo sequences.
In International Conference on Computer Vision, 2007.
2637
Page 21
FANELLO, GORI, METTA AND ODONE
I. L aptev and T. Lindeberg. Spacetime interest points. In IE EE International Conference on Com
puter Vision, 2003.
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from
movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008.
H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efﬁcient sparse coding algorithms. In Conference on
Neural Information Processing Systems, 2007.
V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet
Physics  Doklady, 1966.
W. Li, Z. Zhang, and Z . Liu. Action recognition based on a bag of 3d points. I n Computer Vision
and Pattern Recognition Workshops, 2010.
H.Y.M. Liao, DY. Chen, and S.W Shih. Continuous human action segmentation and recognition
using a spatiotemporal probabilistic framework. IEEE International Symposium on Multimedia,
2006.
Y. M. Lui. A least squares regression framework on manifolds and its application to gesture recog
nition. In Computer Vision and Pattern Recognition Workshops, 2012.
F. Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi
path searching. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
U. Mahbub, H. Imtiaz, T. Roy, S. Rahman, and A. R. Ahad. Action recognition from one example.
Pattern Recognition Letters, 2011.
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for
local image analysis. In IEEE Conference on Computer Vision and Pattern Recognition, 2008a.
J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Trans
actions on Image Processing, pages 53–69, 2008b.
M. R. Malgireddy, I. Inwogu, and V. Govindaraju. A temporal Bayesian model for classifying,
detecting and localizing activities in video sequences. Computer Vision and Pattern Recognition
Workshops, 2012.
G. Metta, P. Fitzpatrick, and L. Natale. YARP: Yet Another Robot Platform. International Journal
of Advanced Robotic Systems, 2006.
G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori. The icub humanoid robot: an open platform
for research in embodied cognition. In Workshop on Performance Metrics for Intelligent Systems,
2008.
D. Minnen, T. Westeyn, and T. Starner. Performance metrics and evaluation issues for continuous
activity recognition. In Performance Metrics for Intelligent Systems Workshop, 2006.
D. L. Mumme. Early social cognition: understanding others in the ﬁrst months of life. Journal of
Infant and Child Development, 2001.
2638
Page 22
KEEP IT SIMPLE AND SPARSE: REALTIME ACTION RECOGNITION
P. Natarajan and R. Nevatia. Coupled hidden semi markov models for activity recognition. In
Workshop Motion and Video Computing, 2007.
B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed
by v1. Vision Research, 1997.
N. Otsu. A threshold selection method from graylevel histograms. IEEE Transactions on Systems,
Man and Cybernetics, 1979.
N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert. Highly accurate optic ﬂow computation
with theoretically justiﬁed warping. International Journal of Computer Vision, 2006.
R. Poppe. A survey on visionbased human action recognition. Image and Vision Computing, 2010.
H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition.
IEEE International Conference on Acoustics, Speech and Signal Processing, 1978.
H. J. Seo and P. Milanfar. A template matching approach of oneshotlearning gesture recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
J. W. Shneider and P. Borlund. Matrix comparison, part 1: Motivation and important issues for
measuring the resemblance between proximity measures or ordination results. In Journal of the
American Society for Information Science and Technology, 2007.
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake.
Realtime human pose recognition in parts from a single depth image. In IEEE Conference on
Computer Vision and Pattern R ecognition, 2011.
C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for realtime tracking.
IEEE Conference on Computer Vision and Pattern Recognition, 1999.
V. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., 1998.
M. Varma and D. Ray. Learning the discriminative powerinvariance tradeoff. In IEEE Interna
tional Conference on Computer Vision, 2007.
P. Viola and M.J. Jones. Robust realtime face detection. International Journal of Computer Vision,
57:137–154, 2004.
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Localityconstrained linear coding for image
classiﬁcation. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.
J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust 3d action recognition with random
occupancy patterns. European Conference on Computer Vision, 2012.
A. Wedel, T. Brox, T. Vaudrey, C. Rabe, U. Franke, and D. Cremers. Stereoscopic scene ﬂow
computation for 3d motion understanding. International Journal of Computer Vision, 2010.
G. Willems, T. Tuytelaars, and L. Gool. An efﬁcient dense and scaleinvariant spatiotemporal
interest point detector. European Conference on Computer Vision, 2008.
2639
Page 23
FANELLO, GORI, METTA AND ODONE
D. Wu, F. Zhu, and L. Shao. One shot learning gesture recognition from rgbd images. In Computer
Vision and Pattern Recognition Wor kshops, 2012.
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for
image classiﬁcation. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
2640
Page 24
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

 "For the other datasets, we did not rely on any initialisation. From SKIG we extracted the same information as [7], which consists of 3DHOF on the RBG frames and GHOG (Global Histogram of Oriented Gradient) on the depth frames. For MSRGesture3D only depth information is available: we extracted twolevel pyramidal HOG (PHOG) features using 32 bins. "
[Show abstract] [Hide abstract] ABSTRACT: Recognising human activities from streaming videos poses unique challenges to learning algorithms: predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. Furthermore, as parameter tuning is problematic in a streaming setting, suitable approaches should be parameterless, and make no assumptions on what class labels may occur in the stream. We present here an approach to the recognition of human actions from streaming data which meets all these requirements by: (1) incrementally learning a model which adaptively covers the feature space with simple local classifiers; (2) employing an active learning strategy to reduce annotation requests; (3) achieving promising accuracy within a fixed model size. Extensive experiments on standard benchmarks show that our approach is competitive with stateoftheart nonincremental methods, and outperforms the existing active incremental baselines. 
 "For the other datasets, we did not rely on any initialisation. From SKIG we extracted the same information as [7], which consists of 3DHOF on the RBG frames and GHOG (Global Histogram of Oriented Gradient) on the depth frames. For MSRGesture3D only depth information is available: we extracted twolevel pyramidal HOG (PHOG) features using 32 bins. "
Dataset: egpaper Babs fabio

 "In contrast, if a new action starts and a old action ends, there exists a transitional stage, so all the estimated distances are similar and the standard deviation is relatively low. In [6] and [4], a similar method is used on the SVM scores.Fig.1 shows a segment of minimum distances between online covariance matrix and training covariance matrices of each action class.Fig.2 shows the standard deviation of the distances inFig.1. As can be seen, when it comes to an action change, the standard deviation value goes to a local minima. "
[Show abstract] [Hide abstract] ABSTRACT: Online action recognition aims to recognize actions from unsegmented streams of data in a continuous manner. One of the challenges in online recognition is the accumulation of evidence for decision making. This paper presents a fast and efficient online method to recognize actions from a stream of noisy skeleton data. The method adopts a covariance descriptor calculated from skeleton data and is based on a novel method developed for incrementally learning the covariance descriptors, referred to as weighted covariance descriptors, so that past frames have less contributions to the descriptor and current frames and informative frames such as key frames contributes more towards the descriptor. The online recognition is achieved using an efficient nearest neighbour search against a set of trained actions. Experimental results on MSRC12 Kinect Gesture dataset and our newly collocated online action recognition dataset have demonstrated the efficacy of the proposed method.