3D Action Recognition Using Multi-temporal Depth Motion Maps and Fisher Vector

Conference Paper (PDF Available) · July 2016with427 Reads

Conference: 25th International Joint Conference on Artificial Intelligence (IJCAI-16), At New York City, NY
Abstract
This paper presents an effective local spatiotemporal descriptor for action recognition from depth video sequences. The unique property of our descriptor is that it takes the shape discrimination and action speed variations into account, intending to solve the problems of distinguishing different pose shapes and identifying the actions with different speeds in one goal. The entire algorithm is carried out in three stages. In the first stage, a depth sequence is divided into temporally overlapping depth segments which are used to generate three depth motion maps (DMMs), capturing the shape and motion cues. To cope with speed variations in actions, multiple frame lengths of depth segments are utilized, leading to a multi-temporal DMMs representation. In the second stage, all the DMMs are first partitioned into dense patches. Then, the local binary patterns (LBP) descriptor is exploited to characterize local rotation invariant texture information in those patches. In the third stage, the Fisher kernel is employed to encode the patch descriptors for a compact feature representation, which is fed into a kernel-based extreme learning machine classifier. Extensive experiments on the public MSRAction3D, MSRGesture3D and DHA datasets show that our proposed method outperforms state-of-the-art approaches for depth-based action recognition.
4 Figures
3D Action Recognition Using Multi-temporal
Depth Motion Maps and Fisher Vector
Chen Chen1,, Mengyuan Liu2,, Baochang Zhang3,, Jungong Han4, Junjun Jiang5, Hong Liu2
1Department of Electrical Engineering, University of Texas at Dallas
2Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University
3School of Automation Science and Electrical Engineering, Beihang University
4Department of Computer Science and Digital Technologies, Northumbria University
5School of Computer Science, China University of Geosciences
{chenchen870713, jungonghan77}@gmail.com, {liumengyuan, hongliu}@pku.edu.cn, bczhang@buaa.edu.cn, junjun0595@163.com
Abstract
This paper presents an effective local spatio-
temporal descriptor for action recognition from
depth video sequences. The unique property of our
descriptor is that it takes the shape discrimination
and action speed variations into account, intend-
ing to solve the problems of distinguishing differ-
ent pose shapes and identifying the actions with
different speeds in one goal. The entire algorithm
is carried out in three stages. In the first stage,
a depth sequence is divided into temporally over-
lapping depth segments which are used to generate
three depth motion maps (DMMs), capturing the
shape and motion cues. To cope with speed vari-
ations in actions, multiple frame lengths of depth
segments are utilized, leading to a multi-temporal
DMMs representation. In the second stage, all
the DMMs are first partitioned into dense patches.
Then, the local binary patterns (LBP) descriptor
is exploited to characterize local rotation invari-
ant texture information in those patches. In the
third stage, the Fisher kernel is employed to en-
code the patch descriptors for a compact feature
representation, which is fed into a kernel-based ex-
treme learning machine classifier. Extensive ex-
periments on the public MSRAction3D,MSRGes-
ture3D and DHA datasets show that our proposed
method outperforms state-of-the-art approaches for
depth-based action recognition.
1 Introduction
Action recognition plays a significant role in a number of
computer vision applications such as context-based video re-
Equal contribution by Chen Chen
(chenchen870713@gmail.com) and Mengyuan Liu (liu-
mengyuan@pku.edu.cn).
Correspondence to Baochang Zhang (bczhang@buaa.edu.cn).
This research was supported by the National Natural Science
Foundation of China (Grant No. 61272052 and 61473086), the Pro-
gram for New Century Excellent Talents of the University of Min-
istry of Education of China, and the Royal Society Newton Mobility
Grant (IE150997).
trieval, human-computer interaction and intelligent surveil-
lance systems, e.g., [Chen et al., 2014a; 2014b; Bloom et
al., 2012]. Previous works focus on recognizing actions cap-
tured by conventional RGB video cameras, e.g., [Wang and
Schmid, 2013]. Based on compact local descriptors, state-of-
the-art results have been achieved on benchmark RGB action
datasets. However, these works suffer from several common
problems such as various lighting conditions and cluttered
backgrounds, due to the limitations of conventional RGB
video cameras.
Recent progresses witness the change of action recognition
from conventional RGB cameras to depth cameras. Com-
pared with RGB cameras, depth cameras have several ad-
vantages: 1) depth data is more robust to changes in lighting
conditions and depth cameras can even work in dark envi-
ronment; 2) color and texture are ignored in depth images,
which makes the tasks of human detection and foreground
extraction from cluttered backgrounds easier [Yang and Tian,
2014]; 3) depth cameras provide depth images with appropri-
ate resolution and accuracy, which capture the 3D structure
information of subjects/objects in the scene [Ni et al., 2011];
4) human skeleton information (e.g., 3D joints positions and
rotation angles) can be efficiently estimated from depth im-
ages providing additional information for action recognition
[Shotton et al., 2011].
Since the release of cost-effective depth cameras (in partic-
ular Microsoft Kinect), more recent works on action recog-
nition have been conducted using depth images. Various
representations of depth sequences have been explored in-
cluding bag of 3D points [Li et al., 2010], spatio-temporal
depth cuboid [Xia and Aggarwal, 2013], depth motion maps
(DMMs) [Yang et al., 2012; Chen et al., 2013; 2015], sur-
face normals [Oreifej and Liu, 2013; Yang and Tian, 2014]
and skeleton joints [Vemulapalli et al., 2014]. Among those,
DMMs-based representations effectively transform the action
recognition problem from 3D to 2D and have been success-
fully applied to depth-based action recognition. Specifically,
DMMs [Yang et al., 2012]are obtained by projecting the
depth frames onto three orthogonal Cartesian planes and ac-
cumulating the difference between projected maps over the
entire sequence. They can be used to describe the shape and
motion cues of a depth action sequence.
Figure 1: (a) is an example depth action sequence high wave. (b) shows the DMM of
the front view projection generated using all the depth frames (60 frames) in a depth
action sequence (high wave). (b) shows 6 DMMs of the front view projection generated
using 6 different subsets of depth frames (e.g., frames 1-10, 11-20, 21-30, etc.) in the
same action sequence. It can be observed that the detailed motion (e.g., raising hand
over head and waving) of a hand waving action can be observed in DMMs generated
using subsets of depth frames in a depth action sequence. In other words, the waving
motion exhibited in the DMMs generated from subsets of depth frames is more obvious
and clear than that in the DMMs generated using the entire depth sequence (all frames).
Motivation and contributions However, DMMs based on
an entire depth sequence may not be able to capture detailed
temporal motion in a subset of depth images. Old motion his-
tory may get overwritten when a more recent action occurs at
the same point. We provide an example in Fig. 1 to illustrate
this limitation of DMMs in capturing detailed motion cues.
In addition, action speed variations may result in large intra-
class variations in DMMs. To this end, in this paper we de-
velop a novel local spatio-temporal descriptor which takes the
shape discrimination and action speed variations into account.
More specifically, we propose to divide a depth sequence into
overlapping segments and generate multiple sets of DMMs in
order to preserve more detailed motion cues that might be lost
in DMMs based on an entire sequence. To cope with speed
variations in actions, we employ different temporal lengths of
the depth segments, leading to a multi-temporal DMMs repre-
sentation. A set of local patch descriptors are then extracted
by partitioning all the DMMs into dense patches and using
the local binary patterns (LBP) [Ojala et al., 2002]descriptor
to characterize local rotation invariant texture information in
those patches. To build a compact representation, the Fisher
kernel [Perronnin et al., 2010]is adopted to encode the patch
descriptors. Our proposed approach is validated on several
benchmark depth datasets for human action recognition and
demonstrates superior performances over other state-of-the-
art approaches.
2 Related Work
In this section, we briefly review recent methods on action
recognition using depth information, which can be broadly
categorized into depth images-based, skeleton-based, and
depth and skeleton fusion-based methods. A comprehensive
review on action recognition from 3D data is provided in [Ag-
garwal and Lu, 2014].
In the field of 3D object retrieval, surface normal vectors
can efficiently reflect local shapes of 3D objects [Tang et al.,
2013]. By extending the same concept to temporal dimen-
sion, [Oreifej and Liu, 2013]described a depth sequence by
a Histogram of Oriented Normal vectors in the 4D space of
depth, spatial coordinates and time (HON4D). To increase the
descriptive power of HON4D, [Rahmani et al., 2014a]char-
acterized each 3D point by encoding Histogram of Oriented
Principal Components (HOPC) within a volume around that
point, which is more informative than HON4D as it captures
the spread of data in three principal directions. To alleviate
the loss of information in quantization part of constructing
HON4D, [Kong et al., 2015]adopted the concept of surface
normal and proposed kernel descriptors to convert pixel-level
3D gradient into patch-level features. Rather than describing
a depth sequence by using surface normal vectors, [Rahmani
et al., 2015]divided a depth sequence into equally spatio-
temporal cells, which were represented by a Histogram of
Oriented 3D Gradients (HOG3D) and encoded by locality-
constrained linear coding. [Lu et al., 2014]developed a bi-
nary descriptor by conducting τtest to encode relative depth
relationships among pairwise 3D points. [Chen et al., 2016]
presented a weighted fusion framework of combining 2D and
3D auto-correlation of gradients features from depth images
for action recognition. [Zhang and Tian, 2015]proposed an
effective descriptor, the Histogram of 3D Facets (H3DF), to
explicitly encode the 3D shape and structures of various depth
images by coding and pooling 3D Facets from depth images.
Estimating skeleton joints from depth images [Shotton et
al., 2011]provides a more intuitive way to perceive human
actions. Existing skeleton-based approaches can be broadly
grouped into joint-based and body part-based approaches.
[Wang et al., 2014]selected an informative subset of joints
for one specific action type, and extracted pairwise rela-
tive position features to represent each selected joint. In-
stead of using joint locations as features, [Vemulapalli et
al., 2014]represented skeletons as points in the Lie group
SE (3) ×... ×SE (3), which explicitly models the 3D geo-
metric relationships among human body parts.
Obviously, skeleton joints only reflect the state of hu-
man bodies, therefore skeleton-based methods gain limited
recognition rates in human object interaction scenarios. To
improve the recognition performance using skeleton joints,
[Wang et al., 2014]proposed an ensemble model which as-
sociates local occupancy pattern features from depth images
with skeleton joints. [Ohn-Bar and M. Trivedi, 2013]utilized
joint angles pairwise similarities to represent skeletons and
extracted HOG features involving depth information around
joints. These two can be considered as representatives of
combining skeleton joints and depth information. Although
multimodal fusion methods generally achieve good recogni-
tion performance, running a depth descriptor on top of a com-
plicated skeleton tracker makes such algorithms computation-
ally expensive, limiting their use in real-time applications.
3 Proposed Depth Video Representation
3.1 Multi-temporal Depth Motion Maps
According to [Chen et al., 2013], the DMMs of a depth se-
quence with Nframes are computed as follows:
DM M{f,s,t}=
N
X
i=2 |mapi
{f,s,t}mapi1
{f,s,t}|(1)
where mapi
f,mapi
sand mapi
tindicate three projected maps
of the ith depth frame on three orthogonal Cartesian planes
corresponding to the front view (f), side view (s) and top
view (t). As mentioned before, the DMMs based on the entire
depth sequence may not be able to capture the detailed mo-
tion cues. Therefore, to overcome this shortcoming, we di-
vide a depth sequence into a set of overlapping 3D depth seg-
ments with equal number of frames (i.e., same frame length
for each depth segment) and compute three DMMs for each
depth segment. Since different people may perform an action
in different speeds, we further employ multiple frame lengths
to represent multiple temporal resolutions to cope with action
speed variations. The proposed multi-temporal DMMs repre-
sentation framework is shown in Fig. 2. Take this figure as an
example, generating DMMs using the entire depth sequence
(i.e., all the frames in the sequence) is considered as a default
level of temporal resolution (denoted by Level 0 in Fig. 2). In
the second level (Level 1 in Fig. 2), the frame length (L1) of
a depth segment is set to 5 (i.e., 5 frames in a depth segment).
In the third level (Level 2 in Fig. 2), the frame length (L2)
of a depth segment is set to 10. Note that L1and L2can be
changed. Obviously, the computational complexity increases
with the increase of temporal levels. Therefore, we limit the
maximum number of levels to be 3 including a default level,
i.e., Level 0 which considers all the frames. The frame in-
terval (R,R < L1and R < L2) in Fig. 2 is the number
of frames between the first frames (or the starting frames) re-
spectively in two neighboring depth segments, indicating how
much overlapping between the two segments. For simplicity,
we use the same Rin Level 1 and Level 2.
3.2 Patch-based LBP Features
DMMs can effectively capture the shape and motion cues of
a depth sequence. However, DMMs are pixel-level features.
To enhance the discriminative power of DMMs, we adopt the
patch-based LBP feature extraction approach in [Chen et al.,
2015]to characterize the rich texture information (e.g., edges,
contours, etc.) in the LBP coded DMMs. Fig. 3 shows the
process of patch-based LBP feature extraction. The overlap
between two patches is controlled by the pixel shift (ps) illus-
trated in Fig. 3. Under each projection view, a set of patch-
based LBP histogram features are generated to describe the
corresponding multi-temporal DMMs. Therefore, three fea-
ture matrices Hf,Hsand Htare generated associated with
front view DMMs, side view DMMs and top view DMMs,
respectively. Each column of the feature matrix (e.g., Hf) is
a histogram feature vector of a local patch.
3.3 A Fisher Kernel Representation
Fisher kernel representation [Perronnin et al., 2010]is an ef-
fective patch aggregation mechanism to characterize a set of
low-level features, which shows superior performance over
the popular Bag-of-Visual-Words (BoVW) model. Therefore,
we employ the Fisher kernel to build a compact and descrip-
tive representation of the patch-based LBP features.
Let H={hiRD,1iM}be a set of M D-
dimensional patch-based LBP feature vectors extracted from
the multi-temporal DMMs of a particular projection view
Figure 2: Proposed multi-temporal DMMs representation of a depth sequence.
Figure 3: Patch-based LBP feature extaction.
(e.g., front veiw) for a depth sequence. By assuming sta-
tistical independence, Hcan be modeled by a K-component
Gaussian mixture model (GMM):
p(H|θ) =
M
Y
i=1
K
X
k=1
ωkN(hi|µk,Σk),(2)
where θ={ωk,µk,Σk}, k = 1, ..., K is the parameter
set with mixing parameters ωk, means µkand diagonal co-
variance matrices Σkwith the variance vector σ2
k. These
GMM parameters can be estimated by using the Expectation-
Maximization (EM) algorithm based on a training dataset (or
feature set).
Two D-dimensional gradients with respect to the mean
vector µkand standard deviation σkof the kth Gaussian
component are defined as
ρk=1
Mπk
M
X
i=1
γk,i
hiµk
σk
,
τk=1
M2πk
M
X
i=1
γk,i hiµk
σk2
1!,
(3)
where γk,i is the posterior probability that qibelongs to the
kth Gaussian component. The Fisher vector (FV) of His
represented as Φ(H)=(ρT
1,τT
1, ..., ρT
K,τT
K)T. The dimen-
sionality of the FV is 2KD.
A power-normalisation step introduced in [Perronnin et al.,
2010], i.e., signed square rooting (SSR) and `2normalization,
is applied to eliminate the sparseness of the FV as follows:
T(Φ(H)) = sgn(Φ(H)) ∗ |Φ(H)|α,0< α 1.(4)
Let Hf,Hsand Htdenote three sets of patch-based
LBP feature vectors from three projection views, each depth
sequence is then represented by concatenating three FVs
[Φ(Hf); Φ(Hs); Φ(Ht)] as the final feature representation.
4 Experiments
In this section we extensively evaluate our proposed method
on three public benchmark datasets: MSRAction3D [Li et al.,
2010],MSRGesture3D [Wang et al., 2012]and DHA [Lin et
al., 2012]. We employ kernel-based extreme learning ma-
chine (KELM) [Huang et al., 2006]with a radial basis func-
tion (RBF) kernel as the classifier due to its general good clas-
sification performance and efficient computation.
4.1 Datasets
MSRAction3D dataset [Li et al., 2010]is one of the most
popular depth datasets for action recognition as reported in
the literature. It contains 20 actions: “high arm wave”,
“horizontal arm wave”, “hammer”, “hand catch”, “forward
punch”, “high throw”, “draw x”, “draw tick”, “draw circle”,
“hand clap”, “two hand wave”, “sideboxing”, “bend”, “for-
ward kick”, “side kick”, “jogging”, “tennis swing”, “tennis
serve”, “golf swing”, “pick up & throw”. Each action is per-
formed 2 or 3 times by 10 subjects facing the depth camera. It
is a challenging dataset due to similarity of actions and large
speed variations in actions.
MSRGesture3D dataset [Wang et al., 2012]is a bench-
mark dataset for depth-based hand gesture recognition. It
consists of 12 gestures defined by American Sign Language:
“bathroom”, “blue”, “finish”, “green”, “hungry”, “milk”,
“past”, “pig”, “store”, “where”, “j”, “z”. Each action is per-
formed 2 or 3 times by each subject, resulting in 336 depth
sequences.
DHA dataset is proposed in [Lin et al., 2012], whose ac-
tion types are extended from the Weizmann dataset [Gore-
lick et al., 2007]which is widely used in action recognition
Figure 4: Actions “drawTick” (left) and “drawX” (right) in the MSRAction3D dataset.
Figure 5: Actions “milk” (left) and “hungry” (right) in the MSRGesture3D dataset.
arm-curl arm-swing bend front-box front-clap golf-swing jack jump
kick leg-curl leg-kick one-hand-wave pitch pjump rod-swing
run skip side side-box side-clap tai-chi two-hand-wave walk
Figure 6: Action snaps in the DHA dataset.
from RGB sequences. It contains 23 action categories:“arm-
curl”, “arm-swing”, “bend”, “front-box”, “front-clap”, “golf-
swing”, “jack”, “jump”, “kick”, “leg-curl”, “leg-kick”, “one-
hand-wave”, “pitch”, “pjump”, “rod-swing”, “run”, “skip”,
“side”, “side-box”, “side-clap”, “tai-chi”, “two-hand-wave”,
“walk”. Each action is performed by 21 subjects (12 males
and 9 females), resulting in 483 depth sequences.
4.2 Experimental Settings
Several action snaps from the three datasets are shown in
Figs. 4-6, where inter-similarity among different types of ac-
tions are observed. In the MSRAction3D dataset, actions such
as “drawX” and “drawTick” are similar except for a slight dif-
ference in the movement of one hand. In the MSRGesture3D
dataset, actions such as “milk” and “hungry” are alike, since
both actions involve the motion of bending palm. What’s
more, self-occlusion is also a challenge for this dataset. In
the DHA dataset, “golf-swing” and “rod-swing” actions share
similar motions by moving hands from one side up to the
other side. More similar pairs can be found in “leg-curl” and
“leg-kick”, “run” and “walk”, etc.
In order to keep the reported results consistent with other
works, we follow the same evaluation protocols in [Wang et
al., 2014],[Wang et al., 2012]and [Lin et al., 2012]respec-
tively for the three datasets.
We adopt the same parameter values in [Chen et al., 2015]
for the patch sizes and parameters for the LBP operator in our
method. The other parameters are determined empirically.
The overall accuracies on three datasets with different param-
eters are shown in Figure 7, where frame length L1, frame
length L2, frame interval R, pixel shift ps and the number of
Gaussians (K) respectively change from 3 to 11, 10 to 18, 1
to 5, 3 to 7 and 20 to 100 at equal intervals. Experiments are
conducted with one parameter changes and the other param-
eters in default values: L1= 7,L2= 14,R= 3,ps = 5 and
K= 60. From Figure 7, we can see that more than 90% ac-
curacies are achieved with different parameters, which reflect
the robustness of our method to parameter settings. Since de-
fault parameters work well for all three datasets, the following
experiments are conducted with these values in default.
Figure 7: Recognition accuracies with changing parameters.
Table 1: Recognition accuracy (%) and average feature computation
time (s) of our method with different numbers of temporal levels on
the MSRAction3D dataset.
Temporal levels Accuracy Time/sequence (s)
1 level (Level 0) 89.95% 0.35
2 levels (Level 0, Level 1) 93.34% 2.51
3 levels (Level 0, Level 1, Level 2) 95.97% 4.49
In our method, we use three levels for the multi-temporal
DMMs representation. We test the algorithm on the MSRAc-
tion3D dataset using different numbers of temporal levels.
The recognition accuracy and average feature computation
time are reported in Table 1. It is worth mentioning that our
algorithm is implemented in MATLAB and executed on CPU
platform with an Intel(R)Core(TM)i7 CPU @2.60GHz and
8GB of RAM. It can easily gain the efficiency by convert-
ing the code to C++ and running the multi-temporal DMMs
representation in parallel.
4.3 Results on the MSRAction3D Dataset
In Figure 8 (a), we show the confusion matrix of the MSRAc-
tion3D dataset with the accuracy of 95.97%. It is observed
that large ambiguities exist between similar action pairs, for
example “handCatch” and “highThrow”, and “drawX” and
“drawTick”, due to the similarities of their DMMs. We
also compare our method with the state-of-the-art methods
in Table 2. “Moving Pose” [Zanfir et al., 2013], “Skele-
tons in a Lie group” [Vemulapalli et al., 2014]and “Skeletal
Quads” [Evangelidis et al., 2014]belong to skeleton-based
features, and “Actionlet Ensemble” [Wang et al., 2014]be-
longs to skeleton+depth based features. Our method out-
performs these methods for two reasons: first, skeleton
Table 2: Recognition accuracy (%) comparison on the MSRAc-
tion3D dataset.
Method Accuracy
Bag of 3D Points [Li et al., 2010]74.70%
Random Occupancy Pattern [Wang et al., 2012]86.50%
Actionlet Ensemble [Wang et al., 2014]88.20%
Depth Motion Maps [Yang et al., 2012]88.73%
HON4D [Oreifej and Liu, 2013]88.89%
DSTIP [Xia and Aggarwal, 2013]89.30%
H3DF [Zhang and Tian, 2015]89.45%
Skeletons in a Lie group [Vemulapalli et al., 2014]89.48%
Skeletal Quads [Evangelidis et al., 2014]89.86%
HOG3D+LLC [Rahmani et al., 2015]90.90%
Moving Pose [Zanfir et al., 2013]91.70%
Hierarchical 3D Kernel [Kong et al., 2015]92.73%
DMM-LBP-DF [Chen et al., 2015]93.00%
Super Normal Vector [Yang and Tian, 2014]93.09%
Hierarchical RNN [Du et al., 2015]94.49%
Range-Sample [Lu et al., 2014]95.62%
Our Method 95.97%
joints used by these methods contain a lot of noises, which
bring ambiguities to distinguish similar actions; second, our
method directly using DMMs, which provide more affluent
motion information. Our result is also better than the re-
cent depth-based features such as Super Normal Vector [Yang
and Tian, 2014]and Range-Sample [Lu et al., 2014], which
demonstrates the superior discriminatory power of our multi-
temporal DMMs representation.
Table 3: Recognition accuracy (%) comparison on the MSRGes-
ture3D dataset.
Method Accuracy
Random Occupancy Pattern [Wang et al., 2012]88.50%
HON4D [Oreifej and Liu, 2013]92.45%
HOG3D+LLC [Rahmani et al., 2015]94.10%
DMM-LBP-DF [Chen et al., 2015]94.60%
Super Normal Vector [Yang and Tian, 2014]94.74%
H3DF [Zhang and Tian, 2015]95.00%
Depth Gradients+RDF [Rahmani et al., 2014b]95.29%
Hierarchical 3D Kernel [Kong et al., 2015]95.66%
HOPC [Rahmani et al., 2014a]96.23%
Our Method 98.19%
4.4 Results on the MSRGesture3D Dataset
In Figure 8 (b), we show the confusion matrix of the MSRGes-
ture3D dataset with the accuracy of 98.19%. It is observed
that similar action pairs like “milk” and “hungry” can be dis-
tinguished with high accuracy. We compare our method with
several existing methods in Table 3. As can be seen from this
table, our method outperforms Histogram of Oriented Princi-
pal Components (HOPC) [Rahmani et al., 2014a]by 1.96%,
leading to a new state-of-the-art result.
4.5 Results on the DHA Dataset
In Figure 8 (c), we present the confusion matrix of our
method on the DHA dataset with the accuracy of 95.44%.
The DHA dataset is originally collected by [Lin et al., 2012],
which only contains 17 action categories. We use an ex-
tended version of the DHA dataset where extra 6 action cate-
Figure 8: Confusion matrices of our method on the (a) MSRAction3D, (b) MSRGesture3D and (c) DHA datasets.
Table 4: Recognition accuracy (%) comparison on the DHA dataset.
Method Accuracy
D-STV/ASM [Lin et al., 2012]86.80%
SDM-BSM [Liu et al., 2015]89.50%
D-DMHI-PHOG [Gao et al., 2015]92.40%
DMPP-PHOG [Gao et al., 2015]95.00%
Our Method 95.44%
gories are involved. [Lin et al., 2012]splited depth sequences
into space-time volume and constructed 3bit binary patterns
as depth features, which achieved an accuracy of 86.80% on
the original dataset. By incorporating multi-temporal infor-
mation to the DMMs, our proposed method achieves higher
accuracy even on the extended DHA dataset. In Table 4, we
observe that our method outperforms D-DMHI-PHOG [Gao
et al., 2015]by 3.04% and outperforms DMPP-PHOG [Gao
et al., 2015]by 0.44%. These improvements show that op-
erating LBP on multi-temporal DMMs can produce more in-
formative features than operating PHOG on depth difference
motion history image (D-MHI).
4.6 Execution Rate and Frame Rate Invariance
Regarding to the execution rate invariance, we have calcu-
lated the statistics for the MSRAction3D dataset, which con-
tains the actions executed by different subjects with differ-
ent execution rates. To be more precise, there are 20 actions,
each being executed by 10 subjects for 2 or 3 times. The stan-
dard derivation of the sequence lengths (numbers of frames)
across the actions is 9.21 frames (max: 13.30 frames; min:
4.86 frames), which means that execution rate difference is
actually quite large. In view of the achieved 95.97% recogni-
tion rate, we would say that our algorithm is resistant to the
execution rate.
To test the effect by frame rate difference, we carry out an
experiment using the MSRAction3D dataset. Specifically, we
use the sequences performed by subjects 1, 3, 5, 7, 9, (the
original action samples) as training data. We select half num-
ber of frames (odd numbers of frames, e.g., 1, 3, 5 ...) of the
sequences performed by subjects 2, 4, 6, 8, 10 to form a set of
new testing samples with 1/2 of the original frame rate. The
achieved recognition result of our method is 93.27%. There-
fore, our proposed algorithm is capable of dealing with frame
rate changes considering the fact that 1/2 frame rate reduction
is actually unrealistic.
5 Conclusion
This paper presents an effective feature representation for
action recognition from depth sequences. A multi-temporal
DMMs representation is proposed to capture more temporal
motion information in depth sequences for better distinguish-
ing similar actions. Multiple temporal resolutions in the pro-
posed representation can also cope with the speed variations
in actions. Patch-based LBP features are extracted from dense
patches in the DMMs and the Fisher kernel representation is
utilized to aggregate local patch features into a compact and
discriminative representation. The proposed method is exten-
sively evaluated on three benchmark datasets. Experimental
results show that our method outperforms the state-of-the-art
methods in all datasets.
References
[Aggarwal and Lu, 2014]J. K. Aggarwal and Xia Lu. Hu-
man activity recognition from 3d data: A review. PRL,
48(1):70–80, 2014.
[Bloom et al., 2012]Victoria Bloom, Dimitrios Makris, and
Vasileios Argyriou. G3d: A gaming action dataset and real
time action recognition evaluation framework. In CVPRW,
pages 7–12, 2012.
[Chen et al., 2013]Chen Chen, Kui Liu, and Nasser Kehtar-
navaz. Real-time human action recognition based on depth
motion maps. Journal of Real-Time Image Processing,
pages 1–9, 2013.
[Chen et al., 2014a]Chen Chen, Nasser Kehtarnavaz, and
Roozbeh Jafari. A medication adherence monitoring sys-
tem for pill bottles based on a wearable inertial sensor. In
EMBC, pages 4983–4986, 2014.
[Chen et al., 2014b]Chen Chen, Kui Liu, Roozbeh Jafari,
and Nasser Kehtarnavaz. Home-based senior fitness test
measurement system using collaborative inertial and depth
sensors. In EMBC, pages 4135–4138, 2014.
[Chen et al., 2015]Chen Chen, R. Jafari, and N. Kehtar-
navaz. Action recognition from depth sequences using
depth motion maps-based local binary patterns. In WACV,
pages 1092–1099, 2015.
[Chen et al., 2016]Chen Chen, Baochang Zhang, Zhenjie
Hou, Junjun Jiang, Mengyuan Liu, and Yun Yang. Action
recognition from depth sequences using weighted fusion
of 2d and 3d auto-correlation of gradients features. Multi-
media Tools and Applications, pages 1–19, 2016.
[Du et al., 2015]Yong Du, Wei Wang, and Liang Wang. Hi-
erarchical recurrent neural network for skeleton based ac-
tion recognition. In CVPR, pages 1110–1118, 2015.
[Evangelidis et al., 2014]Georgios Evangelidis, Gurkirt
Singh, and Radu Horaud. Skeletal quads:human action
recognition using joint quadruples. In ICPR, pages
4513–4518, 2014.
[Gao et al., 2015]Z. Gao, H. Zhang, G. P. Xu, and Y. B. Xue.
Multi-perspective and multi-modality joint representation
and recognition model for 3d action recognition. Neuro-
computing, 151:554–564, 2015.
[Gorelick et al., 2007]Lena Gorelick, Moshe Blank, Eli
Shechtman, Michal Irani, and Ronen Basri. Actions as
space-time shapes. TPMAI, 29(12):2247–2253, 2007.
[Huang et al., 2006]Guang-Bin Huang, Qin-Yu Zhu, and
Chee-Kheong Siew. Extreme learning machine: theory
and applications. Neurocomputing, 70(1):489–501, 2006.
[Kong et al., 2015]Yu Kong, B. Satarboroujeni, and Yun Fu.
Hierarchical 3d kernel descriptors for action recognition
using depth sequences. In FG, pages 1–6, 2015.
[Li et al., 2010]Wanqing Li, Zhengyou Zhang, and Zicheng
Liu. Action recognition based on a bag of 3d points. In
CVPRW, pages 9–14, 2010.
[Lin et al., 2012]Yan-Ching Lin, Min-Chun Hu, Wen-
Huang Cheng, Yung-Huan Hsieh, and Hong-Ming Chen.
Human action recognition and retrieval using sole depth
information. In ACM MM, pages 1053–1056, 2012.
[Liu et al., 2015]Hong Liu, Lu Tian, Mengyuan Liu, and
Hao Tang. Sdm-bsm: A fusing depth scheme for human
action recognition. In ICIP, pages 4674–4678, 2015.
[Lu et al., 2014]Cewu Lu, Jiaya Jia, and Chi Keung Tang.
Range-sample depth feature for action recognition. In
CVPR, pages 772–779, 2014.
[Ni et al., 2011]Bingbing Ni, Gang Wang, and P. Moulin.
Rgbd-hudaact: A color-depth video database for human
daily activity recognition. In ICCVW, pages 1147–1153,
2011.
[Ohn-Bar and M. Trivedi, 2013]Eshed Ohn-Bar and Mohan
M. Trivedi. Joint angles similiarities and hog2 for action
recognition. In CVPRW, pages 465–470, 2013.
[Ojala et al., 2002]Timo Ojala, Matti Pietik¨
ainen, and Topi
M¨
aenp¨
a¨
a. Multiresolution gray-scale and rotation invari-
ant texture classification with local binary patterns. TP-
MAI, 24(7):971–987, 2002.
[Oreifej and Liu, 2013]O. Oreifej and Zicheng Liu. Hon4d:
Histogram of oriented 4d normals for activity recognition
from depth sequences. In CVPR, pages 716–723, 2013.
[Perronnin et al., 2010]Florent Perronnin, Jorge Sanchez,
and Thomas Mensink. Improving the fisher kernel for
large-scale image classification. In ECCV, pages 143–156,
2010.
[Rahmani et al., 2014a]Hossein Rahmani, Arif Mahmood,
Q Huynh Du, and Ajmal Mian. HOPC: Histogram of Ori-
ented Principal Components of 3D Pointclouds for Action
Recognition. Springer International Publishing, 2014.
[Rahmani et al., 2014b]Hossein Rahmani, Arif Mahmood,
Du Q Huynh, and Ajmal Mian. Real time action recog-
nition using histograms of depth gradients and random de-
cision forests. In WACV, pages 626–633, 2014.
[Rahmani et al., 2015]Hossein Rahmani, Q. Huynh Du,
Arif Mahmood, and Ajmal Mian. Discriminative human
action classification using locality-constrained linear cod-
ing. PRL, 2015.
[Shotton et al., 2011]J. Shotton, A. Fitzgibbon, M. Cook,
T. Sharp, M. Finocchio, R. Moore, A. Kipman, and
A. Blake. Real-time human pose recognition in parts from
single depth images. In CVPR, pages 1297–1304, 2011.
[Tang et al., 2013]Shuai Tang, Xiaoyu Wang, Xutao Lv,
Tony X. Han, James Keller, Zhihai He, Marjorie Skubic,
and Shihong Lao. Histogram of Oriented Normal Vec-
tors for Object Recognition with a Depth Sensor. Springer
Berlin Heidelberg, 2013.
[Vemulapalli et al., 2014]Raviteja Vemulapalli, Felipe Ar-
rate, and Rama Chellappa. Human action recognition by
representing 3d human skeletons as points in a lie group.
In CVPR, pages 588–595, 2014.
[Wang and Schmid, 2013]Heng Wang and Cordelia
Schmid. Action recognition with improved trajectories.
In ICCV, pages 3551–3558, 2013.
[Wang et al., 2012]Jiang Wang, Zicheng Liu, Jan
Chorowski, Zhuoyuan Chen, and Ying Wu. Robust
3d action recognition with random occupancy patterns. In
ECCV, pages 872–885, 2012.
[Wang et al., 2014]Jiang Wang, Zicheng Liu, Ying Wu, and
Junsong Yuan. Learning actionlet ensemble for 3D human
action recognition. TPAMI, 36(5):914–927, 2014.
[Xia and Aggarwal, 2013]Lu Xia and J. K. Aggarwal.
Spatio-temporal depth cuboid similarity feature for activ-
ity recognition using depth camera. In CVPR, pages 2834–
2841, 2013.
[Yang and Tian, 2014]Xiaodong Yang and YingLi Tian. Su-
per normal vector for activity recognition using depth se-
quences. In CVPR, pages 804–811, 2014.
[Yang et al., 2012]Xiaodong Yang, Chenyang Zhang, and
Ying Li Tian. Recognizing actions using depth motion
maps-based histograms of oriented gradients. ACM MM,
pages 1057–1060, 2012.
[Zanfir et al., 2013]Mihai Zanfir, Marius Leordeanu, and
Cristian Sminchisescu. The moving pose: An efficient
3d kinematics descriptor for low-latency action recogni-
tion and detection. In ICCV, pages 2752–2759, 2013.
[Zhang and Tian, 2015]Chenyang Zhang and Yingli Tian.
Histogram of 3D facets: A depth descriptor for human ac-
tion and hand gesture recognition. CVIU, 139, 2015.
    • Human action recognition aims to enable computer automatically recognize human action in video through related features[1,2,3,4,5]. Action features can be divided into two categories: hand-crafted features and deep-learned features.
    [Show abstract] [Hide abstract] ABSTRACT: Deep learning features for video action recognition are usually learned from RGB/gray images, image gradients, and optical flows. The single modality of the input data can describe one characteristic of the human action such as appearance structure or motion information. In this paper, we propose a high efficient gradient boundary convolutional network (Con-vNet) to simultaneously learn spatio-temporal feature from the single modality data of gradient boundaries. The gradient boundaries represent both local spacial structure and motion information of action video. The gradient boundaries also have less background noise compared to RGB/gray images and image gradients. Extensive experiments are conducted on two popular and challenging action benchmarks, the UCF101 and the HMDB51 action datasets. The proposed deep gradient boundary feature achieves competitive performances on both benchmarks.
    Full-text · Conference Paper · Sep 2017 · Computer Vision and Image Understanding
    • Person re-identification, aiming to identify images of the same person from various cameras configured in different places, has attracted much attention in the signal processing community[1][2][3][4][5][6][7][8]. Due to low resolution, motion blur, view change, and illumination variation in the individual's appearance, constructing a discriminative representation to adapt to different camera conditions is extremely challenging[9– 16].
    [Show abstract] [Hide abstract] ABSTRACT: Metric learning is an important issue in person re-identification, and Mahalanobis-distance based metric learning methods prevail in this field. All of these approaches can be considered as equivalently projecting all samples to a new metric space and calculating the Euclidean distance there. However, the performance of distinguishing similar samples from dissimilar ones via absolute distance is limited. In this paper, we suggest using relative distance instead. We adopt a bi-target perspective. The core idea is to construct a virtual opposite target for each original target. Then, the similarity between a sample and the others is judged by using both the original and opposite targets of the sample. In this way, we propose a bi-target metric method, named TAICHI distance. Considering simplicity and efficiency, we follow the KISSME metric in this paper. Extensive evaluations on challenging datasets confirm the effectiveness of the proposed method.
    Full-text · Conference Paper · Mar 2017 · Computer Vision and Image Understanding
    • A recently proposed approach in the domain of computer vision has introduced the notion of mid-level descriminative patches [12] to automatically extract semantically rich spatial or spatiotemporal windows of RGB information, in order to distinguish elements that account for primitive human actions. Various feature extraction techniques have also been proposed in the area of depth maps for human action recognition ; typical is the work in [6], where the authors proposed the use of Depth Motion Maps (DMMs) for capturing motion and shape cues concurrently. Subsequently, LBP descriptors are employed for describing rotation invariant textures of the patches employed.
    [Show abstract] [Hide abstract] ABSTRACT: Human activity recognition has received a lot of attention recently, mainly thanks to the advancements in sensing technologies and systems’ increasing computational power. However, complexity in human movements, sensing devices’ noise and person-specific characteristics impose challenges that still remain to be overcome. In the proposed work, a novel, multi-modal human action recognition method is presented for handling the aforementioned issues. Each action is represented by a basis vector and spectral analysis is performed on an affinity matrix of new action feature vectors. Using modality-dependent kernel regressors for computing the affinity matrix, complexity is reduced and robust low-dimensional representations are achieved. The proposed scheme supports online adaptivity of modalities, in a dynamic fashion, according to their automatically inferred reliability. Evaluation on three publicly available datasets demonstrates the potential of the approach.
    Full-text · Article · Sep 2016
    • The size of each depth image is 240 × 320 pixels. The same experimental setup in [20, 4, 25] is adopted. The same parameters reported in [3] were used here for the sizes of DMMs and block.
    [Show abstract] [Hide abstract] ABSTRACT: Correlation filters have attracted growing attention due to their high efficiency, which have been well studied for binary classification. However, by setting the desired output to be a fixed Gaussian function, the conventional multi-class classification based on correlation filters becomes problematic due to the under-fitting in many real-world applications. In this paper, we propose an adaptive multi-class correlation filters (AMCF) method based on an alternating direction method of multipliers (ADMM) framework. Within this framework, we introduce an adaptive output to alleviate the under-fitting problem in the ADMM iterations. By doing so, a closed-form sub-solution is obtained and further used to constrain the optimization objective, simplifying the entire inference mechanism. The proposed approach is successfully combined with the Histograms of Oriented Gradients (HOG) features, multi-channel features and convolution features, and achieves superior performances over state-of-the-arts in two multi-class classification tasks including handwritten digits recognition and RGBD-based action recognition.
    Full-text · Conference Paper · Sep 2016 · Computer Vision and Image Understanding
    • As aforementioned, these hashing methods are specifically developed for image search problems, thus not applicable for content-based video retrieval, e.g., action retrieval in videos. In recent years, extensive effort s have been devoted to action recognition ( Chen et al., 2015Chen et al., , 2016 Karpathy et al., 2014; Liu et al., 2016; Qin et al., 2016 ) and detection ( Jain et al., 2014; Soomro et al., 2015; Wang et al., 2014b ). For instance, Wang et al. (2014b) propose to efficiently detect actions in videos using spatio-temporal tubes (ST-tubes), which are obtained based on mutual information of feature trajectories.
    [Show abstract] [Hide abstract] ABSTRACT: Learning based hashing methods, which aim at learning similarity-preserving binary codes for efficient nearest neighbor search, have been actively studied recently. A majority of the approaches address hashing problems for image collections. However, due to the extra temporal information, videos are usually represented by much higher dimensional (thousands or even more) features compared with images, causing high computational complexity for conventional hashing schemes. In this paper, we propose a simple and efficient hashing scheme for high-dimensional video data. This method, called Disaggregation Hashing (DH), exploits the correlations among different feature dimensions. An intuitive feature disaggregation method is first proposed, followed by a novel hashing algorithm based on different feature clusters. Additionally, a kernelized version of DH is proposed for better performance. We demonstrate the efficiency and effectiveness of our method by theoretical analysis and exploring its application on action retrieval from video databases. Extensive experiments show the superiority of our binary coding scheme over state-of-the-art hashing methods.
    Full-text · Article · Sep 2016
  • [Show abstract] [Hide abstract] ABSTRACT: Correlation filters have attracted growing attention due to their high efficiency, which have been well studied for binary classification. However, by setting the desired output to be a fixed Gaussian function, the conventional multi-class classification based on correlation filters becomes problematic due to the under-fitting in many real-world applications. In this paper, we propose an adaptive multi-class correlation filters (AMCF) method based on an alternating direction method of multipliers (ADMM) framework. Within this framework, we introduce an adaptive output to alleviate the under-fitting problem in the ADMM iterations. By doing so, a closed-form sub-solution is obtained and further used to constrain the optimization objective, simplifying the entire inference mechanism. The proposed approach is successfully combined with the Histograms of Oriented Gradients (HOG) features, multi-channel features and convolution features, and achieves superior performances over state-of-the-arts in two multi-class classification tasks including handwritten digits recognition and RGBD-based action recognition.
    Chapter · Sep 2016 · Computer Vision and Image Understanding
Show more