Content uploaded by Lahoucine Ballihi
Author content
All content in this area was uploaded by Lahoucine Ballihi on May 10, 2017
Content may be subject to copyright.
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
International Journal of Pattern Recognition and Artificial Intelligence
c
World Scientific Publishing Company
Automatic Learning of Articulated Skeletons based on Mean of 3D
Joints for Efficient Action Recognition
Abdelouahid BEN TAMOU∗
, Lahoucine BALLIHI∗and Driss ABOUTAJDINE∗
∗
LRIT-CNRST URAC 29, Mohammed V University In Rabat,
Faculty of Sciences Rabat, Morocco
abdelouahid.bentamou@gmail.com
ballihi@fsr.ac.ma
aboutaj@fsr.ac.ma
In this paper, we present a new approach for human action recognition using 3D
skeleton joints recovered from RGB-D cameras. We propose a descriptor based on dif-
ferences of skeleton joints. This descriptor combines two characteristics including static
posture and overall dynamics that encode spatial and temporal aspects. Then, we apply
the mean function on these characteristics in order to form the feature vector, used as an
input to Random Forest classifier for action classification. The experimental results on
both datasets: MSR Action 3D dataset and MSR Daily Activity 3D dataset demonstrate
that our approach is efficient and gives promising results compared to state-of-the-art
approaches.
Keywords: Action recognition; RGB-Dcamera; depth image; skeleton; Random Forest.
1. Introduction
Human action and activity recognition is one of the heavily studied topics in com-
puter vision, it aims to group all techniques to capture information characterizing
an action, and recognize unknown actions in a query video based on a collection of
annotated action videos. Action recognition has become an interesting subject due
to their applications in surveillance environments12, entertainment environments7,
sign language recognition35 and healthcare systems19,23.
Initial researches in this domain have mainly focused on learning and recog-
nizing actions from image sequences taken by RGB cameras24,25,5. However, these
2D cameras have several limitations, they are sensitive to color and illumination
changes, background clutters, occlusions and presence of noise. Main works, based
on RGB images, are summarized in the surveys of Aggarwal et al.1, Weinland et
al.29 and Poppe17.
With the advent of depth sensors, new data have appeared, these sensors product
three types of data:
∗Corresponding author
1
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
2A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
•RGB images: come from the RGB camera that works like any other 2D
cameras.
•Depth maps: give the distance between objects presented in the scene and
the depth camera.
•Estimation of human skeleton in 3D: thanks to works of Shotton et al.22
who proposed a real-time approach for estimating 3D positions of body
joints using extensive training on synthetic and real depth streams.
Figure 1 shows an example frame samples from each stream type (RGB, depth
map, human skeleton estimation) produced by the Microsoft Kinect camera.
Fig. 1. Video streams product by Kinect: RGB image, depth image and skeleton given in a frame.
Generally, the skeleton tracker provided by the Microsoft Kinect tracks 20 joint
positions as illustrated in figure 2, for each joint, the Kinect captured its three
coordinates (x, y, z ).
In this paper, we propose a new feature descriptor for action recognition based
on mean of differences of skeleton joints, then we use the Random Forest classifier
to classify actions. The rest of this paper is organized as follows: in section 2, we
review the related work. In section 3, we present our proposed approach. In section
4, we test our approach on both datasets: MSR Action 3D dataset11 and MSR
Daily Activity 3D dataset28, then we discuss the results. Finally, we conclude and
present future works in section 5.
2. Related Works
As mentioned earlier, depth camera output consists of a stream of color, depth
and skeleton. Here we differentiate approaches that rely on depth information,
approaches that take skeleton and those who take both as inputs.
2.1. Approaches based on depth information
First approaches used for action recognition from depth sequences have tendency
to extrapolate techniques already developed for color sequences.
Bingbing et al.14 combine color and depth maps to extract Spatio-Temporal
Interest Points STIP and encode Motion History Image MHI. Xia et al.30 present
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 3
an approach to extract STIP from depth sequence DSTIP, then, around these
points of interest they build a depth cuboid similarity feature as descriptor for each
action.
In Li et al.11 depth maps are projected onto each Cartesian planes and the
contours of the silhouette are extracted from these depth map projections and
sampled for reduce the complexity. The sampled points are used as bag-of-points
to characterize a set of salient postures that correspond to the nodes of an action
graph used to model explicitly the dynamics of the actions. One limitation of this
approach is due to noise and occlusions in the depth maps.
Vieira et al.26 represent each depth map sequence as a 4Dgrid by dividing the
space and time axes into multiple segments in order to extract Spatio-Temporal
Occupancy Pattern STOP features. Wang et al.27 present Random Occupancy
Pattern approach ROP where they consider the depth sequence as a 4Dshape and
extracted 4Dsub-volumes randomly with different sizes and at different locations.
Yang et al.33 represent an action sequence by using Histograms of Oriented Gra-
dients features HOG computed from Depth Motion Maps DMM, they project each
depth map onto each Cartesian planes, and each projected map is normalized and
a binary map is generated by computing and thresholding the difference between
two consecutive frames. The binary maps are then summed up to obtain the DMM.
HOG is then applied to the DMM map to extract the features from each view.
Oreifej et al.16 present approach Histogram of Oriented 4DNormals HON4D, it is
a 4Dhistogram computed over depth, spatial coordinates and time capturing the
distribution of the surface normal orientation.
Fig. 2. Skeleton joint positions captured by Microsoft Kinect sensor.
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
4A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
2.2. Approaches based on skeleton information
As mentioned earlier, thanks to the works of Shotton et al.22, skeleton-based meth-
ods have become popular and many approaches in the literature propose to model
the dynamic of the action using these features.
Yang et al.32 apply Principal Component Analysis PCA on three features ex-
tracted from joint sequences to obtain Eigen Joints descriptor. These features in-
clude posture and motion features, which encode spatial and temporal aspect, and
offset features which represent the difference of a pose with the initial pose. Zanfir
et al.36 propose moving pose descriptors using 3D skeleton joints and kinematic
features computed on discriminative key-frames for low latency action recognition.
Ofli et al.15 propose Most Informative Joints descriptor based on selection of
3Djoints containing more information by computing the quantity of information
associated at each joint, and then they are ordered by decreasing quantity of in-
formation, finally they selected the kjoints the most informative. Hussein et al.8
propose a descriptor based on the covariance matrix called Covariance of 3DJoints
Cov3DJ, in practice, the proposed descriptor is the covariance matrix of the set of
all joints coordinates.
Reyes et al.18 apply the Dynamic Time Warping DTW on a feature vector
defined by 15 joints on a 3Dhuman skeleton obtained using Prime-Sense Nite.
Similarly, Sempena et al.21 compute quaternions from the 3Dhuman skeleton model
to form a feature vector of 60 elements. In the case of 3Djoints estimation from
depth maps, the DTW don’t give good recognition rates because of the noisy nature
of skeleton joint position.
Xia et al.31 propose Histograms Of 3DJoint HOJ3D, which mainly encode the
spatial occupancy of the joints relative to the center of the silhouette (hip). In fact,
the joints are projected in a spherical coordinate system partitioned into n-bins. A
probabilistic voting is established to determine the fractional occupancy. Then, the
HOJ3D are projected using LDA and clustered into kposture visual words which
represent the prototypical poses of actions. The temporal evolutions of those visual
words are modeled by discrete hidden Markov models HMMs.
Seidenari et al.20 propose an approach Bag-Of-Poses BOP based on approach
Bag-Of-Words originated from text retrieval search. The main idea of this approach
is to use joint positions to align multiple-parts of the human body using a bag-of-
poses solution applied in a nearest neighbor framework. Keceli et al.9use histograms
of angles between some important joints and displacement of some joints in 3D
coordinate space as features. They construct two models to classify actions, SVM
and RF algorithms.
Recently, Lu et al.13 propose local position offset of 3D skeletal body joints,
which includes two main steps: 1) computation of position offset: by computing
the time differentiated offset information of each joint at the current tth frame to
(t−∆t)th frame. 2) Video expression by bag-of-words: then they collect all
offset vectors of training video sequences and group them by K-means algorithm
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 5
to generate the codewords. Finally, each video sequence is expressed by a set of
histograms of codewords of body joints.
Chen et al.6propose a novel two-level hierarchical framework using 3Dskeleton
joints. In the first level, they introduce a part-based 5Dfeature vector to explore
the most relevant joints of body parts in each action sequence and to cluster action
sequences. In the second level, they propose two modules, motion feature extraction
for reduce computational costs and adapt to variable movement speed, and action
graphs where they exploit the result of motion feature to build action graphs.
Ben Amor et al.3used trajectories on Kendalls shape manifolds to model the
evolution of human skeleton shapes, and used a parametrization-invariant metric
for aligning, comparing, and modeling skeleton joint trajectories, which can deal
with noise caused by large variability of execution rates within and across humans.
However, such method is much more time-consuming. Cai et al.4propose a novel
skeleton representation for low latency human action recognition. They encode each
limb into a state using Markov random field in terms of relative position, speed and
acceleration.
2.3. Approaches based on hybrid information
Some works propose hybrid approaches by combining both depth information and
skeleton data features in order to improve recognition performances.
Wang et al.28 use both skeleton and point cloud information. They combine
joint location features and Local Occupancy Patterns LOP features and employ a
Fourier Temporal Pyramid FTP to represent the temporal dynamics of the actions.
Althloothi et al.2present 3Dshape features based on 3Dmotion features using
kinematic structure of the skeleton and spherical harmonics representation. Then,
they use a multi kernel learning method for merging this both features.
In Human-Object Interactions, Koppola et al.10 define a Markov Random Field
MRF over the spatio-temporal sequence where nodes represent objects and sub-
activities, and edges represent the relationships between object affordances, their
relations with sub-activities, and their evolution over time. Yu et al.34 present a
novel level representation for skeleton and depth information, called orderlet which
is a middle level feature that captures the ordinal pattern among a group of low
level features. For skeletons, it encodes inter-joint coordination and for depth maps,
it encodes the objects shape information.
3. Mean 3DJoints approach
The proposed framework for action recognition using skeleton joint positions recov-
ery from depth images is illustrated in figure 3. To recognize an action, we compute
two characteristics: static posture feature fpand overall dynamics feature fdin
each frame. We then concatenate both features to obtain the characteristic f. To
construct a characteristics matrix of an action Mc, we concatenate all the fchar-
acteristics computed from each frame. Then, we apply the mean function on each
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
6A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
characteristics matrix’s row of Mcin order to form the feature vector used as input
to the Random Forest classifier for action classification.
Mean 3DJoints approach Mean3DJ is an approach based on skeleton informa-
tion. It uses 3Dposition differences of skeleton joints and the mean function to
characterize an action sequence.
First, we compute two features describing 3Dpositions differences of skeleton
joints, respectively, static posture feature fpencoding spatial aspect, and global
dynamics feature fdencoding temporal aspect, in each current frame c. Both fea-
tures are inspired from Yang et al.32. Then, we construct a characteristics matrix
Mcby concatenating these features extracted from several frames as illustrated in
figure 4. Finally, we apply the mean function on each row of Mcto obtain a means
vector characterizing an action.
Each frame gives a set of 3Dcoordinates of skeleton joints Xobtained from
depth maps22, where X={P1, P2, ..., PN};X∈R3N, where Ndenotes the number
of joints and Pi= (xi, yi, zi) are 3Dcoordinates of the joint i.
To characterize the spatial aspect represented by static posture feature of current
frame c, we compute pairwise 3Dpositions differences of skeleton joints within the
current frame:
fp={Pi−Pj|i= 1,2, ..., N ;j > i}(1)
To characterize the temporal aspect represented by global dynamics feature of
the current frame cwith respect to the initial frame i, we compute the 3Dpositions
differences of skeleton joints between frame cand frame i:
fd={Pc
j−Pi
k|Pc
j∈Xc;Pi
k∈Xi}(2)
The combination of both features constructs the characteristic f= [fp;fd], it
describes the preliminary feature representation of each frame.
We note that, depth sensor generates in each frame N3D positions of skeleton
joints. The dimension of fpis N(N−1)/2 and of fdis N2, and we know that each
position contains three coordinates (x;y;z), so the final dimension of ffor each
Fig. 3. Overview of the proposed approach.
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 7
frame is 3[N(N−1)/2 + N2]. For example, if we use the Kinect camera, which
extracts 20 skeleton joints in each frame, the dimension of fis 1770 rows.
We also note that we can use a few frames to recognize an action instead of
using entire video sequences. Each action sequence Vcould be represented as a
set of Nsselected frames where V={F1, F2, ..., FNs}from which we extract the
characteristic f.
Then, we concatenate the characteristic fccalculated from each selected frame
in order to form the characteristics matrix Mc. Finally, we calculate the means
vector Vmas follows:
Vm(i) = 1
Ns
Ns
−1
X
j=0
Mc(i, j) ; i= 0, ..., p −1 (3)
Where Nsis the number of the selected frames and p= 3[N(N−1)/2 + N2] is
the number of rows in the means vector Vm.
The vector Vmis the final action characteristic descriptor. From the equation
3, we deduce that its size is always 1 ×p. Alg. 1. presents the algorithm of the
proposed approach.
Fig. 4. Different steps to compute the characteristics matrix Mc.
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
8A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
Algorithm 1 M ean3DJ algorithm
Input: X={P1, P2, ..., PN}: set of 3Dcoordinates of skeleton joints.
Input: Ns: number of selected frames.
Output: Vm: Means vector
for all sequence of training set do
Take Nsframes from the skeleton sequence, considering a number of frames
equal to: NJump =N umT otalF r ameSeq/(Ns+ 1) between two consecutive
selected frames.
” Where N umT otalF rameS eq is the number of frames in the sequence”
for frameindex from 1 to N umT otalF r ameSeq do
if this frame is among the frames selected then
Compute static posture feature fpas in equation 1;
Compute overall dynamics feature fdas in equation 2;
f←concatenate(fp, fd)
Mc←concatenate(Mc, f )
end if
end for
Compute Means vector Vmas in equation 3;
end for
return Vm
4. Experimental results
We evaluated the proposed approach on two datasets: MSR Action 3D dataset11 and
MSR Daily Activity 3D dataset28. Results scored by our approach on these datasets
were also compared against those obtained by state of the art solutions. Note that
we coded our approach in C++, using OpenCV library 2.4.10 and Microsoft SDK
1.8, evaluated on an Intel Core i7, 2.70 GHz with 8.0 Go RAM. We used the Random
Forest classifier with 400 trees trained with a max depth of 25 and 0.001 as forest
accuracy.
4.1. MSR Action 3D dataset
The MSR Action 3D11 is an action public dataset that contains 20 actions per-
formed by 10 subjects captured by a depth camera. Each action was performed 2
or 3 times by each subject. The 20 actions are: ”high arm wave, horizontal arm
wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw cir-
cle, hand clap, two hand wave, side boxing, bend, forward kick, side kick, jogging,
tennis swing, tennis serve, golf swing, pick up & throw”. There are 567 sequences
in total, each was recorded as a sequence of depth maps and a sequence of skeletal
joint locations. The main challenge of this dataset is data corruption.
Same as Li et al.11, we divide the action set into 3 action subsets, each one
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 9
Table 1. The three subsets of actions used for the MSR Action 3D dataset.
Action set 1 (AS1) Action set 2 (AS2) Action set 3 (AS3)
Horizontal arm wave (HoW) High arm wave (HiW) High throw (HT)
Hammer (H) Hand catch (HC) Forward kick (FK)
Forward punch (FP) Draw x (DX) Side kick (SK)
High throw (HT) Draw tick (DT) Jogging (J)
Hand clap (HC) Draw circle (DC) Tennis swing (TSw)
Bend (B) Two hand wave (HW) Tennis serve (TSr)
Tennis serve (TSr) Forward kick (FK) Golf swing (GS)
Pickup & throw (PT) Side boxing (SB) Pickup & throw (PT)
Table 2. Recognition accuracies of our approach on the MSR Action 3D dataset.
Test One Test Two Cross Subject Test
AS1 68.75% 80.25% 73.62%
AS2 70.23% 75.91% 76.25%
AS3 90.51% 97.5% 98.17%
Overall 76.5% 84.55% 82.68%
containing 8 actions as listed in Table 1. AS1 and AS2 group actions requiring
similar movements, while AS3 groups more complex actions. For each subset, there
are three different tests: Test One, Test Two and Cross Subject Test. In Test One,
1/3 of the data is used as a training set and the rest as a testing set. In Test Two,
2/3 of the data is used as a training set and the rest as a testing set. In the Cross
Subject Test, half of subjects is used for training and the rest is used for testing.
Each test was repeated 10 times. The average result is reported in Table 2. The
recognition accuracies of Mean3DJ in each test under various subsets is shown in
figure 5.
We observe from table 2 and figure 5 that our approach performs well on the
subset AS3. This is probably because actions in AS1 and AS2 have similar motions,
but AS3 groups complex but pretty distinct actions. We also notice that the per-
formance in Test Two is promising, because Test Two has trained 2/3 of the data,
more than Test One and Cross Subject Test.
Figure 6 illustrates the confusion matrix of our approach only under Cross
Subject Test. From the confusion matrix, In AS1, action ”Hammer” gets the lowest
accuracy, it is confused with several actions, especially for Horizontal arm wave
and Forward punch. In AS2, ”Draw X” is repeatedly confused with another action
as ”Draw circle” and ”High arm wave”. These confusions in both subsets occur
because they contain highly similar motions. In AS3, all actions are well recognized
because the actions are complex, they have significant differences.
Table 3 shows the comparison of our approach with state-of-the-art approaches
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
10 A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
Fig. 5. Recognition accuracies in each iteration under various tests.
Fig. 6. Confusion matrix of Mean3DJ in different action sets under Cross Subject Test. Each row
corresponds to ground truth label and each column denotes the recognition results.
on the MSR Action 3D dataset11. From which, we can observe that our approach
gives promising results when compared to the state-of-the-art methods. But it can-
not outperform several methods because two factors:
(1) similar motions: In this case, the position of the joints of actions are practically
identical, therefore, the descriptor vector is almost the same.
(2) corrupted skeletons: Some skeleton sequences are very noisy due to failures of
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 11
Table 3. Comparison of accuracy on MSR Action 3D dataset.
Approach Method accuracy
Bag of 3D Points 11 74.7 %
DMM-HOG 33 81.63 %
STOP 26 84.8 %
Depth ROP 27 86.5 %
HON4D 16 88.89 %
DSTIP 30 89.3 %
HOG3DJ 31 78.97 %
Mean3DJ 82.68 %
Eigen Joints 32 83.3 %
Keceli et al988.2%
Skeleton Chen et al 688.7 %
Ben Amor et al389 %
Conv3DJ890.53 %
Cai et al491.01 %
Moving Pose36 91.7 %
Althloothi 279.7 %
Hybrid Actionlet Ensemble 28 88.2 %
the skeleton tracking algorithm. In some action sequences the skeleton reduces
to a point as shown in Figure 7.
Fig. 7. Some samples of corrupted skeleton from MSR Action 3D dataset.
4.2. MSR Daily Activity 3D dataset
The MSR Daily Activity 3D dataset 28 presents 16 activities done in a living room,
which are: drink, eat, read book, call cellphone, write on a paper, use laptop, use
vacuum cleaner, cheer up, sit still, toss paper, play game, lie down on sofa, walk,
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
12 A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
play guitar, stand up, sit down. Each activity is performed by 10 subjects where
each subject performs each activity twice, once in ”standing” position, and once in
”sitting on sofa” position. The total number of the activity samples is 16×10 ×2 =
320. This dataset was captured by a Kinect device. The figure 8 illustrates some
examples of activities from this dataset in color, depth and skeleton. The actions
in this dataset are more complex than MSR Action 3D dataset 11, they require
interactions with objects in the scene.
Fig. 8. Some examples of activities from MSR Daily Activity 3D dataset 28.
We evaluated our approach on the MSR Daily Activity 3D dataset 28 using
Leave-One-Out Cross-Validation technique (LOOCV) where we left in each itera-
tion an actor out, we can call it Leave-One-Actor-Out Cross-Validation (LOAOCV).
The figure 9 shows the recognition accuracies of Mean3DJ approach in each itera-
tion using LOAOCV. The igure 10 illustrates the average accuracy of each action,
and the confusion matrix is shown in figure 11 .
from Figure 9, we observe that our algorithm gives promising results, we
achieved an average accuracy of 73.75%. From the confusion matrix and Figure 10,
we deduce that the most difficult actions to recognize correspond to the cases
where subjects interact with external objects. The proposed approach has been
also evaluated with state-of-the-art approaches applied on the MSR Daily Activity
3D dataset28 through a comparative analysis, which is reported in table 4.
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 13
Fig. 9. Recognition accuracies for each iteration.
Fig. 10. Recognition accuracies for the 16 action classes of the MSR Daily Activity 3D dataset.
Table 4 demonstrates that our approach gives one of the best results when com-
pared to state-of-the-art approaches, especially with approaches based on skeleton
information. We observe that approaches based on skeleton features applied on
MSR Daily Activity 3D are not as accurate as approaches based on depth features
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
14 A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
Fig. 11. Confusion matrix of Mean3DJ. Each row corresponds to ground truth label and each
column denotes the recognition results.
Table 4. Comparison of accuracy on MSR Daily Activity 3D dataset.
Approach Method accuracy
Only LOP features 28 42.5 %
Depth LHON4D 15 80 %
DSTIP (DCSF) 30 83.6 %
NBNN 20 53 %
NBNN + time 20 60 %
Skeleton NBNN + parts 20 60 %
Only Joint Position features 28 68 %
NBNN + Parts + Time 20 70 %
Ben Amor 370 %
Mean3DJ 73.75 %
Moving Pose36 73.8 %
Orderlets (Only skeleton feature) 34 73.8 %
SVM on FTP Features28 78 %
Cai et al 478.52 %
DSTIP (DCSF+Joints) 30 88.2 %
Hybrid Actionlet Ensemble 28 85.75 %
Althloothi et al.293.1%
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 15
or based on hybrid features, because most activities in this dataset involve human-
object interactions. Such kind of interactions is difficult to model or detect from
skeletal data.
4.3. Result analysis
The analysis of some misclassified examples given by figure 12 shows in fact that
there are two major reasons of this misclassification. The first one is the bad quality
of depth information, such as some occlusions in the main regions, which affected
the quality of skeleton information. The second reason lies in the fact that based
only the skeleton information, there is some confusion, to recognize correctly the
action done by the person. One solution of this problem and for improving the pro-
posed approach is to introduce the texture and depth information, which contains
complementary information in order to consolidate the classifier decision.
Fig. 12. Some examples of actions misclassified by our approach.
5. Conclusion and Future works
In this work, we proposed a new human action recognition approach using skeleton
joints extracted from depth cameras. Our approach is based on difference of the 3D
coordinates of skeleton joints. It first computes static posture feature and overall
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
16 A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
dynamics feature between the current frame and the initial frame. Then, it applies
the mean function on these features in order to form the Mean3DJ descriptor. Our
approach has been applied on both datasets and gives promising results.
The proposed approach can be extended to hybrid approach in order to improve
the performances and recognition accuracy. The texture and depth information
could be associated to skeleton information to consolidate the classifier decision.
In fact, texture and depth channel provide additional informations which could be
complementary to the skeleton one.
References
1. J. K. Aggarwal and M. S. Ryoo, Human activity analysis: A review, ACM Computing
Surveys (CSUR) 43(3) (2011) p. 16.
2. S. Althloothi, M. H. Mahoor, X. Zhang and R. M. Voyles, Human activity recognition
using multi-features and multiple kernel learning, Pattern Recognition 47(5) (2014)
1800–1812.
3. B. B. Amor, J. Su and A. Srivastava, Action recognition using rate-invariant anal-
ysis of skeletal shape trajectories, Pattern Analysis and Machine Intelligence, IEEE
Transactions on 38(1) (2016) 1–13.
4. X. Cai, W. Zhou, L. Wu, J. Luo and H. Li, Effective active skeleton representation
for low latency human action recognition (2013).
5. S. Calderara, A. Prati and R. Cucchiara, Markerless body part tracking for action
recognition, International Journal of Multimedia Intelligence and Security 1(1) (2010)
76–89.
6. H. Chen, G. Wang, J.-H. Xue and L. He, A novel hierarchical framework for human
action recognition, Pattern Recognition 55 (2016) 148–159.
7. S. R. Fanello, I. Gori, G. Metta and F. Odone, Keep it simple and sparse: Real-time
action recognition, The Journal of Machine Learning Research 14(1) (2013) 2617–
2640.
8. M. E. Hussein, M. Torki, M. A. Gowayyed and M. El-Saban, Human action recog-
nition using a temporal hierarchy of covariance descriptors on 3d joint locations, in
Proceedings of the Twenty-Third international joint conference on Artificial Intelli-
gence (2013) pp. 2466–2472.
9. A. S. Keceli and A. B. Can, Recognition of basic human actions using depth informa-
tion, International Journal of Pattern Recognition and Artificial Intelligence 28(02)
(2014) p. 1450004.
10. H. S. Koppula, R. Gupta and A. Saxena, Learning human activities and object af-
fordances from rgb-d videos, The International Journal of Robotics Research 32(8)
(2013) 951–970.
11. W. Li, Z. Zhang and Z. Liu, Action recognition based on a bag of 3d points, in Com-
puter Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer
Society Conference on (2010) pp. 9–14.
12. W. Lin, M. T. Sun, R. Poovandran and Z. Zhang, Human activity recognition for
video surveillance, in Circuits and Systems, 2008. ISCAS 2008. IEEE International
Symposium on (2008) pp. 2737–2740.
13. G. Lu, Y. Zhou, X. Li and M. Kudo, Efficient action recognition via local position
offset of 3d skeletal body joints, Multimedia Tools and Applications (2015) 1–16.
14. B. Ni, G. Wang and P. Moulin, Rgbd-hudaact: A color-depth video database for
human daily activity recognition, in Consumer Depth Cameras for Computer Vision
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 17
(Springer, 2013) pp. 193–208.
15. F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal and R. Bajcsy, Sequence of the most
informative joints (smij): A new representation for human skeletal action recognition,
Journal of Visual Communication and Image Representation 25(1) (2014) 24–38.
16. O. Oreifej and Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition
from depth sequences, in Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on (2013) pp. 716–723.
17. R. Poppe, A survey on vision-based human action recognition, Image and vision
computing 28(6) (2010) 976–990.
18. M. Reyes, G. Dom´ınguez and S. Escalera, Featureweighting in dynamic timewarping
for gesture recognition in depth data, in Computer Vision Workshops (ICCV Work-
shops), 2011 IEEE International Conference on (2011) pp. 1182–1188.
19. D. Sanchez, M. Tentori and J. Favela, Activity recognition for the smart hospital,
Intelligent Systems, IEEE 23(2) (2008) 50–57.
20. L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo and P. Pala, Recognizing actions
from depth cameras as weakly aligned multi-part bag-of-poses, in Computer Vision
and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on (2013) pp.
479–485.
21. S. Sempena, N. U. Maulidevi and P. R. Aryan, Human action recognition using dy-
namic time warping, in Electrical Engineering and Informatics (ICEEI), 2011 Inter-
national Conference on (2011) pp. 1–5.
22. J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook
and R. Moore, Real-time human pose recognition in parts from single depth images,
Communications of the ACM 56(1) (2013) 116–124.
23. E. E. Stone and M. Skubic, Fall detection in homes of older adults using the microsoft
kinect, Biomedical and Health Informatics, IEEE Journal of 19(1) (2015) 290–301.
24. R. Vezzani, D. Baltieri and R. Cucchiara, Hmm based action recognition with pro-
jection histogram features, in Recognizing Patterns in Signals, Speech, Images and
Videos 2010 pp. 286–293.
25. R. Vezzani, M. Piccardi and R. Cucchiara, An efficient bayesian framework for on-
line action recognition, in Image Processing (ICIP), 2009 16th IEEE International
Conference on (2009) pp. 3553–3556.
26. A. W. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu and M. F. Campos, Stop:
Space-time occupancy patterns for 3d action recognition from depth map sequences, in
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
2012 pp. 252–259.
27. J. Wang, Z. Liu, J. Chorowski, Z. Chen and Y. Wu, Robust 3d action recognition
with random occupancy patterns, in Computer vision–ECCV 2012 (Springer, 2012)
pp. 872–885.
28. J. Wang, Z. Liu, Y. Wu and J. Yuan, Mining actionlet ensemble for action recognition
with depth cameras, in Computer Vision and Pattern Recognition (CVPR) (2012) pp.
1290–1297.
29. D. Weinland, R. Ronfard and E. Boyer, A survey of vision-based methods for action
representation, segmentation and recognition, Computer Vision and Image Under-
standing 115(2) (2011) 224–241.
30. L. Xia and J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activ-
ity recognition using depth camera, in Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on (2013) pp. 2834–2841.
31. L. Xia, C.-C. Chen and J. Aggarwal, View invariant human action recognition us-
ing histograms of 3d joints, in Computer Vision and Pattern Recognition Workshops
May 20, 2016 15:12 WSPC/INSTRUCTION FILE Mean3DJ˙IJPRAI
18 A. BEN TAMOU, L. BALLIHI et D. ABOUTAJDINE
(CVPRW), 2012 IEEE Computer Society Conference on (2012) pp. 20–27.
32. X. Yang and Y. Tian, Eigenjoints-based action recognition using naive-bayes-nearest-
neighbor, in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012
IEEE Computer Society Conference on (2012) pp. 14–19.
33. X. Yang, C. Zhang and Y. Tian, Recognizing actions using depth motion maps-based
histograms of oriented gradients, in Proceedings of the 20th ACM international con-
ference on Multimedia (2012) pp. 1057–1060.
34. G. Yu, Z. Liu and J. Yuan, Discriminative orderlet mining for real-time recognition
of human-object interaction, in Computer Vision–ACCV 2014 (Springer, 2014) pp.
50–65.
35. Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton and P. Presti, American sign lan-
guage recognition with the kinect, in Proceedings of the 13th international conference
on multimodal interfaces (2011) pp. 279–286.
36. M. Zanfir, M. Leordeanu and C. Sminchisescu, The moving pose: An efficient 3d
kinematics descriptor for low-latency action recognition and detection, in Proceedings
of the IEEE International Conference on Computer Vision (2013) pp. 2752–2759.
Biographical Sketch and Photo