ArticlePDF Available

Automatic Learning of Articulated Skeletons Based on Mean of 3D Joints for Efficient Action Recognition


Abstract and Figures

In this paper, we present a new approach for human action recognition using (Formula presented.) skeleton joints recovered from RGB-D cameras. We propose a descriptor based on differences of skeleton joints. This descriptor combines two characteristics including static posture and overall dynamics that encode spatial and temporal aspects. Then, we apply the mean function on these characteristics in order to form the feature vector, used as an input to Random Forest classifier for action classification. The experimental results on both datasets: MSR Action 3D dataset and MSR Daily Activity 3D dataset demonstrate that our approach is efficient and gives promising results compared to state-of-the-art approaches.
Content may be subject to copyright.
International Journal of Pattern Recognition and Artificial Intelligence
World Scientific Publishing Company
Automatic Learning of Articulated Skeletons based on Mean of 3D
Joints for Efficient Action Recognition
Abdelouahid BEN TAMOU
, Lahoucine BALLIHIand Driss ABOUTAJDINE
LRIT-CNRST URAC 29, Mohammed V University In Rabat,
Faculty of Sciences Rabat, Morocco
In this paper, we present a new approach for human action recognition using 3D
skeleton joints recovered from RGB-D cameras. We propose a descriptor based on dif-
ferences of skeleton joints. This descriptor combines two characteristics including static
posture and overall dynamics that encode spatial and temporal aspects. Then, we apply
the mean function on these characteristics in order to form the feature vector, used as an
input to Random Forest classifier for action classification. The experimental results on
both datasets: MSR Action 3D dataset and MSR Daily Activity 3D dataset demonstrate
that our approach is efficient and gives promising results compared to state-of-the-art
Keywords: Action recognition; RGB-Dcamera; depth image; skeleton; Random Forest.
1. Introduction
Human action and activity recognition is one of the heavily studied topics in com-
puter vision, it aims to group all techniques to capture information characterizing
an action, and recognize unknown actions in a query video based on a collection of
annotated action videos. Action recognition has become an interesting subject due
to their applications in surveillance environments12, entertainment environments7,
sign language recognition35 and healthcare systems19,23.
Initial researches in this domain have mainly focused on learning and recog-
nizing actions from image sequences taken by RGB cameras24,25,5. However, these
2D cameras have several limitations, they are sensitive to color and illumination
changes, background clutters, occlusions and presence of noise. Main works, based
on RGB images, are summarized in the surveys of Aggarwal et al.1, Weinland et
al.29 and Poppe17.
With the advent of depth sensors, new data have appeared, these sensors product
three types of data:
Corresponding author
RGB images: come from the RGB camera that works like any other 2D
Depth maps: give the distance between objects presented in the scene and
the depth camera.
Estimation of human skeleton in 3D: thanks to works of Shotton et al.22
who proposed a real-time approach for estimating 3D positions of body
joints using extensive training on synthetic and real depth streams.
Figure 1 shows an example frame samples from each stream type (RGB, depth
map, human skeleton estimation) produced by the Microsoft Kinect camera.
Fig. 1. Video streams product by Kinect: RGB image, depth image and skeleton given in a frame.
Generally, the skeleton tracker provided by the Microsoft Kinect tracks 20 joint
positions as illustrated in figure 2, for each joint, the Kinect captured its three
coordinates (x, y, z ).
In this paper, we propose a new feature descriptor for action recognition based
on mean of differences of skeleton joints, then we use the Random Forest classifier
to classify actions. The rest of this paper is organized as follows: in section 2, we
review the related work. In section 3, we present our proposed approach. In section
4, we test our approach on both datasets: MSR Action 3D dataset11 and MSR
Daily Activity 3D dataset28, then we discuss the results. Finally, we conclude and
present future works in section 5.
2. Related Works
As mentioned earlier, depth camera output consists of a stream of color, depth
and skeleton. Here we differentiate approaches that rely on depth information,
approaches that take skeleton and those who take both as inputs.
2.1. Approaches based on depth information
First approaches used for action recognition from depth sequences have tendency
to extrapolate techniques already developed for color sequences.
Bingbing et al.14 combine color and depth maps to extract Spatio-Temporal
Interest Points STIP and encode Motion History Image MHI. Xia et al.30 present
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 3
an approach to extract STIP from depth sequence DSTIP, then, around these
points of interest they build a depth cuboid similarity feature as descriptor for each
In Li et al.11 depth maps are projected onto each Cartesian planes and the
contours of the silhouette are extracted from these depth map projections and
sampled for reduce the complexity. The sampled points are used as bag-of-points
to characterize a set of salient postures that correspond to the nodes of an action
graph used to model explicitly the dynamics of the actions. One limitation of this
approach is due to noise and occlusions in the depth maps.
Vieira et al.26 represent each depth map sequence as a 4Dgrid by dividing the
space and time axes into multiple segments in order to extract Spatio-Temporal
Occupancy Pattern STOP features. Wang et al.27 present Random Occupancy
Pattern approach ROP where they consider the depth sequence as a 4Dshape and
extracted 4Dsub-volumes randomly with different sizes and at different locations.
Yang et al.33 represent an action sequence by using Histograms of Oriented Gra-
dients features HOG computed from Depth Motion Maps DMM, they project each
depth map onto each Cartesian planes, and each projected map is normalized and
a binary map is generated by computing and thresholding the difference between
two consecutive frames. The binary maps are then summed up to obtain the DMM.
HOG is then applied to the DMM map to extract the features from each view.
Oreifej et al.16 present approach Histogram of Oriented 4DNormals HON4D, it is
a 4Dhistogram computed over depth, spatial coordinates and time capturing the
distribution of the surface normal orientation.
Fig. 2. Skeleton joint positions captured by Microsoft Kinect sensor.
2.2. Approaches based on skeleton information
As mentioned earlier, thanks to the works of Shotton et al.22, skeleton-based meth-
ods have become popular and many approaches in the literature propose to model
the dynamic of the action using these features.
Yang et al.32 apply Principal Component Analysis PCA on three features ex-
tracted from joint sequences to obtain Eigen Joints descriptor. These features in-
clude posture and motion features, which encode spatial and temporal aspect, and
offset features which represent the difference of a pose with the initial pose. Zanfir
et al.36 propose moving pose descriptors using 3D skeleton joints and kinematic
features computed on discriminative key-frames for low latency action recognition.
Ofli et al.15 propose Most Informative Joints descriptor based on selection of
3Djoints containing more information by computing the quantity of information
associated at each joint, and then they are ordered by decreasing quantity of in-
formation, finally they selected the kjoints the most informative. Hussein et al.8
propose a descriptor based on the covariance matrix called Covariance of 3DJoints
Cov3DJ, in practice, the proposed descriptor is the covariance matrix of the set of
all joints coordinates.
Reyes et al.18 apply the Dynamic Time Warping DTW on a feature vector
defined by 15 joints on a 3Dhuman skeleton obtained using Prime-Sense Nite.
Similarly, Sempena et al.21 compute quaternions from the 3Dhuman skeleton model
to form a feature vector of 60 elements. In the case of 3Djoints estimation from
depth maps, the DTW don’t give good recognition rates because of the noisy nature
of skeleton joint position.
Xia et al.31 propose Histograms Of 3DJoint HOJ3D, which mainly encode the
spatial occupancy of the joints relative to the center of the silhouette (hip). In fact,
the joints are projected in a spherical coordinate system partitioned into n-bins. A
probabilistic voting is established to determine the fractional occupancy. Then, the
HOJ3D are projected using LDA and clustered into kposture visual words which
represent the prototypical poses of actions. The temporal evolutions of those visual
words are modeled by discrete hidden Markov models HMMs.
Seidenari et al.20 propose an approach Bag-Of-Poses BOP based on approach
Bag-Of-Words originated from text retrieval search. The main idea of this approach
is to use joint positions to align multiple-parts of the human body using a bag-of-
poses solution applied in a nearest neighbor framework. Keceli et al.9use histograms
of angles between some important joints and displacement of some joints in 3D
coordinate space as features. They construct two models to classify actions, SVM
and RF algorithms.
Recently, Lu et al.13 propose local position offset of 3D skeletal body joints,
which includes two main steps: 1) computation of position offset: by computing
the time differentiated offset information of each joint at the current tth frame to
(tt)th frame. 2) Video expression by bag-of-words: then they collect all
offset vectors of training video sequences and group them by K-means algorithm
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 5
to generate the codewords. Finally, each video sequence is expressed by a set of
histograms of codewords of body joints.
Chen et al.6propose a novel two-level hierarchical framework using 3Dskeleton
joints. In the first level, they introduce a part-based 5Dfeature vector to explore
the most relevant joints of body parts in each action sequence and to cluster action
sequences. In the second level, they propose two modules, motion feature extraction
for reduce computational costs and adapt to variable movement speed, and action
graphs where they exploit the result of motion feature to build action graphs.
Ben Amor et al.3used trajectories on Kendalls shape manifolds to model the
evolution of human skeleton shapes, and used a parametrization-invariant metric
for aligning, comparing, and modeling skeleton joint trajectories, which can deal
with noise caused by large variability of execution rates within and across humans.
However, such method is much more time-consuming. Cai et al.4propose a novel
skeleton representation for low latency human action recognition. They encode each
limb into a state using Markov random field in terms of relative position, speed and
2.3. Approaches based on hybrid information
Some works propose hybrid approaches by combining both depth information and
skeleton data features in order to improve recognition performances.
Wang et al.28 use both skeleton and point cloud information. They combine
joint location features and Local Occupancy Patterns LOP features and employ a
Fourier Temporal Pyramid FTP to represent the temporal dynamics of the actions.
Althloothi et al.2present 3Dshape features based on 3Dmotion features using
kinematic structure of the skeleton and spherical harmonics representation. Then,
they use a multi kernel learning method for merging this both features.
In Human-Object Interactions, Koppola et al.10 define a Markov Random Field
MRF over the spatio-temporal sequence where nodes represent objects and sub-
activities, and edges represent the relationships between object affordances, their
relations with sub-activities, and their evolution over time. Yu et al.34 present a
novel level representation for skeleton and depth information, called orderlet which
is a middle level feature that captures the ordinal pattern among a group of low
level features. For skeletons, it encodes inter-joint coordination and for depth maps,
it encodes the objects shape information.
3. Mean 3DJoints approach
The proposed framework for action recognition using skeleton joint positions recov-
ery from depth images is illustrated in figure 3. To recognize an action, we compute
two characteristics: static posture feature fpand overall dynamics feature fdin
each frame. We then concatenate both features to obtain the characteristic f. To
construct a characteristics matrix of an action Mc, we concatenate all the fchar-
acteristics computed from each frame. Then, we apply the mean function on each
characteristics matrix’s row of Mcin order to form the feature vector used as input
to the Random Forest classifier for action classification.
Mean 3DJoints approach Mean3DJ is an approach based on skeleton informa-
tion. It uses 3Dposition differences of skeleton joints and the mean function to
characterize an action sequence.
First, we compute two features describing 3Dpositions differences of skeleton
joints, respectively, static posture feature fpencoding spatial aspect, and global
dynamics feature fdencoding temporal aspect, in each current frame c. Both fea-
tures are inspired from Yang et al.32. Then, we construct a characteristics matrix
Mcby concatenating these features extracted from several frames as illustrated in
figure 4. Finally, we apply the mean function on each row of Mcto obtain a means
vector characterizing an action.
Each frame gives a set of 3Dcoordinates of skeleton joints Xobtained from
depth maps22, where X={P1, P2, ..., PN};XR3N, where Ndenotes the number
of joints and Pi= (xi, yi, zi) are 3Dcoordinates of the joint i.
To characterize the spatial aspect represented by static posture feature of current
frame c, we compute pairwise 3Dpositions differences of skeleton joints within the
current frame:
fp={PiPj|i= 1,2, ..., N ;j > i}(1)
To characterize the temporal aspect represented by global dynamics feature of
the current frame cwith respect to the initial frame i, we compute the 3Dpositions
differences of skeleton joints between frame cand frame i:
The combination of both features constructs the characteristic f= [fp;fd], it
describes the preliminary feature representation of each frame.
We note that, depth sensor generates in each frame N3D positions of skeleton
joints. The dimension of fpis N(N1)/2 and of fdis N2, and we know that each
position contains three coordinates (x;y;z), so the final dimension of ffor each
Fig. 3. Overview of the proposed approach.
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 7
frame is 3[N(N1)/2 + N2]. For example, if we use the Kinect camera, which
extracts 20 skeleton joints in each frame, the dimension of fis 1770 rows.
We also note that we can use a few frames to recognize an action instead of
using entire video sequences. Each action sequence Vcould be represented as a
set of Nsselected frames where V={F1, F2, ..., FNs}from which we extract the
characteristic f.
Then, we concatenate the characteristic fccalculated from each selected frame
in order to form the characteristics matrix Mc. Finally, we calculate the means
vector Vmas follows:
Vm(i) = 1
Mc(i, j) ; i= 0, ..., p 1 (3)
Where Nsis the number of the selected frames and p= 3[N(N1)/2 + N2] is
the number of rows in the means vector Vm.
The vector Vmis the final action characteristic descriptor. From the equation
3, we deduce that its size is always 1 ×p. Alg. 1. presents the algorithm of the
proposed approach.
Fig. 4. Different steps to compute the characteristics matrix Mc.
Algorithm 1 M ean3DJ algorithm
Input: X={P1, P2, ..., PN}: set of 3Dcoordinates of skeleton joints.
Input: Ns: number of selected frames.
Output: Vm: Means vector
for all sequence of training set do
Take Nsframes from the skeleton sequence, considering a number of frames
equal to: NJump =N umT otalF r ameSeq/(Ns+ 1) between two consecutive
selected frames.
” Where N umT otalF rameS eq is the number of frames in the sequence”
for frameindex from 1 to N umT otalF r ameSeq do
if this frame is among the frames selected then
Compute static posture feature fpas in equation 1;
Compute overall dynamics feature fdas in equation 2;
fconcatenate(fp, fd)
Mcconcatenate(Mc, f )
end if
end for
Compute Means vector Vmas in equation 3;
end for
return Vm
4. Experimental results
We evaluated the proposed approach on two datasets: MSR Action 3D dataset11 and
MSR Daily Activity 3D dataset28. Results scored by our approach on these datasets
were also compared against those obtained by state of the art solutions. Note that
we coded our approach in C++, using OpenCV library 2.4.10 and Microsoft SDK
1.8, evaluated on an Intel Core i7, 2.70 GHz with 8.0 Go RAM. We used the Random
Forest classifier with 400 trees trained with a max depth of 25 and 0.001 as forest
4.1. MSR Action 3D dataset
The MSR Action 3D11 is an action public dataset that contains 20 actions per-
formed by 10 subjects captured by a depth camera. Each action was performed 2
or 3 times by each subject. The 20 actions are: ”high arm wave, horizontal arm
wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw cir-
cle, hand clap, two hand wave, side boxing, bend, forward kick, side kick, jogging,
tennis swing, tennis serve, golf swing, pick up & throw”. There are 567 sequences
in total, each was recorded as a sequence of depth maps and a sequence of skeletal
joint locations. The main challenge of this dataset is data corruption.
Same as Li et al.11, we divide the action set into 3 action subsets, each one
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 9
Table 1. The three subsets of actions used for the MSR Action 3D dataset.
Action set 1 (AS1) Action set 2 (AS2) Action set 3 (AS3)
Horizontal arm wave (HoW) High arm wave (HiW) High throw (HT)
Hammer (H) Hand catch (HC) Forward kick (FK)
Forward punch (FP) Draw x (DX) Side kick (SK)
High throw (HT) Draw tick (DT) Jogging (J)
Hand clap (HC) Draw circle (DC) Tennis swing (TSw)
Bend (B) Two hand wave (HW) Tennis serve (TSr)
Tennis serve (TSr) Forward kick (FK) Golf swing (GS)
Pickup & throw (PT) Side boxing (SB) Pickup & throw (PT)
Table 2. Recognition accuracies of our approach on the MSR Action 3D dataset.
Test One Test Two Cross Subject Test
AS1 68.75% 80.25% 73.62%
AS2 70.23% 75.91% 76.25%
AS3 90.51% 97.5% 98.17%
Overall 76.5% 84.55% 82.68%
containing 8 actions as listed in Table 1. AS1 and AS2 group actions requiring
similar movements, while AS3 groups more complex actions. For each subset, there
are three different tests: Test One, Test Two and Cross Subject Test. In Test One,
1/3 of the data is used as a training set and the rest as a testing set. In Test Two,
2/3 of the data is used as a training set and the rest as a testing set. In the Cross
Subject Test, half of subjects is used for training and the rest is used for testing.
Each test was repeated 10 times. The average result is reported in Table 2. The
recognition accuracies of Mean3DJ in each test under various subsets is shown in
figure 5.
We observe from table 2 and figure 5 that our approach performs well on the
subset AS3. This is probably because actions in AS1 and AS2 have similar motions,
but AS3 groups complex but pretty distinct actions. We also notice that the per-
formance in Test Two is promising, because Test Two has trained 2/3 of the data,
more than Test One and Cross Subject Test.
Figure 6 illustrates the confusion matrix of our approach only under Cross
Subject Test. From the confusion matrix, In AS1, action ”Hammer” gets the lowest
accuracy, it is confused with several actions, especially for Horizontal arm wave
and Forward punch. In AS2, ”Draw X” is repeatedly confused with another action
as ”Draw circle” and ”High arm wave”. These confusions in both subsets occur
because they contain highly similar motions. In AS3, all actions are well recognized
because the actions are complex, they have significant differences.
Table 3 shows the comparison of our approach with state-of-the-art approaches
Fig. 5. Recognition accuracies in each iteration under various tests.
Fig. 6. Confusion matrix of Mean3DJ in different action sets under Cross Subject Test. Each row
corresponds to ground truth label and each column denotes the recognition results.
on the MSR Action 3D dataset11. From which, we can observe that our approach
gives promising results when compared to the state-of-the-art methods. But it can-
not outperform several methods because two factors:
(1) similar motions: In this case, the position of the joints of actions are practically
identical, therefore, the descriptor vector is almost the same.
(2) corrupted skeletons: Some skeleton sequences are very noisy due to failures of
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 11
Table 3. Comparison of accuracy on MSR Action 3D dataset.
Approach Method accuracy
Bag of 3D Points 11 74.7 %
DMM-HOG 33 81.63 %
STOP 26 84.8 %
Depth ROP 27 86.5 %
HON4D 16 88.89 %
DSTIP 30 89.3 %
HOG3DJ 31 78.97 %
Mean3DJ 82.68 %
Eigen Joints 32 83.3 %
Keceli et al988.2%
Skeleton Chen et al 688.7 %
Ben Amor et al389 %
Conv3DJ890.53 %
Cai et al491.01 %
Moving Pose36 91.7 %
Althloothi 279.7 %
Hybrid Actionlet Ensemble 28 88.2 %
the skeleton tracking algorithm. In some action sequences the skeleton reduces
to a point as shown in Figure 7.
Fig. 7. Some samples of corrupted skeleton from MSR Action 3D dataset.
4.2. MSR Daily Activity 3D dataset
The MSR Daily Activity 3D dataset 28 presents 16 activities done in a living room,
which are: drink, eat, read book, call cellphone, write on a paper, use laptop, use
vacuum cleaner, cheer up, sit still, toss paper, play game, lie down on sofa, walk,
play guitar, stand up, sit down. Each activity is performed by 10 subjects where
each subject performs each activity twice, once in ”standing” position, and once in
”sitting on sofa” position. The total number of the activity samples is 16×10 ×2 =
320. This dataset was captured by a Kinect device. The figure 8 illustrates some
examples of activities from this dataset in color, depth and skeleton. The actions
in this dataset are more complex than MSR Action 3D dataset 11, they require
interactions with objects in the scene.
Fig. 8. Some examples of activities from MSR Daily Activity 3D dataset 28.
We evaluated our approach on the MSR Daily Activity 3D dataset 28 using
Leave-One-Out Cross-Validation technique (LOOCV) where we left in each itera-
tion an actor out, we can call it Leave-One-Actor-Out Cross-Validation (LOAOCV).
The figure 9 shows the recognition accuracies of Mean3DJ approach in each itera-
tion using LOAOCV. The igure 10 illustrates the average accuracy of each action,
and the confusion matrix is shown in figure 11 .
from Figure 9, we observe that our algorithm gives promising results, we
achieved an average accuracy of 73.75%. From the confusion matrix and Figure 10,
we deduce that the most difficult actions to recognize correspond to the cases
where subjects interact with external objects. The proposed approach has been
also evaluated with state-of-the-art approaches applied on the MSR Daily Activity
3D dataset28 through a comparative analysis, which is reported in table 4.
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 13
Fig. 9. Recognition accuracies for each iteration.
Fig. 10. Recognition accuracies for the 16 action classes of the MSR Daily Activity 3D dataset.
Table 4 demonstrates that our approach gives one of the best results when com-
pared to state-of-the-art approaches, especially with approaches based on skeleton
information. We observe that approaches based on skeleton features applied on
MSR Daily Activity 3D are not as accurate as approaches based on depth features
Fig. 11. Confusion matrix of Mean3DJ. Each row corresponds to ground truth label and each
column denotes the recognition results.
Table 4. Comparison of accuracy on MSR Daily Activity 3D dataset.
Approach Method accuracy
Only LOP features 28 42.5 %
Depth LHON4D 15 80 %
DSTIP (DCSF) 30 83.6 %
NBNN 20 53 %
NBNN + time 20 60 %
Skeleton NBNN + parts 20 60 %
Only Joint Position features 28 68 %
NBNN + Parts + Time 20 70 %
Ben Amor 370 %
Mean3DJ 73.75 %
Moving Pose36 73.8 %
Orderlets (Only skeleton feature) 34 73.8 %
SVM on FTP Features28 78 %
Cai et al 478.52 %
DSTIP (DCSF+Joints) 30 88.2 %
Hybrid Actionlet Ensemble 28 85.75 %
Althloothi et al.293.1%
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 15
or based on hybrid features, because most activities in this dataset involve human-
object interactions. Such kind of interactions is difficult to model or detect from
skeletal data.
4.3. Result analysis
The analysis of some misclassified examples given by figure 12 shows in fact that
there are two major reasons of this misclassification. The first one is the bad quality
of depth information, such as some occlusions in the main regions, which affected
the quality of skeleton information. The second reason lies in the fact that based
only the skeleton information, there is some confusion, to recognize correctly the
action done by the person. One solution of this problem and for improving the pro-
posed approach is to introduce the texture and depth information, which contains
complementary information in order to consolidate the classifier decision.
Fig. 12. Some examples of actions misclassified by our approach.
5. Conclusion and Future works
In this work, we proposed a new human action recognition approach using skeleton
joints extracted from depth cameras. Our approach is based on difference of the 3D
coordinates of skeleton joints. It first computes static posture feature and overall
dynamics feature between the current frame and the initial frame. Then, it applies
the mean function on these features in order to form the Mean3DJ descriptor. Our
approach has been applied on both datasets and gives promising results.
The proposed approach can be extended to hybrid approach in order to improve
the performances and recognition accuracy. The texture and depth information
could be associated to skeleton information to consolidate the classifier decision.
In fact, texture and depth channel provide additional informations which could be
complementary to the skeleton one.
1. J. K. Aggarwal and M. S. Ryoo, Human activity analysis: A review, ACM Computing
Surveys (CSUR) 43(3) (2011) p. 16.
2. S. Althloothi, M. H. Mahoor, X. Zhang and R. M. Voyles, Human activity recognition
using multi-features and multiple kernel learning, Pattern Recognition 47(5) (2014)
3. B. B. Amor, J. Su and A. Srivastava, Action recognition using rate-invariant anal-
ysis of skeletal shape trajectories, Pattern Analysis and Machine Intelligence, IEEE
Transactions on 38(1) (2016) 1–13.
4. X. Cai, W. Zhou, L. Wu, J. Luo and H. Li, Effective active skeleton representation
for low latency human action recognition (2013).
5. S. Calderara, A. Prati and R. Cucchiara, Markerless body part tracking for action
recognition, International Journal of Multimedia Intelligence and Security 1(1) (2010)
6. H. Chen, G. Wang, J.-H. Xue and L. He, A novel hierarchical framework for human
action recognition, Pattern Recognition 55 (2016) 148–159.
7. S. R. Fanello, I. Gori, G. Metta and F. Odone, Keep it simple and sparse: Real-time
action recognition, The Journal of Machine Learning Research 14(1) (2013) 2617–
8. M. E. Hussein, M. Torki, M. A. Gowayyed and M. El-Saban, Human action recog-
nition using a temporal hierarchy of covariance descriptors on 3d joint locations, in
Proceedings of the Twenty-Third international joint conference on Artificial Intelli-
gence (2013) pp. 2466–2472.
9. A. S. Keceli and A. B. Can, Recognition of basic human actions using depth informa-
tion, International Journal of Pattern Recognition and Artificial Intelligence 28(02)
(2014) p. 1450004.
10. H. S. Koppula, R. Gupta and A. Saxena, Learning human activities and object af-
fordances from rgb-d videos, The International Journal of Robotics Research 32(8)
(2013) 951–970.
11. W. Li, Z. Zhang and Z. Liu, Action recognition based on a bag of 3d points, in Com-
puter Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer
Society Conference on (2010) pp. 9–14.
12. W. Lin, M. T. Sun, R. Poovandran and Z. Zhang, Human activity recognition for
video surveillance, in Circuits and Systems, 2008. ISCAS 2008. IEEE International
Symposium on (2008) pp. 2737–2740.
13. G. Lu, Y. Zhou, X. Li and M. Kudo, Efficient action recognition via local position
offset of 3d skeletal body joints, Multimedia Tools and Applications (2015) 1–16.
14. B. Ni, G. Wang and P. Moulin, Rgbd-hudaact: A color-depth video database for
human daily activity recognition, in Consumer Depth Cameras for Computer Vision
Automatic Learning of Articulated Skeletons based on Mean of 3D Joints for Efficient Action Recognition 17
(Springer, 2013) pp. 193–208.
15. F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal and R. Bajcsy, Sequence of the most
informative joints (smij): A new representation for human skeletal action recognition,
Journal of Visual Communication and Image Representation 25(1) (2014) 24–38.
16. O. Oreifej and Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition
from depth sequences, in Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on (2013) pp. 716–723.
17. R. Poppe, A survey on vision-based human action recognition, Image and vision
computing 28(6) (2010) 976–990.
18. M. Reyes, G. Dom´ınguez and S. Escalera, Featureweighting in dynamic timewarping
for gesture recognition in depth data, in Computer Vision Workshops (ICCV Work-
shops), 2011 IEEE International Conference on (2011) pp. 1182–1188.
19. D. Sanchez, M. Tentori and J. Favela, Activity recognition for the smart hospital,
Intelligent Systems, IEEE 23(2) (2008) 50–57.
20. L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo and P. Pala, Recognizing actions
from depth cameras as weakly aligned multi-part bag-of-poses, in Computer Vision
and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on (2013) pp.
21. S. Sempena, N. U. Maulidevi and P. R. Aryan, Human action recognition using dy-
namic time warping, in Electrical Engineering and Informatics (ICEEI), 2011 Inter-
national Conference on (2011) pp. 1–5.
22. J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook
and R. Moore, Real-time human pose recognition in parts from single depth images,
Communications of the ACM 56(1) (2013) 116–124.
23. E. E. Stone and M. Skubic, Fall detection in homes of older adults using the microsoft
kinect, Biomedical and Health Informatics, IEEE Journal of 19(1) (2015) 290–301.
24. R. Vezzani, D. Baltieri and R. Cucchiara, Hmm based action recognition with pro-
jection histogram features, in Recognizing Patterns in Signals, Speech, Images and
Videos 2010 pp. 286–293.
25. R. Vezzani, M. Piccardi and R. Cucchiara, An efficient bayesian framework for on-
line action recognition, in Image Processing (ICIP), 2009 16th IEEE International
Conference on (2009) pp. 3553–3556.
26. A. W. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu and M. F. Campos, Stop:
Space-time occupancy patterns for 3d action recognition from depth map sequences, in
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
2012 pp. 252–259.
27. J. Wang, Z. Liu, J. Chorowski, Z. Chen and Y. Wu, Robust 3d action recognition
with random occupancy patterns, in Computer vision–ECCV 2012 (Springer, 2012)
pp. 872–885.
28. J. Wang, Z. Liu, Y. Wu and J. Yuan, Mining actionlet ensemble for action recognition
with depth cameras, in Computer Vision and Pattern Recognition (CVPR) (2012) pp.
29. D. Weinland, R. Ronfard and E. Boyer, A survey of vision-based methods for action
representation, segmentation and recognition, Computer Vision and Image Under-
standing 115(2) (2011) 224–241.
30. L. Xia and J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activ-
ity recognition using depth camera, in Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on (2013) pp. 2834–2841.
31. L. Xia, C.-C. Chen and J. Aggarwal, View invariant human action recognition us-
ing histograms of 3d joints, in Computer Vision and Pattern Recognition Workshops
(CVPRW), 2012 IEEE Computer Society Conference on (2012) pp. 20–27.
32. X. Yang and Y. Tian, Eigenjoints-based action recognition using naive-bayes-nearest-
neighbor, in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012
IEEE Computer Society Conference on (2012) pp. 14–19.
33. X. Yang, C. Zhang and Y. Tian, Recognizing actions using depth motion maps-based
histograms of oriented gradients, in Proceedings of the 20th ACM international con-
ference on Multimedia (2012) pp. 1057–1060.
34. G. Yu, Z. Liu and J. Yuan, Discriminative orderlet mining for real-time recognition
of human-object interaction, in Computer Vision–ACCV 2014 (Springer, 2014) pp.
35. Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton and P. Presti, American sign lan-
guage recognition with the kinect, in Proceedings of the 13th international conference
on multimodal interfaces (2011) pp. 279–286.
36. M. Zanfir, M. Leordeanu and C. Sminchisescu, The moving pose: An efficient 3d
kinematics descriptor for low-latency action recognition and detection, in Proceedings
of the IEEE International Conference on Computer Vision (2013) pp. 2752–2759.
Biographical Sketch and Photo
... However, in the SLR field, there is a need for a consolidated dataset in some specific languages as reported by [11,14,100,102]. In these cases, the authors often use either databases from other languages [55,77], or gesture databases [16,17,33,65,68,103] or create their own data to perform the tests. ...
... The expectation is that the hand shift pattern throughout the sign execution will be captured by the recurrence matrix and that CNN will be able to identify this pattern for each sign. This matrix (M R ), illustrated by Fig. 18 With this approach, the CNN3D was replaced by a CNN2D, with a similar arrangement: 4 convolutional layers (4,8,16, and 32 filters), each followed by a ReLU activation function and a max pooling layer. Then, 2 fully connected layers and a softmax activation function as a classifier. ...
Full-text available
Sign language recognition is considered the most important and challenging application in gesture recognition, involving the fields of pattern recognition, machine learning and computer vision. This is mainly due to the complex visual–gestural nature of sign languages and the availability of few databases and studies related to automatic recognition. This work presents the development and validation of a Brazilian sign language (Libras) public database. The recording protocol describes (1) the chosen signs, (2) the signaller characteristics, (3) the sensors and software used for video acquisition, (4) the recording scenario and (5) the data structure. Provided that these steps are well defined, a database with more than 1000 videos of 20 Libras signs recorded by twelve different people is created using an RGB-D sensor and an RGB camera. Each sign was recorded five times by each signaller. This corresponds to a database with 1200 samples of the following data: (1) RGB video frames, (2) depth, (3) body points and (4) face information. Some approaches using deep learning-based models were applied to classify these signs based on 3D and 2D convolutional neural networks. The best result shows an average accuracy of 93.3%. This paper presents an important contribution for the research community by providing a publicly available sign language dataset and baseline results for comparison.
... Then, they concatenated both types of features and calculated mean vector for representation. This feature vector was then used as an input to the RF classifier for action classification [12]. The works given in [10][11][12] also use skeletal joints' features for activity recognition as are used in the proposed method; however, the positions of skeletal joints are obtained by using a depth camera such as Kinect in each case. ...
... This feature vector was then used as an input to the RF classifier for action classification [12]. The works given in [10][11][12] also use skeletal joints' features for activity recognition as are used in the proposed method; however, the positions of skeletal joints are obtained by using a depth camera such as Kinect in each case. In our work, we extract skeletal joints' positions directly from the video sequence without using any specific depth camera or device. ...
Full-text available
Vision‐based human activity recognition (HAR) finds its application in many fields such as video surveillance, robot navigation, telecare and ambient intelligence. Most of the latest researches in the field of automated HAR based on skeleton data use depth devices such as Kinect to obtain three‐dimensional (3D) skeleton information directly from the camera. Although these researches achieve high accuracy but are strictly device dependent and cannot be used for videos other than from specific cameras. Current work focuses on the use of only 2D skeletal data, extracted from videos obtained through any standard camera, for activity recognition. Appearance and motion features were extracted using 2D positions of human skeletal joints through OpenPose library. The approach was trained and tested on publically available datasets. Supervised machine learning was implemented for recognising four activity classes including sit, stand, walk and fall. Performance of five techniques including K‐nearest neighbours (KNNs), support vector machine, Naive Bayes, linear discriminant and feed‐forward back‐propagation neural network was compared to find the best classifier for the proposed method. All techniques performed well with best results obtained through the KNN classifier.
... They utilized the Naïve Bayes classifier for activity recognition. In [15], the authors combined 3D joint position differences inside a casing with the joint differences from the initial frame of an action to produce outline features. The features of these frames are concatenated to make a frames sequence. ...
Full-text available
In recent years, Human Activity Recognition (HAR) has become one of the most important research topics in the domains of health and human-machine interaction. Many Artificial intelligence-based models are developed for activity recognition; however, these algorithms fail to extract spatial and temporal features due to which they show poor performance on real-world long-term HAR. Furthermore, in literature, a limited number of datasets are publicly available for physical activities recognition that contains less number of activities. Considering these limitations, we develop a hybrid model by incorporating Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) for activity recognition where CNN is used for spatial features extraction and LSTM network is utilized for learning temporal information. Additionally, a new challenging dataset is generated that is collected from 20 participants using the Kinect V2 sensor and contains 12 different classes of human physical activities. An extensive ablation study is performed over different traditional machine learning and deep learning models to obtain the optimum solution for HAR. The accuracy of 90.89% is achieved via the CNN-LSTM technique, which shows that the proposed model is suitable for HAR applications
... O ideal, nesse tipo de aplicação, é que as amostras sejam representativas e em proporção significativa. As alternativas que alguns trabalhos encontram para realizar experimentos com técnicas de aprendizado de máquina foram a utilização de bases de dados de gestos, devido a sua grande similaridade com os movimentos executados nas línguas de sinais (Li et al., 2010;Xia et al., 2011;Escalera et al., 2013;Liu e Shao, 2013;Bloom et al., 2016;Ben Tamou et al., 2017) ou utilizar o conjunto de dados de uma língua de sinais que possa estar melhor estruturada, como em Nguyen e Ranganath (2012); Júnior et al. (2017). ...
Full-text available
The automatic recognition of Sign Language has been a challenge for the Computational Intelligence area, given the visual-gestural nature that configures this complex communication system. This thesis falls within this context and focuses efforts on the Brazilian Sign Language, Libras. For this purpose, a new database called MINDS-Libras has been proposed. It contains (i) RGB videos, (ii) videos with depth information, (iii) information from 25 points/joints of the body and from (iv) 1347 points of the face of the signaller. Each of the 20 signs that build this base were recorded 5 times by 12 signallers, totaling 1200 samples. Using this data, two different Deep Learning architectures were proposed for recognizing the MINDS-Libras signs. The first one was a 3D Convolutional Neural Network by using videos, and the second a Temporal Convolutional Neural Network for the manual trajectory. The best leave-one-signaller-out was that based in the hand movement, and this can be considered the most important parameter for sign formation. The results also indicate that this approach is feasible for the Libras signs recognition. New perspectives may be opened with the expansion of the database and add more signallers in the process of recording (new) signs.
... The gait cycle of each individual is detected, and features are trained using a KNN classifier. Ben Tamou, Ballihi, and Aboutajdine et al. [11] proposed a novel approach to human action recognition based on depth camera-extracted skeleton joints The 3D coordinates of skeleton joints are subtracted in their method. Bhattacharya, Czejdo, and Perez et al. [12] In the context of aircraft marshaling, they discussed machine learning techniques for gesture classification. ...
... SVM, on the other hand, was designed to be used for non-linear classification by utilizing a kernel trick on feature space [1], [13]. The fundamental principle of SVM is that the function transfers data x to a vector space with a higher dimension (x) [14]. ...
Full-text available
In this paper, a new method is proposed for people tracking using the human skeleton provided by the Kinect sensor, Our method is based on skeleton data, which includes the coordinate value of each joint in the human body. For data classification, the Support Vector Machine (SVM) and Random Forest techniques are used. To achieve this goal, 14 classes of movements are defined, using the Kinect Sensor to extract data containing 46 features and then using them to train the classification models. The system was tested on 12 subjects, each of whom performed 14 movements in each experiment. Experiment results show that the best average accuracy is 90.2 % for the SVM model and 99 % for the Random forest model. From the experiments, we concluded that the best distance between the Kinect sensor and the human body is one meter.
... Skeleton NBNN [70] 53 NBNN + time [70] 60 NBNN + parts [70] 60 Only Joint Position feature [67] 68 NBNN + parts + time [70] 70 Distance + Temporal features [71] 73.43 mean 3D joints [72] 73.75 SVM + FTP feature [67] 78 ...
Full-text available
Machine recognition of the human activities is an active research area in computer vision. In previous study, either one or two types of modalities have been used to handle this task. However, the grouping of maximum information improves the recognition accuracy of human activities. Therefore, this paper proposes an automatic human activity recognition system through deep fusion of multi-streams along with decision-level score optimization using evolutionary algorithms on RGB, depth maps and 3d skeleton joint information. Our proposed approach works in three phases, 1) space-time activity learning using two 3D Convolutional Neural Network (3DCNN) and a Long Sort Term Memory (LSTM) network from RGB, Depth and skeleton joint positions 2) Training of SVM using the activities learned from previous phase for each model and score generation using trained SVM 3) Score fusion and optimization using two Evolutionary algorithm such as Genetic algorithm (GA) and Particle Swarm Optimization (PSO) algorithm. The proposed approach is validated on two 3D challenging datasets, MSRDailyActivity3D and UTKinectAction3D. Experiments on these two datasets achieved 85.94% and 96.5% accuracies, respectively. The experimental results show the usefulness of the proposed representation. Furthermore, the fusion of different modalities improves recognition accuracies rather than using one or two types of information and obtains the state-of-art results.
... Well-designed spatial-temporal encodings of skeleton sequences are more effective than only temporal information of skeleton-based action representation. Therefore, the methods [14] [15] [16] are proposed which process both spatial and temporal information of the action together, leading to more effective recognition systems. ...
There exist a wide range of intra class variations of the same actions and inter class similarity among the actions, at the same time, which makes the action recognition in videos very challenging. In this paper, we present a novel skeleton-based part-wise Spatiotemporal CNN RIAC Network-based 3D human action recognition framework to visualise the action dynamics in part wise manner and utilise each part for action recognition by applying weighted late fusion mechanism. Part wise skeleton based motion dynamics helps to highlight local features of the skeleton which is performed by partitioning the complete skeleton in five parts such as Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. The RIAFNet architecture is greatly inspired by the InceptionV4 architecture which unified the ResNet and Inception based Spatio-temporal feature representation concept and achieving the highest top-1 accuracy till date. To extract and learn salient features for action recognition, attention driven residues are used which enhance the performance of residual components for effective 3D skeleton-based Spatio-temporal action representation. The robustness of the proposed framework is evaluated by performing extensive experiments on three challenging datasets such as UT Kinect Action 3D, Florence 3D action Dataset, and MSR Daily Action3D datasets, which consistently demonstrate the superiority of our method
Recently, human activity recognition using skeleton data is increasing due to its ease of acquisition and finer shape details. Still, it suffers from a wide range of intra-class variation, inter-class similarity among the actions and view variation due to which extraction of discriminative spatial and temporal features is still a challenging problem. In this regard, we present a novel Residual Inception Attention Driven CNN (RIAC-Net) Network, which visualizes the dynamics of the action in a part-wise manner. The complete skeletonis partitioned into five key parts: Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. For each part, a Compact Action Skeleton Sequence (CASS) is defined. Part-wise skeleton-based motion dynamics highlights discriminative local features of the skeleton that helps to overcome the challenges of inter-class similarity and intra-class variation with improved recognition performance. The RIAC-Net architecture is inspired by the concept of inception-residual representation that unifies the Attention Driven Residues (ADR) with inception-based Spatio-Temporal Convolution Features (STCF) to learn efficient salient action features. An ablation study is also carried out to analyze the effect of ADR over simple residue-based action representation. The robustness of the proposed framework is evaluated by performing an extensive experiment on four challenging datasets: UT Kinect Action 3D, Florence 3D action, MSR Daily Action3D, and NTU RGB-D datasets, which consistently demonstrate the superiority of the proposed method over other state-of-the-art methods.
Recognition of human actions in videos is a challenging task, which has received a significant amount of attention in the research community. We introduce an end-to-end multitask model that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting objects-actions in videos, First we fine-tune a Resnet model to detect objects in videos, second a Neural Network model is used for sequence learning to get the object-action correlation. Finally, we apply our multitask architecture to detect visual relationships between objects to recognize activities in videos of the MSR Daily Activity Dataset.
In this paper, we propose a novel two-level hierarchical framework for three-dimensional (3D) skeleton-based action recognition, in order to tackle the challenges of high intra-class variance, movement speed variability and high computational costs of action recognition. In the first level, a new part-based clustering module is proposed. In this module, we introduce a part-based five-dimensional (5D) feature vector to explore the most relevant joints of body parts in each action sequence, upon which action sequences are automatically clustered and the high intra-class variance is mitigated. In the second level, there are two modules, motion feature extraction and action graphs. In the module of motion feature extraction, we utilize the cluster-relevant joints only and present a new statistical principle to decide the time scale of motion features, to reduce computational costs and adapt to variable movement speed. In the action graphs module, we exploit these 3D skeleton-based motion features to build action graphs, and devise a new score function based on maximum-likelihood estimation for action graph-based recognition. Experiments on the Microsoft Research Action3D dataset and the University of Texas Kinect Action dataset demonstrate that our method is superior or at least comparable to other state-of-the-art methods, achieving 95.56% recognition rate on the former dataset and 95.96% on the latter one.
We study the problem of action recognition from depth sequences captured by depth cameras, where noise and occlusion are common problems because they are captured with a single commodity camera. In order to deal with these issues, we extract semi-local features called random occupancy pattern (ROP) features, which employ a novel sampling scheme that effectively explores an extremely large sampling space. We also utilize a sparse coding approach to robustly encode these features. The proposed approach does not require careful parameter tuning. Its training is very fast due to the use of the high-dimensional integral image, and it is robust to the occlusions. Our technique is evaluated on two datasets captured by commodity depth cameras: an action dataset and a hand gesture dataset. Our classification results are superior to those obtained by the state of the art approaches on both datasets.
With the development of depth sensors, low latency 3D human action recognition has become increasingly important in various interaction systems, where response with minimal latency is a critical process. High latency not only significantly degrades the interaction experience of users, but also makes certain interaction systems, e.g., gesture control or electronic gaming, unattractive. In this paper, we propose a novel active skeleton representation towards low latency human action recognition. First, we encode each limb of the human skeleton into a state through a Markov random field. The active skeleton is then represented by aggregating the encoded features of individual limbs. Finally, we propose a multi-channel multiple instance learning with maximum-pattern-margin to further boost the performance of the existing model. Our method is robust in calculating features related to joint positions, and effective in handling the unsegmented sequences. Experiments on the MSR Action3D, the MSR DailyActivity3D, and the Huawei/3DLife-2013 dataset demonstrate the effectiveness of the model with the proposed novel representation, and its superiority over the state-of-the-art low latency recognition approaches.
Conference Paper
This paper presents Space-Time Occupancy Patterns (STOP), a new visual representation for 3D action recognition from sequences of depth maps. In this new representation, space and time axes are divided into multiple segments to define a 4D grid for each depth map sequence. The advantage of STOP is that it preserves spatial and temporal contextual information between space-time cells while being flexible enough to accommodate intra-action variations. Our visual representation is validated with experiments on a public 3D human action dataset. For the challenging cross-subject test, we significantly improved the recognition accuracy from the previously reported 74.7% to 84.8%. Furthermore, we present an automatic segmentation and time alignment method for online recognition of depth sequences.
We study the problem of classifying actions of human subjects using depth movies generated by Kinect or other depth sensors. Representing human body as dynamical skeletons, we study the evolution of their (skeletons’) shapes as trajectories on Kendall’s shape manifold. The action data is typically corrupted by large variability in execution rates within and across subjects and, thus, causing major problems in statistical analyses. To address that issue, we adopt a recently-developed framework of Su et al. [1], [2] to this problem domain. Here, the variable execution rates correspond to re-parameterizations of trajectories, and one uses a parameterization-invariant metric for aligning, comparing, averaging, and modeling trajectories. This is based on a combination of transported square-root vector fields (TSRVFs) of trajectories and the standard Euclidean norm, that allows computational efficiency. We develop a comprehensive suite of computational tools for this application domain: smoothing and denoising skeleton trajectories using median filtering, up- and down-sampling actions in time domain, simultaneous temporalregistration of multiple actions, and extracting invertible Euclidean representations of actions. Due to invertibility these Euclidean representations allow both discriminative and generative models for statistical analysis. For instance, they can be used in a SVM-based classification of original actions as demonstrated here using MSR Action-3D, MSR Daily Activity and 3D Action Pairs datasets. This approach, using only the skeletal data, achieves the state-of-the-art classification results on these datasets.
To accurately recognize human actions in less computational time is one important aspect for practical usage. This paper presents an efficient framework for recognizing actions by a RGB-D camera. The novel action patterns in the framework are extracted via computing position offset of 3D skeletal body joints locally in the temporal extent of video. Action recognition is then performed by assembling these offset vectors using a bag-of-words framework and also by considering the spatial independence of body joints. We conducted extensive experiments on two benchmarking datasets: UCF dataset and MSRC-12 dataset, to demonstrate the effectiveness of the proposed framework. Experimental results suggest that the proposed framework 1) is very fast to extract action patterns and very simple in implementation; and 2) can achieve a comparable or a better performance in recognition accuracy compared with the state-of-the-art approaches.
This paper presents a method for recognising human actions by tracking body parts without using artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians (MoG) is then employed to extract and track significant points on this map, corresponding to significant regions on the human silhouette. The evolution of the mixture in time is analysed by transforming it in a sequence of symbols (corresponding to a MoG). The similarity between actions is computed by applying global alignment and dynamic programming techniques to the corresponding sequences and using a variational approximation of the Kullback-Leibler divergence to measure the dissimilarity between two MoGs. Experiments on publicly available datasets and comparison with existing methods are provided.