Conference PaperPDF Available

Lecture Notes in Computer Science


Abstract and Figures

We propose in this paper a fully automated deep model, which learns to classify human actions without using any prior knowledge. The first step of our scheme, based on the extension of Convolutional Neural Networks to 3D, automatically learns spatio-temporal features. A Recurrent Neural Network is then trained to classify each sequence considering the temporal evolution of the learned features for each timestep. Experimental results on the KTH dataset show that the proposed approach outperforms existing deep models, and gives comparable results with the best related works.
Content may be subject to copyright.
Sequential Deep Learning
for Human Action Recognition
Moez Baccouche1,2, Franck Mamalet1,
Christian Wolf2, Christophe Garcia2, and Atilla Baskurt2
1Orange Labs, 4 rue du Clos Courtel, 35510 Cesson-S´evign´e, France
2LIRIS, UMR 5205 CNRS, INSA-Lyon, F-69621, France
Abstract. We propose in this paper a fully automated deep model,
which learns to classify human actions without using any prior knowl-
edge. The first step of our scheme, based on the extension of Convo-
lutional Neural Networks to 3D, automatically learns spatio-temporal
features. A Recurrent Neural Network is then trained to classify each
sequence considering the temporal evolution of the learned features for
each timestep. Experimental results on the KTH dataset show that the
proposed approach outperforms existing deep models, and gives compa-
rable results with the best related works.
Keywords: Human action recognition, deep models, 3D convolutional
neural networks, long short-term memory, KTH human actions dataset.
1 Introduction and Related Work
Automatic understanding of human behaviour and its interaction with his envi-
ronment have been an active research area in the last years due to its potential
application in a variety of domains. To achieve such a challenging task, sev-
eral research fields focus on modeling human behaviour under its multiple facets
(emotions, relational attitudes, actions, etc.). In this context, recognizing the
behaviour of a person appears to be crucial when interpreting complex actions.
Thus, a great interest has been granted to human action recognition, especially
in real-world environments.
Among the most popular state-of-the-art methods for human action recogni-
tion, we can mention those proposed by Laptev et al. [13], Dollar et al. [3] and
others [12,17,2,4], which all use engineered motion and texture descriptors cal-
culated around spatio-temporal interest points, which are manually engineered.
The Harris-3D detector [13] and the Cuboid detector [3] are likely the most used
space-time salient points detectors in the literature. Nevertheless, even if their
extraction process is fully automated, these so-called hand-crafted features are
especially designed to be optimal for a specific task. Thus, despite their high
performances, these approaches main drawback is that they are highly problem
A.A. Salah and B. Lepri (Eds.): HBU 2011, LNCS 7065, pp. 29–39, 2011.
Springer-Verlag Berlin Heidelberg 2011
30 M. Baccouche et al.
In last years, there has been a growing interest in approaches, so-called deep
models, that can learn multiple layers of feature hierarchies and automatically
build high-level representations of the raw input. They are thereby more generic
since the feature construction process is fully automated. One of the most used
deep models is the Convolutional Neural Network architecture [14,15], hereafter
ConvNets, which is a bioinspired hierarchical multilayered neural network able to
learn visual patterns directly from the image pixels without any pre-processing
step. If ConvNets were shown to yield very competitive performances in many
image processing tasks, their extension to the video case is still an open issue,
and, so far, the few attempts either make no use of the motion information [20],
or operate on hand-crafted inputs (spatio-temporal outer boundaries volume in
[11] or hand-wired combination of multiple input channels in [10]). In addition,
since these models take as input a small number of consecutive frames (typically
less than 15), they are trained to assign a vector of features (and a label) to
short sub-sequences and not to the entire sequence. Thus, even if the learned
features, taken individually, contains temporal information, their evolution over
time is completely ignored. Though, we have shown in our previous work [1] that
such information does help discriminating between actions, and is particularly
usable by a category of learning machines, adapted to sequential data, namely
Long Short-Term Memory recurrent neural networks (LSTM) [6].
In this paper, we propose a two-steps neural-based deep model for human
action recognition. The first part of the model, based on the extension of Conv-
Nets to 3D case, automatically learns spatio-temporal features. Then, the second
step consists in using these learned features to train a recurrent neural network
model in order to classify the entire sequence. We evaluate the performances on
the KTH dataset [24], taking particular care to follow the evaluation protocol
recommendations discussed in [4]. We show that, without using the LSTM clas-
sifier, we obtain comparable results with other deep models based approaches
[9,26,10]. We also demonstrate that the introduction of the LSTM classifica-
tion leads to significant performance improvement, reaching average accuracies
among the best related results.
The rest of the paper is organized as follows. Section 2 outlines some Conv-
Nets fundamentals and the feature learning process. We present in Section 3
the recurrent neural scheme for entire sequence labelling. Finally, experimental
results, carried out on the KTH dataset, will be presented in Section 4.
2 Deep Learning of Spatio-Temporal Features
In this section, we describe the first part of our neural recognition scheme. We
first present some fundamentals of 2D-ConvNets, and then discuss their exten-
sion in 3D and describe the proposed architecture.
2.1 Convolutional Neural Networks (ConvNets)
Despite their generic nature, deep models were not used in many applications
until the late nineties because of their inability to treat “real world” data.
Sequential Deep Learning for Human Action Recognition 31
Indeed, early deep architectures dealt only with 1-D data or small 2D-patches.
The main problem was that the input was “fully connected” to the model, and
thus the number of free parameters was directly related to the input dimension,
making these approaches inappropriate to handle “pictoral” inputs (natural im-
ages, videos. . . ).
Therefore, the convolutional architecture was introduced by LeCun et al.
[14,15] to alleviate this problem. ConvNets are the adaptation of multilayered
neural deep architectures to deal with real world data. This is done by the use of
local receptive fields whose parameters are forced to be identical for all its possi-
ble locations, a principle called weight sharing. Schematically, LeCun’s ConvNet
architecture [14,15] is a succession of layers alternating 2D-convolutions (to cap-
ture salient information) and sub-samplings (to reduce dimension), both with
trainable weights. Jarret et al. [8] have recommended the use of rectification lay-
ers (which simply apply absolute value to its input) after each convolution, which
was shown to significantly improve performances, when input data is normalized.
In the next sub-section, we examine the adaptation of ConvNets to video
processing, and describe the 3D-ConvNets architecture that we used in our ex-
periments on the KTH dataset.
2.2 Automated Space-Time Feature Construction with
The extension from 2D to 3D in terms of architecture is straightforward since
2D convolutions are simply replaced by 3D ones, to handle video inputs. Our
proposed architecture, illustrated in Figure 1, also uses 3D convolutions, but is
different from [11] and [10] in the fact that it uses only raw inputs.
Fig. 1. Our 3D-ConvNet architecture for spatio-temporal features construction
32 M. Baccouche et al.
This architecture consists of 10 layers including the input. There are two
alternating convolutional, rectification and sub-sampling layers C1, R1, S1 and
C2, R2, S2 followed by a third convolution layer C3 and two neuron layers
N1 and N2. The size of the 3D input layer is 34 ×54 ×9, corresponding to
9 successive frames of 34 ×54 pixels each. Layer C1 is composed of 7 feature
maps of size 28 ×48 ×5 pixels. Each unit in each feature map is connected to
a3D7×7×5 neighborhood into the input retina. Layer R1 is composed of
7 feature maps, each connected to one feature map in C1, and simply applies
absolute value to its input. Layer S1 is composed of 7 feature maps of size
14 ×24 ×5, each connected to one feature map in R1. S1 performs sub-sampling
at a factor of 2 in spatial domain, aiming to build robustness to small spatial
distortions. The connection scheme between layers S1 and C2 follows the same
principle described in [5], so that, C2 layer has 35 feature maps performing
5×5×3 convolutions. Layers R2 and S2 follow the same principle described
above for R1 and S1. Finally, layer C3 consists of 5 feature maps fully-connected
to S2 and performing 3 ×3×3 convolutions. At this stage, each C3 feature
map contains 3 ×8×1 values, and thus, the input information is encoded in a
vector of size 120. This vector can be interpreted as a descriptor of the salient
spatio-temporal information extracted from the input. Finally, layers N1 and N2
contain a classical multilayer perceptron with one neuron per action in the output
layer. This architecture corresponds to a total of 17,169 trainable parameters
(which is about 15 times less than the architecture used in [10]). To train this
model, we used the algorithm proposed in [14], which is the standard online
Backpropagation with momentum algorithm, adapted to weight sharing.
Fig. 2. A subset of 3 automatically constructed C1 feature maps (of 7 total), each
one corresponding, from left to right, to the actions walking, boxing, hand-claping and
hand-waving from the KTH dataset
Once the 3D-ConvNet is trained on KTH actions, and since the spatio-
temporal feature construction process is fully automated, it’s interesting to ex-
amine if the learned features are visually interpretable. We report in Figure 2
a subset of learned C1 feature maps, corresponding each to some actions from
the KTH dataset. Even if finding a direct link with engineered features is not
straightforward (and not necessarily required) the learned feature maps seem to
capture visually relevant information (person/background segmentation, limbs
involved during the action, edge information. . . ).
Sequential Deep Learning for Human Action Recognition 33
Fig. 3. An overview of our two-steps neural recognition scheme
In the next section, we describe how these features are used to feed a recurrent
neural network classifier, which is trained to recognize the actions based on the
temporal evolution of features.
3 Sequence Labelling Considering the Temporal
Evolution of Learned Features
Once the features are automatically constructed with the 3D-ConvNet architec-
ture as described in Section 2, we propose to learn to label the entire sequence
based on the accumulation of several individual decisions corresponding each to
a small temporal neighbourhood which was involved during the 3D-ConvNets
learning process (see Figure 3). This allows to take advantage of the temporal
evolution of the features, in comparison with the majority voting process on the
individual decisions.
Among state of the art learning machines, Recurrent Neural Networks (RNN)
are one of the most used for temporal analysis of data, because of their ability to
take into account the context using recurrent connections in the hidden layers.
It has been demonstrated in [6] that if RNN are able to learn tasks which involve
short time lags between inputs and corresponding teacher signals, this short-term
memory becomes insufficient when dealing with “real world” sequence process-
ing, e.g video sequences. In order to alleviate this problem, Schmidhuber et al. [6]
have proposed a specific recurrent architecture, namely Long Short-Term Mem-
ory (LSTM). These networks use a special node called Constant Error Carousel
(CEC), that allows for constant error signal propagation through time. The sec-
ond key idea in LSTM is the use of multiplicative gates to control the access to
the CEC. We have shown in our previous work [1] that LSTM are efficient to
label sequences of descriptors corresponding to hand-crafted features.
In order to classify the action sequences, we propose to use a Recurrent Neural
Network architecture with one hidden layer of LSTM cells. The input layer of
this RNN consists in 120 C3 output values per time step. LSTM cells are fully
connected to these inputs and have also recurrent connexions with all the LSTM
cells. Output layer consists in neurons connected to LSTM outputs at each time
step. We have tested several network configuration, varying the number of hidden
LSTM. A configuration of 50 LSTM was found to be a good compromise for
34 M. Baccouche et al.
Fig. 4. A sample of actions/scenarios from the KTH dataset [24]
this classification task. This architecture corresponds to about 25,000 trainable
parameters. The network was trained with online backpropagation through time
with momentum [6].
4 Experiments on KTH Dataset
The KTH dataset was provided by Schuldt et al. [24] in 2004 and is the most
commonly used public human actions dataset. It contains 6 types of actions
(walking, jogging, running,boxing, hand-waving and hand-clapping) performed
by 25 subjects in 4 different scenarios including indoor, outdoor, changes in
clothing and variations in scale (see Figure 4). The image size is of 160 ×120
pixels, and temporal resolution is of 25 frames per second. There are considerable
variations in duration and viewpoint. All sequences were taken over homogeneous
backgrounds, but hard shadows are present.
As in [4], we rename the KTH dataset in two ways: the first one (the original
one) where each person performs the same action 3 or 4 times in the same video,
is named KTH1 and contains 599 long sequences (with a length between 8 and
59 seconds) with several “empty” frames between action iterations. The second,
named KTH2, is obtained by splitting videos in smaller ones where a person does
an action only one time, and contains 2391 sequences (with a length between 1
and 14 seconds).
4.1 Evaluation Protocol
In [4], Gao et al. presented a comprehensive study on the influence of the
evaluation protocol on the final results. It was shown that the use of differ-
ent experimental configurations can lead to performance differences up to 9%.
Sequential Deep Learning for Human Action Recognition 35
Furthermore, authors demonstrated that the same method, when evaluated on
KTH1 or KTH2 can have over 5.85% performance deviations. Action recogni-
tion methods are usually directly compared although they use different testing
protocols or/and datasets (KTH1 or KTH2), which distorts the conclusions. In
this paper, we choose to evaluate our method using cross-validation, in which
16 randomly-selected persons are used for training, and the other 9 for testing.
Recognition performance corresponds to the average across 5 trials. Evaluations
are performed on both KTH1 and KTH2.
4.2 Experimental Results
The two-steps model was trained as described above. Original videos under-
went the following steps: spatial down-sampling by a factor of 2 horizontally
and vertically to reduce the memory requirement, extracting the person-centred
bounding box as in [9,10], and applying 3D Local Contrast Normalization on
a7×7×7 neighbourhood, as recommended in [8]. Note that we do not use
any complex pre-processing (optical flow, gradients, motion history. . . ). We also
generated vertically flipped and mirrored versions of each training sample to
increase the number of examples. In our experiments, we observed that, both
for 3D-ConvNets and LSTM, no overtraining is observed without any valida-
tion sequence and stopping when performances on training set no longer rise.
Obtained results, corresponding to 5 randomly selected training/test configura-
tions are reported on Table 1.
Table 1 . Summary of experimental results using 5 randomly selected configurations
from KTH1 and KTH2
Config.1 Config.2 Config.3 Config.4 Config.5 Average
KTH1 3D-ConvNet + Voting 90.79 90.24 91.42 91.17 91.62 91.04
3D-ConvNet + LSTM 92.69 96.55 94.25 93.55 94.93 94.39
3D-ConvNet + Voting 89.14 88.55 89.89 89.45 89.97 89.40
KTH2 3D-ConvNet + LSTM 91.50 94.64 90.47 91.31 92.97 92.17
Harris-3D [13] + LSTM 84.87 90.64 88.32 90.12 84.95 87.78
The 3D-ConvNet, combined to majority voting on short sub-sequences, gives
comparable results (91.04%) to other deep model based approaches [9,10,26]. We
especially note that results with this simple non-sequential approach are almost
the same than those obtained in [10], with a 15 times smaller 3D-ConvNet model,
and without using neither gradients nor optical flow as input. We also notice that
the first step of our model gives relatively stable results on the 5 configurations,
compared to the fluctuations generally observed for the other methods [4]. The
LSTM contribution is quite important, increasing performances of about 3%.
KTH1 improvement (+3,35%) is higher than KTH2, which confirms that LSTM
are more suited for long sequences.
36 M. Baccouche et al.
In order to point out the benefit of using automatically learned features, we
also evaluated the combination of the LSTM classifier with common engineered
space-time salient points. This was done by applying the Harris-3D [13] detector
to each video sequence, and calculating the HOF descriptor (as recommended in
[27] for KTH) around each detected point. We used the original implementation
available on-line1and standard parameter settings. A LSTM classifier was then
trained taking as input a temporally-ordered succession of descriptors. Obtained
results, reported on Table 1, show that our learned 3D-ConvNet features, in
addition to their generic nature, perform better on KTH2 than hand-crafted
ones, with performances improvement of 4.39%.
To conclude, our two-steps sequence labelling scheme achieves an overall ac-
curacy of 94.39% on KTH1 and 92.17% on KTH2. These results, and others
among the best performing of related work on KTH dataset, are reported on
Table 2.
Table 2 . Obtained results and comparison with state-of-the-art on KTH dataset: meth-
ods reported in bold corresponds to deep models approaches, and the others to those
using hand-crafted features
Dataset Evaluation Protocol Method Accuracy
Our method 94.39
Cross validation Jhuang et al. [9] 91.70
with 5 runs Gao et al. [4] 95.04
Schindler and Gool [23] 92.70
KTH1 Gao et al. [4] 96.33
Chen and Hauptmann [2] 95.83
Leave-one-out Liu and Shah [17] 94.20
Sun et al. [25] 94.0
Niebles et al. [19] 81.50
Cross Our method 92.17
validation Ji et al. [10] 90.20
with 5 runs Gao et al. [4] 93.57
KTH2 Tay l o r e t al . [26] 90.00
Kim et al. [12] 95.33
Other protocols Ikizler et al. [7] 94.00
Laptev et al. [13] 91.80
Dollar et al. [3] 81.20
Table 2 shows that our approach outperforms all related deep model works
[9,10,26], both on KTH1 and KTH2. One can notice that our recognition scheme
outperforms the HMAX model, proposed by Jhaung et al. [9] although it is of
hybrid nature, since low and mid level features are engineered and learned ones
are constructed automatically at the very last stage.
1Available at
Sequential Deep Learning for Human Action Recognition 37
For each dataset, Table 2 is divided into two groups: the first group consists of
the methods which can be directly compared with ours, i.e those using the same
evaluation protocol (which is cross validation with 5 randomly selected splits of
the dataset into training and test). The second one includes the methods that
use different protocols, and therefore those for whom the comparison is only
indicative. Among the methods of the first group, to our knowledge, our method
obtained the second best accuracy, both on KTH1 and KTH2, the best score
being obtained by Gao et al. [4]. Note that the results in [4] corresponds to
the average on the 5 best runs over 30 total, and that these classification rates
decreases to 90.93% for KTH1 and 88.49% for KTH2 if averaging on the 5 worst
More generally, our method gives comparable results with the best related
work on KTH dataset, even with methods relying on engineered features, and
those evaluated using protocols which was shown to outstandingly increase per-
formances (e.g leave-one-out). This is a very promising result considering the
fact that all the steps of our scheme are based on automatic learning, without
the use of any prior knowledge.
5 Conclusion and Discussion
In this paper, we have presented a neural-based deep model to classify sequences
of human actions, without a priori modeling, but only relying on automatic learn-
ing from training examples. Our two-steps scheme automatically learns spatio-
temporal features and uses them to classify the entire sequences. Despite its
fully automated nature, experimental results on the KTH dataset show that the
proposed model gives competitive results, among the best of related work, both
on KTH1 (94.39%) and KTH2 (92.17%).
As future work, we will investigate the possibility of using a single-step model,
in which the 3D-ConvNet architecture described in this paper is directly con-
nected to the LSTM sequence classifier. This could considerably reduce com-
putation time, since the complete model is trained once. The main difficulty
will be the adaptation of the training algorithm, especially when calculating the
retro-propagated error.
Furthermore, even if KTH remains the most widely used dataset for human
action recognition, recent works are increasingly interested by other more chal-
lenging datasets, which contains complex actions and realistic scenarios. There-
fore, we plan to verify the genericity of our approach by testing it on recent
challenging datasets, e.g Hollywood-2 dataset [18], UCF sports action dataset
[21], YouTube action dataset [16], UT-Interaction dataset [22] or LIRIS human
activities dataset2. This will allow us to confirm the benefit of the learning-
based feature extraction process, since we expect to obtain stable performances
on these datasets despite their high diversity, which is not the case of the ap-
proaches based on hand-crafted features.
2Available at
38 M. Baccouche et al.
1. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Action Classifica-
tion in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks.
In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010. LNCS, vol. 6353,
pp. 154–159. Springer, Heidelberg (2010)
2. Chen, M.y., Hauptmann, A.: MoSIFT: Recognizing human actions in. surveillance
videos. Tech. Rep. CMU-CS-09-161, Carnegie Mellon University (2009)
3. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse
spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveil-
lance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
4. Gao, Z., Chen, M.-y., Hauptmann, A.G., Cai, A.: Comparing Evaluation Protocols
on the KTH Dataset. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.)
HBU 2010. LNCS, vol. 6219, pp. 88–100. Springer, Heidelberg (2010)
5. Garcia, C., Delakis, M.: Convolutional face finder: a neural architecture for fast
and robust face detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence 26(11), 1408–1423 (2004)
6. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with
LSTM recurrent networks. Journal of Machine Learning Research 3, 115–143 (2003)
7. Ikizler, N., Cinbis, R., Duygulu, P.: Human action recognition with line and flow
histograms. In: International Conference on Pattern Recognition, pp. 1–4 (2008)
8. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-
stage architecture for object recognition? In: International Conference on Computer
Vision, pp. 2146–2153 (2009)
9. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action
recognition. In: International Conference on Computer Vision, pp. 1–8 (2007)
10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human
action recognition. In: International Conference on Machine Learning, pp. 495–502
11. Kim, H.J., Lee, J., Yang, H.S.: Human Action Recognition Using a Modified Con-
volutional Neural Network. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.)
ISNN 2007. LNCS, vol. 4492, pp. 715–723. Springer, Heidelberg (2007)
12. Kim, T.K., Wong, S.F., Cipolla, R.: Tensor canonical correlation analysis for ac-
tion classification. In: International Conference on Computer Vision and Pattern
Recognition, pp. 1–8 (2007)
13. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: International Conference on Computer Vision and Pattern
Recognition, pp. 1–8 (2008)
14. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
15. LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications
in vision. In: IEEE International Symposium on Circuits and Systems, pp. 253–256
16. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild.
In: International Conference on Computer Vision and Pattern Recognition, pp.
1996–2003 (2009)
17. Liu, J., Shah, M.: Learning human actions via information maximization. In: Inter-
national Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
18. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: International Con-
ference on Computer Vision and Pattern Recognition, pp. 2929–2936 (2009)
Sequential Deep Learning for Human Action Recognition 39
19. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories
using spatial-temporal words. International Journal of Computer Vision 79, 299–
318 (2008)
20. Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., Barbano, P.E.: Toward
automatic phenotyping of developing embryos from videos. IEEE Transactions on
Image Processing 14(9), 1360–1371 (2005)
21. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH a spatio-temporal maximum
average correlation height filter for action recognition. In: Computer Vision and
Pattern Recognition, pp. 1–8 (2008)
22. Ryoo, M., Aggarwal, J.: Spatio-temporal relationship match: Video structure com-
parison for recognition of complex human activities. In: International Conference
on Computer Vision, pp. 1593–1600 (2009)
23. Schindler, K., van Gool, L.: Action snippets: How many frames does human action
recognition require? In: International Conference on Computer Vision and Pattern
Recognition, pp. 1–8 (2008)
24. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM ap-
proach. In: International Conference on Pattern Recognition, vol. 3, pp. 32–36
25. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and
holistic features. In: International Conference on Computer Vision and Pattern
Recognition Workshops, pp. 58–65 (2009)
26. Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional Learning of Spatio-
temporal Features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010.
LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010)
27. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local
spatio-temporal features for action recognition. In: British Machine Vision Confer-
ence (2009)
... A summary of certain methods which have been presented before and utilized DNN approaches features for the tasks of the human action recognition. Baccouche et al. [81] for capturing the nature of the video data, the 3D-CNN was suggested. The network is trained in order to assign a limited number of succeeding frames a vector of spatiotemporal features, and after that uses the feature vectors for classifying the complete sequences. ...
Full-text available
Depending on the context of interest, an anomaly is defined differently. In the case when a video event isn't expected to take place in the video, it is seen as anomaly. It can be difficult to describe uncommon events in complicated scenes, but this problem is frequently resolved by using high-dimensional features as well as descriptors. There is a difficulty in creating reliable model to be trained with these descriptors because it needs a huge number of training samples and is computationally complex. Spatiotemporal changes or trajectories are typically represented by features that are extracted. The presented work presents numerous investigations to address the issue of abnormal video detection from crowded video and its methodology. Through the use of low-level features, like global features, local features, and feature features. For the most accurate detection and identification of anomalous behavior in videos, and attempting to compare the various techniques, this work uses a more crowded and difficult dataset and require light weight for diagnosing anomalies in objects through recording and tracking movements as well as extracting features; thus, these features should be strong and differentiate objects. After reviewing previous works, this work noticed that there is more need for accuracy in video modeling and decreased time, and since attempted to work on real-time and outdoor scenes.
... It has experienced a rapid rate of adoption in a variety of sectors. Examples include support for autonomous cars to navigate safely in traffic [4][5][6][7], detection of abusive behavior [8,9], facial detection [10,11], human behavior analysis [12,13], and medical imaging such as cancer detection [14,15], robotics [16,17], general image processing techniques such as cropping, orientation detection, and contrast enhancement [18][19][20][21], remote sensing applications [214][215][216], and many other use cases [217][218][219]. Regarding future use cases for object detection, the possibilities are endless. ...
Full-text available
Detecting objects remains one of computer vision and image understanding applications’ most fundamental and challenging aspects. Significant advances in object detection have been achieved through improved object representation and the use of deep neural network models. This paper examines more closely how object detection has evolved in the era of deep learning over the past years. We present a literature review on various state-of-the-art object detection algorithms and the underlying concepts behind these methods. We classify these methods into three main groups: anchor-based, anchor-free, and transformer-based detectors. Those approaches are distinct in the way they identify objects in the image. We discuss the insights behind these algorithms and experimental analyses to compare quality metrics, speed/accuracy tradeoffs, and training methodologies. The survey compares the major convolutional neural networks for object detection. It also covers the strengths and limitations of each object detector model and draws significant conclusions. We provide simple graphical illustrations summarising the development of object detection methods under deep learning. Finally, we identify where future research will be conducted.
A multilayer perceptron (MLP), a feedforward neural network with one hidden layer, is called reducible if a hidden unit can be removed without changing the input–output function. In the MLP’s search space, the MLP is reducible in some regions, and some regions in the reducible regions have zero gradients, where learning stagnates. Nonetheless, some methods have been proposed to leverage such regions. A reducible region of an MLP with J hidden units can be generated from an MLP with \(J-1\) hidden units. To begin learning from a reducible region guarantees a monotonically decreasing training error as the number of hidden units increases. The evaluation experiments reveal that the methods using reducible regions stably determine better solutions than existing methods. In addition, methods using reducible regions can stably obtain high-quality solutions not only for MLPs, but also for complex-valued MLPs, and radial basis function networks. In this study, we show that the search space of a recurrent neural network also has reducible regions. In addition, we propose a method that utilizes reducible regions to obtain higher quality solutions in a stable manner, as compared to existing methods with randomly set initial weights.
Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.
Human activity recognition is essential in many domains, including the medical and smart home sectors. Using deep learning, we conduct a comprehensive survey of current state and future directions in human activity recognition (HAR). Key contributions of deep learning to the advancement of HAR, including sensor and video modalities, are the focus of this review. A wide range of databases and performance metrics used in the implementation of HAR methodologies are described in depth. This paper explores the wide range of HAR’s potential uses, from healthcare, emotion calculation and assisted living to security and education. The paper provides an in-depth analysis of the most significant works that employ deep learning techniques for a variety of HAR downstream tasks across both the video and sensor domains including the most recent advances. Finally, it addresses problems and limitations in the current state of HAR research and proposes future research avenues for advancing the field.
Full-text available
Relied on discernible or corporeal attributes, human beings are recognized by employing biometric scheme. In computer perception and design ratification domain, progressive studies are carried out in face recognition. Given the constant development in the discipline of imaging sensor, a legion of rest of the novel problems has occurred. The chief issue remains how to discover focus region more precisely for multi-focus face detection. Several studies have been proliferated in face discernment, spotting, and protection acknowledgment; the key problem remains in this is considering those images into contemplation that had “disparate dimensions” and “disparate aspect ratio” in a singular frame avoiding the progression to attain or surpass human-level accuracy in human facial aspect like noise in face pictures, defying lighting conditions and posture ratio.
Detection and localization of activities in a human-centric manufacturing assembly operation will help improve manufacturing process optimization. Through the human-in-loop approach, the step time and cycle time of the manufacturing assemblies can be continuously monitored thereby identifying bottlenecks and updating lead times instantaneously. Autonomous and continuous monitoring can also enable the detection of any anomalies in the assembly operation as they occur. Several studies have been conducted that aim to detect and localize human actions, but they mostly exist in the domain of healthcare, video understanding, etc. The work on detection and localization of actions in a manufacturing assembly operation is limited. Hence, in this work, we aim to review the process of human action detection and localization in the context of manufacturing assemblies. We aim to provide a holistic review that covers the current state-of-the-art approaches in human activity detection across different problem domains and explore the prospective of applying them to manufacturing assemblies. Additionally, we also aim to provide a complete review of the current state of research in human-centric assembly operation monitoring and explore prospective future research directions.
Convolutional neural networks are designed to work with grid-structured inputs, which have strong spatial dependencies in local regions of the grid. The most obvious example of grid-structured data is a 2-dimensional image.
With the convenience brought by the rapid development of information, teaching reform has also set off a heat wave. The emergence of Internet+ , multimedia technology, virtual reality technology and artificial intelligence has made education continue to develop towards informatization, and mobile learning has become a way for students to learn. This paper provides a reference for the reform of exercise classes in colleges and universities, enriches the teaching methods, improves the learning quality of students, and makes the action image processing technology better applied to physical education classrooms. In this paper, according to the characteristics of the human motion system, the training actions of sports are collected. After the 3D action data is properly calibrated and normalized, the motion vector is used as the training input method. This method conforms to the laws of the human motion system. In terms of temporal feature extraction, this paper proposes a multi-level long short-term memory neural network structure for action recognition. The action image processing technology improves the accuracy of movements more significantly, and proves that it has a certain promotion effect on the self-editing ability and innovation ability of students'. The final results of the research show that the accuracy of action recognition has been improved to a certain extent after image data processing. The accuracy of action recognition in images has increased from 67.13% to 91.23%, and the accuracy rate has remained above 90% after action image processing which proves the effectiveness of the neural network algorithm processing method.KeywordsVirtual realityArtificial intelligenceImage processing technologyTemporal features
Full-text available
p>In recent years, many researchers have studied the HAR (Human Activity Recognition) system. HAR using smart home sensor is based on computing in smart environment, and intelligent surveillance system conducts intensive research on peripheral support life. The previous system studied in some of the activities is a fixed motion and the methodology is less accurate. In this paper, vision-based studies using thermal imaging cameras improve the accuracy of motion recognition in intelligent surveillance systems. We use one of the deep learning architectures widely used in image recognition systems called Convolutional Neural Networks (CNN). Therefore, we use CNN and thermal cameras to provide accuracy and many features through the proposed method.</p
Conference Paper
Full-text available
In this paper, we present a systematic framework for recognizing realistic actions from videos ldquoin the wildrdquo. Such unconstrained videos are abundant in personal collections as well as on the Web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.
Conference Paper
Full-text available
This paper exploits the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common outdoors. The contribution of this paper is three-fold: (a) we automatically discover relevant scene classes and their correlation with human actions, (b) we show how to learn selected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We use movie scripts as a means of automatic supervision for training. For selected action classes we identify correlated scene classes in text and then retrieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features framework and are combined in a joint scene-action SVM-based classifier. We report experimental results and validate the method on a new large dataset with twelve action classes and ten scene classes acquired from 69 movies.
Conference Paper
Full-text available
We present a compact representation for human action recognition in videos using line and optical flow histograms. We introduce a new shape descriptor based on the distribution of lines which are fitted to boundaries of human figures. By using an entropy-based approach, we apply feature selection to densify our feature representation, thus, minimizing classification time without degrading accuracy. We also use a compact representation of optical flow for motion information. Using line and flow histograms together with global velocity information, we show that high-accuracy action recognition is possible, even in challenging recording conditions.
In this paper we propose a unified action recognition framework fusing local descriptors and holistic features. The motivation is that the local descriptors and holistic features emphasize different aspects of actions and are suitable for the different types of action databases. The proposed unified framework is based on frame differencing, bag-of-words and feature fusion. We extract two kinds of local descriptors, i.e. 2D and 3D SIFT feature descriptors, both based on 2D SIFT interest points. We apply Zernike moments to extract two kinds of holistic features, one is based on single frames and the other is based on motion energy image. We perform action recognition experiments on the KTH and Weizmann databases, using Support Vector Machines. We apply the leave-one-out and pseudo leave-N-out setups, and compare our proposed approach with state-of-the-art results. Experiments show that our proposed approach is effective. Compared with other approaches our approach is more robust, more versatile, easier to compute and simpler to understand.
Conference Paper
In this paper, a human action recognition method using a hybrid neural network is presented. The method consists of three stages: preprocessing, feature extraction, and pattern classification. For feature extraction, we propose a modified convolutional neural network (CNN) which has a three-dimensional receptive field. The CNN generates a set of feature maps from the action descriptors which are derived from a spatiotemporal volume. A weighted fuzzy min-max (WFMM) neural network is used for the pattern classification stage. We introduce a feature selection technique using the WFMM model to reduce the dimensionality of the feature space. Two kinds of relevance factors between features and pattern classes are defined to analyze the salient features.
Conference Paper
Human action recognition has become a hot research topic, and a lot of algorithms have been proposed. Most of researchers evaluated their performances on the KTH dataset, but there is no unified standard how to evaluate algorithms on this dataset. Different researchers have employed different test setups, so the comparison is not accurate, fair or complete. In order to know how much difference there is when different experimental setups are used, we take our own spatio-temporal MoSIFT feature as an example to assess its performance on the KTH dataset using different test scenarios and different partitioning of the data. In all experiments, support vector machine (SVM) with a chi-square kernel is adopted. First, we evaluate performance changes resulting from differing vocabulary sizes of the codebook, and then decide on a suitable vocabulary size of codebook. Then, we train the models using different training dataset partitions, and test the performances one the corresponding held-out test sets. Experiments show that the best performance of MoSIFT can reach 96.33% on the KTH dataset. When different n-fold cross-validation methods are used, there can be up to 10.67% difference in the result. And when different dataset segmentations are used (such as KTH1 and KTH2), the difference in results can be up to 5.8% absolute. In addition, the performance changes dramatically when different scenarios are used in the training and test dataset. When training on KTH1 S1+S2+S3+S4 and testing on KTH1 S1 and S3 scenarios, the performance can reach 97.33% and 89.33% respectively. This paper shows how different test configurations can skew results, even on standard data set. The recommendation is to use a simple leave-one-out as the most easily replicable clear-cut partitioning. KeywordsAction Recognition-training/test data sets-partitioning- experimental methods
Conference Paper
Human activity recognition is a challenging task, especially when its background is unknown or changing, and when scale or illumination differs in each video. Approaches utilizing spatio-temporal local features have proved that they are able to cope with such difficulties, but they mainly focused on classifying short videos of simple periodic actions. In this paper, we present a new activity recognition methodology that overcomes the limitations of the previous approaches using local features. We introduce a novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos. Our match hierarchically considers spatio-temporal relationships among feature points, thereby enabling detection and localization of complex non-periodic activities. In contrast to previous approaches to `classify' videos, our approach is designed to `detect and localize' all occurring activities from continuous videos where multiple actors and pedestrians are present. We implement and test our methodology on a newly-introduced dataset containing videos of multiple interacting persons and individual pedestrians. The results confirm that our system is able to recognize complex non-periodic activities (e.g. `push' and `hug') from sets of spatio-temporal features even when multiple activities are present in the scene.