ArticlePDF Available

Summary of continuous action recognition


Abstract and Figures

In the field of human-computer interaction, it is very important for computers to understand human behaviors, so human action recognition is of great significance. But in the current action recognition work, most of them are for the segmented action data. Compared with this, there is less research on continuous action recognition. Therefore, this paper summarizes the research on continuous action recognition, and elaborates on action feature extraction, action classification, and continuous action segmentation.
Content may be subject to copyright.
Journal of Physics: Conference Series
Summary of continuous action recognition
To cite this article: Jiahui An et al 2020 J. Phys.: Conf. Ser. 1607 012116
View the article online for updates and enhancements.
This content was downloaded from IP address on 18/08/2020 at 13:48
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
Journal of Physics: Conference Series 1607 (2020) 012116
IOP Publishing
Summary of continuous action recognition
Jiahui An1, Xinrong Cheng1*, Qing Wang1, Hong Chen1, Jiayue Li1 and Shiji Li1
1 Department of Computer Science and Technology, China Agricultural University,
Beijing 100083, China
Abstract. In the field of human-computer interaction, it is very important for computers to
understand human behaviors, so human action recognition is of great significance. But in the
current action recognition work, most of them are for the segmented action data. Compared
with this, there is less research on continuous action recognition. Therefore, this paper
summarizes the research on continuous action recognition, and elaborates on action feature
extraction, action classification, and continuous action segmentation.
1. Introduction
Human-computer interaction refers to the interaction between human and computer in a certain way to
complete the information exchange between human and computer more efficiently and naturally.
People can interact with the computer through different channels such as vision, touch, speech,
gestures, expressions, eye movements and other channels.
People hope that computers can become more and more intelligent, and they can "see" the world
and "listen" to the world like humans. Among them, computer vision technology can make computers
"see" the world like humans. In the field of computer vision, action recognition technology has an
important position, it can understand human movements, and better interact with people. It has
appeared in video surveillance, gaming, medical and virtual reality fields.
In action recognition, most of them are still in the research of single action. That is, it is assumed
that the input actions are independent of each other or have been divided, that is, the start point and
end point of the action sequence data must be manually selected [1]. But in the actual situation, there
will be some non-core actions before and after an action. If you want to identify the core action, you
must manually select the start point and end point of the action sequence data when calibrating the
action sequence. This obviously increases the workload, and also adds human factors, making the
action data not very accurate. At the same time, it also affects the interaction process between the user
and the system, reducing the naturalness of the interaction. Therefore, continuous action segmentation
and recognition are required.
The traditional methods of action recognition are divided into several processes: feature extraction,
feature integration, and feature classification. In recent years, with deep learning methods, it has also
appeared in the field of action recognition. Next, this article will introduce action feature extraction,
action feature classification, and continuous action segmentation.
2. Research on action feature extraction
In human action recognition, if we can extract the action features that effectively express the action, it
is very important for the result of action recognition, because it directly affects the final result of
action recognition. Therefore, it is necessary to analyze the specific situation, select different types of
Journal of Physics: Conference Series 1607 (2020) 012116
IOP Publishing
features according to the different video quality and application scenarios, and choose different
methods according to the different types of features selected. More common features can be extracted
from the following aspects: static features, dynamic features, space-time mixed features and
descriptive features [2].
Static features, as the name implies, mainly describe the state of the human body target when it is
relatively still, such as size, contour, shape, edge, etc., showing the overall information of the human
body. For static features, a pose estimation method can be used at this time. In order to complete the
action recognition, Carlsson et al. [3] performed shape matching between the key frames extracted from
the action video and the saved action prototype. Among them, the shape information is some edge data,
which is detected by the Canny edge detector. Shotton et al. [4] estimated the 3-dimensional
spatiotemporal position of human joints from depth images. Compared with RGB image features, joint
points have the advantage of not being blocked by light.
Dynamic features include movement speed, movement direction and trajectory, etc. Commonly
used methods include low-level tracking methods and optical flow calculation methods, but the former
is extremely error-prone, especially in complex scenarios. In view of this, Efros et al. [5] divided the
optical flow field into horizontal and vertical channels, and then divided them into left and right
channels, respectively, and used a Gaussian filter to filter the four channels, and finally carried out
Normalization realizes action recognition using optical flow descriptors.
Common image processing methods for spatio-temporal features, Davis et al. [6] connect static
frames to form action sequence, and then subtract the dynamic sequence background to form a static
image to characterize a certain type of action. The static graph can be action energy graph or action
history graph, and finally action recognition is performed.
Descriptive features mainly use machine learning methods. Yao et al. [7] took atomic actions,
objects and postures as descriptive features.
In addition, automatic feature extraction using neural networks has also appeared. Hinton et al. [8]
proposed a deep belief network consisting of multiple layers of restricted Boltzmann machines. The
RBM model learns the expression of features in an unsupervised manner, and the learning and training
process is very efficient. Inspired by Hinton [9], Lei Jun [10] designed a convolution restricted Boltzmann
machine to express the statistical structure of the samples in the tracking scene. In the CRBM model,
several filters are learned from the data. These filters are actually equivalent to local feature detectors.
When the amount of training data is small, the CRBM model can still be effectively trained and
produce discriminative features.
3. Research on action recognition method
Current commonly used action recognition methods include: template-based methods, probability-
statistic-based methods, and grammar-based methods [2]. The template-based method is intuitive and
simple, and judges the action category by comparing the similarity between the target to be detected
and the template. Therefore, it lacks certain robustness. Ji et al. [11] used the dynamic time warping
method to calculate the degree of similarity between the action to be recognized and the action in the
action library.
Probabilistic statistical models represent actions as a continuous sequence of states, and the
transition law between states can be expressed with a time transfer function. Shi et al. [12] adopted the
Markov model and proposed a Viterbi-like dynamic programming algorithm to segment and identify
continuous actions simultaneously. Liu Fen [13] used Kinect sensors to generate human action depth
maps, built a three-dimensional human model, used the angle and modulus ratio of action vectors as
feature vectors, and used SVM classifiers to classify and recognize human actions.The classification
diagram of linear support vector machine is shown in figure 1.
Journal of Physics: Conference Series 1607 (2020) 012116
IOP Publishing
Figure 1. Classification chart of linear support vector machine.
The grammar-based method describes human actions as a series of symbols. Each symbol
represents an atomic level action. Action recognition is performed by first identifying the atomic
In recent years, with the continuous development of deep learning, it has gradually appeared in the
field of action recognition. Yu Hua [14] uses an improved DPM algorithm to extract features, and uses a
gradient optimization training CNN model to classify and recognize actions. Li Ting [15] used the
improved L-K optical flow method of convolution kernel to extract action features for the problem of
continuous action recognition, and used 3D CNN and SVM mixed models to recognize action. Article
[16] proposed a new method of action recognition using convolutional neural network (CNN) and deep
bidirectional LSTM (DB-LSTM) network to process video data. Depth features are extracted from
every six frames of video, and the DB-LSTM network is used to learn the sequence information
between frame features. In the forward and backward traversal of DB-LSTM, multiple layers are
superimposed together to increase its depth. The model frame diagram is shown in figure 2.
Figure 2. Model framework.
4. Research on continuous action segmentation
In recent years, domestic and foreign researches on the recognition of single actions have made
important progress. However, in most application scenarios, actions are not manually collected,
labeled and segmented, and are more often complex continuous actions. Therefore, the recognition of
continuous and complex actions is becoming more and more important. For the recognition of
Journal of Physics: Conference Series 1607 (2020) 012116
IOP Publishing
continuous actions, it is an important part to segment the collected action time series. The quality of
the segmentation results directly affect the final recognition result.
4.1. Segmentation method
According to the sequence of segmentation and recognition, it can be divided into direct segmentation
method and indirect segmentation method. The direct segmentation method is to segment the action
first and then recognize. For example, Bai Dongtian [17] splits continuous actions into sequences
through a priori knowledge, extracts skeleton information as action features for each sequence, and
combines a hidden Markov model to identify individual actions of the human body. And through
dynamic programming algorithm and a threshold model to get the best action recognition results. One
drawback of this method is that the prior action knowledge divides the continuous action sequence to
reduce the recognition accuracy.
The indirect segmentation method is to recognize while performing action segmentation. For
example, Mao Yijie [18] studied an action segmentation recognition scheme based on the confidence of
support vector machine classification. A method for calculating the classification confidence of the
SVM multi-classifier is proposed. The sliding window is used to obtain the action starting point and
action category.
4.2. Segmentation model
Action segmentation is a kind of sequence analysis. The solution of sequence analysis is sometimes
domain model method, in which HMM model is widely used. The HMM algorithm was first applied
to the field of speech recognition, and recently has also been applied to action recognition. HMMS can
effectively solve the problem of spatial and temporal differences, but HMMS requires a large number
of training sets [1].
Therefore, in combination with the current hot deep learning technology, many researchers use
HMM and neural networks for continuous action segmentation and recognition. Luo Xiaoyu [19] uses a
sliding window-based method to detect the initial segmentation point, and uses deep confidence
networks and hidden Markov models to identify individual actions in continuous actions, and uses
dynamic programming to optimize the initial segmentation point to achieve continuous Segmentation
and recognition of actions. Lei Jun [10] combines the CNN network and the HMM model. It not only
combines the CNN network's feature learning ability and the HMM model's sequence dynamic
modeling ability, but also can realize the model training under the condition of weakly labeled samples.
For weakly labeled continuous action videos, the HMM model estimates the implicit sequence of
action categories, which can be used as label information to train the CNN network.
In addition, the conditional random field model is also widely used. Lei Jun [20] proposed a
convolutional neural network and hidden dynamic conditional random field model to solve the
problem of continuous action recognition in video. This method designs and constructs a three-
dimensional CNN network, which automatically learns the action features directly from the original
video data. The LDCRF model is used to model the continuous action, and the tasks of continuous
action recognition and segmentation are completed. The model framework is shown in figure 3.
Figure 3. CNN-LDCRF model framework.
Journal of Physics: Conference Series 1607 (2020) 012116
IOP Publishing
5. Conclusion
Continuous action recognition is a challenging subject. With the development of computer technology,
there have been many achievements. Now that video data is so rich, it lays the foundation for a large
number of data sets required for deep learning. It is believed that there will be more research on
continuous action recognition using deep learning technology in the future.
Thanks to my mentor Cheng Xinrong, Wang Qing and Chen Hong. Thanks for their valuable
suggestions for my article.
[1] Huang, Y. H., Ye, S. Z. (2008) Summary of Continuous Action Segmentation. In: The 14th
National Conference on Image Graphics. Fuzhou.
[2] Hu, Q., Qin, L., Huang, Q.M. (2013) Overview of human action recognition based on vision.
Chinese Journal of Computers.
[3] Carlsson, C., Carlsson, S., Sullivan, S. (2001) Action recognition by shape matching to key
frames. In: Proceedings of the Workshop on Models Versus Exemplars in Computer Vision.
Colorado. pp. 1-8.
[4] Shotton, J., Fitzgibbon, A., Sharp, T., et al. (2011) Real-time human pose recognition in parts
from a single depth image. In: Proceedings of the IEEE Conference on Recognition.
Colorado Springs. pp. 1297-1304.
[5] Efros, A. A., Berg, A. C., Mori, G., Malik, J. (2003) Recognition action at a distance. In:
Proceedings of the 9th IEEE International Conference on Computer Vision. Nice. pp. 726-
[6] Davis, J. W., Bobick, A. F. (1997) The representation and recognition of action using temporal
templates. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. San Juan. pp. 928-934.
[7] Yao, B., Jiang, X., Khosla, A., et al. (2011) Human action recognition by learning bases of
action attributes and parts. In: Proceedings of the IEEE International Conference on
Computer Vision. Barcelona. pp. 1331-1338.
[8] Hinton, G. E., Osindero, S., The, Y. W. (2006) A fast learning algorithm for deep belief nets.
Neural Computation,18 (7): 15271554.
[9] Hinton, G. E., Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural
networks. Science, 313 (5786): 504507.
[10] Lei, J. (2017) Research on continuous action recognition method combining deep network and
probability graph. National University of Defense Technology.
[11] Ji, R., Yao, H., Sun, X. (2011) Actor-independent action search using spatial temporal
vocabulary with appearance hashing. Pattern Recognition, 44 (3): 624-638.
[12] Shi, Q., Cheng, L., Wang, L., et al. (2011) Human action segmentation and recognition using
discriminative semi-Markov models. International Journal of Computer Vision, 93 (1): 22-32.
[13] Liu, F., Wu, Z. P. (2019) A Kind of Human Action Recognition Algorithm Based on Kinect and
SVM. Modern Computer, 18: 55-58.
[14] Yu, H., Zhi, M. (2019) Human action recognition based on convolutional neural network.
Computer Engineering and Design, 40 (04): 1161-1166.
[15] Li, T. (2018) Human body continuous action recognition based on 3D CNN. Harbin Institute of
[16] Ullah, A., Ahmad, J., Muhammad, K., et al. (2017) Action Recognition in Video Sequences
using Deep Bi-Directional LSTM with CNN Features. IEEE Access,6: 1155-1166.
[17] Bai, D.T. (2016) Static gesture and continuous action recognition of upper limbs based on
KINECT. Beijing Institute of Technology.
[18] Mao, Y. J. (2017) Research on continuous action recognition of human body based on kinect.
Journal of Physics: Conference Series 1607 (2020) 012116
IOP Publishing
University of Electronic Science and Technology of China.
[19] Luo, X. Y. (2018) Human action recognition based on DBN-HMM. Xi'an University of
[20] Lei, J., Li, G., Li, S., Tu, D., Guo, Q.(2016) Continuous action recognition based on hybrid
CNN-LDCRF model. In: 2016 International Conference on Image, Vision and Computing.
Portsmouth. pp.63-69.
... Other approaches such as [26][27][28][29] use object-independent tracking methods that rely more on spatiotemporal-based features. While these features decrease the accuracy of recognizing different actions, they employ methods such as 2D CNNs [30] but more commonly 3D CNNs [31] to predict actions in the scene in a more generalized manner. ...
Full-text available
Video surveillance and image acquisition systems represent one of the most active research topics in computer vision and smart city domains. The growing concern for public and workers’ safety has led to a significant increase in the use of surveillance cameras that provide high-definition images and even depth maps when 3D cameras are available. Consequently, the need for automatic techniques for behavior analysis and action recognition is also increasing for several applications such as dangerous actions detection in railway stations or construction sites, event detection in crowd videos, behavior analysis, optimization in industrial sites, etc. In this context, several computer vision and deep learning solutions have been proposed recently where deep neural networks provided more accurate solutions, but they are not so efficient in terms of explainability and flexibility since they remain adapted for specific situations only. Moreover, the complexity of deep neural architectures requires the use of high computing resources to provide fast and real-time computations. In this paper, we propose a review and a comparative analysis of deep learning solutions in terms of precision, explainability, computation time, memory size, and flexibility. Experimental results are conducted within simulated and real-world dangerous actions in railway construction sites. Thanks to our comparative analysis and evaluation, we propose a personalized approach for dangerous action recognition depending on the type of collected data (image) and users’ requirements.
Full-text available
Recurrent neural network (RNN) and long short-term memory (LSTM) have achieved great success in processing sequential multimedia data and yielded state-of-the-art results in speech recognition, digital signal processing, video processing, and text data analysis. In this paper, we propose a novel action recognition method by processing the video data using convolutional neural network (CNN) and deep bidirectional LSTM (DB-LSTM) network. Firstly, deep features are extracted from every sixth frame of the videos which helps reduce the redundancy and complexity. Next, the sequential information among frame features is learnt using DB-LSTM network, where multiple layers are stacked together in both forward pass and backward pass of DB-LSTM to increase its depth. The proposed method is capable of learning long term sequences and can process lengthy videos by analyzing features for a certain time interval. Experimental results show significant improvements in action recognition using the proposed method on three benchmark datasets including UCF-101, YouTube 11 Actions, and HMDB51 compared to state-of-the-art action recognition methods.
Conference Paper
Full-text available
We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.
Objective: To explore the application of human behavior recognition based on convolutional neural network (CNN) in the new generation of pre-hospital first aid. Methods: Sixty videos were obtained from the Montreal Falling Video Data base, and divided into model training data and evaluation test data at a ratio of 5:1. (1) Data model training: singular value decomposition was used to clarify the picture, the target boundary of the human body in the picture was identified through target detection and Fourier transform, then the human body curve was described; OpenCv computer vision and machine learning software library to estimate the body pose were used to mark the important parts of the human body (such as hips, knees), the angle between the line of important parts and the horizontal direction and the length and width ratio of the detection frame were calculated, and whether the human body had abnormal behavior was identified. (2) Evaluation test: 6 videos were randomly extracted from the model training data set, 10 frame were extracted from each video, each frame was treated as one picture, CNN behavior recognition was used on each frame, and calculated the recognition rate between normal behavior and abnormal behavior. Results: In the process of data model training, each frame was artificially labeled to train the CNN human behavior recognition model. The evaluation results showed that the recognition rate of normal behavior was (90.33±3.03)%, and the recognition rate of abnormal behavior was (87.74±2.88)%. Conclusions: When passers-by have dangerous behaviors, the identification of human behaviors through CNN can determine whether they are in a critical state, and issue early warning in time, which plays a vital role in pre-hospital first aid.
Human actions in movies and sitcoms usually capture semantic cues for story understanding, which offer a novel search pattern beyond the traditional video search scenario. However, there are great challenges to achieve action-level video search, such as global motions, concurrent actions, and actor appearance variances. In this paper, we introduce a generalized action retrieval framework, which achieves fully unsupervised, robust, and actor-independent action search in large-scale database. First, an Attention Shift model is presented to extract human-focused foreground actions from videos containing global motions or concurrent actions. Subsequently, a spatiotemporal vocabulary is built based on 3D-SIFT features extracted from these human-focused action regions. These 3D-SIFT features offer robustness against rotations and viewpoints. And the spatiotemporal vocabulary guarantees our search efficiency, which is achieved by inverted indexing structure with approximate nearest-neighbor search. In the online ranking, we employ dynamic time warping distance to handle the action duration variances, as well as partial action matching. Finally, an appearance hashing strategy is presented to address the performance degeneration caused by divergent actor appearances. For experimental validation, we have deployed actor-independent action retrieval framework in 3-season “Friends” sitcoms (over 30 h). In this database, we have reported the best performance (MAP@1>0.53) with comparisons to alternative and state-of-the-art approaches.
Conference Paper
In this work, we propose to use attributes and parts for recognizing human actions in still images. We define action attributes as the verbs that describe the properties of human actions, while the parts of actions are objects and poselets that are closely related to the actions. We jointly model the attributes and parts by learning a set of sparse bases that are shown to carry much semantic meaning. Then, the attributes and parts of an action image can be recon-structed from sparse coefficients with respect to the learned bases. This dual sparsity provides theoretical guarantee of our bases learning and feature reconstruction approach. On the PASCAL action dataset and a new "Stanford 40 Ac-tions" dataset, we show that our method extracts meaning-ful high-order interactions between attributes and parts in human actions while achieving state-of-the-art classifica-tion performance.
Conference Paper
Our goal is to recognize human action at a distance, at resolutions where a whole person may be, say, 30 pixels tall. We introduce a novel motion descriptor based on optical flow measurements in a spatiotemporal volume for each stabilized human figure, and an associated similarity measure to be used in a nearest-neighbor framework. Making use of noisy optical flow measurements is the key challenge, which is addressed by treating optical flow not as precise pixel displacements, but rather as a spatial pattern of noisy measurements which are carefully smoothed and aggregated to form our spatiotemporal motion descriptor. To classify the action being performed by a human figure in a query sequence, we retrieve nearest neighbor(s) from a database of stored, annotated video sequences. We can also use these retrieved exemplars to transfer 2D/3D skeletons onto the figures in the query sequence, as well as two forms of data-based action synthesis "do as I do" and "do as I say". Results are demonstrated on ballet, tennis as well as football datasets.
A challenging problem in human action understanding is to jointly segment and recognize human actions from an unseen video sequence, where one person performs a sequence of continuous actions. In this paper, we propose a discriminative semi-Markov model approach, and define a set of features over boundary frames, segments, as well as neighboring segments. This enable us to conveniently capture a combination of local and global features that best represent each specific action type. To efficiently solve the inference problem of simultaneous segmentation and recognition, a Viterbi-like dynamic programming algorithm is utilized, which in practice is able to process 20 frames per second. Moreover, the model is discriminatively learned from large margin principle, and is formulated as an optimization problem with exponentially many constraints. To solve it efficiently, we present two different optimization algorithms, namely cutting plane method and bundle method, and demonstrate that each can be alternatively deployed in a “plug and play” fashion. From its theoretical aspect, we also analyze the generalization error of the proposed approach and provide a PAC-Bayes bound. The proposed approach is evaluated on a variety of datasets, and is shown to perform competitively to the state-of-the-art methods. For example, on KTH dataset, it achieves 95.0% recognition accuracy, where the best known result on this dataset is 93.4% (Reddy and Shah in ICCV, 2009).
We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.