Conference PaperPDF Available

Evaluation of Local Spatio-temporal Salient Feature Detectors for Human Action Recognition

Authors:

Abstract and Figures

Local spatio-temporal salient features are used for a sparse and compact representation of video contents in many computer vision tasks such as human action recognition. To localize these features (i.e., key point detection), existing methods perform either symmetric or asymmetric multi-resolution temporal filtering and use a structural or a motion saliency criteria. In a common discriminative framework for action classification, different saliency criteria of the structured-based detectors and different temporal filters of the motion-based detectors are compared. We have two main observations. (1) The motion-based detectors localize features which are more effective than those of structured-based detectors. (2) The salient motion features detected using an asymmetric temporal filtering performbetter than all other sparse salient detectors and dense sampling. Based on these two observations, we recommend the use of asymmetric motion features for effective sparse video content representation and action recognition.
Content may be subject to copyright.
Evaluation of Local Spatio-temporal Salient Feature Detectors
for Human Action Recognition
Amir H. Shabani1,2David A. Clausi1John S. Zelek2
Vision and Image Processing Lab.1, Intelligent Systems Lab.2
University of Waterloo, ON, Canada N2L 3G1
{hshabani,dclausi,jzelek}@uwaterloo.ca
Abstract
Local spatio-temporal salient features are used for a
sparse and compact representation of video contents in
many computer vision tasks such as human action recog-
nition. To localize these features (i.e., key point detection),
existing methods perform either symmetric or asymmetric
multi-resolution temporal filtering and use a structural or
a motion saliency criteria. In a common discriminative
framework for action classification, different saliency cri-
teria of the structured-based detectors and different tempo-
ral filters of the motion-based detectors are compared. We
have two main observations. (1) The motion-based detec-
tors localize features which are more effective than those
of structured-based detectors. (2) The salient motion fea-
tures detected using an asymmetric temporal filtering per-
form better than all other sparse salient detectors and dense
sampling. Based on these two observations, we recommend
the use of asymmetric motion features for effective sparse
video content representation and action recognition.
1 Introduction
Local spatio-temporal salient features have been widely
used for sparse and compact representations of video con-
tent in many computer vision applications such as human
action recognition [1, 2, 3, 4], video super-resolution [5],
unusual event detection [6], human-computer interac-
tion [7], and content-based video retrieval [8]. These fea-
tures are typically localized in spatio-temporal key points
where a sudden change in both space and time occurs. For
example, 3D Harris corners occur when a spatially salient
structure such as a corner changes its motion direction. This
detector thus localizes the start/stop of local motion events
in the video.
The salient features in a video represent the local video
events which occur at different spatial and/or temporal
scales. The spatial scale refers to the size of the body limbs
or the subject as a whole which might vary across individ-
uals and also by distance from the camera. The temporal
scale refers to the fact that different people perform a given
(sub-)action with different speed. In absence of any knowl-
edge about these scales, a multi-scale analysis of the video
signal is required to detect the features at different spatio-
temporal scales. Effective feature detection is important for
compact video content representation and consequently, ac-
tion recognition, for example.
Detection of multi-scale salient features consists of three
main steps: (1) spatio-temporal scale-space representation
of the video signal, (2) saliency map construction, and
(3) non-maxima suppression. Existing spatio-temporal fea-
ture detectors can be divided into two main categories:
structured-based or motion-based. The structured-based
feature detectors such as 3D Harris [9, 1] and 3D Hes-
sian [3] are more selective of salient structures for which
they incorporate different saliency criteria, but are limited
to using just a symmetric 3D Gaussian filtering for a scale-
space representation. The motion-based feature detectors
such as Cuboids [4] and asymmetric motion features [2] lo-
calize the salient motion events in a video by treating the
time domain different from space and hence, they are more
consistent with human motion perception [10, 11, 12].
This paper evaluates the performance of both structured-
based and motion-based feature detectors in a common
framework for action recognition. To perform a fair com-
parison, we employ the standard discriminative bag-of-
words action recognition framework [1] in which these fea-
tures are utilized to learn the set of action prototypes and
represent the action contents. Our objective is to find out
which feature detector is the most effective method for
video representation in an action classification application.
The rest of this paper is organized as follows. Section 2
reviews the existing salient spatio-temporal feature detec-
tors in video. Section 3 categorizes the existing detectors
into structural-based and motion-based detectors. In this
section, we briefly explain several examples from each cat-
egory. Section 4 presents the human action recognition
datasets, the evaluation framework, and the experimental re-
sults for performance evaluation of the different detectors.
Finally, Section 5 summarizes the results.
2 Related work
Drawing inspiration from the usefulness of local multi-
scale salient features for object recognition [13, 14], an im-
mediate extension has been developed for spatio-temporal
feature extraction for action recognition and for video anal-
ysis in general.
To extend 2D salient features to video, most of existing
methods consider the sequence of images (2D+t) as a 3D
object. As the 2D feature detectors select mainly salient
structures in a still image, their extensions to 3D is con-
sidered as structured-based feature detectors. For example,
Laptev et al. [9] extended the 2D Harris corner detector to
3D by performing 3D Gaussian filtering and computing the
cornerness saliency criteria for a 3D autocorrelation matrix.
In contrast, there are few methods that treat time domain
different from space and detect motion-based salient fea-
tures. For example, Dollar et al. [4] performed symmetric
temporal Gabor filtering to detect salient motions referred
to as Cuboids. Shabani et al. [12] used the difference of
Poisson as the time-causal filter to detect opponent-based
motion features. Recently, Shabani et al. [2] proposed a
novel asymmetric multi-resolution temporal filtering to de-
tect asymmetric motion features.
Our objective in this paper is to provide a fair and
complete comparison of both structured-based and motion-
based feature detectors for action classification. There are
two very close publications to our evaluations in this paper.
(1) Wang et al. [15] compared the performance of different
sparse spatio-temporal key point detectors and dense sam-
pling for human action classification. The dense samples
detected at regular 3D points performed better on more re-
alistic datasets such as UCF sports [16] and HOHA [17]
which are collected from Youtube and Hollywood movies,
respectively. However, it did not perform the best cat-
egorization of choreographed atomic actions in the KTH
dataset [18]. The authors then concluded that the dense
sampling method performs better on the real-world videos,
but not on simple videos. (2) Shabani et al. [12, 2] evalu-
ated the importance of temporal filtering in salient motion-
based feature detection. They showed that the asymmetric
temporal filters result in detection of motion features with
higher precision rate and higher robustness under geomet-
ric changes such as camera view change or affine transfor-
mation [2]. Moreover, the asymmetric motion features [2]
provide higher classification accuracy compared to features
detected using a symmetric temporal filter such as Gabor
(i.e, Cuboids [4]).
This paper can be considered as an extension to the
existing spatio-temporal feature detectors evaluation pa-
pers. More specifically, in addition to motion-based fea-
tures in [2], we also include the structured-based features
in the comparison. Moreover, we set the number of spatio-
temporal scales fixed for all the detectors. This is in con-
trast with the evaluation in [15] in which different feature
types are detected at different set of spatio-temporal scales
(e.g., Cuboids at one scale, 3D Harris at twelve scales) and
as a result, the comparison is not consistent. Performing
this complete evaluation determines the most effective fea-
ture detection method for action recognition. To this end,
we use the standard discriminative bag-of-words recogni-
tion approach for the action classification. In this frame-
work, the action primitives are learnt using the salient fea-
tures of all the samples in the training set. An action is then
represented globally as the frequency histogram of the ap-
pearance of the local features in the whole video.
3 Salient Feature Detectors
Existing spatio-temporal salient feature detectors can be
categorized into two sets depending on whether detection of
a salient structure is of interest or the detection of a salient
motion is relevant. The differences come into the type of
video filtering and the saliency criteria they use. The video
filtering at different spatio-temporal scales provide a multi-
resolution representation of the video contents from which
features at different scales are detected. The saliency crite-
ria determines which type of features will be chosen in their
local spatio-temporal neighborhood. The salient features
are localized at key points which are detected by performing
non-maxima suppression in search window of (3,3,3) [2].
In this section, we briefly explain different examples of
both structured-based and motion-based feature detectors.
3.1 Structured-based features
To detect spatio-temporal structured-based features, ex-
isting methods treat the time domain as the third dimension
of space and hence, they apply the same scale-space filter
in space and time directions. That is, similar to the spa-
tial Gaussian filtering, a temporal Gaussian is applied in the
time direction [1, 3].
In this section, we briefly explain the extension of Harris
corners and Hessian blobs to 3D which have been already
used in action recognition literature [1, 3]. We also intro-
duce the extension of 2D KLT (Kanade-Lucas-Tomasi) [19]
features to 3D for action recognition.
2
Figure 1. Standard discriminative bag-of-words framework for action recognition. The focus of this
paper is to compare different salient feature detectors. These features encode the local video events
and are used to learn the set of action prototypes (i.e., visual words or action primitives) during
training. An action is represented by encoding its salient features over the prototypes. Finally, a
classifier such as SVM determines the label of an unknown action.
3.1.1 3D Harris
Laptev et al. [9, 1] extended the Harris corner criteria from
2D image to 3D to extract corresponding points in a video
sequence. To this end, the original video signal I(x, y, t)is
smoothed using a spatial Gaussian Gσand a temporal Gaus-
sian kernel Gτusing the convolution L=GσGτI. The
autocorrelation matrix A=LT
d×Ldis then computed from
the spatio-temporal derivative vector Ld= [Lx, Ly, Lt].
To compare each pixel to the neighborhood pixels a spatio-
temporal Gaussian weighting G2σG2τis then applied.
M=G2σG2τA=G2σG2τ
L2
xLxLyLxLt
LyLxL2
yLyLt
LtLxLtLyL2
t
(1)
The autocorrelation matrix Mdefines the second mo-
ment approximation for the local distribution of the gra-
dients within a spatio-temporal neighborhood. Using the
eigen-values λ1, λ2, λ3of the Harris matrix M, one can
compute the spatio-temporal corner map Cin which the
corners are magnified and the rest are weakened (k=
0.0005).
C= det(M)k(trace3(M)) = λ1λ2λ3k(λ1+λ2+λ3)3
(2)
3.1.2 3D Hessian
Willems et al. [3] extended 2D Hessian features to 3D by
applying (an approximation) of 3D Gaussian filter and used
the determinant of the Hessian matrix (3) as the saliency
criteria. The points with high-value determinant (S=
kdet(H)k) represent the center of the ellipsoids (3D blob-
like structures) in the video.
H=
Lxx Lxy Lxt
Lyx Lyy Lyt
Ltx Lty Ltt
(3)
3.1.3 3D KLT
2D KLT features [19] have been widely used in many com-
puter vision tasks such as tracking and structure from mo-
tion [20]. The 3D KLT [21] is the extension of its counter-
part from 2D and it can be detected at multiple spatial and
temporal scales. To this end, a family of scale-space rep-
resentation of the video is obtained by performing 2D spa-
tial Gaussian filtering Gσand a temporal Gaussian filtering
Gτ. At each scale, the 3D KLT saliency criteria is applied
on the 3D autocorrelation matrix Ato keep the points with
the minimum of the eigen values above a threshold (i.e.,
min(λ1, λ2, λ3)> α). The 3D KLT features are then local-
ized at points with maximum saliency value in their spatio-
temporal neighborhood..
3.2 Motion-based features
The motion-based feature detectors perform the
biologically-consistent Gaussian filtering for space, but
they might use different temporal filters. The temporal
filter might be symmetric or asymmetric. More specifically,
consistency with the human’s motion perception [10, 11]
requires that the response to a periodic motion be mapped
to a constant value. Moreover, the mapped representation
should have the same value for two stimuli with different
phases, but same motion patterns (i.e., it should be phase
insensitive [11]). The filtering response should also be
contrast-polarity insensitive [10, 11] to make sure that this
representation is not sensitive to the polarity of the moving
3
stimuli versus background. For these phase and contrast-
polarity insensitivity requirements, an energy model which
induces quadrature-pair temporal filtering (i.e., two filters
with 90 degree phase difference) is required [11]. The
summation of the squared responses of the quadrature
filters induces the energy map from which the salient
motion features are detected.
In this section, we briefly explain different motion-based
feature detectors which use a symmetric or an asymmetric
temporal filter.
3.2.1 Cuboids (symmetric) motion features
Dollar et al. [4] used the energy field of temporal Gabor
filtering on the spatially Gaussian smoothed video R=
(GσFeven I)2+ (GσFodd I)2to extract the Cuboids
centered at the spatio-temporal key points in the energy
map. To detect the Cuboids at multiple scale, we performed
the video filtering at different spatial (σ)and temporal (τ)
scales which will be introduced in Section 4.2. Note that the
even component (4) and odd component (5) of the complex
Gabor filter are 90oin phase difference and are essential to
gain phase-insensitive motion map [12].
Feven
τ(t) = cos(ω0t)et2
2τ2(4)
Fodd
τ(t) = sin(ω0t)et2
2τ2(5)
3.2.2 Asymmetric motion features
Both Gaussian and Gabor are biologically consistent for
spatial image filtering, but they are symmetric and non-
causal which makes them not consistent with the tempo-
ral sensitivity of the human visual system [10, 11] and not
feasible with the V1 cells physiology [22]. With this mo-
tivation from biological vision, Shabani et al. [23, 12, 2]
advocate the use of time-causal video filtering for salient
feature detection. Extending the spatial scale-space filtering
to time, but with the time-causality constraint, the authors
developed a new time-causal multi-resolution temporal fil-
ter based on the RC circuit theory [2]. The resulting filter is
an asymmetric sinc filter K(t;τ)with a quadrature pair ob-
tained form its convolution with the Hilbert transform (i.e.,
Kh(t;τ) = K(t;τ)? h(t)in which h(t) = 1
πt [10]).
K(t;τ) = sinc(tτ)S(t)(6)
where S(t)denotes the Heaviside step function (i.e., S(t) =
1,t0and it is zero otherwise). Note that the shape of
the asymmetric sinc kernel changes as a function of the tem-
poral scale τ. More specifically, at finer scales, the kernel is
more skewed towards the times before the peak of the ker-
nel. The shape change of the asymmetric sinc filter with the
scale increase results in detection of a wide range of motion
features from asymmetric to symmetric motions. The per-
formance comparison of the asymmetric sinc filtering with
two other asymmetric filters of truncated exponential and
Poisson [24] and with the symmetric Gabor [4] shows its
higher efficiency [2]. In fact, the features detected using
asymmetric sinc show higher precision rate, higher repro-
ducibility under different geometric variations, and higher
action classification rate [2]. From now on, we coin these
features as asymmetric motion features.
We consider the performance comparison of three
structured-based features (3D Harris, 3D Hessian, and 3D
KLT) and two motion-based features (Cuboids and asym-
metric motion features) for action recognition.
4 Experimental setup and results
This section presents our common action classification
framework for feature evaluation, the data sets, and the ac-
tion classification results using different salient features.
4.1 Action classification framework
For action classification, we incorporated the standard
discriminative bag-of-words (BOW) setting [4, 15, 12] as
shown in Fig. 1. In this framework, salient features are
detected at multiple spatial and temporal scales (σ, τ )to
localize the video events of different scales. Action pro-
totypes/primitives are then learnt using the standard vec-
tor quantization procedure by clustering the features of all
training samples into 1000 groups, experimentally [15, 2].
The clustering is performed by 10 times running the K-
means algorithm with random seed initialization and keep-
ing the result with the lowest error [15]. The clusters rep-
resent action primitives and are referred to as visual words
in the BOW framework. Most existing methods select the
number of cluster experimentally [15, 2], typically in order
of 1000. To obtain a better statistics, we vary the number
of clusters form 500 to 1500 with interval of 100 and report
the average classification results.
By assigning each salient feature to its closest cluster
(i.e., visual words), a global representation of an action is
the frequency histogram of the appearance of the features
in the whole video clip. The L1normalized frequency his-
togram is finally considered as the compact signature of
the action. Finally, a nonlinear SVM (support vector ma-
chine) with the radial basis function (RBF ) is utilized for
the matching of action signatures (Si, Sj). The parameter
γof the RBF (7) is learnt through cross validation using the
LibSVM toolbox [25].
KRBF (Si, Sj) = eγ|SiSj|2
(7)
4
4.2 Spatio-temporal scales
To have a fair comparison, all the feature detectors use
the same spatial (σx=σy=σiper pixels) and tempo-
ral scales (τiper frames) in their scale-space video filter-
ing. We consider combination of three spatial scales and
three temporal scales, according to 2 (2)i, i ={0,1,2}
formula [14, 2]. Note that the minimum spatial scale
σ0= 2 pixels determines the maximum spatial frequency
of 0.5cycles/pixel(i.e., the highest spatial resolution is one
cycle each two pixels). With 25 frames per second, the max-
imum temporal frequency of 12.5cycles/sec is obtained
at τ= 2. After video filtering at each spatio-temporal
scale, the saliency map is computed using the correspond-
ing saliency criteria of each feature detector (e.g., the cor-
nerness (2) for the 3D Harris). To detect the key points
and hence, localize the salient features, we perform non-
maxima suppression in the spatio-temporal search window
of (3 ×3×3). To describe the motion and appearance of
each detected feature, we use the 3D SIFT descriptor [26]
which has shown promising performance in encoding the
spatio-temporal histogram of oriented gradients in the fea-
ture’s extension (6σ×6σ×6τ)[26].
4.3 Datasets
Three benchmark human action recognition datasets
have been used for the performance evaluation of different
detectors.
The KTH data set [18] consists of six actions (running,
boxing, walking, jogging, hand waving, and hand clapping)
with 600 choreographed video samples. Twenty-five differ-
ent subjects perform each action in four different scenarios:
indoors, outdoors, outdoors with scale change (fast zoom-
ing in/out) and outdoors with different clothes. According
to the initial citation [18], the video samples are divided into
a test set (9 subjects: 2,3,5,6,7,8,9,10, and 22) and a training
set (the remaining 16 subjects).
The UCF Sports dataset [16] includes actions such as
diving, golf swing, kicking, lifting, riding horse, run, skate
boarding, swing baseball, and walk with 150 video sam-
ples collected from the Youtube website. This dataset is
more challenging due to diverse ranges of views and scene
changes with moving camera, clutter, and partial occlusion.
A horizontally flipped version of each video is also used
during training to increase the data samples [15]. Two ver-
sion of this dataset has been used in the literature. The
original authors [16] categorized this dataset into 9classes,
but recent publications [27, 28, 29, 15] split the samples
of “swing” category into two categories of “swing on the
pommel horse” and “swing at the high bar”. We use leave-
one-out (without considering the flipped samples for test-
ing) protocol and report our results for both protocols.
The Hollywood dataset [30] consists of eight human ac-
tions (answer phone, get out the car, hand shake, hug a per-
son, kiss, sit down, sit up, and stand up) from 32 Holly-
wood movies. The dataset is divided into a test set obtained
from 20 movies and the (clean) training set obtained from
12 movies different from the test set. There are 219 sample
videos in the training set and 211 samples in the test set.
4.4 Action recognition results
Figure 2 shows the 2D projection of different spatio-
temporal salient features on a sample frames from “diving”
action from the UCF sports dataset [16]. In this video, the
camera is following the athlete and that is a challenge for
any local salient feature detector. In fact, the salient fea-
ture detector do not perform any background subtraction
for video segmentation and hance, as a result, they find
some features from the background (i.e., false positives). In
this sample video, as it can be seen, among the structured-
based features, the 3D Hessian has less false positive de-
tection compared with the 3D Harris and the 3D KLT de-
tector. In contrast, the motion-based detectors detect most
of the features form the moving limbs. Among all the de-
tectors, the asymmetric motion features are more localized
on the moving limbs with much less false positives from
the background. Note that one could perform camera mo-
tion compensation to reduce the false detection, but stan-
dard velocity-adaptation approaches such as Galilean trans-
formation are computationally expensive and has not been
successful in feature detection literature [1].
Table 1 presents the average classification accuracy
of using different detectors on three different benchmark
datasets. As can be seen, in all three datasets, the motion-
based salient features perform better than the structured-
base salient features. Among the structured-based salient
features, the 3D Harris [9] and 3D KLT perform better than
3D Hessian [3]. Among the motion-based features, the
asymmetric motion features [2] provide higher classifica-
tion accuracy than the symmetric Cuboids [4]. These results
support the importance of using motion-based features for
video content representation and more specifically, the use
of asymmetric temporal filtering to extract a wider range of
motions from asymmetric to symmetric at different scales.
4.5 Comparison with other methods
Table 2 presents the classification rate of using asymmet-
ric motion features and other published methods on three
different datasets. As can be seen, the asymmetric fea-
tures provide the highest accuracy on both the UCF and
HOHA datasets. On the KTH dataset, our 93.7% accu-
5
(a) 3D Harris features
(b) 3D Hessian features
(c) 3D KLT features
(d) Cuboids features
(e) Asymmetric motion features
Figure 2. 2D projection of different multi-scale local salient features on sample frames of a “diving”
action from the UCF sports dataset [16]. From top row to bottom row, the features are (a) 3D Harris,
(b) 3D Hessian, (c) 3D KLT, (d) Cuboids, and (e) asymmetric motion features. Among all the detectors,
the asymmetric motion features are more localized on the moving body limbs with much less false
positives from the background.
racy is comparable with 94.2% [29] accuracy obtained us-
ing joint dense trajectories and dense sampling which re-
quire much more computation time and memory compared
to our sparse features. In a comparable setting with [15],
the asymmetric motion features perform better than other
salient features and dense sampling.
5 Conclusion
In a common discriminative framework for action clas-
sification, we compared different salient structured-based
and motion-based feature detectors. For performance eval-
uation, we used three benchmark human action recogni-
tion datasets of the KTH, UCF sports, and Hollywood. In
all three datasets, the motion-based features provide higher
classification accuracy than the structured-based features.
More specifically, among all of these sparse feature detec-
tors, the asymmetric motion features perform the best as
they capture a wide range of motions from asymmetric to
symmetric. With much less computation time and memory
usage, these sparse features provide higher classification ac-
curacy than the dense sampling as well. Based on our exper-
imental results, we recommend the use of asymmetric mo-
tion filtering for effective salient feature detection, sparse
6
Table 1. Average classification accuracy on different datasets using the features detected by different
methods. The accuracy variation is in order of 0.01 and is not reported here. Note that motion-
based detectors perform better than structured-based detectors on all three datasets. Moreover, the
asymmetric features provide higher classification accuracy than the symmetric Cuboids features.
Dataset Structured-based fetaures Motion-based features
3D Harris 3D Hessian 3D KLT Cuboids Asymmetric
KTH [18] 63.5 % 67.5 % 68.2 % 89.5 % 93.7 %
UCF sports [16] 9 classes 72.8 % 70.6 % 72.6 % 73.3 % 91.7 %
10 classes 73.9 % 70.2 % 72.5 % 76.7 % 92.3 %
HOHA [30] 58.1 % 57.3 % 58.9 % 60.5 % 62 %
Table 2. Comparison of different published methods for the human action classification on the
KTH [18], the UCF sports [16], and the HOHA [30] datasets. In this table, the bold italic items show
the original protocol of the dataset introduced by their corresponding authors.
Method KTH UCF sports HOHA
9 classes 10 classes
Laptev et al. [30] - - - 38.4 %
Schuldt et al. [18] 71.7 % - - -
Rodriguez et al. [16] 86.7 % 69.2 % - -
Shabani et al. [2] 93.3 % 91.5 % - -
Wang et al. [29] 94.2 % - 88.2 % -
Wang et al. [15] 92.1 % - 85.60 % -
Willems et al. [3] 88.3 % - 85.60 % -
Asymmetric motions 93.7 % 91.7 %92.3 %62 %
video content representation, and consequently, action clas-
sification.
Acknowledgment
The authors would like to thank both GEOIDE (Geo-
matics for Informed Decisions), supported by the Natural
Science and Engineering Research Council (NSERC) of
Canada, and the Ontario Centres of Excellence (OCE) for
financial support of this project.
References
[1] I. Laptev, B. Caputo, C. Schuldt, and T. Linde-
berg. Local velocity-adapted motion events for spatio-
temporal recognition. Computer Vision and Image
Understanding, pages 207–229, 2007.
[2] A. H. Shabani, D. A. Clausi, and J. S. Zelek. Im-
proved spatio-temporal salient feature detection for
action recognition. British Machine Vision Confer-
ence, Dundee, UK, Sep. 2011.
[3] G. Willems, T. Tuytelaars, , and L. Van Gool. An effi-
cient dense and scale-invariant spatio-temporal inter-
est point detector. European Conference on Computer
Vision, Marseille, France, pages 650–663, Oct. 2008.
[4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie.
Behavior recognition via sparse spatio-temporal fil-
ters. IEEE International Workshop VS-PETS, Beijing,
China, pages 65–72, Aug. 2005.
[5] O. Shahar, A. Faktor, and M. Irani. Space-time super-
resolution from a single video. IEEE Conference on
Computer Vision and Pattern Recognition, Colorado
Springs, CO, pages 3353–3360, June 2011.
[6] T. Zhang, S. Liu, C. Xu, and H. Lu. Boosted
multi-class semi-supervised learning for human action
recognition. Pattern Recognition, 44:23342342, 2011.
[7] R. Poppe. A survey on vision-based human action
recognition. Image and Vision Computing, 28:976–
990, 2010.
[8] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A
survey on visual content-based video indexing and re-
7
trieval. IEEE Transactions on Systems, Man, and Cy-
bernetics, 41(6):797–819, 2011.
[9] I. Laptev. On space-time interest points. International
Journal of Computer Vision, 64:107–123, 2005.
[10] A. B. Watson and A. J. Ahumada. Model of human
visual-motion sensing. Journal of Optical Society of
America, 2(2):322–342, 1985.
[11] E. H. Adelson and J.R. Bergen. Spatio-temporal en-
ergy models for the perception of motion. Optical So-
ciety of America, pages 284–299, 1985.
[12] A. H. Shabani, J.S. Zelek, and D.A. Clausi. Human
action recognition using salient opponent-based mo-
tion features. IEEE Canadian Conference on Com-
puter and Robot Vision, Ottawa, Canada, pages 362 –
369, May 2010.
[13] Harris C. and M.J. Stephens. A combined corner and
edge detector. Alvey Vision Conference, pages 147–
152, 1988.
[14] D. G. Lowe. Distinctive image features from scale-
invariant key points. International Journal of Com-
puter Vision, 60:91–110, 2004.
[15] H. Wang, M.M. Ullah, A. Klaser, I. Laptev, and
C. Schmid. Evaluation of local spatio-temporal fea-
tures for action recognition. British Machine Vision
Conference, London, UK, Sep. 2009.
[16] M.D. Rodriguez, J. Ahmed, and M. Shah. Action
mach: a spatio-temporal maximum average correla-
tion height filter for action recognition. IEEE Con-
ference on Computer Vision and Pattern Recognition,
Alaska, pages 1–8, June 2008.
[17] M. Marszalek, I. Laptev, and C. Schmid. Actions in
context. IEEE Conference on Computer Vision and
Pattern Recognition, Miami, Florida, pages 2929–
2936, June 2009.
[18] C. Schuldt, I. Laptev, and B. Caputo. Recognizing
human actions: a local svm approach. IEEE Inter-
national Conference on Pattern Recognition, Cam-
bridge, UK, 3:32–36, Aug. 2004.
[19] J. Shi and C. Tomasi. Good features to track. IEEE
Conference on Computer Vision and Pattern Recogni-
tion, Seattle, WA, pages 593–600, 1994.
[20] N. Govender. Evaluation of feature detection al-
gorithms for structure from motion. 3rd Robotics
and Mechatronics Symposium (ROBMECH), Pretoria,
South Africa, page 4, Nov. 2009.
[21] Y. Kubota, K. Aoki, H. Nagahashi, and S.I. Minohara.
Pulmonary motion tracking from 4D-CT images us-
ing a 3D-KLT tracker. IEEE Nuclear Science Sym-
posium Conference Record,Orlando, FL, pages 3475–
3479, Oct. 2009.
[22] B.M. ter Haar Romeny, L.M.J. Florack, and
M. Nielsen. Scale-time kernels and models. Scale-
Space and Morphology in Computer Vision, Vancou-
ver, Canada, pages 255–263, July 2001.
[23] H. Shabani, D.A. Clausi, and J.S. Zelek. Towards a
robust spatio-temporal interest point detection for hu-
man action recognition. IEEE Canadian Conference
on Computer and Robot Vision, Kelowna, BC, pages
237–243, May 2009.
[24] T. Lindeberg and D. Fagerstrom. Scale-space with
causal time direction. European Conference on Com-
puter Vision, Cambridge, UK, pages 229–240, April
1996.
[25] C.C. Chang and C.J. Lin. LIBSVM: a library for sup-
port vector machines, 2001. http://www.csie.
ntu.edu.tw/˜cjlin/libsvm.
[26] P. Scovanner, S. Ali, and M. Shah. A 3-Dimensional
SIFT descriptor and its application to action recogni-
tion. ACM Multimedia, Augsburg, Germany, pages
357–360, Sep. 2007. Code available at http://
www.cs.ucf.edu/˜pscovann/.
[27] A. Gaidon, Z. Harchaoui, and C. Schmid. A time se-
ries kernel for action recognition. British Machine Vi-
sion Conference, Dundee, UK, Sep. 2011.
[28] Q.V. Le, W.Y. Zou, S.Y. Yeung, and A.Y. Ng. Learn-
ing hierarchical invariant spatio-temporal features for
action recognition with independent subspace analy-
sis. IEEE Conference on Computer Vision and Pat-
tern Recognition, Colorado Spring, pages 3361–3368,
June 2011.
[29] H. Wang, A. Klaser, C. Schmid, and C.L. Liu. Action
recognition by dense trajectories. IEEE Conference on
Computer Vision and Pattern Recognition, Colorado
Spring, pages 3169–3176, June 2011.
[30] I. Laptev, M. Marszalek, C. Schmid, and B. Rozen-
feld. Learning realistic human actions from movies.
IEEE Conference on Computer Vision and Pattern
Recognition, Anchorage, Alaska, pages 1–8, 2008.
8
... Human action recognition (HAR) is an important topic in computer vision due to its applications in assisted living, smart surveillance systems, human-computer interaction, computer gaming and affective computing [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]. Depending on the target application, an action recognition system can be used to either recognize full body behavior [1], or to recognize partial body like gesture recognition [17] and facial recognition [18]. ...
... Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis methods [11,12], human activities were represented as as a set of spatio-temporal features or trajectories. On the other hand, in stochastic methods [20,21], statistical models like hidden Markov models were applied to represent human activities. ...
Article
Full-text available
Human action recognition is one of the most important topics in computer vision. Monitoring elderly people and children, smart surveillance systems and human-computer interaction are a few examples of its applications. The aim of this study is to recognize human activities by utilizing the phase information extracted from the frequency domain of the video data as handcrafted features. Rather than estimating optical flow or computing motion vectors, we aim to utilize the localized phase information as descriptors of the motion dynamics of the scene. Phase correlation information extracted from each two co-sited blocks from each two consecutive frames of video clips were used to train a model using KNN classifier to model the action. To evaluate the performance of our method, an extensive work has been done on three large and complex datasets: UCF101, Kinetics-400 and Kinetics-700. The results show that our approach succeeds on recognizing human actions across all these datasets with high accuracy.
... Humans easily recognize and identify actions in video but automating this procedure is challenging. Human action recognition in videos is of interest for applications such as automated surveillance, elderly behavior monitoring [10], human-computer interaction, content-based video retrieval and video summarization [34]. ...
Chapter
Full-text available
Nowadays most of the streets, squares and buildings are monitored by a large number of surveillance cameras. Nevertheless, these cameras are used only to record scenes to be analyzed after crimes or thefts, and not to prevent violent actions in an automatic way. In few cases there may be a guard who checks the videos manually in real-time, but it is a very inefficient and expensive process. In this paper we proposes a novel approach to Violence Detection task using a recent architecture named ConvMixer, a simple CNN which uses patch-based embeddings in order to obtain superior performance with fewer parameters and computation resources. We also use an interesting technique that consists in arranging frames into super images to encode the temporal information into the spatial dimensions. Our tests on popular “Real Life Violence Situations” dataset highlight a remarkable accuracy of 0.95, placing our proposed model at the second position of the leader board on the same dataset.
... Human activity recognition has been studied for years and researchers have stated a variety of solutions to approach this problem. Existing [6] approaches typically use inertial sensors, vision sensors, and a mixture of both. Machine Learning and limitbased algorithms are regularly applied. ...
Chapter
Full-text available
Human Activity Recognition (HAR) is one of the most ongoing research fields in computer vision for different contexts like surveillance, military, healthcare. So Human Activity Recognition is a category of time series classification problem that needs to be solved. Here the data from a series of timestamps to accurately detect the action that is present in the video is considered. This study uses single- frame Convolutional Neural Networks as a learning method for Human Activity Recognition (HAR). Unlike the other conventional machine learning methods, which require expertise on domain-specific knowledge, Convolutional neural networks can extract the features automatically. This study comprises reading and pre-processing the dataset, constructing the Convolutional Neural Networks model as per need, Compiling and training the model and using the single-frame CNN method. This approach is implemented using various technologies like Tensor flow, Open cv.KeywordsANNHARCNNTimestampLSTMTensor flowOpenCV
... [16] . 使用比较多 的 兴 趣 点 的 检 测 方 法 包 括 Harris3D 检 测 [17] 、 Hessian 检测 [18] 和 cuboid 检测 [19] . ...
Article
Depth maps reduce the dimension loss of 3D human motion information in the process of vision acquisition, therefore depth map-based human action recognition reflects technical advantages in fields of feature extraction, representation and recognition accuracy, compared with traditional RGB image, and attracts the extensive attention. The research status of depth map-based human action recognition was summarized in this paper. First, the existing methods of recognizing human action from depth maps were collated and classified. Then, multiple publicly available human action datasets were introduced, and the accuracies of several datasets in different methods were compared. Finally, the possible future directions of human action recognition were analyzed.
... As we discuss space approaches, transient factors include job recognition [37,38] or trajectories [39,40]. Stochastic methods use a mathematical model to identify human activities (e.g., Markov's hidden models) [41,42] and, as the name suggests in methods based on the Rule, use rules to identify human activities [43,44]. ...
Article
Full-text available
Human Activity Recognition (HAR) is a vast and exciting topic for researchers and students. HAR aims to recognize activities by observing the actions of subjects and surrounding conditions. This topic also has many significant and futuristic applications and a basis of many automated tasks like 24*7 security surveillance, healthcare, laws regulations, automatic vehicle controls, game controls by human motion detection, basically human-computer interaction. This survey paper focuses on reviewing other research papers on sensing technologies used in HAR. This paper has covered distinct research in which researchers collect data from smartphones; some use a surveillance camera system to get video clips. Most of the researchers used videos to train their systems to recognize human activities collected from YouTubes and other video sources. Several sensor-based approaches have also covered in this survey paper to study and predict human activities, such as accelerometer, gyroscope, and many more. Some of the papers also used technologies like a Convolutional neural network (CNN) with spatiotemporal three-dimensional (3D) kernels for model development and then using to integrate it with OpenCV. There are also work done for Alzheimer’s patient in the Healthcare sector, used for their better performance in day-to-day tasks. We will analyze the research using both classic and less commonly known classifiers on distinct datasets available on the UCI Machine Learning Repository. We describe each researcher’s approaches, compare the technologies used, and conclude the adequate technology for Human Activity Recognition. Every research will be discussed in detail in this survey paper to get a brief knowledge of activity recognition.
... STIP) in 2D + t space that can be used to describe its styles more accurately [2,27]. The mentioned idea has been proposed earlier in action recognition studies [24,25]. For example, S. Rahman et al. [22] describe an action with the local features using statistical information (i.e. ...
Article
Full-text available
Recognizing gait of people has been of great interest to the researchers of biometrics in the last decade. The robust features have been recently developed to identify human’s gait under different conditions. But developing efficient gait template preserving spatio-temporal features of walking is still an open problem. To address this issue, we develop a patch-based feature that can describe rhythm of walking under covariate factors properly. In our method, a new gait signature (i.e. set of spatio-temporal features) is computed from distribution of local patches in a sequence. The given signature has been used to adjust the weights of spatio-temporal coordinates and the corresponding weights are concatenated with the Gabor features. As a result, a new augmented template called Patch Gait Feature (PGF) is derived accordingly. In addition, to verify how our feature template is efficient in gait recognition, we apply two common classification methods (PCA + LDA and Random Subspace Method (RSM)) separately and evaluate the results under different challenging conditions. The recognition rate on the USF dataset indicates Rank1/Rank5 accuracies of 61.59/80.67% with PCA + LDA and 76.01/86.59% with RSM and shows an improvement of about 5% with rational computational complexity compared with other related methods.
Article
Human action recognition is one of the most important topics in computer vision. Monitoring elderly people and children, smart surveillance systems and human-computer interaction are a few examples of its applications. The aim of this study is to recognize human activities by utilizing the phase information extracted from the frequency domain of the video data as handcrafted features. Rather than estimating optical flow or computing motion vectors, we aim to utilize the localized phase information as descriptors of the motion dynamics of the scene. Phase correlation information extracted from each two co-sited blocks from each two consecutive frames of video clips were used to train a model using KNN classifier to model the action. To evaluate the performance of our method, an extensive work has been done on three large and complex datasets: UCF101, Kinetics-400 and Kinetics-700. The results show that our approach succeeds on recognizing human actions across all these datasets with high accuracy.
Thesis
Full-text available
Video quality assessment plays a key role in the video processing and communications applications. An ideal video quality metric shall ensure high correlation between the video distortion prediction and the perception of the Human Visual System. This work proposes the use of visual attention models with bottom-up approach based on saliencies for video qualitty assessment. Three objective metrics are proposed. The first method is a full reference metric based on the structural similarity. The second is a no reference metric based on a sigmoidal model with least squares solution using the Levenberg-Marquardt algorithm and extraction of spatial and temporal features. And, the third is analagous to the last one, but uses the characteristic Blockiness for detecting blocking distortions in the video. The bottom-up approach is used to obtain the salient maps, which are extracted using a multiscale background model based on motion detection. The experimental results show an increase of efficiency in the quality prediction of the proposed metrics using salient model in comparission to the same metrics not using these model, highlighting the no reference proposed metrics that had better results than metrics with reference to some categories of videos.
Article
Full-text available
The most successful approaches to video understanding and video matching use local spatio-temporal features as a sparse representation for video content. In the last decade, a great interest in evaluation of local visual features in the domain of images is observed. The aim is to provide researchers with guidance when selecting the best approaches for new applications and data-sets. FeEval is presented, a framework for the evaluation of spatio-temporal features. For the first time, this framework allows for a systematic measurement of the stability and the invariance of local features in videos. FeEval consists of 30 original videos from a great variety of different sources, including HDTV shows, 1080p HD movies and surveillance cameras. The videos are iteratively varied by well defined challenges leading to a total of 1710 video clips. We measure coverage, repeatability and matching performance under these challenges. Similar to prior work on 2D images, this leads to a new robustness and matching measurement. Supporting the choices of recent state of the art benchmarks, this allows for a in-depth analysis of spatio-temporal features in comparison to recent benchmark results.
Conference Paper
Full-text available
We propose a new method for lung-motion tracking and its quantification from 4-dimensional X-ray computed tomographic (4D-CT) images. This method uses an enhanced 3D-KLT tracker. An advantage of our method is that it can find many feature points (regions) for tracking that are not restricted to the bifurcation points of bronchi or vessels. The feature point extraction algorithm depends only on image gradients. Moreover, our method adopts a hierarchical tracking based on pyramidal image structure. This provides robustness for large movements of the objects. Lung motion is quantified by tracking a large number of feature points in the lung. In this paper, we first evaluate the performance of our proposed method for artificial 4D-CT images and then describe quantification results of real 4D-CT images. Our experimental results clearly show that lung movement is not modeled by a simple translation but by an oval pattern.
Article
Full-text available
Spatio-temporal salient features can localize the local motion events and are used to represent video sequences for many computer vision tasks such as action recognition. The robust detection of these features under geometric variations such as affine transfor-mation and view/scale changes is however an open problem. Existing methods use the same filter for both time and space and hence, perform an isotropic temporal filtering. A novel anisotropic temporal filter for better spatio-temporal feature detection is developed. The effect of symmetry and causality of the video filtering is investigated. Based on the positive results of precision and reproducibility tests, we propose the use of temporally asymmetric filtering for robust motion feature detection and action recognition.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
Local image features or interest points provide compact and abstract representations of patterns in an image. In this paper, we extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for interpretation of spatio-temporal events.To detect spatio-temporal events, we build on the idea of the Harris and Frstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time. We estimate the spatio-temporal extents of the detected events by maximizing a normalized spatio-temporal Laplacian operator over spatial and temporal scales. To represent the detected events, we then compute local, spatio-temporal, scale-invariant N-jets and classify each event with respect to its jet descriptor. For the problem of human motion analysis, we illustrate how a video representation in terms of local space-time features allows for detection of walking people in scenes with occlusions and dynamic cluttered backgrounds.
Conference Paper
Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/~wzou/.
Conference Paper
Spatial Super Resolution (SR) aims to recover fine image details, smaller than a pixel size. Temporal SR aims to recover rapid dynamic events that occur faster than the video frame-rate, and are therefore invisible or seen incorrectly in the video sequence. Previous methods for Space-Time SR combined information from multiple video recordings of the same dynamic scene. In this paper we show how this can be done from a single video recording. Our approach is based on the observation that small space-time patches (`ST-patches', e.g., 5×5×3) of a single `natural video', recur many times inside the same video sequence at multiple spatio-temporal scales. We statistically explore the degree of these ST-patch recurrences inside `natural videos', and show that this is a very strong statistical phenomenon. Space-time SR is obtained by combining information from multiple ST-patches at sub-frame accuracy. We show how finding similar ST-patches can be done both efficiently (with a randomized-based search in space-time), and at sub-frame accuracy (despite severe motion aliasing). Our approach is particularly useful for temporal SR, resolving both severe motion aliasing and severe motion blur in complex `natural videos'.