PreprintPDF Available

TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Video action anticipation aims to predict future action categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states, and predict future actions from the hidden representations. It is well known that the recurrent pipeline is inefficient in capturing long-term information which may limit its performance in predication task. To address this problem, this paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction (TTPP) framework, which repurposes a Transformer-style architecture to aggregate observed features, and then leverages a light-weight network to progressively predict future features and actions. Specifically, predicted features along with predicted probabilities are accumulated into the inputs of subsequent prediction. We evaluate our approach on three action datasets, namely TVSeries, THUMOS-14, and TV-Human-Interaction. Additionally we also conduct a comprehensive study for several popular aggregation and prediction strategies. Extensive results show that TTPP not only outperforms the state-of-the-art methods but also more efficient.
Content may be subject to copyright.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
TTPP: Temporal Transformer with Progressive
Prediction for Efficient Action Anticipation
Wen Wang, Xiaojiang Peng, Yanzhou Su, Yu Qiao, Jian Cheng
Abstract—Video action anticipation aims to predict future
action categories from observed frames. Current state-of-the-
art approaches mainly resort to recurrent neural networks
to encode history information into hidden states, and predict
future actions from the hidden representations. It is well known
that the recurrent pipeline is inefficient in capturing long-term
information which may limit its performance in predication task.
To address this problem, this paper proposes a simple yet effi-
cient Temporal Transformer with Progressive Prediction (TTPP)
framework, which repurposes a Transformer-style architecture
to aggregate observed features, and then leverages a light-weight
network to progressively predict future features and actions.
Specifically, predicted features along with predicted probabilities
are accumulated into the inputs of subsequent prediction. We
evaluate our approach on three action datasets, namely TVSeries,
THUMOS-14, and TV-Human-Interaction. Additionally we also
conduct a comprehensive study for several popular aggregation
and prediction strategies. Extensive results show that TTPP not
only outperforms the state-of-the-art methods but also more
efficient.
Index Terms—Action anticipation, Transformer, Encoder-
Decoder.
I. INTRODUCTION
Human action anticipation, also aka early action prediction,
aiming to predict future unseen actions, is one of the main
topics in video understanding with wide applications in secu-
rity, visual surveillance and human-computer interaction, etc.
In contrast to the well-studied action recognition, which infers
the action label after observing the entire action execution,
action anticipation is to early predict human actions without
observing the future action execution. It is a very challenging
task because the input videos are temporally incomplete with
wide variety of irrelevant background, and decisions must be
made based on such incomplete action executions. In all, ac-
tion anticipation needs to overcome all the difficulties of action
recognition and capture sufficient historical and contextual
information to make future predictions in untrimmed video
streams.
Generally, most of the action anticipation approaches can
be divided into two key phases, namely observed information
aggregation and future prediction, as shown in Figure 1. Early
Wen Wang, Yanzhou Su and Jian Cheng are with the School of Informa-
tion and Communication Engineering, University of Electronic Science and
Technology of China, Chengdu, Sichuan, China, 611731.
Xiaojiang Peng and Yu Qiao are with ShenZhen Key Lab of Computer
Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Insti-
tutes of Advanced Technology, Chinese Academy of Sciences; SIAT Branch,
Shenzhen Institute of Artificial Intelligence and Robotics for Society.
This work was done when Wen Wang was intern at Shenzhen Institutes of
Advanced Technology, Chinese Academy of Sciences.
Corresponding author: Xiaojiang Peng (xj.peng@siat.ac.cn)
Fig. 1: The summarized generic flowchart for action an-
ticipation, which mainly consists of observed information
aggregation and future prediction.
works of action anticipation focus on trimmed action videos,
and mainly make efforts on extracting discriminative features
from partial videos, i.e. observed information aggregation, for
early action prediction [1], [28], [30], [42], [57]. In deep
learning era, recent works turn to predict future actions in
practical untrimmed video streams [11], [13], [56], [62], and
mainly repurpose sequential models from the natural lan-
guage processing (NLP) domain, like long short-term memory
(LSTM) [15] and gated recurrent neural networks [6]. For
instance, Gao et al. [13] propose a Reinforced Encoder-
Decoder network, which utilizes an encoder-decoder LSTM
network [36], [59] to aggregate historical features and predict
future features or actions. Xu et al. [62] propose a LSTM-
based temporal recurrent network to predict future features for
both online action detection and action anticipation. Though
the encoder-decoder recurrent networks can be easily trans-
formed from NLP domain to temporal action anticipation,
their inherent sequential nature precludes parallelization within
training examples and limits the memory power for longer
sequence length [52]. Moreover, they are known to have
limited improvements in other action understanding tasks [?],
[60].
In this paper, we address the two issues of action antici-
pation via a simple yet efficient Temporal Transformer with
Progressive Prediction (TTPP) framework. TTPP repurposes
a Transformer-style module to aggregate observed information
and leverages a light-weight network to progressively predict
future features and actions. Specifically, TTPP contains a Tem-
poral Transformer Module (TTM) and an elaborately-designed
Progressive Prediction Module (PPM). Given historical and
current features, the TTM aggregates the historical features
based solely on self-attention mechanisms with the current
feature as query, which is inspired by the Transformer in
machine translation [52]. The aggregated historical feature
along with the current feature are then fed into the PPM. The
PPM is comprised of an initial prediction block and a shared-
arXiv:2003.03530v1 [cs.CV] 7 Mar 2020
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
parameter progressive prediction block, each of which is built
with two fully-connected (FC) layers, a ReLU activation [38]
and a layer normalization (LN) [2]. With the output feature of
TTM, the initial prediction block of PPM predicts the immedi-
ately following clip feature and action probabilities. The pro-
gressive prediction block accumulates the former predictions
and the output of TTM, and further predicts a few subsequent
future features and actions. The whole TTPP model can be
jointly trained in an end-to-end manner with supervision from
ground-truth future features and action labels. Compared to
previous encoder-decoder methods, the benefits of our TTPP
are two-fold. First, the temporal Transformer is more efficient
than recurrent methods in capturing historical context by self-
attention. Second, the progressive prediction module with skip
connections to aggregated historical features can efficiently
deliver temporal information and help long-term anticipation.
We evaluate our approach on three widely-used action an-
ticipation datasets, namely TVSeries [7], THUMOS-14 [20],
and TV-Human-Interaction [39]. Additionally we also conduct
a comprehensive study for several popular aggregation and
prediction strategies, including temporal convolution, LSTM
and single-shot prediction, etc. Extensive results show that
TTPP is more efficient than the state-of-the-art methods in
both training and inference, and outperforms them with a large
margin.
The main contributions of this work can be concluded as
follows.
We propose a simple yet efficient TTPP framework for
action anticipation, which leverages a Transformer-style
architecture to aggregate information and a light-weight
module to predict future actions.
We elaborately design a progressive prediction module
for predicting future features and actions, and achieve the
state-of-the-art performance on TVSeries, THUMOS-14,
and TV-Human-Interaction.
We conduct a comprehensive study for several popular
aggregation and prediction strategies, including aggrega-
tion methods of temporal convolution, Encoder-LSTM,
and prediction methods of Decoder-LSTM, single-shot
prediction, etc.
The rest of this paper is organized as follows: We first re-
view some related works in Section II. Section III describes the
proposed framework with TTM to aggregate observed features
and PPM to progressively predict future actions. Afterwards,
we show our experimental results on several datasets in section
IV and conclude the paper in Section V.
II. RE LATE D WORK
Action recognition. Action recognition is an important
branch of video related research areas and has been exten-
sively studied in the past decades. The existing methods are
mainly developed for extracting discriminative action features
from temporally complete action videos. These methods can
be roughly categorized into hand-crafted feature based ap-
proaches and deep learning based approaches. Early methods
such as Improved Dense Trajectory (IDT) mainly adopt hand-
crafted features, such as HOF [32], HOG [32] and MBH
[55]. Recent studies demonstrate that action features can
be learned by deep learning methods such as convolutional
neural networks (CNN) and recurrent neural networks (RNN).
Two-stream network [46], [57] learns appearance and motion
features based on RGB frame and optical flow field separately.
RNNs, such as long short-term memory (LSTM) [15] and
gated recurrent unit (GRU) [6], have been used to model long-
term temporal correlations and motion information in videos,
and generate video representation for action classification.
A CNN+LSTM model, which uses a CNN to extract frame
features and a LSTM to integrate features over time, is also
used to recognize activities in videos [9]. C3D [10] network
simultaneously captures appearance and motion features using
a series of 3D convolutional layers. Recently, I3D [4] networks
use two stream CNNs with inflated 3D convolutions on both
dense RGB and optical flow sequences to achieve state of the
art performance on Kinetics dataset [24].
Action anticipation. Many works have been proposed to
exploit the partially observed videos for early action predic-
tion or future action anticipation. Recently, Hoai et al. [17]
propose a max-margin framework with structured SVMs to
solve this problem. Ryoo et al. [42] develop an early action
prediction system by observing some evidences from the
temporal accumulated features. Lan et al. [31] design a coarse-
to-ne hierarchical representation to capture the discriminative
human movement at different levels, and use a max-margin
framework for final prediction. Cao et al. [64] formulate
the action prediction problem into a probabilistic framework,
which aims to maximize the posterior of activity given ob-
served frames. In their work, the likelihood is computed by
feature reconstruction error using sparse coding. However, it
suffers from high computational complexity as the inference
is performed on the entire training data. Carl et al. [54]
present a framework that uses large-scale unlabeled data to
predict a rich visual representation in the future, and apply
it towards anticipating both actions and objects. Kong et
al. [29] propose a combined CNN and LSTM along with a
memory module in order to record “hard-to-predict” samples,
they benchmark their results on UCF101 [49] and Sports-1M
[23] datasets. Gao et al. [13] propose a Reinforced Encoder-
Decoder (RED) network for action anticipation, which uses
reinforcement learning to encourage the model to make the
correct anticipations as early as possible. Ke et al. [27] propose
an attended temporal feature, which uses multi-scale temporal
convolutions to process the time-conditioned observation. In
this work, we focus only on recent results on anticipation of
action labels, more details can be found in [13] and [62].
Online action detection. Online action detection is usually
solved as an online per-frame labelling task on streaming
videos, which requires correctly classifying every frame with-
out accessing future frames. De Geest et al. [7] first introduce
the problem by introducing a realistic dataset, i.e. TVSeries,
and benchmarked the existing models. They have shown that
a simple LSTM approach is not sufficient for online action
detection, and even worse than the traditional pipeline of
improved trajectories, Fisher vectors and SVM. Their later
work [8] introduces a two-stream feedback network, where
one stream processes the input and the other one models
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
𝐼𝑡−3
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑔𝑒𝑛𝑐
𝑓
𝑡
𝑓
𝑡+1
𝑓
𝑡+2
𝑓
𝑡+3
𝑓
𝑡+4
𝑓
𝑡+1
𝑓
𝑡+2
𝑓
𝑡+3
𝑓
𝑡+4
𝑆𝑡
𝑔𝑡𝑡𝑚
𝑔𝑡𝑡𝑚
𝑔𝑡𝑡𝑚
𝑔𝑝𝑟𝑒𝑑
0
𝑔𝑝𝑟𝑒𝑑
𝑔𝑝𝑟𝑒𝑑
𝑔𝑝𝑟𝑒𝑑
𝐼𝑡−2
𝐼𝑡−1
𝐼𝑡
𝐼𝑡+1
𝐼𝑡+2
𝐼𝑡+3
𝐼𝑡+4
𝑓
𝑡−3
𝑓
𝑡−2
𝑓
𝑡−1
𝐿𝑐+𝐿𝑟
𝐾
𝑄
𝑉
Fig. 2: The flowchart of our TTPP method for action anticipation. Given a continuous video sequence, an encoder network
is first used to map each video clip to clip features, and then a Temporal Transformer Module is proposed to aggregate
observed clip features, and finally a Progressive Prediction Module is designed for future action anticipation. Note that the
future information in the red dashed box is only used in training stage and the classifier is performed on each prediction.
the temporal relations. Gao et al. [13] propose a Reinforced
Encoder-Decoder network for action anticipation and treat
online action detection as a special case of their framework.
Xu et al. [62] propose the Temporal Recurrent Network (TRN)
to model the temporal context by simultaneously performing
online action detection and anticipation. Besides, Shou et al.
[45] address the online detection of action start (ODAS) by
encouraging a classification network to learn the representation
of action start windows.
Attention for video understanding. The attention mecha-
nism which directly models long-term interactions with self-
attention has led to state-of-the-art models for action under-
standing tasks, such as video-based and skeleton-based action
recognition [14], [34], [35], [44], [48]. Our work is related
to the recent Video Action Transformer Network [14], which
uses the Transformer architecture as the “head” of a detection
framework. Specifically, it uses the ROI-pooled I3D feature of
a target region as query and aggregates contextual information
from the spatial-temporal feature maps of an input video clip.
Our work differs from it in the following aspects: (1) The
problem is different from spatial-temporal action detection. To
the best of our knowledge, we are the first to use Transformer
architecture for action anticipation. (2) We have task-specific
considerations. For instance, our Transformer unit takes the
current frame feature as query and the historical frame features
as memory. (3) We elaborately design a light-weight progres-
sive prediction module for efficient action anticipation.
III. OUR AP PROACH
In this section, we present our temporal Transformer with
progressive prediction for the action anticipation task. We
propose two module, temporal Transformer module (TTM)
to aggregate observed information and progressive prediction
module (PPM) to anticipate future actions.
A. Problem Formulation
The action anticipation task aims to predict the action class
yfor each frame in the future from an observed action video
V. More formally, let VL
1= [I1, I2, ..., IL]be a video with L
frames. Given the first tframes Vt
1= [I1, I2, ..., It], the task is
to predict the actions happening from frames t+ 1 to L. That
is, we aim to assign action labels yL
t+1 = [yt+1, yt+2 , ..., yL]
to each of the unobserved frames.
B. Overall Framework
Two crucial issues of action anticipation are i) how to
aggregate observed information and ii) how to predict future
actions. We address these two issues with a simple yet efficient
framework, termed as Temporal Transformer with Progres-
sive Prediction (TTPP). As illustrated in Figure 2, a long
video is first segmented into multiple non-overlapped chunks
hI0
1, I0
2, ..., I0
tiwith each clip containing an equal number of
consecutive frames. Then, a network, i.e.genc, maps each
video chunk into a representation ft=genc(I0
t). More details
on video pre-processing and feature extraction are presented
in Section IV-C. Subsequently, a Temporal Transformer Mod-
ule (TTM), i.e.gttm, temporally aggregates tconsecutive
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
chunk representations into a historical representation St=
gttm(f1, f2, ..., ft). Finally, a Progressive Prediction Module
(PPM) progressively predicts future features and actions. The
PPM is comprised of an initial prediction block,i.e.g0
pred, and
ashared-parameter progressive prediction block,i.e.gpred.
g0
pred takes Stas input and predicts the immediately following
clip feature and action probability. gpred accumulates the for-
mer predictions and St, and further predicts a few subsequent
future features and actions.
C. Temporal Transformer Module (TTM)
Transformer revisit. Transformer was originally proposed
to replace traditional recurrent models for machine transla-
tion [52]. The core idea of Transformer is to model correlation
between contextual signals by an attention mechanism. Specif-
ically, it aims to encode the input sequence to a higher-level
representation by modeling the relationship between queries
(Q) and memory (keys (K) and values (V)) with,
Attention(Q, K, V ) = Sofmax(QK T
dm
)V, (1)
where QRLq×dm,KRLk×dmand VRLk×dv.
This architecture becomes “self-attention” with Q=K=
V={f1, f2,·· · , fT}which is also known as the non-local
networks [58]. A self-attention module maps the sequence to a
higher-level representation like RNNs but without recurrence.
Temporal Transformer.To efficiently aggregate observed
information, our TTPP framework resorts to a Transformer-
style architecture, termed as Temporal Transformer Module
(TTM). The TTM takes as input the video chunk features and
maps them into a query feature and memory features. For
online action anticipation, considering that the last observed
feature ftwould be the most relevant one to the future actions,
we use ftas the query of TTM. The memory of TTM
is intuitively set as the historical features [f1, f2, ..., ft1].
Formally, the query and memory are as follows,
Q=ft, K =V= [f1, f2, ..., ft1].(2)
Since temporal information is lost in the attention operation,
we add the positional encoding [52] into the input representa-
tions. Given sequence feature fin = [f1, f2, ..., fT]RT×dm,
the i-th value of the positional vector in temporal position pos
is defined as,
P E(pos,i)=sin(pos/10000i/dm)if iis even
cos(pos/10000i/dm)otherwise.(3)
The original feature vector fpos is then updated by fpos =
fpos +P E(pos,:) which provides information about temporal
position of each clip feature.
To model complicated action videos, our TTM further
leverages the multi-head attention mechanism as follows,
At=MultiHead(Q, K, V ) = Concat(h1, ..., hn)Wo,
where hi=Attention(QW Q
i, K W K
i, V W V
i),
(4)
where nis the number of attention heads, and WQ
iRdm×dq,
WK
iRdm×dk,WV
iRdm×dvare parameters for the i-th
attention head which are used for linear projection, and Wo
(𝑆𝑡,𝑓
𝑡,𝑝𝑡)
𝑆𝑡
𝑆𝑡
𝑓
𝑡+1
𝑝𝑡+1
Share Parameters
Linear
Linear
ReLU
LN
Dropout
Share Parameters
Linear
Linear
ReLU
LN
Dropout
Linear
Linear
ReLU
LN
Dropout
𝑝𝑡+2
𝑓
𝑡+2
Classifier Classifier Classifier
Fig. 3: The illustration of our PPM. It consists of an initial
prediction block highlighted in blue and a shared-parameter
progressive prediction block highlighted in yellow, where each
block is built with two fully-connected layers, followed by a
ReLU activation. In addition, we use layer normalization and
dropout to improve regularization.
Rndv×dmis the projection matrix to reduce the dimension
of the concatenated attention vector. For each head, we use
dk=dq=dv=dm
n. Considering the importance of ftfor
anticipation, we view Atas an extra information and add it to
the original query feature via a shortcut connection. The final
output feature of TTM is St=At+ftwith dimension dm.
D. Progressive Prediction Module (PPM)
Partially inspired by WaveNet [51], we design a Progressive
Prediction Module (PPM) to better exploit the aggregated
historical knowledge for future prediction. As illustrated in
Figure 3, the PPM is comprised of an initial prediction block
and a shared-parameter progressive prediction block, where
each block is built with two fully-connected (FC) layers, a
ReLU activation [38] and a layer normalization (LN) [2].
Assume we predict lsteps in the future from time t+ 1
to t+l. At the first time step t+ 1, the initial prediction
block takes as input the aggregated historical representation
StRdmand predicts the feature f0
t+1 Rdmand action
probability p0
t+1 RC. Formally, this block is as follows,
pt=Softmax(Wcft)(5)
f0
t+1 =g0
pred(Stftpt),(6)
p0
t+1 =Softmax(Wcf0
t+1),(7)
where Wcis the multi-class (Caction classes) action classifier.
At other time step t+i(i > 1), the previously predicted
embedding f0
t+i1and action probability p0
t+i1are first
concatenated with Stin channel-wise, and then fed into the
progressive prediction block. Formally, this block is defined
as follows,
f0
t+i=gpred(Stf0
t+i1p0
t+i1),(8)
p0
t+i=Softmax(Wcf0
t+i),(9)
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
where ‘’ denotes concatenate operation. Due to the concate-
nation, the input dimension of the progressive prediction block
is 2dm+C. For both blocks, we use two fully-connected (FC)
layers with the first FC reducing the input dimension to dm
2and
the second FC generating output vector of dimension dm. It is
worth noting that different steps in the progressive prediction
block share parameters. Thus, the whole PPM is a light-weight
network.
Training. Our TTPP framework is trained in an end-to-
end manner with supervision on the PPM module. Specifi-
cally, we use two types of loss functions, namely a feature
reconstruction loss Lrand a classification loss Lc.Lris the
mean squared error loss (MSE) between predicted features and
ground-truth features, which is defined as,
Lr=
l
X
i=1 ||f0
t+ift+i||2.(10)
Lcis the sum of cross-entropy loss (CE) on all the prediction
steps, which is defined as,
Lc=
l
X
i=1
C
X
j=1
y(t+i,j)log p0
(t+i,j),(11)
where y(t+i,:) is the one-hot ground-truth vector at time t+i.
The total loss is formulated as,
L=Lc+λLr,(12)
where λis a trade-off weight for feature reconstruction loss.
We experimentally find the final performance is not sensitive
to the value of weight, we set λ= 1 for simplicity in our
experiments.
IV. EXP ER IM EN TS
The proposed method was evaluated on three datasets, i.e.
TVSeries [62], THUMOS-14 [20] and TV-Human-Interaction
[39]. We choose these datasets because they include videos
from diverse perspectives and applications: TVSeries was
recorded from television and contains a variety of everyday
activities, THUMOS-14 is a popular dataset of sports-related
actions, and TV-Huamn-Interaction contains human interaction
actions collected from tv shows. In this section, we report
experimental results and detailed analysis.
A. Datasets
TVSeries [7] is originally proposed for online action de-
tection, which consists of 27 episodes of 6popular TV series,
namely Breaking Bad (3episodes), How I Met Your Mother
(8), Mad Men (3), Modern Family (6), Sons of Anarchy (3),
and Twenty-four (4). It contains totally 16 hours of video.
The dataset is temporally annotated at the frame level with
30 realistic, everyday actions (e.g. , pick up,open door,
drink, etc.). It is challenging with diverse actions, multiple
actors, unconstrained viewpoints, heavy occlusions, and a large
proportion of non-action frames.
THUMOS-14 [20] is a popular benchmark for temporal
action detection. It contains over 20 hours of sport videos
annotated with 20 actions. The training set (i.e. UCF101 [49])
contains only trimmed videos that cannot be used to train
temporal action detection models. Following prior works [13],
[62], we train our model on the validation set (including 3K
action instances in 200 untrimmed videos) and evaluate on the
test set (including 3.3Kaction instances in 213 untrimmed
videos).
TV-Human-Interaction (TV-HI) [39]. We also evaluate our
method on TV-Human-Interaction which is also used in [13].
The dataset contains 300 trimmed video clips extracted from
23 different TV shows. It is annotated with four interaction
classes, namely hand shake,high five,hug, and kissing. It
also contains a negative class with 100 videos, that have none
of the listed interactions. We use the suggested experimental
setup of two train/test splits.
B. Evaluation Protocols
For each class on TVSeries, we use the per-frame calibrated
average precision (cAP) which is proposed in [7],
cAP =PkcP rec(k)I(k)
P,(13)
where calibrated precision cP rec =T P
T P +F P/w ,I(k)is an
indicator function that is equal to 1if the cut-off frame
kis a true positive, Pdenotes the total number of true
positives, and wis the ratio between negative and positive
frames. The mean cAP over all classes is reported for final
performance. The advantage of cAP is that it is fair for class
imbalance condition. For THUMOS-14, we report per-frame
mean Average Precision (mAP) performance. For TV-Human-
Interaction, we report classification accuracy (ACC).
C. Implementation Details
To make fair comparisons with state-of-the-art methods
[7], [13], [62], we follow their experimental settings on each
dataset.
Chunk-level feature extraction. We extract frames from
all videos at 24 Frames Per Second (FPS). The video chunk
size is set to 6,i.e.0.25 second. We use three different feature
extractors as the visual encoder genc, VGG-16 [47] network
pre-trained on UCF101 [49], two-stream (TS) [61] network1
pre-trained on ActivityNet-1.3 [3], and inflated 3D ConvNet2
(I3D) [5] pre-trained on Kinetics [25]. VGG-16 features (4096-
D) are extracted at the fc6layer for the central frame of
each chunk. For the two-stream features in each chunk, the
appearance CNN feature is extracted on the central frame
which is the output of Flatten 673 layer in ResNet-200 [16],
and the motion feature is extracted on the 6optical flow
frames of each chunk which is output of global pool layer
in a pre-trained BN-Inception model [19]. The motion feature
and appearance feature are then concatenated into a TS feature
(4096-D) for each chunk. Different from prior works [13],
[62], we also use recent I3D features. The I3D model is
originally trained with 64-frame video snippets, thus may not
be a good idea for per-frame action anticipation. Nevertheless,
1https://github.com/yjxiong/anet2016-cuhk
2https://github.com/piergiaj/pytorch-i3d
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Time predicted into the future (seconds)
Method Inputs 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s Avg
ED [13] VGG 71.0 70.6 69.9 68.8 68.0 67.4 67.0 66.7 68.7
RED [13] VGG 71.2 71.0 70.6 70.2 69.2 68.5 67.5 66.8 69.4
Ours VGG 72.7 72.3 71.9 71.6 71.3 70.9 69.9 69.3 71.3
ED [13] TS 78.5 78.0 76.3 74.6 73.7 72.7 71.7 71.0 74.5
RED [13] TS 79.2 78.7 77.1 75.5 74.2 73.0 72.0 71.2 75.1
TRN [62] TS 79.9 78.4 77.1 75.9 74.9 73.9 73.0 72.3 75.7
Ours TS 81.2 80.3 79.3 77.6 76.9 76.7 76.0 74.9 77.9
TABLE I: Comparison with the state-of-the-art methods on
TVSeries in terms of mean cAP (%).
Time predicted into the future (seconds)
Method Inputs 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s Avg
ED [13] TS 43.8 40.9 38.7 36.8 34.6 33.9 32.5 31.6 36.6
RED [13] TS 45.3 42.1 39.6 37.5 35.8 34.4 33.2 32.1 37.5
TRN [62] TS 45.1 42.4 40.7 39.1 37.7 36.4 35.3 34.3 38.9
Ours TS 45.9 43.7 42.4 41.0 39.9 39.4 37.9 37.3 40.9
Ours I3D 46.8 45.5 44.6 43.6 41.9 41.1 40.4 38.7 42.8
TABLE II: Comparison with the state-of-the-art methods on
THUMOS-14 in terms of mAP (%).
we input the 6frames of each chunk to I3D and extract the
output (1024-D) of the last global average pooling layer as
I3D-based feature.
Hyperparameter setting. We implement our proposed
method in PyTorch and perform all experiments on a system
with 8Nvidia TiTAN X graphic cards. We use the SGD
optimizer with a learning rate 0.001, a momentum of 0.9,
and batch size 32. The input sequence length is set to 8by
default, corresponding to 2seconds. We use single-layer multi-
head setting for our TTM, and the number of heads is set to
4by default.
D. Popular Baselines
Here we present several advanced baselines for temporal
information aggregation and future prediction.
Temporal convolution (i.e. Conv1D) aggregates temporal
features with 1-D convolution operations in temporal axis. We
apply 3Conv1D layers with kernel size 3and stride 2on two-
stream features for this baseline.
LSTM takes sequence features as input and recurrently
updates its hidden states over time. The Encoder-LSTM
summarizes historical information into the final hidden state
for information aggregation. The Decoder-LSTM recurrently
decodes information into hidden states as predicted features.
We use a single-layer LSTM architecture with 4096 hidden
units for this baseline.
Single-shot prediction (SSP). We implement a single-
shot prediction method similar to [13], [54]. With the ag-
gregated historical feature, this method uses two FC layers
to anticipate the single future feature at Ta, where Ta
{t+ 1, t + 2, ..., t +l}. This prediction method is equal to our
PPM without the progressive process.
E. Comparison with State of the Art
We compare our proposed TTPP method to several state-of-
the-art methods on TVSeries, THUMOS-14, and TV-HI. The
Method Vondrick et al. [54] RED-VGG [13] RED-TS [13] Ours-TS
ACC (%) 43.6 47.5 50.2 53.5
TABLE III: Anticipation results on TV-Human-Interaction at
Ta= 1sin terms of ACC (%).
Time predicted into the future (seconds)
Method 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s Avg
Conv1D77.9 77.1 75.9 74.3 73.6 72.7 71.9 71.0 74.3
Conv1D 79.4 77.9 76.6 75.4 73.9 73.6 72.8 72.4 75.3
LSTM79.3 78.2 76.9 74.6 73.0 72.3 71.6 69.8 74.5
LSTM 78.9 77.9 76.3 75.5 74.3 73.8 72.8 72.0 75.2
TTM79.1 78.6 77.9 77.0 76.4 75.6 75.1 74.1 76.7
TTM (Ours) 81.2 80.3 79.3 77.6 76.9 76.7 76.0 74.9 77.9
(a) TVSeries
Time predicted into the future (seconds)
Method 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s Avg
Conv1D 41.8 40.9 39.6 38.1 37.2 36.7 35.9 35.5 38.2
LSTM 41.6 40.5 39.1 37.9 35.6 34.9 34.3 33.3 37.0
TTM (Ours) 45.9 43.7 42.4 41.0 39.9 39.4 37.9 37.3 40.9
(b) THUMOS-14
TABLE IV: Evaluation of temporal aggregation methods with
two-stream features on TVSeries and THUMOS-14. For fair
comparison, PPM is used for future prediction. “” denotes
that the aggregated feature is directly used for prediction
without the shortcut connection to current feature.
results are presented in Table I, Table II, and Table III, respec-
tively. Our method consistently outperforms all these methods
in all the predicted steps. With two-stream features, our TTPP
achieves 77.9% (mean cAP), 40.9% (mAP), and 53.5% (ACC)
on these datasets, which outperforms these recent advanced
methods by 2.2%,2.0%, and 3.3%, respectively.
On both TVSeries and THUMOS-14, the improvements
over other methods are more evident on long-term predictions.
For instance, with two-stream features, our TTPP outperforms
ED (Encoder-Decoder LSTM) [13] by 2.1% at Ta= 0.25sand
5.7% at Ta= 2.0son THUMOS-14, and these numbers are
2.7% and 3.9% on TVSeries. With VGG features, our method
improves the Reinforcement ED by 2.6% in average cAP on
TVSeries. Since the VGG and TS features are relatively old,
we also test the I3D features, which updates a new state-of-
the-art on THUMOS-14 with 42.8% in average mAP over
time.
F. Ablation Study of TTM and PPM
To further investigate the effectiveness of our proposed
TTPP, we conduct extensive evaluations for both TTM and
PPM by comparing them to recent temporal aggregation and
prediction methods, respectively.
For temporal aggregation, we compare our TTM to Conv1D
and Encoder-LSTM on both THUMOS-14 and TVSeries with
the PPM as prediction phase. Since we use a shortcut con-
nection in TTM to highlight the current frame information,
we also evaluate the benefits of this design for all the aggre-
gation methods. The results are shown in Table IV. Several
observations can be concluded as follows. First, the proposed
TTM is superior to both Conv1D and LSTM regardless
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Time predicted into the future (seconds)
Method (A-P)0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s Avg
LSTM-LSTM (ED [13]) 78.5 78.0 76.3 74.6 73.7 72.7 71.7 71.0 74.5
LSTM-SSP (EFC [13]) 78.4 76.9 74.2 72.7 70.7 70.2 69.0 67.9 72.5
LSTM-PPM 78.6 77.5 76.0 75.0 73.5 73.8 72.8 71.8 74.9
TTM-LSTM 79.3 77.5 76.2 75.1 73.3 72.8 71.6 69.9 74.5
TTM-SSP 80.1 77.1 75.3 73.6 72.3 71.7 70.0 68.9 73.6
TTM-PPM (Ours) 81.2 80.3 79.3 77.6 76.9 76.7 76.0 74.9 77.9
(a) TVSeries
Time predicted into the future (seconds)
Method (A-P)0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s Avg
LSTM-LSTM (ED [13]) 43.8 40.9 38.7 36.8 34.6 33.9 32.5 31.6 36.6
LSTM-SSP (EFC [13]) 40.6 39.3 37.2 34.9 33.2 31.5 30.4 28.5 34.4
LSTM-PPM 41.3 40.3 38.9 37.6 35.4 34.6 33.9 33.0 36.8
TTM-LSTM 44.4 43.1 41.3 40.2 38.7 37.9 37.2 36.6 39.9
TTM-SSP 43.9 41.0 38.5 37.1 35.5 32.7 32.2 30.5 36.4
TTM-PPM (Ours) 45.9 43.7 42.4 41.0 39.9 39.4 37.9 37.3 40.9
(b) THUMOS-14
TABLE V: Evaluation of future prediction methods with two-
stream features on TVSeries and THUMOS-14. Both TTM and
LSTM are evaluated for temporal aggregation. (A-P)denotes
temporal aggregation and future prediction, respectively.
of the shortcut connection. Specifically, TTM outperforms
Conv1D and LSTM by 2.6% (2.7%) and 2.7% (3.9%) with
shortcut connection on TVSeries (THUMOS-14), respectively.
Second, the shortcut connection to current feature significantly
improves all methods on TVSeries. For instance, our TTM
degrades from 77.9% to 76.7% after removing the shortcut
connection which demonstrates the importance of current
feature and the superiority of our design. Last but not the least,
the improvements of our TTM over other methods are similar
for different time steps, which suggests that TTM provides
better aggregated features via attention than the others.
For future prediction, we compare our PPM to Decoder-
LSTM and SSP with either Encoder-LSTM or TTM as aggre-
gation method. The results are presented in Table V. Several
finds are concluded as follows. First, with both aggregation
methods, our PPM consistently outperforms Decoder-LSTM
and SSP on both datasets which shows the effectiveness of
PPM. Second, our PPM obtains more improvements with our
aggregation method TTM than Encoder-LSTM. For instance,
TTM-PPM outperforms TTM-LSTM by 3.4% while LSTM-
PPM only outperforms LSTM-LSTM by 0.4%. Third, with
both aggregation methods, our PPM is significantly superior
to SSP on both datasets especially at long-term prediction time
steps, which demonstrates the progressive design of our PPM
is important.
G. Importance of Feature Prediction
In order to evaluate the influence of feature prediction
for the final action anticipation, we remove the predicted
features (w/o FP) by only use the concatenation of action
probability and the aggregated historical representation in the
PPM. The results are shown in Figure 4. Without considering
the predicted features, the performance of the model w/o FP
degenerates dramatically. It indicates that only relying on the
action probability to predict future actions is not enough and
the predicted feature representations are always related to the
Fig. 4: Evaluation of feature prediction for action anticipation
on TVSeries (cAP %) and THUMOS-14 dataset (mAP %)
with two-stream features.
Fig. 5: Evaluation of trade-off weight λon TVSeries (cAP %)
and THUMOS-14 dataset (mAP %) with two-stream features.
action itself and thus could possibly provide some useful
information.
H. Evaluation of Sequence Length and Parameters
In the above experiments, we use a fixed historical length
8for aggregation, 4parallel heads and trade-off loss weight
λ= 1.0for training by default. To investigate their impacts
to the proposed TTPP framework, we evaluate them on both
THUMOS-14 and TVSeries.
The impact of λ.λis the weight of feature reconstruction
loss in training. Figure 5 shows the results of varied λon
THUMOS-14 and TVSeries. Removing the feature reconstruc-
tion loss, i.e.λ= 0, degrades performance dramatically on
both datasets which suggests the necessary of feature predic-
tion. Increasing the weight from 0 to 1 improves performance,
and it gets saturation or slightly hurts performance after 1. This
may be explained by that overemphasizing feature reconstruc-
tion can hurt the discrimination of predicted features.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Fig. 6: Evaluation of input sequence length for temporal
aggregation on TVSeries (cAP %) and THUMOS-14 dataset
(mAP %) with two-stream features.
Number of heads n. We also study performance variations
given various number of heads for temporal Transformer.
Average prediction performance of our TTPP network with
n∈ {1,2,4,8,16}are shown in Table VI. Results in Table VI
indicate that our method is not sensitive to parameter n. The
largest performance variation is only within 0.8% on TVSeries
and 1.1% on THUMOS-14. On both datasets, we achieve best
performance with head number n= 4.
Input sequence length. The length of observed sequence
determines how much historical information can be used.
Figure 6 illustrates the evaluation results on THUMOS-14 and
TVSeries. On both datasets, we achieve the best performance
with length 8. Decreasing sequence length leads to insufficient
context information and increasing sequence length results to
massive background information which are both inferior to the
default length.
I. Efficiency and Visualization
Table VII reports a comparison of parameters, memory
footprint, inference time and performance of different mod-
els on TVSeries dataset. Compared to the popular Encoder-
Decoder LSTM model, our TTPP has 64% fewer parameters,
44% fewer memory footprint and less inference time, while
achieves 4.6% higher performance. The efficiency of the
proposed TTPP owes to both the Transformer architecture
for sequence modeling and the efficient progressive prediction
module.
Figure 7 shows some examples of attention weights and ac-
tion anticipation on TVSeries, THUMOS-14 and TV-Human-
Interaction. We find that frames near the current frame usually
get higher weights compared to these distant frames since
the current frame feature is used as the query. On TVSeries
and THUMOS-14, multiple action instances and confusing
background frames exist in the videos which lead to incorrect
anticipation inevitably.
V. CONCLUSION
In this paper, we propose a novel deep framework to boost
action anticipation by adopting Temporal Transformer with
Progressive Prediction, where a TTM is used to aggregate
observed information and a PPM to progressively predict
Number of Heads n=1 n=2 n=4 n=8 n=16
TVSeries 77.1 77.7 77.9 77.5 77.2
THUMOS-14 40.1 40.4 40.9 40.2 39.8
TABLE VI: Comparison between different number of heads
on TVSeries and THUMOS-14 with two-stream features.
Model Parameter (M) Memory (M) Inf (s) cAP (%)
ED [13] 277 6560 212 74.5
TTPP 101 3675 145 77.9
TABLE VII: Comparison of parameter, memory footprint and
inference time on TVSeries dataset with two-stream features.
future features and actions. Experimental results on TVSeries,
THUMOS-14, and TV-Human-Interaction demonstrate that
our framework significantly outperforms the state-of-the-art
methods. Extensive ablation studies are conducted to show
the effectiveness of each module of our method.
VI. ACK NOWLEDGEMENTS
This work is partially supported by the National
Key Research and Development Program of China
(No.2016YFC1400704), and National Natural Science Foun-
dation of China (U1813218, U1713208, 61671125), Shenzhen
Basic Research Program (JCYJ20170818164704758,
CXB201104220032A), the Joint Lab of CAS-HK, Shenzhen
Institute of Artificial Intelligence and Robotics for Society,
and the Sichuan Province Key Rresearch and Development
Plan (2019YFS0427).
REFERENCES
[1] Mohammad Sadegh Aliakbarian, Fatemehsadat Saleh, Mathieu Salz-
mann, Basura Fernando, and Lars Andersson. Encouraging lstms to
anticipate actions very early. In ICCV, 2017.
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer
normalization. 2016.
[3] Fabian Caba, Victor Escorcia, Bernard Ghanem, and Juan Carlos
Niebles. Activitynet: A large-scale video benchmark for human activity
understanding. In CVPR, 2015.
[4] Jo˜
ao Carreira and Andrew Zisserman. Quo vadis, action recognition? A
new model and the kinetics dataset. CoRR, abs/1705.07750, 2017.
[5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a
new model and the kinetics dataset. In CVPR, 2017.
[6] Junyoung Chung, C¸ aglar G ¨
ulc¸ehre, KyungHyun Cho, and Yoshua Ben-
gio. Empirical evaluation of gated recurrent neural networks on sequence
modeling. CoRR, abs/1412.3555, 2014.
[7] Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees
Snoek, and Tinne Tuytelaars. Online action detection. In CVPR, 2016.
[8] Roeland De Geest and Tinne Tuytelaars. Modeling temporal structure
with lstm for online action detection. In WACV, pages 1549–1557, 2018.
[9] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus
Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.
Long-term recurrent convolutional networks for visual recognition and
description. CoRR, abs/1411.4389, 2014.
[10] Tran Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
Manohar Paluri. Learning spatiotemporal features with 3d convolutional
networks. In ICCV, 2015.
[11] Antonino Furnari and Giovanni Maria Farinella. What would you
expect? anticipating egocentric actions with rolling-unrolling lstms and
modality attention. In ICCV, 2019.
[12] Jiyang Gao and Ram Nevatia. Revisiting temporal modeling for video-
based person reid. In BMVC, 2018.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
Run Run
Background (gt)
0.5s 1.0s 1.5s 2.0s
0.5
0
0.15 0.07 0.09 0.04
0.27
0.09 0.29
CurrentObserved
PoleVault PoleVault PoleVault Background
PoleVault (gt)
Run Run
0.5
0
0.13 0.12 0.05 0.04 0.11 0.19 0.36
0.5
0
0.12 0.15 0.12 0.12 0.15 0.13 0.21 Hug Hug Hug Hug
Fig. 7: Visualization of attention weights and action anticipation on TVSeries (1st row), THUMOS-14 (2nd row), and TV-
Human-Interaction (3rd row). Incorrect anticipation results are shown in red.
[13] Jiyang Gao, Zhenheng Yang, and Ram Nevatia. RED: reinforced
encoder-decoder networks for action anticipation. In BMVC, 2017.
[14] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman.
Video action transformer network. In CVPR, 2019.
[15] Alex Graves. Long short-term memory. Neural Computation, 9(8):1735–
1780, 1997.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. CoRR, abs/1512.03385, 2015.
[17] Minh Hoai and Fernando De La Torre. Max-margin early event
detectors. In CVPR, 2012.
[18] Sepp Hochreiter and Jrgen Schmidhuber. Long short-term memory.
Neural Computation, 9(8):1735–1780, 1997.
[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. 2015.
[20] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah,
and R. Sukthankar. THUMOS challenge: Action recognition with a large
number of classes. 2014.
[21] Xu Jing, Zhao Rui, Zhu Feng, Huaming Wang, and Wanli Ouyang.
Attention-aware compositional network for person re-identification. In
CVPR, 2018.
[22] Rafal J´
ozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and
Yonghui Wu. Exploring the limits of language modeling. CoRR,
abs/1602.02410, 2016.
[23] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, and
Fei Fei Li. Large-scale video classification with convolutional neural
networks. In 2014 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2014.
[24] Will Kay, Jo˜
ao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,
Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics
human action video dataset. CoRR, abs/1705.06950, 2017.
[25] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, and Andrew
Zisserman. The kinetics human action video dataset. 2017.
[26] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Farid Boussaid,
and Ferdous Sohel. Human interaction prediction using deep temporal
features. In ECCV, 2016.
[27] Qiuhong Ke, Mario Fritz, and Bernt Schiele. Time-conditioned action
anticipation in one shot. In CVPR, 2019.
[28] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action
prediction. In CVPR, 2017.
[29] Yu Kong, Shangqian Gao, Bin Sun, and Yun Fu. Action prediction from
videos via memorizing hard-to-predict samples. In Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[30] Yu Kong, Dmitry Kit, and Yun Fu. A discriminative model with multiple
temporal scales for action prediction. In ECCV, 2014.
[31] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical
representation for future action prediction. In David Fleet, Tomas Pajdla,
Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV
2014, pages 689–704, Cham, 2014. Springer International Publishing.
[32] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozen-
feld. Learning realistic human actions from movies. In CVPR, 2008.
[33] Kang Li and Yun Fu. Prediction of human activity by discovering
temporal sequence patterns. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 36(8):1644–1657, 2014.
[34] Jun Liu, Gang Wang, Ling Yu Duan, Kamila Abdiyeva, and Alex C.
Kot. Skeleton based human action recognition with global context-
aware attention lstm networks. IEEE Transactions on Image Processing,
PP(99):1–1, 2018.
[35] Jun Liu, Gang Wang, Ping Hu, Ling Yu Duan, and Alex C. Kot. Global
context-aware attention lstm networks for 3d action recognition. In
CVPR, 2017.
[36] Minh Thang Luong, Hieu Pham, and Christopher D. Manning. Effective
approaches to attention-based neural machine translation. Computer
Science, 2015.
[37] Tahmida Mahmud, Mahmudul Hasan, and Amit K. Roy-Chowdhury.
Joint prediction of activity labels and starting times in untrimmed videos.
In CVPR, 2017.
[38] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve
restricted boltzmann machines. In ICML, 2010.
[39] Alonso Patron, Marcin Marszalek, Andrew Zisserman, and Ian Reid.
High five: Recognising human interactions in tv shows. In BMVC, 2010.
[40] S. Ren, K. He, R Girshick, and J. Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
[41] Cristian Rodriguez, Basura Fernando, and Hongdong Li. Action antici-
pation by predicting future dynamic images.
[42] M. S. Ryoo. Human activity prediction: Early recognition of ongoing
activities from streaming videos. In CVPR, 2012.
[43] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recog-
nition using visual attention. In ICLR, 2015.
[44] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recog-
nition using visual attention. Computer Science, 2017.
[45] Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan
Mansour, Anthony Vetro, Xavier Giro-I-Nieto, and Shih Fu Chang.
Online detection of action start in untrimmed, streaming videos. In
CVPR, 2018.
[46] Karen Simonyan and Andrew Zisserman. Two-stream convolutional
networks for action recognition in videos. In NIPS, 2014.
[47] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. Computer Science, 2014.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
[48] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying
Liu. An end-to-end spatio-temporal attention model for human action
recognition from skeleton data. In AAAI, 2016.
[49] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A
dataset of 101 human actions classes from videos in the wild. Computer
Science, 2012.
[50] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks
from overfitting. Journal of Machine Learning Research, 15(1):1929–
1958, 2014.
[51] A¨
aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and
Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR,
abs/1609.03499, 2016.
[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need. CoRR, abs/1706.03762, 2017.
[53] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating
the future by watching unlabeled video. CoRR, abs/1504.08023, 2015.
[54] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating
visual representations from unlabeled video. In CVPR, 2016.
[55] Heng Wang, Alexander Klser, Cordelia Schmid, and Cheng Lin Liu.
Dense trajectories and motion boundary descriptors for action recogni-
tion. International Journal of Computer Vision, 103(1):60–79, 2013.
[56] Hongsong Wang and Jiashi Feng. Delving into 3d action anticipation
from streaming videos. 2019.
[57] Limin Wang, Yuanjun Xiong, Wang Zhe, and Qiao Yu. Towards good
practices for very deep two-stream convnets. Computer Science, 2015.
[58] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-
local neural networks. In CVPR, 2018.
[59] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, and
Klaus Macherey. Google’s neural machine translation system: Bridging
the gap between human and machine translation. 2016.
[60] Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue.
Modeling spatial-temporal clues in a hybrid deep learning framework
for video classification. CoRR, abs/1504.01561, 2015.
[61] Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song,
Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, and Xiaoou Tang. Cuhk
and ethz and siat submission to activitynet challenge 2016.
[62] Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S. Davis, and David J.
Crandall. Temporal recurrent networks for online action detection. In
ICCV, 2019.
[63] Fan Yang, Ke Yan, Shijian Lu, Huizhu Jia, Xiaodong Xie, and Wen
Gao. Attention driven person re-identification. Pattern Recognition,
pages S0031320318303133–, 2018.
[64] Cao Yu, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, and
Wang Song. Recognize human activities from partially observed videos.
In CVPR, 2013.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Person re-identification (ReID) is a challenging task due to arbitrary human pose variations, background clutters, etc. It has been studied extensively in recent years, but the multifarious local and global features are still not fully exploited by either ignoring the interplay between whole-body images and body-part images or missing in-depth examination of specific body-part images. In this paper, we propose a novel attention-driven multi-branch network that learns robust and discriminative human representation from global whole-body images and local body-part images simultaneously. Within each branch, an intra-attention network is designed to search for informative and discriminative regions within the whole-body or body-part images, where attention is elegantly decomposed into spatial-wise attention and channel-wise attention for effective and efficient learning. In addition, a novel inter-attention module is designed which fuses the output of intra-attention networks adaptively for optimal person ReID. The proposed technique has been evaluated over three widely used datasets CUHK03, Market-1501 and DukeMTMC-ReID, and experiments demonstrate its superior robustness and effectiveness as compared with the state of the arts.
Article
Full-text available
Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, Long Short-Term Memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network , Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action recognition, which is capable of selectively focusing on the informative joints in each frame by using a global context memory cell. To further improve the attention capability, we also introduce a recurrent attention mechanism, with which the attention performance of our network can be enhanced progressively. Besides, a two-stream framework, which leverages coarse-grained attention and fine-grained attention, is also introduced. The proposed method achieves state-of-the-art performance on five challenging datasets for skeleton based action recognition.
Conference Paper
Full-text available
We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the frames are relevant for the task at hand and attaches higher importance to them. We evaluate the model on UCF-11 (YouTube Action), HMDB-51 and Hollywood2 datasets and analyze how the model focuses its attention depending on the scene and the action being performed.