Deep ConvNets have shown its good performance in image classification tasks.
However it still remains as a problem in deep video representation for action
recognition. The problem comes from two aspects: on one hand, current video
ConvNets are relatively shallow compared with image ConvNets, which limits its
capability of capturing the complex video action information; on the other
hand, temporal
... [Show full abstract] information of videos is not properly utilized to pool and
encode the video sequences. Towards these issues, in this paper, we utilize two
state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet) and the
temporal net from Two-Stream ConvNets, for action representation. The
convolutional layers and the proposed new layer, called frame-diff layer, are
extracted and pooled with two temporal pooling strategy: Trajectory pooling and
line pooling. The pooled local descriptors are then encoded with VLAD to form
the video representations. In order to verify the effectiveness of the proposed
framework, we conduct experiments on UCF101 and HMDB51 datasets. It achieves
the accuracy of 93.78\% on UCF101 which is the state-of-the-art and the
accuracy of 65.62\% on HMDB51 which is comparable to the state-of-the-art.