Conference Paper

Selecting an Iconic Pose From an Action Video

Authors:
  • Rakuten Institute of Technology
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Multi-person articulated pose tracking in complex unconstrained videos is an important and challenging problem. In this paper, going along the road of top-down approaches, we propose a decent and efficient pose tracker based on pose flows. First, we design an online optimization framework to build association of cross-frame poses and form pose flows. Second, a novel pose flow non maximum suppression (NMS) is designed to robustly reduce redundant pose flows and re-link temporal disjoint pose flows. Extensive experiments show our method significantly outperforms best reported results on two standard Pose Tracking datasets (PoseTrack dataset and PoseTrack Challenge dataset) by 13 mAP 25 MOTA and 6 mAP 3 MOTA respectively. Moreover, in the case of working on detected poses in individual frames, the extra computation of proposed pose tracker is very minor, requiring 0.01 second per frame only.
Conference Paper
Full-text available
This paper addresses the problem of unsupervised video summarization, formulated as selecting a sparse subset of video frames that optimally represent the input video. Our key idea is to learn a deep summarizer network to minimize distance between training videos and a distribution of their summarizations, in an unsupervised way. Such a summarizer can then be applied on a new video for estimating its optimal summarization. For learning, we specify a novel generative adversarial framework, consisting of the summarizer and discriminator. The summarizer is the au-toencoder long short-term memory network (LSTM) aimed at, first, selecting video frames, and then decoding the obtained summarization for reconstructing the input video. The discriminator is another LSTM aimed at distinguishing between the original video and its reconstruction from the summarizer. The summarizer LSTM is cast as an adversary of the discriminator, i.e., trained so as to maximally confuse the discriminator. This learning is also regularized for sparsity. Evaluation on four benchmark datasets, consisting of videos showing diverse events in first-and third-person views, demonstrates our competitive performance in comparison to fully supervised state-of-the-art approaches.
Conference Paper
Full-text available
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
Article
Full-text available
We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
Conference Paper
Full-text available
With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lag far behind. Current action recognition databases contain on the order of ten different action categories collected under fairly controlled conditions. State-of-the-art performance on these datasets is now near ceiling and thus there is a need for the design and creation of new benchmarks. To address this issue we collected the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube. We use this database to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions such as camera motion, viewpoint, video quality and occlusion.
Article
Full-text available
Illustrating motion in still imagery for the purpose of summary, abstraction and motion description is important for a diverse spectrum of fields, ranging from arts to sciences. In this paper, we introduce a method that produces an action synopsis for presenting motion in still images. The method carefully selects key poses based on an analysis of a skeletal animation sequence, to facilitate expressing complex motions in a single image or a small number of concise views. Our approach is to embed the high-dimensional motion curve in a low-dimensional Euclidean space, where the main characteristics of the skeletal action are kept. The lower complexity of the embedded motion curve allows a simple iterative method which analyzes the curve and locates significant points, associated with the key poses of the original motion. We present methods for illustrating the selected poses in an image as a means to convey the action. We applied our methods to a variety of motions of human actions given either as 3D animation sequences or as video clips, and generated images that depict their synopsis.
Article
Full-text available
We propose a novel method for removing irrelevant frames from a video given user-provided frame-level labeling for a very small number of frames. We first hypothesize a number of windows which possibly contain the object of interest, and then determine which window(s) truly contain the object of interest. Our method enjoys several favorable properties. First, compared to approaches where a single descriptor is used to describe a whole frame, each window's feature descriptor has the chance of genuinely describing the object of interest; hence it is less affected by background clutter. Second, by considering the temporal continuity of a video instead of treating frames as independent, we can hypothesize the location of the windows more accurately. Third, by infusing prior knowledge into the patch-level model, we can precisely follow the trajectory of the object of interest. This allows us to largely reduce the number of windows and hence reduce the chance of overfitting the data during learning. We demonstrate the effectiveness of the method by comparing it to several other semi-supervised learning approaches on challenging video clips.
Chapter
This paper addresses the problem of video summarization. Given an input video, the goal is to select a subset of the frames to create a summary video that optimally captures the important information of the input video. With the large amount of videos available online, video summarization provides a useful tool that assists video search, retrieval, browsing, etc. In this paper, we formulate video summarization as a sequence labeling problem. Unlike existing approaches that use recurrent models, we propose fully convolutional sequence models to solve video summarization. We firstly establish a novel connection between semantic segmentation and video summarization, and then adapt popular semantic segmentation networks for video summarization. Extensive experiments and analysis on two benchmark datasets demonstrate the effectiveness of our models.
Conference Paper
Thumbnails play such an important role in online videos. As the most representative snapshot, they capture the essence of a video and provide the first impression to the viewers; ultimately, a great thumbnail makes a video more attractive to click and watch. We present an automatic thumbnail selection system that exploits two important characteristics commonly associated with meaningful and attractive thumbnails: high relevance to video content and superior visual aesthetic quality. Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video. On the task of predicting thumbnails chosen by professional video editors, we demonstrate the effectiveness of our system against six baseline methods, using a real-world dataset of 1,118 videos collected from Yahoo Screen. In addition, we study what makes a frame a good thumbnail by analyzing the statistical relationship between thumbnail frames and non-thumbnail frames in terms of various image quality features. Our study suggests that the selection of a good thumbnail is highly correlated with objective visual quality metrics, such as the frame texture and sharpness, implying the possibility of building an automatic thumbnail selection system based on visual aesthetics.
Video summarization is a challenging problem with great application potential. Whereas prior approaches, largely unsupervised in nature, focus on sampling useful frames and assembling them as summaries, we consider video summarization as a supervised subset selection problem. Our idea is to teach the system to learn from human-created summaries how to select informative and diverse subsets, so as to best meet evaluation metrics derived from human-perceived quality. To this end, we propose the sequential determinantal point process (seqDPP), a probabilistic model for diverse sequential subset selection. Our novel seqDPP heeds the inherent sequential structures in video data, thus overcoming the deficiency of the standard DPP, which treats video frames as randomly permutable items. Meanwhile, seqDPP retains the power of modeling diverse subsets, essential for summarization. Our extensive results of summarizing videos from 3 datasets demonstrate the superior performance of our method, compared to not only existing unsuper-vised methods but also naive applications of the standard DPP model.
Article
The explosive growth of video data in the modern era has set the stage for research in the field of video summarization, which attempts to abstract the salient frames in a video in order to provide an easily interpreted synopsis. Existing work on video summarization has primarily been static - that is, the algorithms require the summary length to be specified as an input parameter. However, video streams are inherently dynamic in nature, while some of them are relatively simple in terms of visual content, others are much more complex due to camera/object motion, changing illumination, cluttered scenes and low quality. This necessitates the development of adaptive summarization techniques, which adapt to the complexity of a video and generate a summary accordingly. In this paper, we propose a novel algorithm to address this problem. We pose the summary selection as an optimization problem and derive an efficient technique to solve the summary length and the specific frames to be selected, through a single formulation. Our extensive empirical studies on a wide range of challenging, unconstrained videos demonstrate tremendous promise in using this method for real-world video summarization applications.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Article
This paper describes a new algorithm for identifying key frames in shots from video programs. We use optical flow computations to identify local minima of motion in a shot-stillness emphasizes the image for the viewer. This technique allows us to identify both gestures which are emphasized by momentary pauses and camera motion which links together several distinct images in a single shot. Results show that our algorithm can successfully select several key frames from a single complex shot which effectively summarize the shot.
Conference Paper
Existing techniques to automatically create image thumbnail(s) for videos are mostly based on low-level feature analysis of the video frames, such as color and motion information. However, these approaches do not contain semantic models of the underlying theme of the video, and as a result, the selected frames may not be semantically representative. To address this problem, we propose a theme-based keyframe selection algorithm that explicitly models the visual characteristics of the underlying video theme. This thematic model is constructed by finding the common features of relevant visual samples, which are obtained by querying a visual database with keywords associated with the video. Our initial testing on a set of videos shows promising results of our video thumbnail image selection method.
Conference Paper
With the fast rising of the video sharing websites, the online video becomes an important media for people to share messages, interests, ideas, beliefs, etc. In this paper, we propose a novel approach to dynamically generate the web video thumbnails according to user's query. Two issues are addressed: the video content representativeness of the selected video thumbnail, and the relationship between the selected video thumbnail and the user's query. For the first issue the reinforcement based algorithm is adopted to rank the frames in each video. For the second issue the relevance model based method is employed to calculate the similarity between the video frames and the query keywords. The final video thumbnail is generated by linear fusion of the above two scores. Compared with the existing web video thumbnails, which only reflect the preference of the video owner, the thumbnails generated in our approach not only consider the video content representativeness of the frame, but also reflect the intention of the video searcher. In order to show the effectiveness of the proposed method, experiments are conducted on the videos selected from the video sharing website. Experimental results and subjective evaluations demonstrate that the proposed method is effective and can meet the user's intention requirement.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
The key frame is a simple yet effective form of summarizing a long video sequence. The number of key frames used to abstract a shot should be compliant to visual content complexity within the shot and the placement of key frames should represent most salient visual content. Motion is the more salient feature in presenting actions or events in video and, thus, should be the feature to determine key frames. We propose a triangle model of perceived motion energy (PME) to model motion patterns in video and a scheme to extract key frames based on this model. The frames at the turning point of the motion acceleration and motion deceleration are selected as key frames. The key-frame selection process is threshold free and fast and the extracted key frames are representative.
Video summarization using fully convolutional sequence networks
  • Linwei Mrigank Rochan
  • Yang Ye
  • Wang
RMPE: Regional multi-person pose estimation
  • Shuqin Hao-Shu Fang
  • Yu-Wing Xie
  • Cewu Tai
  • Lu
Deep keyframe detection in human action videos
  • Xiang Yan
  • Hanlin Syed Zulqarnain Gilani
  • Mingtao Qin
  • Liang Feng
  • Ajmal Zhang
  • Mian