About
54
Publications
13,811
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,386
Citations
Introduction
Additional affiliations
December 2011 - present
August 2012 - present
September 2006 - July 2012
Publications
Publications (54)
This paper addresses the problem of recognizing human interactions from videos. We propose a novel approach that recognizes human interactions by the learned high-level descriptions, interactive phrases. Interactive phrases describe motion relationships between interacting people. These phrases naturally exploit human knowledge and allow us to cons...
This paper addresses the problem of recognizing human interactions with close physical contact from videos. Due to ambiguities in feature-to-person assignments and frequent occlusions in close interactions, it is difficult to accurately extract the interacting people. This degrades the recognition performance. We therefore propose a hierarchical mo...
This paper proposes a novel approach to action recognition from RGB-D cameras, in which depth features and RGB visual features are jointly used. Rich heterogeneous RGB and depth data are effectively compressed and projected to a learned shared space, in order to reduce noise and capture useful information for recognition. Knowledge from various sou...
In this paper, we present a novel approach for human interaction recognition from videos. We introduce high-level descriptions called interactive phrases to express binary semantic motion relationships between interacting people. Interactive phrases naturally exploit human knowledge to describe interactions and allow us to construct a more descript...
Zero-shot learning (ZSL) typically explores a shared semantic space in order to recognize novel categories in the absence of any labeled training data. However, the traditional ZSL methods always suffer from serious domain shift problem in human action recognition. This is because: 1) existing ZSL methods are specifically designed for object recogn...
Recently, deep convolutional neural network (CNN) has achieved great success for image restoration (IR) and provided hierarchical features at the same time. However, most deep CNN based IR models do not make full use of the hierarchical features from the original low-quality images, thereby resulting in relatively-low performance. In this work, we...
Microsoft’s Kinect sensors are receiving an increasing amount of interest by security researchers since they are cost-effective and can provide both visual and depth modality data at the same time. Unfortunately, depth or RGB modalities are unavailable in training or testing procedures in some realistic scenarios. Therefore, we explore a new proble...
Convolutional neural network has recently achieved great success for image restoration (IR) and also offered hierarchical features. However, most deep CNN based IR models do not make full use of the hierarchical features from the original low-quality images, thereby achieving relatively-low performance. In this paper, we propose a novel residual de...
Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from partially observed videos containing incomplete action executions. It is challenging because these partial videos have insufficient discriminative information, and their temporal structure is damaged. We study this problem in this pa...
Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action execution...
A very deep convolutional neural network (CNN) has recently achieved great success for image super-resolution (SR) and offered hierarchical features as well. However, most deep CNN based SR models do not make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this pap...
In this paper, we present a novel two-layer video representation for human action recognition employing hierarchical group sparse encoding technique and spatio-temporal structure. In the first layer, a new sparse encoding method named locally consistent group sparse coding (LCGSC) is proposed to make full use of motion and appearance information of...
This paper addresses the problem of geotagging images, i.e., assigning GPS coordinates (i.e., latitude, longitude) to images using image contents. Due to the huge appearance variability of visual features across the world, image content and their GPS coordinates may be inconsistent. This means images captured from geographically close areas may app...
In deep learning scenarios, a lot of labeled samples are needed to train the models. However, in practical application fields, since the objects to be recognized are complex and non-uniformly distributed, it is difficult to get enough labeled samples at one time. Active learning can actively improve the accuracy with fewer training labels, which is...
Visual tracking has achieved remarkable success in recent decades, but it remains a challenging problem due to appearance variations over time and complex cluttered background. In this paper, we adopt a tracking-by-verification scheme to overcome these challenges by determining the patch in the subsequent frame that is most similar to the target te...
Classifying human actions from varied views is challenging due to huge data variations in different views. The key to this problem is to learn discriminative view-invariant features robust to view variations. In this work, we address this problem by learning view-specific and view-shared features using novel deep models. View-specific features capt...
We propose a novel approach, max-margin heterogeneous information machine (MMHIM), for human action recognition from RGB-D videos. MMHIM fuses heterogeneous RGB visual features and depth features, and learns effective action classifiers using the fused features. Rich heterogeneous visual and depth data are effectively compressed and projected to a...
In this paper, we consider the problem of learning multiple related tasks simultaneously with the goal of improving the generalization performance of individual tasks. The key challenge is to effectively exploit the shared information across multiple tasks as well as preserve the discriminative information for each individual task. To address this,...
After fully observing the entire video, action recognition approaches will classify the video observation into one of the action categories. It should be noted that certain real-world applications (e.g., vehicle accident and criminal activity) do not allow the luxury of waiting for the entire action to be executed. Reactions must be performed in a...
Recognizing human activities is a fundamental problem in the computer vision community and is a key step toward the automatic understanding of scenes.
There are now billions of images stored on photo sharing websites. These images contain visual cues that reflect the geographical location of where the photograph was taken (e.g., New York City). Linking visual features in images to physical locations has many potential applications, such as tourism recommendation systems. However, the size and nat...
In this paper, we present our solution to the MS-Celeb-1M Challenge. This challenge aims to recognize 100k celebrities at the same time. The huge number of celebrities is the bottleneck for training a deep convolutional neural network of which the output is equal to the number of celebrities. To solve this problem, an independent softmax model is p...
Rooted in a basic hypothesis that a data matrix is strictly drawn from some independent subspaces, the low-rank representation (LRR) model and its variations have been successfully applied in various image classification tasks. However, this hypothesis is very strict to the LRR model as it can not always be guaranteed in real images. Moreover, the...
This paper addresses the problem of recognizing human actions from RGB-D videos. A discriminative relational feature learning method is proposed for fusing heterogeneous RGB and depth modalities, and classifying actions in RGB-D sequences. Our method factorizes the feature matrix of each modality, and enforces the same semantics for them in order t...
For human action recognition task, the traditional methods are based on the RGB data captured by webcam, while the popular RGB action databases\footnote{In this work, RGB action database means the one that is captured by a conventional RGB camera. Only RGB data are available in the RGB action database; depth data are not available. RGB-D action dat...
Human action recognition is an important and challenging task due to intra-class variation and complexity of actions which is caused by diverse style and duration in performed action. Previous works mostly concentrate on either depth or RGB data to build an understanding about the shape and movement cues in videos but fail to simultaneously utilize...
The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this p...
Action recognition is a challenging task due to intra-class motion variation caused by diverse style and duration in performed action videos. Previous works on action recognition task are more focused on hand-crafted features, treat different sources of information independently, and simply combine them before classification. In this paper we study...
This paper proposes a method to compensate RGB-D images from the original target RGB images by transferring the depth knowledge of source data. Conventional RGB databases (e.g., UT-Interaction database) do not contain depth information since they are captured by the RGB cameras. Therefore, the methods designed for RGB databases cannot take advantag...
The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this p...
This paper addresses the problem of recognizing human interactions with close physical contact from videos. Different from conventional human interaction recognition, recognizing close interactions faces the problems of ambiguities in feature-to-person assignments and frequent occlusions. Therefore, it is infeasible to accurately extract the intera...
This study addresses the problem of recognising human interactions between two people. The main difficulties lie in the partial occlusion of body parts and the motion ambiguity in interactions. The authors observed that the interdependencies existing at both the action level and the body part level can greatly help disambiguate similar individual m...
Can a machine tell us if an image was taken in Beijing or New York? Automated identification of the geographical coordinates based on image content is of particular importance to data mining systems, because geolocation provides a large source of context for other useful features of an image. However, successful localization of unannotated images r...
In this paper, we address the problem of recognizing human actions from videos. Most of the existing approaches employ low-level features (e.g., local features and global features) to represent an action video. However, algorithms based on low-level features are not robust to complex environments such as cluttered background, camera movement and il...
This paper presents a novel random forest based method to build mid-level features describing spatial and temporal structure information for activity recognition. Our model consists of two separate parts, spatial part and temporal part, which are employed to capture the distinctive characteristics in spatial and temporal domains of activity analysi...
Recognizing human interactions is a challenging task due to partially occluded body parts and motion ambiguities in interactions. We observe that the interdependencies existing at both action level and body part level greatly help disambiguate similar individual movements and facilitate human interaction recognition. In this paper, we propose a nov...
This paper presents a novel random forest learning framework to construct a discriminative and informative mid-level feature from low-level features. Since a single low-level feature based representation is not enough to capture the variations of human appearance, multiple low-level features (i.e., optical flow and histogram of gradient 3D features...
In this paper, we address the problem of representing objects using contours for the purpose of recognition. We propose a novel segmentation method for integrating a new contour matching energy into level set based segmentation schemes. The contour matching energy is represented by major components of Elliptic Fourier shape descriptors and serves a...
In this paper, we address the problem of learning objects by contours. Toward this goal, we propose a novel curve evolution scheme which provides the classifier with more accurate contour representations. We detect edgelet feature to help localize objects in images so that the proposed evolution method can achieve more reliable contours. To capture...
In this paper, we address the problem of recognizing human interaction of two persons from videos. We fuse global and local features to build a more expressive and discriminative action representation. The representation based on multiple features is robust to motion ambiguity and partial occlusion in interactions. Moreover, action context informat...
Learning a compact and yet discriminative codebook is an important procedure for local feature-based action recognition. A common procedure involves two independent phases: reducing the dimensionality of local features and then performing clustering. Since the two phases are disconnected, dimensionality reduction does not necessarily capture the di...
Visual codebook has been popular in object classification as well as action analysis. However, its performance is often sensitive to the codebook size that is usually predefined. Moreover, the codebook generated by unsupervised methods, e.g., K-means, often suffers from the problem of ambiguity and weak efficiency. In other words, the visual codebo...
Learning a compact and yet discriminative codebook for classifying human actions is a challenging problem. One difficulty lies in that the learning procedure is split into two independent phases (dimension reduction and clustering) and thus results in the loss of discriminative information which clustering requires. Besides, traditional used princi...
This paper proposes an annealed particle swarm optimization based particle filter algorithm for articulated 3D human body tracking. In our algorithm, a sampling covariance and an annealing factor are incorporated into the velocity updating equation of particle swarm optimization (PSO). The sampling covariance and the annealing factor are initiated...
Group action recognition is a challenging task in computer vision due to the large complexity induced by multiple motion patterns.
This paper aims at analyzing group actions in video clips containing several activities. We combine the probability summation
framework with the space-time (ST) interest points for this task. First, ST interest points a...
This paper proposes a local motion-based approach for recognizing group activities in soccer videos. Given the SIFT keypoint
matches on two successive frames, we propose a simple but effective method to group these keypoints into the background point
set and the foreground point set. The former one is used to estimate camera motion and the latter o...
Group action recognition in soccer videos is a chal- lenging problem due to the difficulties of group action representation and camera motion estimation. This pa- per presents a novel approach for recognizing group action with a moving camera. In our approach, ego- motion is estimated by the Kanade-Lucas-Tomasi fea- ture sets on successive frames....