Liang Wang

Chinese Academy of Sciences, Peping, Beijing, China

Are you Liang Wang?

Claim your profile

Publications (76)42.16 Total impact

  • Ran He, Man Zhang, Liang Wang, Ye Ji, Qiyue Yin
    [Show abstract] [Hide abstract]
    ABSTRACT: In multimedia applications, the text and image components in a web document form a pairwise constraint that potentially indicates the same semantic concept. This paper studies cross-modal learning via the pairwise constraint, and aims to find the common structure hidden in different modalities. We first propose a compound regularization framework to deal with the pairwise constraint, which can be used as a general platform for developing cross-modal algorithms. For unsupervised learning, we propose a cross-modal subspace clustering method to learn a common structure for different modalities. For supervised learning, to reduce the semantic gap and the outliers in pairwise constraints, we propose a cross-modal matching method based on compound ?21 regularization along with an iteratively reweighted algorithm to find the global optimum. Extensive experiments demonstrate the benefits of joint text and image modeling with semantically induced pairwise constraints, and show that the proposed cross-modal methods can further reduce the semantic gap between different modalities and improve the clustering/retrieval accuracy.
    11/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Bag-of-words framework is probably one of the best models used in image classification. In this model, coding plays a very important role in the classification process. There are many coding meth- ods that have been proposed to encode images in different ways. The relationship between different codewords is studied, but the relationship among descriptors is not fully discovered. In this work, we aim to draw a relationship between descriptors, and propose a new method that can be used with other coding methods to improve the performance. The basic idea behind this is encoding the descriptor not only with its n- earest codewords but also with the codewords of its nearest neighboring descriptors. Experiments on several benchmark datasets show that even using this simple relationship between the descriptors helps to improve coding methods.
    CCPR; 11/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Feature coding and pooling are two critical stages in the widely used Bag-of-Features (BOF) framework in image classification. After coding, each local feature formulates its representation by the visual codewords. However, the two-dimensional feature-code layout is transformed to a one-dimensional codeword representation after pooling. The property for each local feature is ignored and the whole representation is tightly coupled. To resolve this problem, we propose a hierarchical feature coding approach which regards each feature-code representation as a high level feature. Codeword learning, coding and pooling are also applied to these new features, and thus a high level representation of the image is obtained. Experiments on different datasets validate our analysis and demonstrate that the new representation is more discriminative than that in the previous BOF framework. Moreover, we show that various kinds of traditional feature coding algorithms can be easily embedded into our framework to achieve better performance.
    Neurocomputing 11/2014; 144:509–515. · 2.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Spatial information is an important cue for visual object analysis. Various studies in this field have been conducted. However, they are either too rigid or too fragile to efficiently utilize such information. In this paper, we propose to model the distribution of objects׳ local appearance patterns by using their co-occurrence at different spatial locations. In order to represent such a distribution, we propose a flexible framework called spatial feature co-pooling, with which the relations between patterns are discovered. As the final representation resulted from our framework is of high dimensionality, we propose a semi-greedy (SG) grafting algorithm to select the most discriminative features. Experimental results on the CIFAR 10, UIUC Sports and VOC 2007 datasets show that our method is effective and comparable with the state-of-art algorithms.
    Neurocomputing 09/2014; 139:415–422. · 2.01 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel unsupervised fall detection system that employs the collected acoustic signals (footstep sound signals) from an elderly person׳s normal activities to construct a data description model to distinguish falls from non-falls. The measured acoustic signals are initially processed with a source separation (SS) technique to remove the possible interferences from other background sound sources. Mel-frequency cepstral coefficient (MFCC) features are next extracted from the processed signals and used to construct a data description model based on a one class support vector machine (OCSVM) method, which is finally applied to distinguish fall from non-fall sounds. Experiments on a recorded dataset confirm that our proposed fall detection system can achieve better performance, especially with high level of interference from other sound sources, as compared with existing single microphone based methods.
    Signal Processing 08/2014; · 2.24 Impact Factor
  • Zhen Zhou, Li Zhong, Liang Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Clustering methods are widely deployed in the fields of data mining and pattern recognition. Many of them require the number of clusters as the input, which may not be practical when it is totally unknown. Several existing visual methods for cluster tendency assessment can be used to estimate the number of clusters by displaying the pairwise dissimilarity matrix into an intensity image where objects are reordered to reveal the hidden data structure as dark blocks along the diagonal. A major limitation of the existing methods is that they are not capable to highlight cluster structure with complex clusters. To address this problem, this paper proposes an effective approach by using Markov Random Fields, which updates each object with its local information dynamically and maximizes the global probability measure. The proposed method can be used to determine the cluster tendency and partition data simultaneously. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed method.
    Neurocomputing 07/2014; 136:49–55. · 2.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Global spatial structure is an important factor for visual object recognition but has not attracted sufficient attention in recent studies. Especially, the problems of features' ambiguity and sensitivity to location change in the image space are not yet well solved. In this paper, we propose multiple spatial pooling (MSP) to address these problems. MSP models global spatial structure with multiple Gaussian distributions and then pools features according to the relations between features and Gaussian distributions. Such a process is further generalized into a unified framework, which formulates multiple pooling using matrix operation with structured sparsity. Experiments in terms of scene classification and object categorization demonstrate that MSP can enhance traditional algorithms with small extra computational cost.
    Neurocomputing 01/2014; 129:225–231. · 2.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cross-modal matching has recently drawn much attention due to the widespread existence of multimodal data. It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous works mainly focus on solving the first problem. In this paper, we propose a novel coupled linear regression framework to deal with both problems. Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. And in the learning procedure, the ell_21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. We also present an iterative algorithm based on half-quadratic minimization to solve the proposed regularized linear regression problem. The experimental results on two challenging cross-modal datasets demonstrate that the proposed method outperforms the state-of-the-art approaches.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel computer vision-based fall detection system for monitoring an elderly person in a home care, assistive living application. Initially, a single camera covering the full view of the room environment is used for the video recording of an elderly person's daily activities for a certain time period. The recorded video is then manually segmented into short video clips containing normal postures, which are used to compose the normal dataset. We use the codebook background subtraction technique to extract the human body silhouettes from the video clips in the normal dataset and information from ellipse fitting and shape description, together with position information, is used to provide features to describe the extracted posture silhouettes. The features are collected and an online one class support vector machine (OCSVM) method is applied to find the region in feature space to distinguish normal daily postures and abnormal postures such as falls. The resultant OCSVM model can also be updated by using the online scheme to adapt to new emerging normal postures and certain rules are added to reduce false alarm rate and thereby improve fall detection performance. From the comprehensive experimental evaluations on datasets for 12 people, we confirm that our proposed person-specific fall detection system can achieve excellent fall detection performance with 100% fall detection rate and only 3% false detection rate with the optimally tuned parameters. This work is a semiunsupervised fall detection system from a system perspective because although an unsupervised-type algorithm (OCSVM) is applied, human intervention is needed for segmenting and selecting of video clips containing normal postures. As such, our research represents a step toward a complete unsupervised fall detection system.
    IEEE journal of biomedical and health informatics. 11/2013; 17(6):1002-14.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recommender systems become popular in recent years. It has been widely used in many areas such as e-commerce and social networks. Many recommender algorithms have been proposed and the most famous one is collaborative filtering algorithm. As it is vulnerable to inject profile attacks, some people try to attack the system due to profit purpose. They promote or demote target items by injecting some fake user profiles into the system. Existing works are mainly focused on detecting individual spammers and few of them are proposed to detect groups of spammers while these groups have stronger influence than individuals on recommender systems. But none of them can detect items which are under attack at the same time. In this paper we propose a method which can not only detect groups of spammers but also detect correlated groups of items. The proposed method first find some candidate groups of attackers and candidate groups of target items. Then we derive three features from the collusion attack phenomenon to detect groups of spammers and groups of target items. Finally we utilize rank aggregation technique to get the most possible group of spammers and correlated group of target items. Experimental results demonstrate the effectiveness of the proposed method.
    Proceedings of the 2013 Fifth International Conference on Multimedia Information Networking and Security; 11/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multivariate timeseries become a popular data form to represent images, that are used as suitable inputs to higher-level recognition processes. We present a novel cluster analysis based on timeseries structure to identify similar human motion sequences. To clustering sequences, the movement silhouettes from video were transformed into low-dimensional multivariate timeseries, then further converted into vectors based on their structure in a finite-dimensional Euclidean space. The identification and selection of structural metrics for human motion sequences were highlighted to demonstrate that these statistical features are generic but also problem dependent. Various clustering algorithms were used to demonstrate the effectiveness and simplicity using real data sets.
    Intelligent Data Analysis 11/2013; 17(6):1057-1074. · 0.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper investigates the problem of cross-modal retrieval, where users can search results across various modalities by submitting any modality of query. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. To address this problem, we propose a joint graph regularized multi-modal subspace learning (JGRMSL) algorithm, which integrates inter-modality similarities and intra-modality similarities into a joint graph regularization to better explore the cross-modal correlation and the local manifold structure in each modality of data. To obtain good class separation, the idea of Linear Discriminant Analysis (LDA) is incorporated into the proposed method by maximizing the between-class covariance of all projected data and minimizing the within-class covariance of all projected data. Experimental results on two public cross-modal datasets demonstrate the effectiveness of our algorithm.
    2013 2nd IAPR Asian Conference on Pattern Recognition (ACPR); 11/2013
  • Ran He, Tieniu Tan, Liang Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Low-rank matrix recovery algorithms aim to recover a corrupted low-rank matrix with sparse errors. However, corrupted errors may not be sparse in real-world problems and the relationship between L1 regularizer on noise and robust M-estimators is still unknown. This paper proposes a general robust framework for low-rank matrix recovery via implicit regularizers of robust M-estimators, which are derived from convex conjugacy and can be used to model arbitrarily corrupted errors. Based on the additive form of half-quadratic optimization, proximity operators of implicit regularizers are developed such that both low-rank structure and corrupted errors can be alternately recovered. In particular, the dual relationship between the absolute function in L1 regularizer and Huber M-estimator is studied, which establishes a relationship between robust low-rank matrix recovery methods and M-estimators based robust principal component analysis methods. Extensive experiments on synthetic and real-world datasets corroborate our claims and verify the robustness of the proposed framework.
    IEEE Transactions on Software Engineering 09/2013; · 2.59 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Describing videos efficiently is an important task for content based web video retrieval. To solve this problem, we propose an unsupervised approach based on an undirected topic model to learn a compact topical descriptor upon the bag-of-words (BoW) video representation. In our method, words in a BoW are assumed to have different topic features, and the topical descriptor of an entire video is obtained by aggregating those features, which makes the descriptor contain information about relative strength of topics. To improve the descriptor interpretability, an L1 penalty is used to control the topical sparsity. Furthermore, efficient learning and inference algorithms are presented. We evaluate the proposed descriptor on the Columbia Consumer Video dataset. Experimental results demonstrate that compared with the BoW and other topical representations, the proposed compact descriptor has better performance in web video retrieval.
    2013 20th IEEE International Conference on Image Processing (ICIP); 09/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Sparse representation is one of the most influential frameworks for visual tracking. However, when applying this framework to the real-world tracking applications, there are still many challenges such as appearance variations and background noise. In ...
    Pattern Recognition. 07/2013; 46(7):1748–1749.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Image classification is a hot topic in computer vision and pattern recognition. Feature coding, as a key component of image classification, has been widely studied over the past several years, and a number of coding algorithms have been proposed. However, there is no comprehensive study concerning the connections between different coding methods, especially how they have evolved. In this paper, we firstly make a survey on various feature coding methods, including their motivations and mathematical representations, and then exploit their relations, based on which a taxonomy is proposed to reveal their evolution. Further, we summarize the main characteristics of current algorithms, each of which is shared by several coding strategies. Finally, we choose several representatives from different kinds of coding approaches and empirically evaluate them with respect to the size of the codebook and the number of training samples on several widely used databases (15-Scenes, Caltech-256, PASCAL VOC07, and SUN397). Experimental findings firmly justify our theoretical analysis, which is expected to benefit both practical applications and future research.
    IEEE Transactions on Software Engineering 06/2013; · 2.59 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Spatial information in images is considered to be of great importance in the process of object recognition. Recent studies show that human's classification accuracy might drop dramatically if the spatial information of an image is removed. The original bag-of-words (BoW) model is actually a system simulating such a classification process with incomplete information. To handle the spatial information, spatial pyramid matching (SPM) was proposed, which has become the most widely used scheme in the purpose of spatial modeling. Given an image, SPM divides it into a series of spatial blocks on several levels and concatenates the representations obtained separately within all the blocks. SPM greatly improves the performance since it embeds spatial information into BoW. However, SPM ignores the relationships between the spatial blocks. To address this problems, we propose a new scheme based on a spatial graph, whose nodes correspond to the spatial blocks in SPM, and edges correspond to the relationships between the blocks. Thorough experiments on several popular datasets verify the advantages of the proposed scheme.
    Proceedings of the 11th Asian conference on Computer Vision - Volume Part I; 11/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In gait recognition field, template-based approaches such as Gait Energy Image (GEI) and Chrono-Gait Image (CGI) can achieve good recognition performance with low computational cost. Meanwhile, CGI can preserve temporal information better than GEI. However, they pay less attention to the local shape features. To preserve temporal information and generate more abundant local shape features, we generate multiple HOG templates by extracting Histogram of Oriented Gradients (HOG) of GEI and CGI templates. Experiments show that compared with several published approaches, our proposed multiple HOG templates achieve better performance for gait recognition.
    Pattern Recognition (ICPR), 2012 21st International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recognizing objectionable content draws more and more attention nowadays given the rapid proliferation of images and videos on the Internet. Although there are some investigations about violence video detection and pornographic information filtering, very few existing methods touch on the problem of violence detection in still images. However, given its potential use in violence webpage filtering, online public opinion monitoring and some other aspects, recognizing violence in still images is worth being deeply investigated. To this end, we first establish a new database containing 500 violence images and 1500 non-violence images. And we use the Bag-of-Words (BoW) model which is frequently adopted in image classification domain to discriminate violence images and non-violence images. The effectiveness of four different feature representations are tested within the BoW framework. Finally the baseline results for violence image detection on our newly built database are reported.
    Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Hermeneutics, defined as the art of interpreting a message, centered its study for many centuries in theorizing the interpretation process in written texts, mainly biblical ones. The appearance of the first recordings of image sequences during the last period of the 19th century expanded such a domain. The aesthetics of such videos was of interest in the early 20th century to great philosophers such as Ludwig Wittgenstein since, in that new format of communication, cinematographic texts became an interpretation game in which the language film was articulated in a network of multiple readings. Hermeneutics in image sequences, or as we call it, Video-Hermeneutics (VH) involves explaining the subjective and social value of the human behavior observed in image sequences and, in general, of all multimedia content. So VH is not only related to what happens in a video, but also to understand the meaning of what is being described, i.e. what message is being transmitted to us as human observers.
    Computer Vision and Image Understanding 01/2012; 116:305-306. · 1.23 Impact Factor

Publication Stats

510 Citations
42.16 Total Impact Points

Institutions

  • 2007–2014
    • Chinese Academy of Sciences
      • • Institute of Automation
      • • National Pattern Recognition Laboratory
      Peping, Beijing, China
  • 2007–2013
    • Northeast Institute of Geography and Agroecology
      • • Institute of Automation
      • • National Pattern Recognition Laboratory
      Beijing, Beijing Shi, China
  • 2007–2011
    • University of Melbourne
      • Department of Electrical and Electronic Engineering
      Melbourne, Victoria, Australia
  • 2010
    • Southeast University (China)
      • School of Computer Science and Engineering
      Nanjing, Jiangxi Sheng, China
    • Nanyang Technological University
      • School of Computer Engineering
      Singapore, Singapore
    • University of Bath
      • Department of Computer Science
      Bath, England, United Kingdom
  • 2009
    • Deakin University
      • School of Information Technology
      Geelong, Victoria, Australia
  • 2008
    • Victoria University Melbourne
      Melbourne, Victoria, Australia
  • 2006–2008
    • Monash University (Australia)
      • • Department of Electrical and Computer Systems Engineering, Clayton
      • • Intelligent Robotics Research Centre (IRRC)
      Melbourne, Victoria, Australia