Yan-Tao Zheng

Institute for Infocomm Research, Tumasik, Singapore

Are you Yan-Tao Zheng?

Claim your profile

Publications (22)9.9 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Graph-based Semi-Supervised Learning (SSL) methods are the widely used SSL methods due to their high accuracy. They can well meet the manifold assumption with high computational cost, but don't meet the cluster assumption. In this paper, we propose a Semi-supervised learning via SPArse (SSPA) model. Since SSPA uses sparse matrix multiplication to depict the adjacency relations among samples, SSPA can approximate low dimensional manifold structure of samples with lower computational complexity than these graph-based SSL methods. Each column of this sparse matrix corresponds to one sparse representation of a sample. The rational is that the inner product of sparse representations can also be sparse under certain constraint. Since the dictionary in the SSPA model can depict the distribution of the entire samples, the sparse representation of a sample encodes its spatial location information. Therefore, in the SSPA model the manifold structure of samples is computed via their locations in the intrinsic geometry of the distribution instead of their feature vectors. In order to meet the cluster assumption, we propose an structured dictionary learning algorithm to explicitly reveal the cluster structure of the dictionary. We develop the SSPA algorithms with the structured dictionary besides non-structured one, and experiments show that our methods are efficient and outperform state-of-the-art graph-based SSL methods.
    Neurocomputing 01/2014; 131:124–131. · 1.63 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: GPS devices have been widely used in automobiles to compute navigation routes to destinations. The generated driving route targets the minimal traveling distance, but neglects the sightseeing experience of the route. In this study, we propose an augmented GPS navigation system, GPSView, to incorporate a scenic factor into the routing. The goal of GPSView is to plan a driving route with scenery and sightseeing qualities, and therefore allow travelers to enjoy sightseeing on the drive. To do so, we first build a database of scenic roadways with vistas of landscapes and sights along the roadside. Specifically, we adapt an attention-based approach to exploit community-contributed GPS-tagged photos on the Internet to discover scenic roadways. The premise is: a multitude of photos taken along a roadway imply that this roadway is probably appealing and catches the public's attention. By analyzing the geospatial distribution of photos, the proposed approach discovers the roadside sight spots, or Points-Of-Interest (POIs), which have good scenic qualities and visibility to travelers on the roadway. Finally, we formulate scenic driving route planning as an optimization task towards the best trade-off between sightseeing experience and traveling distance. Testing in the northern California area shows that the proposed system can deliver promising results.
    ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP). 02/2013; 9(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent research has discovered that leveraging ontology is an effective way to facilitate semantic video concept detection. As an explicit knowledge representation, a formal ontology definition usually consists of a lexicon, properties, and relations. In this paper, we present a comprehensive representation scheme for video semantic ontology in which all the three components are well studied. Specifically, we leverage LSCOM to construct the concept lexicon, describe concept property as the weights of different modalities which are obtained manually or by data-driven approach, and model two types of concept relations (i.e., pairwise correlation and hierarchical relation). In contrast with most existing ontologies which are only focused on one or two components for domain-specific videos, the proposed ontology is more comprehensive and general. To validate the effectiveness of this ontology, we further apply it to video concept detection. The experiments on TRECVID 2005 corpus have demonstrated a superior performance compared to existing key approaches to video concept detection.
    Neurocomputing. 10/2012; 95:29–39.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Domain adaptive video concept detection and annotation has recently received significant attention, but in existing video adaptation processes, all the features are treated as one modality, while multi-modalities, the unique and important property of video data, is typically ignored. To fill this blank, we propose a novel approach, named multi-modality transfer based on multi-graph optimization (MMT-MGO) in this paper, which leverages multi-modality knowledge generalized by auxiliary classifiers in the source domains to assist multi-graph optimization (a graph-based semi-supervised learning method) in the target domain for video concept annotation. To our best knowledge, it is the first time to introduce multi-modality transfer into the field of domain adaptive video concept detection and annotation. Moreover, we propose an efficient incremental extension scheme to sequentially estimate a small batch of new emerging data without modifying the structure of multi-graph scheme. The proposed scheme can achieve a comparable accuracy with that of brand-new round optimization which combines these new data with the data corpus for the nearest round optimization, while the time for estimation has been reduced greatly. Extensive experiments over TRECVID2005-2007 data sets demonstrate the effectiveness of both the multi-modality transfer scheme and the incremental extension scheme.
    Neurocomputing. 10/2012; 95:11–21.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, the bag-of-words (BoW) video representations have achieved promising results in human action recognition in videos. By vector quantizing local spatial temporal (ST) features, the BoW video representation brings in simplicity and efficiency, but limitations too. First, the discretization of feature space in BoW inevitably results in ambiguity and information loss in video representation. Second, there exists no universal codebook for BoW representation. The codebook needs to be re-built when video corpus is changed. To tackle these issues, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local ST features of a video into a distribution estimated by a generative probabilistic model. Furthermore, the probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable to most discriminative classifiers, such as the nearest neighbor schemes and the kernel based classifiers. Experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results. KeywordsHuman action recognition–Probabilistic video representation–Information-theoretic video matching
    Multimedia Tools and Applications 06/2012; · 1.01 Impact Factor
  • Source
    03/2012; , ISBN: 978-953-51-0216-8
  • Source
    Sheng Tang, Yan-Tao Zheng, Yu Wang, Tat-Seng Chua
    [Show abstract] [Hide abstract]
    ABSTRACT: This work presents a novel sparse ensemble learning scheme for concept detection in videos. The proposed ensemble first exploits a sparse non-negative matrix factorization (NMF) process to represent data instances in parts and partition the data space into localities, and then coordinates the individual classifiers in each locality for final classification. In the sparse NMF, data exemplars are projected to a set of locality bases, in which the non-negative superposition of basis images reconstructs the original exemplars. This additive combination ensures that each locality captures the characteristics of data exemplars in part, thus enabling the local classifiers to hold reasonable diversity in their own regions of expertise. More importantly, the sparse NMF ensures that an exemplar is projected to only a few bases (localities) with non-zero coefficients. The resultant ensemble model is, therefore, sparse, in the way that only a small number of efficient classifiers in the ensemble will fire on a testing sample. Extensive tests on the TRECVid 08 and 09 datasets show that the proposed ensemble learning achieves promising results and outperforms existing approaches. The proposed scheme is feature-independent, and can be applied in many other large scale pattern recognition problems besides visual concept detection.
    IEEE Transactions on Multimedia 01/2012; 14:43-54. · 1.75 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One of the main challenges in interactive concept-based video search is the problem of insufficient relevant samples, especially for queries with complex semantics. In this paper, “related samples” are exploited to enhance interactive video search. The related samples refer to those video segments that are relevant to part of the query rather than the entire query. Compared to the relevant samples which may be rare, the related samples are usually plentiful and easy to find in search results. Generally, the related samples are visually similar and temporally neighboring to the relevant samples. Based on these two characters, we develop a visual ranking model that simultaneously exploits the relevant, related, and irrelevant samples, as well as a temporal ranking model to leverage the temporal relationship between related and relevant samples. An adaptive fusion method is then proposed to optimally explore these two ranking models to generate search results. We conduct extensive experiments on two real-world video datasets: TRECVID 2008 and YouTube datasets. As the experimental results show, our approach achieves at least 96% and 167% performance improvements against the state-of-the-art approaches on the TRECVID 2008 and YouTube datasets, respectively.
    IEEE Transactions on Multimedia 01/2012; · 1.75 Impact Factor
  • Source
    Yan-Tao Zheng, Zheng-Jun Zha, Tat-Seng Chua
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, the phenomenal advent of photo-sharing services, such as Flickr and Panoramio, have led to volumous community-contributed photos with text tags, timestamps, and geographic references on the Internet. The photos, together with their time- and geo-references, become the digital footprints of photo takers and implicitly document their spatiotemporal movements. This study aims to leverage the wealth of these enriched online photos to analyze people’s travel patterns at the local level of a tour destination. Specifically, we focus our analysis on two aspects: (1) tourist movement patterns in relation to the regions of attractions (RoA), and (2) topological characteristics of travel routes by different tourists. To do so, we first build a statistically reliable database of travel paths from a noisy pool of community-contributed geotagged photos on the Internet. We then investigate the tourist traffic flow among different RoAs by exploiting the Markov chain model. Finally, the topological characteristics of travel routes are analyzed by performing a sequence clustering on tour routes. Testings on four major cities demonstrate promising results of the proposed system.
    ACM Transactions on Intelligent Systems and Technology (TIST). 01/2012; 3(3).
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The known item search task (KIS) aims to retrieve a unique video or video clip in the video corpus. This paper presents a novel interactive video browsing system for KIS task. Our system integrates visual content-based, text-based and concept-based search approaches. It allows users to flexibly choose the search approaches. Moreover, two novel feedback schemes are employed: first, users can specify the temporal order in visual and conceptual inputs; second, users can label related samples with respect to visual, textual and conceptual features. Adopting these two feedback schemes greatly enhances search performance.
    Advances in Multimedia Modeling - 18th International Conference, MMM 2012, Klagenfurt, Austria, January 4-6, 2012. Proceedings; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces an effective interactive video retrieval system named VisionGo. It jointly explores human and computer to accomplish video retrieval with high effectiveness and efficiency. It assists the interactive video retrieval process in different aspects: (1) it maximizes the interaction efficiency between human and computer by providing a user interface that supports highly effective user annotation and an intuitive visualization of retrieval results; (2) it employs a multiple feedback technique that assists users in choosing proper method to enhance relevance feedback performance; and (3) it facilitates users to assess the retrieval results of motion-related queries by using motion-icons instead of static keyframes. Experimental results based on over 160h of news video shows demonstrate the effectiveness of the VisionGo system.
    Inf. Sci. 01/2011; 181:4197-4213.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Visual summarization of landmarks is an interesting and non-trivial task with the availability of gigantic community-contributed resources. In this work, we investigate ways to generate representative and distinctive views of landmarks by automatically discovering the un- derlying Scenic Themes (e.g. sunny, night view, snow, foggy views, etc.) via a content-based analysis. The challenge is that the task suffers from the subjectivity of the scenic theme understanding, and there is lack of prior knowledge of scenic themes understanding. In addition, the visual variations of scenic themes are results of joint effects of factors includ- ing weather, time, season, etc. To tackle the aforementioned issues, we exploit the Dirichlet Process Gaussian Mixture Model (DPGMM). The major advantages in using DPGMM is that it is fully unsupervised and do not require the number of components to be fixed beforehand, which avoids the difficulty in adjusting model complexity to avoid over-fitting. This work makes the first attempt towards generation of representative views of landmarks via scenic theme mining. Testing on seven famous world landmarks show promising results.
    Advances in Multimedia Modeling - 17th International Multimedia Modeling Conference, MMM 2011, Taipei, Taiwan, January 5-7, 2011, Proceedings, Part I; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Social video sharing websites allow users to annotate videos with descriptive keywords called tags, which greatly facilitate video search and browsing. However, many tags only describe part of the video content, without any temporal indication on when the tag actually appears. Currently, there is very little research on automatically assigning tags to shot-level segments of a video. In this paper, we leverage user's tags as a source to analyze the content within the video and develop a novel system named ShotTagger to assign tags at the shot level. There are two steps to accomplish the location of tags at shot level. The first is to estimate the distribution of tags within the video, which is based on a multiple instance learning framework. The second is to perform the semantic correlation of a tag with other tags in a video in an optimization framework and impose the temporal smoothness across adjacent video shots to refine the tagging results at shot level. We present different applications to demonstrate the usefulness of the tag location scheme in searching, and browsing of videos. A series of experiments conducted on a set of Youtube videos has demonstrated the feasibility and effectiveness of our approach.
    Proceedings of the 1st International Conference on Multimedia Retrieval, ICMR 2011, Trento, Italy, April 18 - 20, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Realistic human action recognition in videos has been a useful yet challenging task. Video shots of same actions may present huge intra-class variations in terms of visual appear- ance, kinetic patterns, video shooting, and editing styles. Hetero- geneous feature representations of videos pose another challenge on how to effectively handle the redundancy, complementariness and disagreement in these features. This paper proposes a local- ized multiple kernel learning (L-MKL) algorithm to tackle the issues above. L-MKL integrates the localized classifier ensemble learning and multiple kernel learning in a unified framework to leverage the strengths of both. The basis of L-MKL is to build multiple kernel classifiers on diverse features at subspace localities of heterogeneous representations. L-MKL integrates the discriminability of complementary features locally and enables localized MKL classifiers to deliver better performance in its own region of expertise. Specifically, L-MKL develops a locality gating model to partition the input space of heterogeneous representations to a set of localities of simpler data structure. Each locality then learns its localized optimal combination of Mercer kernels of heterogeneous features. Finally, the gating model coordinates the localized multiple kernel classifiers globally to perform action recognition. Experiments on two datasets show that the proposed approach delivers promising performance. Index Terms—Action recognition, localized classifier, multiple kernel learning.
    IEEE Transactions on Circuits and Systems for Video Technology 01/2011; 21:1193-1202. · 1.82 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The phenomenal advances of photo-sharing services, such as Flickr TM , have led to voluminous community-contributed photos with socially generated textual, temporal and geographical metadata on the Internet. The photos, together with their time- and geo-references, implicitly document the photographers’ spatiotemporal movement paths. This study aims to leverage the wealth of these enriched online photos to analyze the people’s travel pattern at the local level of a tour destination. First, from a noisy pool of GPS-tagged photos downloaded from Internet, we build a statistically reliable database of travel paths, and mine a list of regions of attraction (RoA). We then investigate the tourist traffic flow among different RoAs, by exploiting Markov chain model. Testings on four major cities demonstrate promising results of the proposed system.
    Advances in Multimedia Modeling - 17th International Multimedia Modeling Conference, MMM 2011, Taipei, Taiwan, January 5-7, 2011, Proceedings, Part I; 01/2011
  • Source
    Yan-Tao Zheng, Zheng-Jun Zha, Tat-Seng Chua
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, the emergence of georeferenced media, like geotagged photos, on the Internet has opened up a new world of possibilities for geographic related research and applications. Despite of its short history, georeferenced media has been attracting attentions from several major research communities of Computer Vision, Multimedia, Digital Libraries and KDD. This paper provides a comprehensive survey on recent research and applications on online georeferenced media. Specifically, the survey focuses on four aspects: (1) organizing and browsing georeferenced media resources, (2) mining semantic/social knowledge from georeferenced media, (3) learning landmarks in the world, and (4) estimating geographic location of a photo. Furthermore, based on the current technical achievements, open research issues and challenges are identified, and directions that can lead to compelling applications are suggested.
    Multimedia Tools and Applications 01/2011; 51:77-98. · 1.01 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Content Based Image Retrieval (CBIR) has attracted increasing attention from both academia and industry. Relevance Feedback is one of the most effective techniques to bridge the semantic gap in CBIR. One of the key research problems related to relevance feedback is how to select the most informative images for users to label. In this paper, we propose a novel active learning algorithm, called Locally Regressive G-Optimal Design (LRGOD) for relevance feedback image retrieval. Our assumption is that for each image, its label can be well estimated based on its neighbors via a locally regressive function. LRGOD algorithm is developed based on a locally regressive least squares model which makes use of the labeled and unlabeled images, as well as simultaneously exploits the local structure of each image. The images that can minimize the maximum prediction variance are selected as the most informative ones. We evaluated the proposed LRGOD approach on two real-world image corpus: Corel and NUS-WIDE-OBJECT [5] datasets, and compare it to three state-of-the-art active learning methods. The experimental results demonstrate the effectiveness of the proposed approach.
    Proceedings of the 1st International Conference on Multimedia Retrieval, ICMR 2011, Trento, Italy, April 18 - 20, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Most current research on human action recognition in videos uses the bag-of-words (BoW) representations based on vector quantization on local spatial temporal features, due to the simplicity and good performance of such representations. In contrast to the BoW schemes, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local spatial temporal (ST) features of a video into a distribution estimated by a generative probabilistic model such as the Gaussian Mixture Model. Furthermore, this probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable as input to most discriminative classifiers, such as the nearest neighbor schemes and the kernel methods. The experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results.
    Multimedia and Expo (ICME), 2010 IEEE International Conference on; 08/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The semantic contextual information is shown to be an important resource for improving the scene and image recognition, but is seldom explored in the literature of previous distance metric learning (DML) for images. In this work, we present a novel Contextual Metric Learning (CML) method for learning a set of contextual distance metrics for real world multi-label images. The relationships between classes are formulated as contextual constraints for the optimization framework to leverage the learning performance. In the experiment, we apply the proposed method for automatic image annotation task. The experimental results show that our approach outperforms the start-of-the-art DML algorithms.
    Advances in Multimedia Information Processing - PCM 2010 - 11th Pacific Rim Conference on Multimedia, Shanghai, China, September 21-24, 2010, Proceedings, Part I; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a location based reminder system with image recognition technology. With this system, mobile phone users can actively capture pictures from their favorite product or event promotional materials. After the phone user sends the picture to a computer server, location based reminders will be downloaded to the phone. The mobile phone will alert the user when he/she is close to the place where the product is selling or the event is happening. Kd-tree image matching and geometric validation are used to identify which product the user is interested in. A mobile client application is developed to take pictures, conduct GPS location tracking and pop up the reminder.
    Proceedings of the 18th International Conference on Multimedea 2010, Firenze, Italy, October 25-29, 2010; 01/2010