Jia Zeng

Soochow University (PRC), Wu-hsien, Jiangsu Sheng, China

Are you Jia Zeng?

Claim your profile

Publications (47)62.76 Total impact

  • Jia Zeng, Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This state-of-the-art book describes important advances in type-2 fuzzy systems that have been made in the past decade for real-world pattern recognition problems, such as speech recognition, handwriting recognition, and topic modeling. The success of type-2 fuzzy sets has been largely attributed to their three-dimensional membership functions to handle both randomness and fuzziness uncertainties in real-world problems. In pattern recognition, both features and models have uncertainties, such as nonstationary babble noise in speech signals, large variations of handwritten Chinese character shapes, uncertain meaning of words in topic modeling, and uncertain parameters of models because of insufficient and noisy training data. All these uncertainties motivate us to integrate type-2 fuzzy sets with probabilistic graphical models to achieve better overall performance in terms of robustness, generalization ability, or recognition accuracy. For example, we inte- grate type-2 fuzzy sets with graphical models such as Gaussian mixture models, hidden Markov models, Markov random fields, and latent Dirichlet allocation- based topic models for pattern recognition. The type-2 fuzzy Gaussian mixture models can describe uncertain densities of observations. The type-2 fuzzy hidden Markov models incorporate the first-order Markov chain into the type-2 fuzzy Gaussian mixture models, which is suitable for modeling uncertain speech signals under babble noise. The type-2 fuzzy Markov random fields combine type-2 fuzzy sets with Markov random fields, which is able to handle large variations in struc- tural patterns such as handwritten Chinese characters. The type-2 fuzzy topic models focus on uncertain mixed membership of words to different topical clusters, which is effective to partition the observed (visual) words into semantically meaningful topical themes. In conclusion, these real-world pattern recognition applications demonstrate the effectiveness of type-2 fuzzy graphical models for handling uncertainties.
    01/2015; Springer-Verlag and Tsinghua University Press., ISBN: 978-3-662-44689-8
  • Lei Xie, Jia Zeng, Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: IntroductionThe past decade has seen a rapid development of probabilistic topic models notably probabilistic latent semantic index (PLSI) and latent Dirichlet allocation (LDA). Originally, topic modeling methods have been used to find thematic word clusters called topics from a collection of documents. Since the bag-of-word (BOW) representations have been widely extended to represent both images and videos, topic modeling techniques have found many important applications in the multimedia area. Typical examples include natural scene categorization, human action recognition, multi-label image annotation, part of speech annotation, topic identification and spoken document segmentation. The advantage of topic models lies in their elegant graphical representations and efficient approximate inference algorithms. In the meanwhile, many real-world systems use topic modeling methods to automatically do the feature engineering job. However, different applications require investigating different ...
    Soft Computing 12/2014; · 1.30 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture.
    11/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
    Cybernetics, IEEE Transactions on. 08/2013; 43(4):1265-1276.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interest and touches on many important applications in text mining, computer vision and computational biology. This paper represents the collapsed LDA as a factor graph, which enables the classic loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Although two commonly used approximate inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great success in learning LDA, the proposed BP is competitive in both speed and accuracy, as validated by encouraging experimental results on four large-scale document datasets. Furthermore, the BP algorithm has the potential to become a generic scheme for learning variants of LDA-based topic models in the collapsed space. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using BP based on the factor graph representations.
    IEEE Transactions on Software Engineering 05/2013; 35(5):1121-34. · 2.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Gene duplication, followed by functional evolution of duplicate genes, is a primary engine of evolutionary innovation. In turn, gene expression evolution is a critical component of overall functional evolution of paralogs. Inferring evolutionary history of gene expression among paralogs is therefore a problem of considerable interest. It also represents significant challenges. The standard approaches of evolutionary reconstruction assume that at an internal node of the duplication tree, the two duplicates evolve independently. However, because of various selection pressures functional evolution of the two paralogs may be coupled. The coupling of paralog evolution corresponds to three major fates of gene duplicates: subfunctionalization (SF), conserved function (CF) or neofunctionalization (NF). Quantitative analysis of these fates is of great interest and clearly influences evolutionary inference of expression. These two interrelated problems of inferring gene expression and evolutionary fates of gene duplicates have not been studied together previously and motivate the present study. Results Here we propose a novel probabilistic framework and algorithm to simultaneously infer (i) ancestral gene expression and (ii) the likely fate (SF, NF, CF) at each duplication event during the evolution of gene family. Using tissue-specific gene expression data, we develop a nonparametric belief propagation (NBP) algorithm to predict the ancestral expression level as a proxy for function, and describe a novel probabilistic model that relates the predicted and known expression levels to the possible evolutionary fates. We validate our model using simulation and then apply it to a genome-wide set of gene duplicates in human. Conclusions Our results suggest that SF tends to be more frequent at the earlier stage of gene family expansion, while NF occurs more frequently later on.
    BMC Genomics 01/2013; 14(1). · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The batch latent Dirichlet allocation (LDA) algorithms play important roles in probabilistic topic modeling, but they are not suitable for processing big data streams due to high time and space compleixty. Online LDA algorithms can not only extract topics from big data streams with constant memory requirements, but also detect topic shifts as the data stream flows. In this paper, we present a novel and easy-to-implement online belief propagation (OBP) algorithm that infers the topic distribution from the previously unseen documents incrementally within the stochastic approximation framework. We discuss intrinsic relations between OBP and online expectation-maximization (OEM) algorithms, and show that OBP can converge to the local stationary point of the LDA's likelihood function. Extensive empirical studies confirm that OBP significantly reduces training time and memory usage while achieves a much lower predictive perplexity when compared with current state-of-the-art online LDA algorithms. Due to its ease of use, fast speed and low memory usage, OBP is a strong candidate for becoming the standard online LDA algorithm.
    10/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel communication-efficient parallel belief propagation (CE-PBP) algorithm for training latent Dirichlet allocation (LDA). Based on the synchronous belief propagation (BP) algorithm, we first develop a parallel belief propagation (PBP) algorithm on the parallel architecture. Because the extensive communication delay often causes a low efficiency of parallel topic modeling, we further use Zipf's law to reduce the total communication cost in PBP. Extensive experiments on different data sets demonstrate that CE-PBP achieves a higher topic modeling accuracy and reduces more than 80% communication cost than the state-of-the-art parallel Gibbs sampling (PGS) algorithm.
    06/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.
    06/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Fast convergence speed is a desired property for training latent Dirichlet allocation (LDA), especially in online and parallel topic modeling for massive data sets. This paper presents a novel residual belief propagation (RBP) algorithm to accelerate the convergence speed for training LDA. The proposed RBP uses an informed scheduling scheme for asynchronous message passing, which passes fast-convergent messages with a higher priority to influence those slow-convergent messages at each learning iteration. Extensive empirical studies confirm that RBP significantly reduces the training time until convergence while achieves a much lower predictive perplexity than other state-of-the-art training algorithms for LDA, including variational Bayes (VB), collapsed Gibbs sampling (GS), loopy belief propagation (BP), and residual VB (RVB).
    04/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications in computer vision and computational biology. This paper proposes a fast and accurate algorithm, active belief propagation (ABP), for training LDA. Usually training LDA requires repeated scanning of the entire corpus and searching the complete topic space. Confronted with massive corpus with large number of topics, such a training iteration is often inefficient and time-consuming. To accelerate the training speed, ABP actively scans partial corpus and searches partial topic space for topic modeling, saving enormous training time in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around 10 to 100 times faster than some of the major state-of-the-art algorithms for training LDA, while retains a comparable topic modeling accuracy.
    04/2012;
  • Source
    Jia Zeng
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux. Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models. The current version includes BP algorithms for latent Dirichlet allocation (LDA), author-topic models (ATM), relational topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project and more BP-based algorithms for various topic models will be added in the near future. Interested users may also extend BP algorithms for learning more complicated topic models. The source codes are freely available under the GNU General Public Licence, Version 1.0 at https://mloss.org/software/view/399/.
    Journal of Machine Learning Research 01/2012; 13(1). · 2.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper studies the topic modeling problem of tagged documents and images. Higher-order relations among tagged documents and images are major and ubiquitous characteristics, and play positive roles in extracting reliable and interpretable topics. In this paper, we propose the tag-topic models (TTM) to depict such higher-order topic structural dependencies within the Markov random field (MRF) framework. First, we use the novel factor graph representation of latent Dirichlet allocation (LDA)-based topic models from the MRF perspective, and present an efficient loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Second, we propose the factor hypergraph representation of TTM, and focus on both pairwise and higher-order relation modeling among tagged documents and images. Efficient loopy BP algorithm is developed to learn TTM, which encourages the topic labeling smoothness among tagged documents and images. Extensive experimental results confirm the incorporation of higher-order relations to be effective in enhancing the overall topic modeling performance, when compared with current state-of-the-art topic models, in many text and image mining tasks of broad interests such as word and link prediction, document classification, and tag recommendation.
    09/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the coauthor network topic (CNT) model constructed based on Markov random fields (MRFs) with higher-order cliques. Regularized by the complex coauthor network structures, the CNT can simultaneously learn topic distributions as well as expertise of authors from large document collections. Besides modeling the pairwise relations, we model also higher-order coauthor relations and investigate their effects on topic and expertise modeling. We derive efficient inference and learning algorithms from the Gibbs sampling procedure. To confirm the effectiveness, we apply the CNT to the expert finding problem on a DBLP corpus of titles from six different computer science conferences. Experiments show that the higher-order relations among coauthors can improve the topic and expertise modeling performance over the case with pairwise relations, and thus can find more relevant experts given a query topic or document.
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on; 10/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper integrates the signal, context, and structure features for genome-wide human promoter recognition, which is important in improving genome annotation and analyzing transcriptional regulation without experimental supports of ESTs, cDNAs, or mRNAs. First, CpG islands are salient biological signals associated with approximately 50 percent of mammalian promoters. Second, the genomic context of promoters may have biological significance, which is based on n-mers (sequences of n bases long) and their statistics estimated from training samples. Third, sequence-dependent DNA flexibility originates from DNA 3D structures and plays an important role in guiding transcription factors to the target site in promoters. Employing decision trees, we combine above signal, context, and structure features to build a hierarchical promoter recognition system called SCS. Experimental results on controlled data sets and the entire human genome demonstrate that SCS is significantly superior in terms of sensitivity and specificity as compared to other state-of-the-art methods. The SCS promoter recognition system is available online as supplemental materials for academic use and can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.95.
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 07/2010; 7(3):550-62. · 2.25 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.
    BMC Bioinformatics 03/2010; 11:164. · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting perceptually meaningful strokes plays an essential role in modeling structures of handwritten Chinese characters for accurate character recognition. This paper proposes a cascade Markov random field (MRF) model that combines both bottom-up (BU) and top-down (TD) processes for stroke extraction. In the low-level stroke segmentation process, we use a BU MRF model with smoothness prior to segment the character skeleton into directional substrokes based on self-organization of pixel-based directional features. In the high-level stroke extraction process, the segmented substrokes are sent to a TD MRF-based character model that, in turn, feeds back to guide the merging of corresponding substrokes to produce reliable candidate strokes for character recognition. The merit of the cascade MRF model is due to its ability to encode the local statistical dependencies of neighboring stroke components as well as prior knowledge of Chinese character structures. Encouraging stroke extraction and character recognition results confirm the effectiveness of our method, which integrates both BU/TD vision processing streams within the unified MRF framework.
    Information Sciences 01/2010; · 3.89 Impact Factor
  • Source
    Inf. Sci. 01/2010; 180:301-311.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the coauthor network topic (CNT) model constructed based on Markov random fields (MRFs) with higher-order cliques. Regularized by the complex coauthor network structures, the CNT can simultaneously learn topic distributions as well as expertise of authors from large document collections. Besides modeling the pairwise relations, we model also higher-order coauthor relations and investigate their effects on topic and expertise modeling. We derive efficient inference and learning algorithms from the Gibbs sampling procedure. To confirm the effectiveness, we apply the CNT to the expert finding problem on a DBLP corpus of titles from six different computer science conferences. Experiments show that the higher-order relations among coauthors can improve the topic and expertise modeling performance over the case with pairwise relations, and thus can find more relevant experts given a query topic or document.
    2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, Toronto, Canada, August 31 - September 3, 2010, Main Conference Proceedings; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatically tracking human body parts is a difficult problem because of background clutters, missing body parts, and the high degrees of freedoms and complex kinematics of the articulated human body. This paper presents the sequential Markov random fields (SMRFs) for tracking and labeling moving human body parts automatically by learning the spatio-temporal structures of human motions in the setting of occlusions and clutters. We employ a hybrid strategy, where the temporal dependencies between two successive human poses are described by the sequential Monte Carlo method, and the spatial relationships between body parts in a pose is described by the Markov random fields. Efficient inference and learning algorithms are developed based on the relaxation labeling. Experimental results show that the SMRF can effectively track human body parts in natural scenes.
    20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August 2010; 01/2010

Publication Stats

344 Citations
62.76 Total Impact Points

Institutions

  • 2010–2013
    • Soochow University (PRC)
      • Department of Computer Science and Technology
      Wu-hsien, Jiangsu Sheng, China
  • 2008–2010
    • Hong Kong Baptist University
      • Department of Computer Science
      Chiu-lung, Kowloon City, Hong Kong
  • 2008–2009
    • Northwestern Polytechnical University
      • Department of Computer Science and Software
      Xi’an, Liaoning, China
  • 2007–2008
    • The University of Hong Kong
      • Department of Electrical and Electronic Engineering
      Hong Kong, Hong Kong
  • 2004–2008
    • City University of Hong Kong
      • • Department of Electronic Engineering
      • • School of Creative Media
      Kowloon, Hong Kong