Jia Zeng

Soochow University (PRC), Wu-hsien, Jiangsu Sheng, China

Are you Jia Zeng?

Claim your profile

Publications (45)52.07 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture.
    11/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interest and touches on many important applications in text mining, computer vision and computational biology. This paper represents the collapsed LDA as a factor graph, which enables the classic loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Although two commonly used approximate inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great success in learning LDA, the proposed BP is competitive in both speed and accuracy, as validated by encouraging experimental results on four large-scale document datasets. Furthermore, the BP algorithm has the potential to become a generic scheme for learning variants of LDA-based topic models in the collapsed space. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using BP based on the factor graph representations.
    IEEE Transactions on Software Engineering 05/2013; 35(5):1121-34. · 2.59 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
    Cybernetics, IEEE Transactions on. 01/2013; 43(4):1265-1276.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The batch latent Dirichlet allocation (LDA) algorithms play important roles in probabilistic topic modeling, but they are not suitable for processing big data streams due to high time and space compleixty. Online LDA algorithms can not only extract topics from big data streams with constant memory requirements, but also detect topic shifts as the data stream flows. In this paper, we present a novel and easy-to-implement online belief propagation (OBP) algorithm that infers the topic distribution from the previously unseen documents incrementally within the stochastic approximation framework. We discuss intrinsic relations between OBP and online expectation-maximization (OEM) algorithms, and show that OBP can converge to the local stationary point of the LDA's likelihood function. Extensive empirical studies confirm that OBP significantly reduces training time and memory usage while achieves a much lower predictive perplexity when compared with current state-of-the-art online LDA algorithms. Due to its ease of use, fast speed and low memory usage, OBP is a strong candidate for becoming the standard online LDA algorithm.
    10/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel communication-efficient parallel belief propagation (CE-PBP) algorithm for training latent Dirichlet allocation (LDA). Based on the synchronous belief propagation (BP) algorithm, we first develop a parallel belief propagation (PBP) algorithm on the parallel architecture. Because the extensive communication delay often causes a low efficiency of parallel topic modeling, we further use Zipf's law to reduce the total communication cost in PBP. Extensive experiments on different data sets demonstrate that CE-PBP achieves a higher topic modeling accuracy and reduces more than 80% communication cost than the state-of-the-art parallel Gibbs sampling (PGS) algorithm.
    06/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.
    06/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Fast convergence speed is a desired property for training latent Dirichlet allocation (LDA), especially in online and parallel topic modeling for massive data sets. This paper presents a novel residual belief propagation (RBP) algorithm to accelerate the convergence speed for training LDA. The proposed RBP uses an informed scheduling scheme for asynchronous message passing, which passes fast-convergent messages with a higher priority to influence those slow-convergent messages at each learning iteration. Extensive empirical studies confirm that RBP significantly reduces the training time until convergence while achieves a much lower predictive perplexity than other state-of-the-art training algorithms for LDA, including variational Bayes (VB), collapsed Gibbs sampling (GS), loopy belief propagation (BP), and residual VB (RVB).
    04/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications in computer vision and computational biology. This paper proposes a fast and accurate algorithm, active belief propagation (ABP), for training LDA. Usually training LDA requires repeated scanning of the entire corpus and searching the complete topic space. Confronted with massive corpus with large number of topics, such a training iteration is often inefficient and time-consuming. To accelerate the training speed, ABP actively scans partial corpus and searches partial topic space for topic modeling, saving enormous training time in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around 10 to 100 times faster than some of the major state-of-the-art algorithms for training LDA, while retains a comparable topic modeling accuracy.
    04/2012;
  • Source
    Jia Zeng
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux. Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models. The current version includes BP algorithms for latent Dirichlet allocation (LDA), author-topic models (ATM), relational topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project and more BP-based algorithms for various topic models will be added in the near future. Interested users may also extend BP algorithms for learning more complicated topic models. The source codes are freely available under the GNU General Public Licence, Version 1.0 at https://mloss.org/software/view/399/.
    Journal of Machine Learning Research 01/2012; 13(1). · 3.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper studies the topic modeling problem of tagged documents and images. Higher-order relations among tagged documents and images are major and ubiquitous characteristics, and play positive roles in extracting reliable and interpretable topics. In this paper, we propose the tag-topic models (TTM) to depict such higher-order topic structural dependencies within the Markov random field (MRF) framework. First, we use the novel factor graph representation of latent Dirichlet allocation (LDA)-based topic models from the MRF perspective, and present an efficient loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Second, we propose the factor hypergraph representation of TTM, and focus on both pairwise and higher-order relation modeling among tagged documents and images. Efficient loopy BP algorithm is developed to learn TTM, which encourages the topic labeling smoothness among tagged documents and images. Extensive experimental results confirm the incorporation of higher-order relations to be effective in enhancing the overall topic modeling performance, when compared with current state-of-the-art topic models, in many text and image mining tasks of broad interests such as word and link prediction, document classification, and tag recommendation.
    09/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the coauthor network topic (CNT) model constructed based on Markov random fields (MRFs) with higher-order cliques. Regularized by the complex coauthor network structures, the CNT can simultaneously learn topic distributions as well as expertise of authors from large document collections. Besides modeling the pairwise relations, we model also higher-order coauthor relations and investigate their effects on topic and expertise modeling. We derive efficient inference and learning algorithms from the Gibbs sampling procedure. To confirm the effectiveness, we apply the CNT to the expert finding problem on a DBLP corpus of titles from six different computer science conferences. Experiments show that the higher-order relations among coauthors can improve the topic and expertise modeling performance over the case with pairwise relations, and thus can find more relevant experts given a query topic or document.
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on; 10/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.
    BMC Bioinformatics 03/2010; 11:164. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatically tracking human body parts is a difficult problem because of background clutters, missing body parts, and the high degrees of freedoms and complex kinematics of the articulated human body. This paper presents the sequential Markov random fields (SMRFs) for tracking and labeling moving human body parts automatically by learning the spatio-temporal structures of human motions in the setting of occlusions and clutters. We employ a hybrid strategy, where the temporal dependencies between two successive human poses are described by the sequential Monte Carlo method, and the spatial relationships between body parts in a pose is described by the Markov random fields. Efficient inference and learning algorithms are developed based on the relaxation labeling. Experimental results show that the SMRF can effectively track human body parts in natural scenes.
    20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper integrates the signal, context, and structure features for genome-wide human promoter recognition, which is important in improving genome annotation and analyzing transcriptional regulation without experimental supports of ESTs, cDNAs, or mRNAs. First, CpG islands are salient biological signals associated with approximately 50 percent of mammalian promoters. Second, the genomic context of promoters may have biological significance, which is based on n-mers (sequences of n bases long) and their statistics estimated from training samples. Third, sequence-dependent DNA flexibility originates from DNA 3D structures and plays an important role in guiding transcription factors to the target site in promoters. Employing decision trees, we combine above signal, context, and structure features to build a hierarchical promoter recognition system called SCS. Experimental results on controlled data sets and the entire human genome demonstrate that SCS is significantly superior in terms of sensitivity and specificity as compared to other state-of-the-art methods. The SCS promoter recognition system is available online as supplemental materials for academic use and can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.95.
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 01/2010; 7(3):550-62. · 2.25 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting perceptually meaningful strokes plays an essential role in modeling structures of handwritten Chinese characters for accurate character recognition. This paper proposes a cascade Markov random field (MRF) model that combines both bottom-up (BU) and top-down (TD) processes for stroke extraction. In the low-level stroke segmentation process, we use a BU MRF model with smoothness prior to segment the character skeleton into directional substrokes based on self-organization of pixel-based directional features. In the high-level stroke extraction process, the segmented substrokes are sent to a TD MRF-based character model that, in turn, feeds back to guide the merging of corresponding substrokes to produce reliable candidate strokes for character recognition. The merit of the cascade MRF model is due to its ability to encode the local statistical dependencies of neighboring stroke components as well as prior knowledge of Chinese character structures. Encouraging stroke extraction and character recognition results confirm the effectiveness of our method, which integrates both BU/TD vision processing streams within the unified MRF framework.
    Information Sciences. 01/2010;
  • Source
    Inf. Sci. 01/2010; 180:301-311.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the coauthor network topic (CNT) model constructed based on Markov random fields (MRFs) with higher-order cliques. Regularized by the complex coauthor network structures, the CNT can simultaneously learn topic distributions as well as expertise of authors from large document collections. Besides modeling the pairwise relations, we model also higher-order coauthor relations and investigate their effects on topic and expertise modeling. We derive efficient inference and learning algorithms from the Gibbs sampling procedure. To confirm the effectiveness, we apply the CNT to the expert finding problem on a DBLP corpus of titles from six different computer science conferences. Experiments show that the higher-order relations among coauthors can improve the topic and expertise modeling performance over the case with pairwise relations, and thus can find more relevant experts given a query topic or document.
    2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, Toronto, Canada, August 31 - September 3, 2010, Main Conference Proceedings; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA rigidity is an important physical property originating from the DNA three-dimensional structure. Although the general DNA rigidity patterns in human promoters have been investigated, their distinct roles in transcription are largely unknown. In this paper, we discover four highly distinct human promoter groups based on similarity of their rigidity profiles. First, we find that all promoter groups conserve relatively rigid DNAs at the canonical TATA box [a consensus TATA(A/T)A(A/T) sequence] position, which are important physical signals in binding transcription factors. Second, we find that the genes activated by each group of promoters share significant biological functions based on their gene ontology annotations. Finally, we find that these human promoter groups correlate with the tissue-specific gene expression.
    Physical Review E 10/2009; 80(4 Pt 1):041917. · 2.31 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Clustering MEDLINE documents is usually conducted by the vector space model, which computes the content similarity between two documents by basically using the inner-product of their word vectors. Recently, the semantic information of MeSH (Medical Subject Headings) thesaurus is being applied to clustering MEDLINE documents by mapping documents into MeSH concept vectors to be clustered. However, current approaches of using MeSH thesaurus have two serious limitations: first, important semantic information may be lost when generating MeSH concept vectors, and second, the content information of the original text has been discarded. Our new strategy includes three key points. First, we develop a sound method for measuring the semantic similarity between two documents over the MeSH thesaurus. Second, we combine both the semantic and content similarities to generate the integrated similarity matrix between documents. Third, we apply a spectral approach to clustering documents over the integrated similarity matrix. Using various 100 datasets of MEDLINE records, we conduct extensive experiments with changing alternative measures and parameters. Experimental results show that integrating the semantic and content similarities outperforms the case of using only one of the two similarities, being statistically significant. We further find the best parameter setting that is consistent over all experimental conditions conducted. We finally show a typical example of resultant clusters, confirming the effectiveness of our strategy in improving MEDLINE document clustering. Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2009; 25(15):1944-51. · 5.47 Impact Factor
  • Jia Zeng, Shanfeng Zhu, Hong Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: This review describes important advances that have been made during the past decade for genome-wide human promoter recognition. Interest in promoter recognition algorithms on a genome-wide scale is worldwide and touches on a number of practical systems that are important in analysis of gene regulation and in genome annotation without experimental support of ESTs, cDNAs or mRNAs. The main focus of this review is on feature extraction and model selection for accurate human promoter recognition, with descriptions of what they are, what has been accomplished, and what remains to be done.
    Briefings in Bioinformatics 07/2009; 10(5):498-508. · 5.30 Impact Factor

Publication Stats

315 Citations
52.07 Total Impact Points

Institutions

  • 2010–2013
    • Soochow University (PRC)
      • Department of Computer Science and Technology
      Wu-hsien, Jiangsu Sheng, China
  • 2008–2010
    • Hong Kong Baptist University
      • Department of Computer Science
      Chiu-lung, Kowloon City, Hong Kong
  • 2008–2009
    • Northwestern Polytechnical University
      • Department of Computer Science and Software
      Xi’an, Liaoning, China
  • 2007–2009
    • The University of Hong Kong
      • Department of Electrical and Electronic Engineering
      Hong Kong, Hong Kong
  • 2004–2008
    • City University of Hong Kong
      • • Department of Electronic Engineering
      • • School of Creative Media
      Kowloon, Hong Kong